ISSN 0253-2778

CN 34-1054/N

Open AccessOpen Access JUSTC Original Paper

Parallel ISOMAP algorithm based on Spark

Cite this:
https://doi.org/10.3969/j.issn.0253-2778.2016.09.001
  • Received Date: 01 March 2016
  • Accepted Date: 17 September 2016
  • Rev Recd Date: 17 September 2016
  • Publish Date: 30 September 2016
  • To achieve quick dimensionality reduction of the nonlinear high dimensional in the big data environment, a parallel ISOMAP algorithm based on Spark was proposed. In the method, a parallel algorithm to search for near neighbors was designed and realized to fast build the neighborhood matrix, which was based on exact Euclidean locality sensitive hashing. A parallel eigenvalue solving method was designed to quickly solve the eigenvalues, which executes the power method and the order reduction method in turn. To further improve the performance of the algorithm, the parallel method was optimized to reduce the memory consumption and data transmissions through using Spark’s sparse vector, broadcast mechanism and caching mechanism according to the characteristics of Spark. Experimental results on Swiss roll data and S-curve data demonstrated that the parallel ISOMAP algorithm based on Spark greatly improved the executing efficiency of the method through parallel executing and optimizing the calculating procedure. It could be suitable for reducing the dimension of large scale data sets.
    To achieve quick dimensionality reduction of the nonlinear high dimensional in the big data environment, a parallel ISOMAP algorithm based on Spark was proposed. In the method, a parallel algorithm to search for near neighbors was designed and realized to fast build the neighborhood matrix, which was based on exact Euclidean locality sensitive hashing. A parallel eigenvalue solving method was designed to quickly solve the eigenvalues, which executes the power method and the order reduction method in turn. To further improve the performance of the algorithm, the parallel method was optimized to reduce the memory consumption and data transmissions through using Spark’s sparse vector, broadcast mechanism and caching mechanism according to the characteristics of Spark. Experimental results on Swiss roll data and S-curve data demonstrated that the parallel ISOMAP algorithm based on Spark greatly improved the executing efficiency of the method through parallel executing and optimizing the calculating procedure. It could be suitable for reducing the dimension of large scale data sets.
  • loading
  • [1]
    TENENBAUM J B, DE SILVA V, LANGFORD J C. A global geometric framework for nonlinear dimensionality reduction[J]. Science, 2000, 290(5): 2319-2323.
    [2]
    王自强,钱旭,孔敏. 流形学习算法综述[J]. 计算机工程与应用,2008,44(35): 9-12.
    WANG Z Q, QIAN X, KONG M. Survey on manifold learning algorithms[J]. Computer Engineering and Applications, 2008, 44(35): 9-12.
    [3]
    PLESS R, SOUVENIR R. A survey of manifold learning for images[J]. IPSJ Transactions on Computer Vision & Applications, 2009, 1(1): 83-94.
    [4]
    IZENMAN A J. Introduction to manifold learning[J]. Wiley Interdisciplinary Reviews Computational Statistics, 2012, 4(5):439-446.
    [5]
    曾宪华, 罗四维. 全局保持的流形学习算法对比研究[J]. 计算机工程与应用, 2010, 46(15): 1-6.
    ZENG X H, LUO S W. Contrasting research of global preserving manifold learning algorithms[J]. Computer Engineering and Applications, 2010, 46(15): 1-6.
    [6]
    尹宏伟,李凡长. 谱机器学习研究综述[J]. 计算机科学与探索,2015,9(12):1409-1419.
    [7]
    任磊,杜一,马帅,等.大数据可视分析综述[J]. 软件学报, 2014, 25(9): 1909-1936.
    [8]
    李毅. 基于Hadoop平台的局部线性嵌入算法研究[D]. 华南理工大学,2011.
    [9]
    卞云龙. 基于云计算平台的大规模流形学习算法研究[D]. 南京理工大学,2012.
    [10]
    薛永坚, 倪志伟. 基于MapReduce的大规模数据集流形学习降维研究[J]. 系统工程理论与实践,2014,34(S): 151-157.
    [11]
    刘勇. 头部姿态估计的监督流形学习研究及其并行化扩展[D]. 厦门大学,2013
    [12]
    GU L, LI H. Memory or time: Performance evaluation for iterative operation on hadoop and spark [C]// International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing. Zhangjiajie, China: IEEE Press, 2013: 721-727.
    [13]
    DATAR M, IMMORLICA N, INDYK P, et al. Locality-sensitive hashing scheme based on p-stable distributions [C]// Proceedings the of 20th Annual symposium on Computational Geometry. New York: ACM Press, 2004: 253-262.
    [14]
    DE SILVA V, TENENBAUM J B. Global versus local methods in nonlinear dimensionality reduction [C] // Proceedings of the Advances in Neural Information Processing Systems. Cambridge: IEEE Press, 2003, 15: 1959-1966.
    [15]
    张瑞杰,郭志刚, 李弼程,等. 基于E2LSH-MKL的视觉语义概念检测[J]. 自动化学报,2012,38(10): 1671-1678.
    ZHANG R J, GUO Z G, LI B C, et al. A visual semantic concept detection algorithm based on E2LSH-MKL[J]. Acta Automatica Sinica, 2012, 38(10): 1671-1678.
    [16]
    王洪峰,刘辛. 基于位置敏感哈希的网络视频重复检测[J]. 计算机应用研究,2012,29(5): 1954-1958.
    WANG H F, LIU X. Near-duplicate Web video detection based on locality sensitive hashing[J]. Application Research of Computers, 2012,29(5): 1954-1958.
    [17]
    INDYK P. Stable distributions, pseudorandom generators, embeddings, and data stream computation [J]. Journal of the ACM, 2006, 53(3): 307-323.
    [18]
    SPARKS E R, TALWALKAR A, SMITH V, et al. MLI: An API for distributed machine learning [C]// Proceedings of the 13th International Conference on Data Mining. Dallas: IEEE Computer Society, 2013:1187-1192.
    [19]
    ZAHARIA M, CHOWDHURY M, DAS T, et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing [C]// Proceedings of the USENIX Conference on Networked Systems Design and Implementation. Berkeley: ACM Press, 2012, 70(2):141-146.
    [20]
    ZAHARIA M, CHOWDHURY M, FRANKLIN M J, et al. Spark: Cluster computing with working sets [J]. USENIX Conference on Hot Topics in Cloud Computing, 2010, 15(1):1765-1773.)
  • 加载中

Catalog

    [1]
    TENENBAUM J B, DE SILVA V, LANGFORD J C. A global geometric framework for nonlinear dimensionality reduction[J]. Science, 2000, 290(5): 2319-2323.
    [2]
    王自强,钱旭,孔敏. 流形学习算法综述[J]. 计算机工程与应用,2008,44(35): 9-12.
    WANG Z Q, QIAN X, KONG M. Survey on manifold learning algorithms[J]. Computer Engineering and Applications, 2008, 44(35): 9-12.
    [3]
    PLESS R, SOUVENIR R. A survey of manifold learning for images[J]. IPSJ Transactions on Computer Vision & Applications, 2009, 1(1): 83-94.
    [4]
    IZENMAN A J. Introduction to manifold learning[J]. Wiley Interdisciplinary Reviews Computational Statistics, 2012, 4(5):439-446.
    [5]
    曾宪华, 罗四维. 全局保持的流形学习算法对比研究[J]. 计算机工程与应用, 2010, 46(15): 1-6.
    ZENG X H, LUO S W. Contrasting research of global preserving manifold learning algorithms[J]. Computer Engineering and Applications, 2010, 46(15): 1-6.
    [6]
    尹宏伟,李凡长. 谱机器学习研究综述[J]. 计算机科学与探索,2015,9(12):1409-1419.
    [7]
    任磊,杜一,马帅,等.大数据可视分析综述[J]. 软件学报, 2014, 25(9): 1909-1936.
    [8]
    李毅. 基于Hadoop平台的局部线性嵌入算法研究[D]. 华南理工大学,2011.
    [9]
    卞云龙. 基于云计算平台的大规模流形学习算法研究[D]. 南京理工大学,2012.
    [10]
    薛永坚, 倪志伟. 基于MapReduce的大规模数据集流形学习降维研究[J]. 系统工程理论与实践,2014,34(S): 151-157.
    [11]
    刘勇. 头部姿态估计的监督流形学习研究及其并行化扩展[D]. 厦门大学,2013
    [12]
    GU L, LI H. Memory or time: Performance evaluation for iterative operation on hadoop and spark [C]// International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing. Zhangjiajie, China: IEEE Press, 2013: 721-727.
    [13]
    DATAR M, IMMORLICA N, INDYK P, et al. Locality-sensitive hashing scheme based on p-stable distributions [C]// Proceedings the of 20th Annual symposium on Computational Geometry. New York: ACM Press, 2004: 253-262.
    [14]
    DE SILVA V, TENENBAUM J B. Global versus local methods in nonlinear dimensionality reduction [C] // Proceedings of the Advances in Neural Information Processing Systems. Cambridge: IEEE Press, 2003, 15: 1959-1966.
    [15]
    张瑞杰,郭志刚, 李弼程,等. 基于E2LSH-MKL的视觉语义概念检测[J]. 自动化学报,2012,38(10): 1671-1678.
    ZHANG R J, GUO Z G, LI B C, et al. A visual semantic concept detection algorithm based on E2LSH-MKL[J]. Acta Automatica Sinica, 2012, 38(10): 1671-1678.
    [16]
    王洪峰,刘辛. 基于位置敏感哈希的网络视频重复检测[J]. 计算机应用研究,2012,29(5): 1954-1958.
    WANG H F, LIU X. Near-duplicate Web video detection based on locality sensitive hashing[J]. Application Research of Computers, 2012,29(5): 1954-1958.
    [17]
    INDYK P. Stable distributions, pseudorandom generators, embeddings, and data stream computation [J]. Journal of the ACM, 2006, 53(3): 307-323.
    [18]
    SPARKS E R, TALWALKAR A, SMITH V, et al. MLI: An API for distributed machine learning [C]// Proceedings of the 13th International Conference on Data Mining. Dallas: IEEE Computer Society, 2013:1187-1192.
    [19]
    ZAHARIA M, CHOWDHURY M, DAS T, et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing [C]// Proceedings of the USENIX Conference on Networked Systems Design and Implementation. Berkeley: ACM Press, 2012, 70(2):141-146.
    [20]
    ZAHARIA M, CHOWDHURY M, FRANKLIN M J, et al. Spark: Cluster computing with working sets [J]. USENIX Conference on Hot Topics in Cloud Computing, 2010, 15(1):1765-1773.)

    Article Metrics

    Article views (34) PDF downloads(90)
    Proportional views

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return