ISSN 0253-2778

CN 34-1054/N

Open AccessOpen Access JUSTC Original Paper

A density-based hierarchical clustering algorithm of gene data based on MapReduce

Cite this:
https://doi.org/10.3969/j.issn.0253-2778.2014.07.001
  • Received Date: 21 March 2014
  • Accepted Date: 15 June 2014
  • Rev Recd Date: 15 June 2014
  • Publish Date: 30 July 2014
  • The amount of gene expression data scale is increasing sharply with the rapid development of bio-informatics technology, which poses a serious challenge for traditional clustering algorithms. Density-based hierarchical clustering (DHC) can solve the problem of the nested class of gene expression data and has good robustness, but for handling huge amounts of data. Therefore, a density-based hierarchical clustering algorithm on MapReduce(DisDHC) was proposed. It partitioned data sets into smaller blocks, clustered each block using DHC in parallel, gathered the result for re-clustering, and produced all density centers of each cluster. The experiments on GAL dataset, Cell cycle dataset, and Serum dataset show that DisDHC reduces clustering time and achieves high performance.
    The amount of gene expression data scale is increasing sharply with the rapid development of bio-informatics technology, which poses a serious challenge for traditional clustering algorithms. Density-based hierarchical clustering (DHC) can solve the problem of the nested class of gene expression data and has good robustness, but for handling huge amounts of data. Therefore, a density-based hierarchical clustering algorithm on MapReduce(DisDHC) was proposed. It partitioned data sets into smaller blocks, clustered each block using DHC in parallel, gathered the result for re-clustering, and produced all density centers of each cluster. The experiments on GAL dataset, Cell cycle dataset, and Serum dataset show that DisDHC reduces clustering time and achieves high performance.
  • loading
  • [1]
    Gilbert D R, Schroeder M, van Helden J. Interactive visualization and exploration of relationships between biological objects[J]. Trends in Biotechnology, 2000, 18(12): 487-494.
    [2]
    Eisen M B, Brown P O. DNA arrays for analysis of gene expression[J]. Methods Enzymology, 1999, 303:179-205.
    [3]
    Jiang D X, Pei J, Zhang A D. DHC: A density-based hierarchical clustering method for time series gene expression data[C] // Proceedings of the 3rd IEEE Symposium on Bioinformatics and Bioengineering. Washington, USA: IEEE Press, 2003: 393-400.
    [4]
    Jain A K, Murty M N, Flynn P J. Data clustering: A review[J]. ACM Computing Surveys, 1999, 31(3): 264-323.
    [5]
    Bach F R, Jordan M I. Learning spectral clustering[R]. Proceedings of Neural Information Processing Systems, No. UCB/CSD-03-1249, Berkeley, USA, 2003.
    [6]
    Frey B J, Dueck D. Clustering by passing messages between data points[J]. science, 2007, 315(5814): 972-976.
    [7]
    Ben-Dor A, Shamir R, Yakhini Z. Clustering gene expression patterns[J]. Journal of Computational Biology, 1999, 6(3-4): 281-297.
    [8]
    Tamayo P, Slonim D, Mesirov J, et al. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation[J]. Proceedings of the National Academy of Sciences, 1999, 96(6): 2 907-2 912.
    [9]
    Zhao W Z, Ma H F, He Q. Parallel k-means clustering based on MapReduce[C]// Proceedings of the First International Conference on Cloud Computing. Beijing, China: Springer, 2009: 674-679.
    [10]
    Chen W Y, Song Y Q, Bai H J, et al. Parallel spectral clustering in distributed systems[J]. Pattern Analysis and Machine Intelligence, 2011, 33(3): 568-586.
    [11]
    鲁伟明, 杜晨阳, 魏宝刚, 等. 基于 MapReduce 的分布式近邻传播聚类算法[J]. 计算机研究与发展, 2012, 49(8): 1 762-1 772.
    [12]
    Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters[J]. Communications of the ACM, 2008, 51(1): 107-113.
    [13]
    刘鹏. 实战 Hadoop: 开启通向云计算的捷径[M]. 北京:电子工业出版社, 2011.
    [14]
    Schena M, Shalon D, Davis R W, et al. Quantitative monitoring of gene expression patterns with a complementary DNA microarray[J]. Science, 1995, 270(5235): 467-470.
    [15]
    Lockhart D J, Dong H, Byrne M C, et al. Expression monitoring by hybridization to high density oligonucleotide arrays[J]. Nature Biotechnology, 1996, 14(13): 1 675-1 680.
    [16]
    Jiang D X, Pei J, Zhang A D. Gpx: Interactive mining of gene expression data[C]// Proceedings of the 30th International Conference on Very Large Databases. Toronto, Canada: ACM Press, 2004: 1 249-1 252.
    [17]
    Bustin S A. Absolute quantification of mRNA using real-time reverse transcription polymerase chain reaction assays[J]. Journal of Molecular Endocrinology, 2000, 25(2): 169-193.
    [18]
    Cho R J, Campbell M J, Winzeler E A, et al. A genome-wide transcriptional analysis of the mitotic cell cycle[J]. Molecular Cell, 1998, 2(1): 65-73.
    [19]
    Yeung K Y, Haynor D R, Ruzzo W L. Validating clustering for gene expression data[J]. Bioinformatics, 2001, 17(4): 309-318.
    [20]
    杨春梅. 基因表达数据聚类分析算法研究和应用[D]. 天津大学, 2006.
  • 加载中

Catalog

    [1]
    Gilbert D R, Schroeder M, van Helden J. Interactive visualization and exploration of relationships between biological objects[J]. Trends in Biotechnology, 2000, 18(12): 487-494.
    [2]
    Eisen M B, Brown P O. DNA arrays for analysis of gene expression[J]. Methods Enzymology, 1999, 303:179-205.
    [3]
    Jiang D X, Pei J, Zhang A D. DHC: A density-based hierarchical clustering method for time series gene expression data[C] // Proceedings of the 3rd IEEE Symposium on Bioinformatics and Bioengineering. Washington, USA: IEEE Press, 2003: 393-400.
    [4]
    Jain A K, Murty M N, Flynn P J. Data clustering: A review[J]. ACM Computing Surveys, 1999, 31(3): 264-323.
    [5]
    Bach F R, Jordan M I. Learning spectral clustering[R]. Proceedings of Neural Information Processing Systems, No. UCB/CSD-03-1249, Berkeley, USA, 2003.
    [6]
    Frey B J, Dueck D. Clustering by passing messages between data points[J]. science, 2007, 315(5814): 972-976.
    [7]
    Ben-Dor A, Shamir R, Yakhini Z. Clustering gene expression patterns[J]. Journal of Computational Biology, 1999, 6(3-4): 281-297.
    [8]
    Tamayo P, Slonim D, Mesirov J, et al. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation[J]. Proceedings of the National Academy of Sciences, 1999, 96(6): 2 907-2 912.
    [9]
    Zhao W Z, Ma H F, He Q. Parallel k-means clustering based on MapReduce[C]// Proceedings of the First International Conference on Cloud Computing. Beijing, China: Springer, 2009: 674-679.
    [10]
    Chen W Y, Song Y Q, Bai H J, et al. Parallel spectral clustering in distributed systems[J]. Pattern Analysis and Machine Intelligence, 2011, 33(3): 568-586.
    [11]
    鲁伟明, 杜晨阳, 魏宝刚, 等. 基于 MapReduce 的分布式近邻传播聚类算法[J]. 计算机研究与发展, 2012, 49(8): 1 762-1 772.
    [12]
    Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters[J]. Communications of the ACM, 2008, 51(1): 107-113.
    [13]
    刘鹏. 实战 Hadoop: 开启通向云计算的捷径[M]. 北京:电子工业出版社, 2011.
    [14]
    Schena M, Shalon D, Davis R W, et al. Quantitative monitoring of gene expression patterns with a complementary DNA microarray[J]. Science, 1995, 270(5235): 467-470.
    [15]
    Lockhart D J, Dong H, Byrne M C, et al. Expression monitoring by hybridization to high density oligonucleotide arrays[J]. Nature Biotechnology, 1996, 14(13): 1 675-1 680.
    [16]
    Jiang D X, Pei J, Zhang A D. Gpx: Interactive mining of gene expression data[C]// Proceedings of the 30th International Conference on Very Large Databases. Toronto, Canada: ACM Press, 2004: 1 249-1 252.
    [17]
    Bustin S A. Absolute quantification of mRNA using real-time reverse transcription polymerase chain reaction assays[J]. Journal of Molecular Endocrinology, 2000, 25(2): 169-193.
    [18]
    Cho R J, Campbell M J, Winzeler E A, et al. A genome-wide transcriptional analysis of the mitotic cell cycle[J]. Molecular Cell, 1998, 2(1): 65-73.
    [19]
    Yeung K Y, Haynor D R, Ruzzo W L. Validating clustering for gene expression data[J]. Bioinformatics, 2001, 17(4): 309-318.
    [20]
    杨春梅. 基因表达数据聚类分析算法研究和应用[D]. 天津大学, 2006.

    Article Metrics

    Article views (37) PDF downloads(70)
    Proportional views

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return