基于MapReduce的基因数据密度层次聚类算法

涂金金; 杨明; 郭丽娜

doi:10.3969/j.issn.0253-2778.2014.07.001

基于MapReduce的基因数据密度层次聚类算法

A density-based hierarchical clustering algorithm of gene data based on MapReduce

摘要

摘要: 随着生物信息技术的快速发展，基因表达数据的规模急剧增长，这给传统的基因表达数据聚类算法带来了严峻的挑战.基于密度的层次聚类(DHC)能够较好地解决基因表达数据嵌套类问题且鲁棒性较好，但处理海量数据的效率不高.为此，提出了基于MapReduce的密度层次聚类算法——DisDHC.该算法首先进行数据分割，在每个子集上利用DHC进行聚类获得稀疏化的数据；在此基础上再次进行DHC聚类；最终产生整体数据的密度中心点.在酵母数据集、酵母细胞周期数据集、人血清数据集上进行实验，结果表明，DisDHC算法在保持DHC聚类效果的同时，极大地缩短了聚类时间.

Abstract: The amount of gene expression data scale is increasing sharply with the rapid development of bio-informatics technology, which poses a serious challenge for traditional clustering algorithms. Density-based hierarchical clustering (DHC) can solve the problem of the nested class of gene expression data and has good robustness, but for handling huge amounts of data. Therefore, a density-based hierarchical clustering algorithm on MapReduce(DisDHC) was proposed. It partitioned data sets into smaller blocks, clustered each block using DHC in parallel, gathered the result for re-clustering, and produced all density centers of each cluster. The experiments on GAL dataset, Cell cycle dataset, and Serum dataset show that DisDHC reduces clustering time and achieves high performance.

HTML全文

参考文献(20)

施引文献

资源附件(0)