ISSN 0253-2778

CN 34-1054/N

Open AccessOpen Access JUSTC Original Paper

BDAP: A data mining platform based on Spark

Cite this:
https://doi.org/10.3969/j.issn.0253-2778.2017.04.011
  • Received Date: 28 August 2016
  • Rev Recd Date: 08 December 2017
  • Publish Date: 30 April 2017
  • Large data processing system has become a hot spot research issue in the field of large data. First of all, The data analysis platform architecture and the function was analyzed, dividing it into the data source layer, data absorption layer, data storage layer, data platform layer, security and monitoring layer, equipment layer and application layer. Platform includes multiple data preprocessing and algorithm modules. The platform architecture provided a foundation for the big data analysis. The platform comprehensively features which can be freely combined. The coupling degree between the modules is low, which is convenient for maintenance and further development. From the user's point of view, the adjustment of parameters, the establishment of the process, monitoring, and data mining process are all visual, and workflow and scheduling stream technology are available. Terms of performance, the BDAP algorithm works better than Hive and MLlib. Finally, an example illustrates the application scenarios of this data mining platform. After analyzing the circuit fault and meteorological data, faults can be predicted and classified. Also video mining can be used to get useful information.
    Large data processing system has become a hot spot research issue in the field of large data. First of all, The data analysis platform architecture and the function was analyzed, dividing it into the data source layer, data absorption layer, data storage layer, data platform layer, security and monitoring layer, equipment layer and application layer. Platform includes multiple data preprocessing and algorithm modules. The platform architecture provided a foundation for the big data analysis. The platform comprehensively features which can be freely combined. The coupling degree between the modules is low, which is convenient for maintenance and further development. From the user's point of view, the adjustment of parameters, the establishment of the process, monitoring, and data mining process are all visual, and workflow and scheduling stream technology are available. Terms of performance, the BDAP algorithm works better than Hive and MLlib. Finally, an example illustrates the application scenarios of this data mining platform. After analyzing the circuit fault and meteorological data, faults can be predicted and classified. Also video mining can be used to get useful information.
  • loading
  • [1]
    SARLIS D, PAPAILIOU N, KONSTANTINOU I, et al. Datix: A system for scalable network analytics[J]. ACM SIGCOMM Computer Communication Review, 2015, 45(5): 21-28.
    [2]
    卓安. 基于P2P可伸缩架构的大数据分析平台研究与实现[D]. 北京: 清华大学, 2012.
    [3]
    LI H, LU K J, MENG S C. Bigprovision: A provisioning framework for big data analytics[J]. IEEE Network, 2015, 29(5): 50-56.
    [4]
    何清, 庄福振, 曾立, 等. PDMiner: 基于云计算的并行分布式数据挖掘工具平台[J]. 中国科学:信息科学, 2014, 44(7): 871-885.
    HE Qing, ZHUANG Fuzhen, ZENG Li, et al. PDMiner: A cloud computing based parallel and distributed data mining toolkit platform[J]. Scientia, Sinica information, 2014, 44(7): 871-885.
    [5]
    YU L, ZHENG J, SHEN W C, et al. BC-PDM: Data mining, social network analysis and text mining system based on cloud computing[C]// Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Beijing, China: ACM Press, 2012: 1496-1499.
    [6]
    董西成. Hadoop 技术内幕深入解析 MapReduce 架构设计与实现原理[M]. 北京: 机械工业出版, 2013: 92-93.
    [7]
    Welcome to ApacheTM Hadoop![EB/OL]. http://hadoop.apache.org/.
    [8]
    DEAN J, GHEMAWAT S. MapReduce: Simplified data processing on large clusters[J]. Communications of the ACM, 2008, 51(1): 107-113.
    [9]
    HBASE. A distributed, scalable, big data store[EB/OL]. https://hbase.apache.org/, 2016.03.28.
    [10]
    方宸.基于YARN网络数据分析系统实现与应用研究 [D]. 武汉:华中科技大学, 2014.
    [11]
    Alluxio. Formerly known as Tachyon, is a memory speed virtual distributed storage system[EB/OL]. http://www.alluxio.org/. 2016.03.28.
    [12]
    Apache Spark. Lightning-fast cluster computing [EB/OL]. http://spark.apache.org/, 2016.03.28.
    [13]
    沈超. 基于Web Services的工作流系统关键技术的研究与实现[D]. 南京:东南大学, 2005.
    [14]
    Apache Storm - A free and open source distributed realtime computation system,http://storm.apache.org/.
    [15]
    GARG N. Apache Kafka[M]. Packt Publishing, 2013: 5-9.
    [16]
    石山园. Spark入门实战系列: 10.分布式内存文件系统Tachyon介绍及安装部署[EB/OL]. http://www.cnblogs.com/shishanyuan/p/4775400.html, 2015.9.10.
    [17]
    KTH video dataset. Recognition of human action[EB/OL]. http://www.nada.kth.se/cvap/actions/, 2016.03.28.
  • 加载中

Catalog

    [1]
    SARLIS D, PAPAILIOU N, KONSTANTINOU I, et al. Datix: A system for scalable network analytics[J]. ACM SIGCOMM Computer Communication Review, 2015, 45(5): 21-28.
    [2]
    卓安. 基于P2P可伸缩架构的大数据分析平台研究与实现[D]. 北京: 清华大学, 2012.
    [3]
    LI H, LU K J, MENG S C. Bigprovision: A provisioning framework for big data analytics[J]. IEEE Network, 2015, 29(5): 50-56.
    [4]
    何清, 庄福振, 曾立, 等. PDMiner: 基于云计算的并行分布式数据挖掘工具平台[J]. 中国科学:信息科学, 2014, 44(7): 871-885.
    HE Qing, ZHUANG Fuzhen, ZENG Li, et al. PDMiner: A cloud computing based parallel and distributed data mining toolkit platform[J]. Scientia, Sinica information, 2014, 44(7): 871-885.
    [5]
    YU L, ZHENG J, SHEN W C, et al. BC-PDM: Data mining, social network analysis and text mining system based on cloud computing[C]// Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Beijing, China: ACM Press, 2012: 1496-1499.
    [6]
    董西成. Hadoop 技术内幕深入解析 MapReduce 架构设计与实现原理[M]. 北京: 机械工业出版, 2013: 92-93.
    [7]
    Welcome to ApacheTM Hadoop![EB/OL]. http://hadoop.apache.org/.
    [8]
    DEAN J, GHEMAWAT S. MapReduce: Simplified data processing on large clusters[J]. Communications of the ACM, 2008, 51(1): 107-113.
    [9]
    HBASE. A distributed, scalable, big data store[EB/OL]. https://hbase.apache.org/, 2016.03.28.
    [10]
    方宸.基于YARN网络数据分析系统实现与应用研究 [D]. 武汉:华中科技大学, 2014.
    [11]
    Alluxio. Formerly known as Tachyon, is a memory speed virtual distributed storage system[EB/OL]. http://www.alluxio.org/. 2016.03.28.
    [12]
    Apache Spark. Lightning-fast cluster computing [EB/OL]. http://spark.apache.org/, 2016.03.28.
    [13]
    沈超. 基于Web Services的工作流系统关键技术的研究与实现[D]. 南京:东南大学, 2005.
    [14]
    Apache Storm - A free and open source distributed realtime computation system,http://storm.apache.org/.
    [15]
    GARG N. Apache Kafka[M]. Packt Publishing, 2013: 5-9.
    [16]
    石山园. Spark入门实战系列: 10.分布式内存文件系统Tachyon介绍及安装部署[EB/OL]. http://www.cnblogs.com/shishanyuan/p/4775400.html, 2015.9.10.
    [17]
    KTH video dataset. Recognition of human action[EB/OL]. http://www.nada.kth.se/cvap/actions/, 2016.03.28.

    Article Metrics

    Article views (1330) PDF downloads(439)
    Proportional views

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return