BDAP: A data mining platform based on Spark

BU Yao; WU Bin; CHEN Yufeng; BAI Demeng

doi:10.3969/j.issn.0253-2778.2017.04.011

PDF( 12132 KB)

Open Access JUSTC Original Paper

BDAP: A data mining platform based on Spark

1.
Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia, Beijing University of Posts and Telecommunications, Beijing 100876, China
2.
School of Computer Science, Beijing University of Posts and Telecommunications, Beijing 100876, China
3.
State Grid Shandong Electric Power Research Institute, Jinan 250000, China

Cite this:

https://doi.org/10.3969/j.issn.0253-2778.2017.04.011

Received Date: 28 August 2016
Rev Recd Date: 08 December 2017
Publish Date: 30 April 2017

Abstract Full text PDF

Abstract

Abstract

Large data processing system has become a hot spot research issue in the field of large data. First of all, The data analysis platform architecture and the function was analyzed, dividing it into the data source layer, data absorption layer, data storage layer, data platform layer, security and monitoring layer, equipment layer and application layer. Platform includes multiple data preprocessing and algorithm modules. The platform architecture provided a foundation for the big data analysis. The platform comprehensively features which can be freely combined. The coupling degree between the modules is low, which is convenient for maintenance and further development. From the user's point of view, the adjustment of parameters, the establishment of the process, monitoring, and data mining process are all visual, and workflow and scheduling stream technology are available. Terms of performance, the BDAP algorithm works better than Hive and MLlib. Finally, an example illustrates the application scenarios of this data mining platform. After analyzing the circuit fault and meteorological data, faults can be predicted and classified. Also video mining can be used to get useful information.

Abstract

Large data processing system has become a hot spot research issue in the field of large data. First of all, The data analysis platform architecture and the function was analyzed, dividing it into the data source layer, data absorption layer, data storage layer, data platform layer, security and monitoring layer, equipment layer and application layer. Platform includes multiple data preprocessing and algorithm modules. The platform architecture provided a foundation for the big data analysis. The platform comprehensively features which can be freely combined. The coupling degree between the modules is low, which is convenient for maintenance and further development. From the user's point of view, the adjustment of parameters, the establishment of the process, monitoring, and data mining process are all visual, and workflow and scheduling stream technology are available. Terms of performance, the BDAP algorithm works better than Hive and MLlib. Finally, an example illustrates the application scenarios of this data mining platform. After analyzing the circuit fault and meteorological data, faults can be predicted and classified. Also video mining can be used to get useful information.

FullText(HTML)

References(17)

References

[1]	SARLIS D, PAPAILIOU N, KONSTANTINOU I, et al. Datix: A system for scalable network analytics[J]. ACM SIGCOMM Computer Communication Review, 2015, 45(5): 21-28.
[2]	卓安. 基于P2P可伸缩架构的大数据分析平台研究与实现[D]. 北京: 清华大学, 2012.
[3]	LI H, LU K J, MENG S C. Bigprovision: A provisioning framework for big data analytics[J]. IEEE Network, 2015, 29(5): 50-56.
[4]	何清, 庄福振, 曾立, 等. PDMiner: 基于云计算的并行分布式数据挖掘工具平台[J]. 中国科学：信息科学, 2014, 44(7): 871-885. HE Qing, ZHUANG Fuzhen, ZENG Li, et al. PDMiner: A cloud computing based parallel and distributed data mining toolkit platform[J]. Scientia, Sinica information, 2014, 44(7): 871-885.
[5]	YU L, ZHENG J, SHEN W C, et al. BC-PDM: Data mining, social network analysis and text mining system based on cloud computing[C]// Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Beijing, China: ACM Press, 2012: 1496-1499.
[6]	董西成. Hadoop 技术内幕深入解析 MapReduce 架构设计与实现原理[M]. 北京: 机械工业出版, 2013: 92-93.
[7]	Welcome to ApacheTM Hadoop![EB/OL]. http://hadoop.apache.org/.
[8]	DEAN J, GHEMAWAT S. MapReduce: Simplified data processing on large clusters[J]. Communications of the ACM, 2008, 51(1): 107-113.
[9]	HBASE. A distributed, scalable, big data store[EB/OL]. https://hbase.apache.org/, 2016.03.28.
[10]	方宸.基于YARN网络数据分析系统实现与应用研究 [D]. 武汉：华中科技大学, 2014.
[11]	Alluxio. Formerly known as Tachyon, is a memory speed virtual distributed storage system[EB/OL]. http://www.alluxio.org/. 2016.03.28.
[12]	Apache Spark. Lightning-fast cluster computing [EB/OL]. http://spark.apache.org/, 2016.03.28.
[13]	沈超. 基于Web Services的工作流系统关键技术的研究与实现[D]. 南京：东南大学, 2005.
[14]	Apache Storm - A free and open source distributed realtime computation system,http://storm.apache.org/.
[15]	GARG N. Apache Kafka[M]. Packt Publishing, 2013: 5-9.
[16]	石山园. Spark入门实战系列: 10.分布式内存文件系统Tachyon介绍及安装部署[EB/OL]. http://www.cnblogs.com/shishanyuan/p/4775400.html, 2015.9.10.
[17]	KTH video dataset. Recognition of human action[EB/OL]. http://www.nada.kth.se/cvap/actions/, 2016.03.28.

Supplements(0)

Track Citations

Proportional views

Proportional views

Get Citation

PDF

XML

[1]	SARLIS D, PAPAILIOU N, KONSTANTINOU I, et al. Datix: A system for scalable network analytics[J]. ACM SIGCOMM Computer Communication Review, 2015, 45(5): 21-28.
[2]	卓安. 基于P2P可伸缩架构的大数据分析平台研究与实现[D]. 北京: 清华大学, 2012.
[3]	LI H, LU K J, MENG S C. Bigprovision: A provisioning framework for big data analytics[J]. IEEE Network, 2015, 29(5): 50-56.
[4]	何清, 庄福振, 曾立, 等. PDMiner: 基于云计算的并行分布式数据挖掘工具平台[J]. 中国科学：信息科学, 2014, 44(7): 871-885. HE Qing, ZHUANG Fuzhen, ZENG Li, et al. PDMiner: A cloud computing based parallel and distributed data mining toolkit platform[J]. Scientia, Sinica information, 2014, 44(7): 871-885.
[5]	YU L, ZHENG J, SHEN W C, et al. BC-PDM: Data mining, social network analysis and text mining system based on cloud computing[C]// Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Beijing, China: ACM Press, 2012: 1496-1499.
[6]	董西成. Hadoop 技术内幕深入解析 MapReduce 架构设计与实现原理[M]. 北京: 机械工业出版, 2013: 92-93.
[7]	Welcome to ApacheTM Hadoop![EB/OL]. http://hadoop.apache.org/.
[8]	DEAN J, GHEMAWAT S. MapReduce: Simplified data processing on large clusters[J]. Communications of the ACM, 2008, 51(1): 107-113.
[9]	HBASE. A distributed, scalable, big data store[EB/OL]. https://hbase.apache.org/, 2016.03.28.
[10]	方宸.基于YARN网络数据分析系统实现与应用研究 [D]. 武汉：华中科技大学, 2014.
[11]	Alluxio. Formerly known as Tachyon, is a memory speed virtual distributed storage system[EB/OL]. http://www.alluxio.org/. 2016.03.28.
[12]	Apache Spark. Lightning-fast cluster computing [EB/OL]. http://spark.apache.org/, 2016.03.28.
[13]	沈超. 基于Web Services的工作流系统关键技术的研究与实现[D]. 南京：东南大学, 2005.
[14]	Apache Storm - A free and open source distributed realtime computation system,http://storm.apache.org/.
[15]	GARG N. Apache Kafka[M]. Packt Publishing, 2013: 5-9.
[16]	石山园. Spark入门实战系列: 10.分布式内存文件系统Tachyon介绍及安装部署[EB/OL]. http://www.cnblogs.com/shishanyuan/p/4775400.html, 2015.9.10.
[17]	KTH video dataset. Recognition of human action[EB/OL]. http://www.nada.kth.se/cvap/actions/, 2016.03.28.

TrendMD

Volume 47 Issue 4 page: 358-368

Cover

Keywords

Article Metrics

Article views (1330) PDF downloads(439)

BDAP: A data mining platform based on Spark

Abstract

Abstract

References

Proportional views

Catalog

Recommended articles

TrendMD

Article Metrics

Proportional views

Authors

Browse

Contact Us

About

BDAP: A data mining platform based on Spark

Share

Tools

Abstract

Abstract

References

Proportional views

Catalog

Recommended articles

TrendMD

Article Metrics

Proportional views

Authors

Browse

Contact Us

About

Export File

Citation

Format

Content