ISSN 0253-2778

CN 34-1054/N

Open AccessOpen Access JUSTC Original Paper

Spark/Shark-based OLAP system for smart grid applications

Cite this:
https://doi.org/10.3969/j.issn.0253-2778.2016.01.009
  • Received Date: 27 August 2015
  • Accepted Date: 29 September 2015
  • Rev Recd Date: 29 September 2015
  • Publish Date: 30 January 2016
  • The OLAP queries on electricity consumption information in Smart Grid have some prominent features: huge amounts of data, involving multiple tables in a joint operation, complex SQL structure, etc. Faced with this kind of applications, traditional RDBMS always leads to poor scalability, low write throughput, and unacceptable query performance, etc. A Spark/Shark-Based OLAP system for electricity consumption information in smart grid was designed. The system used distributed file system HDFS for data storage, and makes use of Shark to parse the SQL queries and Spark to execute them. However, Shark does not support fine-grained index, which hinders further improvement of query performance. To overcome this limitation, a Trie tree based fine-grained index technique TrieIndex and data re-organization scheme for better query performance was proposed. The experiment results with real electricity consumption information data and query show that the write throughput of the system is 12 times faster than that of RDBMS, and the query efficiency of the system is 10 times greater than that of original Shark.
    The OLAP queries on electricity consumption information in Smart Grid have some prominent features: huge amounts of data, involving multiple tables in a joint operation, complex SQL structure, etc. Faced with this kind of applications, traditional RDBMS always leads to poor scalability, low write throughput, and unacceptable query performance, etc. A Spark/Shark-Based OLAP system for electricity consumption information in smart grid was designed. The system used distributed file system HDFS for data storage, and makes use of Shark to parse the SQL queries and Spark to execute them. However, Shark does not support fine-grained index, which hinders further improvement of query performance. To overcome this limitation, a Trie tree based fine-grained index technique TrieIndex and data re-organization scheme for better query performance was proposed. The experiment results with real electricity consumption information data and query show that the write throughput of the system is 12 times faster than that of RDBMS, and the query efficiency of the system is 10 times greater than that of original Shark.
  • loading
  • [1]
    Apache Hadoop. Welcome to apache hadoop[EB/OL]. https://hadoop.apache.org/.
    [2]
    Spark. Lightning-fast cluster computing[EB/OL]. https://spark.apache.org/.
    [3]
    Zaharia M, Chowdhury M, Franklin M J, et al. Spark: Cluster computing with working sets[C]// Proceedings of the 2nd USENIX Conference on Hot Ttopics in Cloud Computing. Boston, USA: USENIX, 2010: 10-14.
    [4]
    Xin R S, Rosen J, Zaharia M, et al. Shark: SQL and rich analytics at scale[C]// Proceedings of the ACM SIGMOD International Conference on Management of Data. New York, USA: ACM Press, 2013:13-24.
    [5]
    Abouzeid A, Bajda-Pawlikowski K, Abadi D, et al. HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads[J]. Proceedings of the VLDB Endowment, 2009, 2(1): 922-933.
    [6]
    Jiang D W, Ooi B C, Shi L, et al. The performance of MapReduce: An in-depth study[J]. Proceedings of the VLDB Endowment, 2010, 3(1-2): 472-483.
    [7]
    Dittrich J, Quiané-Ruiz J A, Jindal A, et al. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing) [J]. Proceedings of the VLDB Endowment, 2010, 3(1-2): 515-529.
    [8]
    Eltabakh M Y, zcan F, Sismanis Y, et al. Eagle-eyed elephant: Split-oriented indexing in Hadoop[C]// Proceedings of the 16th International Conference on Extending Database Technology. Genoa, Italy: ACM Press, 2013: 89-100.
    [9]
    Liu Y, Hu S L, Rabl T, et al. DGFIndex for smart grid: Enhancing hive with a cost-effective multidimensional range index[C]// 40th International Conference on VLDB. Hangzhou, China: ACM Press, 2014: 1496-1507.
    [10]
    宋振伟. 用电信息采集系统数据库的云存储设计[D].山东大学, 2014.
    [11]
    彭小圣,邓迪元,程时杰,等. 面向智能电网应用的电力大数据关键技术[J]. 中国电机工程学报. 2015, 35(3): 503-511.
    Peng X S, Deng D Y, Cheng S J, et al. Key technologies of electric power big data and its application prospects in smart grid[J]. Proceedings of the CSEE, 2015, 35(3): 503-511.
    [12]
    Apache HiveTM[EB/OL]. http://hive.apache.org/.
    [13]
    Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters[C]// Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation. ACM Press, 2004: 137-149.
    [14]
    Apache Oozie. Apache Oozie workflow scheduler for Hadoop[EB/OL]. http://oozie.apache.org.)
  • 加载中

Catalog

    [1]
    Apache Hadoop. Welcome to apache hadoop[EB/OL]. https://hadoop.apache.org/.
    [2]
    Spark. Lightning-fast cluster computing[EB/OL]. https://spark.apache.org/.
    [3]
    Zaharia M, Chowdhury M, Franklin M J, et al. Spark: Cluster computing with working sets[C]// Proceedings of the 2nd USENIX Conference on Hot Ttopics in Cloud Computing. Boston, USA: USENIX, 2010: 10-14.
    [4]
    Xin R S, Rosen J, Zaharia M, et al. Shark: SQL and rich analytics at scale[C]// Proceedings of the ACM SIGMOD International Conference on Management of Data. New York, USA: ACM Press, 2013:13-24.
    [5]
    Abouzeid A, Bajda-Pawlikowski K, Abadi D, et al. HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads[J]. Proceedings of the VLDB Endowment, 2009, 2(1): 922-933.
    [6]
    Jiang D W, Ooi B C, Shi L, et al. The performance of MapReduce: An in-depth study[J]. Proceedings of the VLDB Endowment, 2010, 3(1-2): 472-483.
    [7]
    Dittrich J, Quiané-Ruiz J A, Jindal A, et al. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing) [J]. Proceedings of the VLDB Endowment, 2010, 3(1-2): 515-529.
    [8]
    Eltabakh M Y, zcan F, Sismanis Y, et al. Eagle-eyed elephant: Split-oriented indexing in Hadoop[C]// Proceedings of the 16th International Conference on Extending Database Technology. Genoa, Italy: ACM Press, 2013: 89-100.
    [9]
    Liu Y, Hu S L, Rabl T, et al. DGFIndex for smart grid: Enhancing hive with a cost-effective multidimensional range index[C]// 40th International Conference on VLDB. Hangzhou, China: ACM Press, 2014: 1496-1507.
    [10]
    宋振伟. 用电信息采集系统数据库的云存储设计[D].山东大学, 2014.
    [11]
    彭小圣,邓迪元,程时杰,等. 面向智能电网应用的电力大数据关键技术[J]. 中国电机工程学报. 2015, 35(3): 503-511.
    Peng X S, Deng D Y, Cheng S J, et al. Key technologies of electric power big data and its application prospects in smart grid[J]. Proceedings of the CSEE, 2015, 35(3): 503-511.
    [12]
    Apache HiveTM[EB/OL]. http://hive.apache.org/.
    [13]
    Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters[C]// Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation. ACM Press, 2004: 137-149.
    [14]
    Apache Oozie. Apache Oozie workflow scheduler for Hadoop[EB/OL]. http://oozie.apache.org.)

    Article Metrics

    Article views (23) PDF downloads(83)
    Proportional views

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return