• 中文核心期刊要目总览
  • 中国科技核心期刊
  • 中国科学引文数据库(CSCD)
  • 中国科技论文与引文数据库(CSTPCD)
  • 中国学术期刊文摘数据库(CSAD)
  • 中国学术期刊(网络版)(CNKI)
  • 中文科技期刊数据库
  • 万方数据知识服务平台
  • 中国超星期刊域出版平台
  • 国家科技学术期刊开放平台
  • 荷兰文摘与引文数据库(SCOPUS)
  • 日本科学技术振兴机构数据库(JST)

基于Spark/Shark的电力用采大数据OLAP分析系统

Spark/Shark-based OLAP system for smart grid applications

  • 摘要: 用电信息大数据上的OLAP查询涉及数据量大,具有多表连接操作频繁、SQL结构复杂等特点,传统关系型数据库面对该类应用,表现出可扩展性弱、数据写入吞吐量低与查询效率低等问题.为此设计了一套基于Spark/Shark的电力大数据OLAP分析系统,该系统采用分布式文件系统HDFS保存电力用电信息采集系统的大数据,通过Shark进行前端SQL解析,Spark进行查询计算;然而,原生Shark只支持粗粒度分区,不支持细粒度的索引技术,难以高效地过滤无关数据,影响了查询性能.为克服这一不足,该系统设计了一种基于前缀树的细粒度索引结构TrieIndex,并通过数据重组技术优化了数据在HDFS的分布,提升了Shark的数据过滤能力以及用电信息大数据OLAP分析的性能.真实用电信息采集系统数据与查询的实验结果表明,该系统比关系型数据库的写入速度提升了12倍,比原生Shark的查询效率提升了10倍以上.

     

    Abstract: The OLAP queries on electricity consumption information in Smart Grid have some prominent features: huge amounts of data, involving multiple tables in a joint operation, complex SQL structure, etc. Faced with this kind of applications, traditional RDBMS always leads to poor scalability, low write throughput, and unacceptable query performance, etc. A Spark/Shark-Based OLAP system for electricity consumption information in smart grid was designed. The system used distributed file system HDFS for data storage, and makes use of Shark to parse the SQL queries and Spark to execute them. However, Shark does not support fine-grained index, which hinders further improvement of query performance. To overcome this limitation, a Trie tree based fine-grained index technique TrieIndex and data re-organization scheme for better query performance was proposed. The experiment results with real electricity consumption information data and query show that the write throughput of the system is 12 times faster than that of RDBMS, and the query efficiency of the system is 10 times greater than that of original Shark.

     

/

返回文章
返回