ISSN 0253-2778

CN 34-1054/N

Open AccessOpen Access JUSTC Original Paper

Tabular-oriented data model and its query issues

Cite this:
https://doi.org/10.3969/j.issn.0253-2778.2016.01.008
  • Received Date: 27 August 2015
  • Accepted Date: 29 September 2015
  • Rev Recd Date: 29 September 2015
  • Publish Date: 30 January 2016
  • With the rapid development of information technologies, data storage and representation of various sources, including not only the traditional structured data such as relational databases and object-oriented databases, but also those special unstructured data like Excel, CSV documents, manifest distributed and heterogeneous characteristics. Undoubtedly, all above data features high-volume, continuously-updating, low-usability, which falls into Big Data. However, the organization and management of Excel and other forms of data by using unstructured and semi-structured methods leads to a weakly-controllable, weakly-usable data structure with poor access efficiency. To solve this problem, this paper, taking Excel data source into consideration, aims to propose a new tabular-oriented relational data model and discusses Tabular querying and optimizing issues. Firstly, the formal definition of Tabular form data is given; secondly, PartiPath tree is designed to achieve structural transformation by tabular division and its relation schema as well; then its data model is presented. After that, four basic queries and their optimization by improved DICE with user interest similarity are described. Finally, the experiment was conducted and a conclusion was drawm.
    With the rapid development of information technologies, data storage and representation of various sources, including not only the traditional structured data such as relational databases and object-oriented databases, but also those special unstructured data like Excel, CSV documents, manifest distributed and heterogeneous characteristics. Undoubtedly, all above data features high-volume, continuously-updating, low-usability, which falls into Big Data. However, the organization and management of Excel and other forms of data by using unstructured and semi-structured methods leads to a weakly-controllable, weakly-usable data structure with poor access efficiency. To solve this problem, this paper, taking Excel data source into consideration, aims to propose a new tabular-oriented relational data model and discusses Tabular querying and optimizing issues. Firstly, the formal definition of Tabular form data is given; secondly, PartiPath tree is designed to achieve structural transformation by tabular division and its relation schema as well; then its data model is presented. After that, four basic queries and their optimization by improved DICE with user interest similarity are described. Finally, the experiment was conducted and a conclusion was drawm.
  • loading
  • [1]
    Mayer-Schnberger V, Cukier K. Big Data: A Revolution That Will Transform How We Live, Work, and Think[M]. Boston: Houghton Mifflin Harcourt, 2013.
    [2]
    China Argo News Letter, 2014, NO.2.
    [3]
    Argo data center in China, http://www.argo.org.cn/
    [4]
    Chui M, Brown B, Bughin J, et al. Big data: The next frontier for innovation, competition, and productivity[R]. McKinsey Global Institute, 2011.
    [5]
    Hurst M. Layout and language: Challenges for table understanding on the web[EB/OL]. http://cgi.csc.liv.ac.uk/~wda2001/Papers/12_hurst_wda2001.pdf.
    [6]
    Embley D W, Tao C, Liddle S W. Automating the extraction of data from HTML tables with unknown structure[J]. Data & Knowledge Engineering, 2005, 54(1): 3-28.
    [7]
    Douglas S, Hurst M, Quinn D, et al. Using natural language processing for identifying and interpreting tables in plain text[J]. Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval, 1997, 21(2-4): 231-243.
    [8]
    Gatterbauer W, Bohunsky P, Herzog M, et al. Towards domain-independent information extraction from web tables[C]// Proceedings of the16th International Conference on World Wide Web. Banff, Canada: ACM Press, 2007: 71-80.
    [9]
    Ferrucci D, Lally A. UIMA: An architectural approach to unstructured information processing in the corporate research environment[J]. Natural Language Engineering, 2004, 10(3-4): 327-348.
    [10]
    Pivk A, Sure Y, Cimiano P, et al. Transforming arbitrary tables into logical form with tartar[J]. Data & Knowledge Engineering, 2007, 60(3): 567-595.
    [11]
    Pinto D, McCallum A, Wei X, et al. Table extraction using conditional random fields[C]// Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval. Tprpnto, Canada: ACM Press, 2003: 235-242.
    [12]
    Duygulu P, Atalay V. A hierarchical representation of form documents for identification and retrieval[J]. International Journal on Document Analysis and Recognition, 1995, 5(1): 17-27.
    [13]
    Tijerino Y A, Embley D W, Lonsdale D W, et al. Towards ontology generation from tables[J]. World Wide Web: Internet and Web Information Systems, 2005, 8(3): 261-285.
    [14]
    Shigarov A O. Table understanding using a rule engine[J]. Expert Systems with Applications, 2015, 42(2): 929-937.
    [15]
    Lopresti D, Nagy G. A tabular survey of automated table processing[C]// Graphics Recognition Recent Advances. Springer, 2000: 93-120.
    [16]
    Wang X. Tabular abstraction, editing, and formatting[D]. University of Waterloo, Canada, 1996.
    [17]
    Kumar T V V, Goel A, Jain N. Mining information for constructing materialised views[J]. International Journal of Information and Communication Technology, 2010, 2(4): 386-405.
    [18]
    Frakes W B, Baeza-Yates R. Information Retrieval: Data Structure and Algorithms[M]. Upper Saddle River, USA: Prentice-Hall, 1992.)
  • 加载中

Catalog

    [1]
    Mayer-Schnberger V, Cukier K. Big Data: A Revolution That Will Transform How We Live, Work, and Think[M]. Boston: Houghton Mifflin Harcourt, 2013.
    [2]
    China Argo News Letter, 2014, NO.2.
    [3]
    Argo data center in China, http://www.argo.org.cn/
    [4]
    Chui M, Brown B, Bughin J, et al. Big data: The next frontier for innovation, competition, and productivity[R]. McKinsey Global Institute, 2011.
    [5]
    Hurst M. Layout and language: Challenges for table understanding on the web[EB/OL]. http://cgi.csc.liv.ac.uk/~wda2001/Papers/12_hurst_wda2001.pdf.
    [6]
    Embley D W, Tao C, Liddle S W. Automating the extraction of data from HTML tables with unknown structure[J]. Data & Knowledge Engineering, 2005, 54(1): 3-28.
    [7]
    Douglas S, Hurst M, Quinn D, et al. Using natural language processing for identifying and interpreting tables in plain text[J]. Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval, 1997, 21(2-4): 231-243.
    [8]
    Gatterbauer W, Bohunsky P, Herzog M, et al. Towards domain-independent information extraction from web tables[C]// Proceedings of the16th International Conference on World Wide Web. Banff, Canada: ACM Press, 2007: 71-80.
    [9]
    Ferrucci D, Lally A. UIMA: An architectural approach to unstructured information processing in the corporate research environment[J]. Natural Language Engineering, 2004, 10(3-4): 327-348.
    [10]
    Pivk A, Sure Y, Cimiano P, et al. Transforming arbitrary tables into logical form with tartar[J]. Data & Knowledge Engineering, 2007, 60(3): 567-595.
    [11]
    Pinto D, McCallum A, Wei X, et al. Table extraction using conditional random fields[C]// Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval. Tprpnto, Canada: ACM Press, 2003: 235-242.
    [12]
    Duygulu P, Atalay V. A hierarchical representation of form documents for identification and retrieval[J]. International Journal on Document Analysis and Recognition, 1995, 5(1): 17-27.
    [13]
    Tijerino Y A, Embley D W, Lonsdale D W, et al. Towards ontology generation from tables[J]. World Wide Web: Internet and Web Information Systems, 2005, 8(3): 261-285.
    [14]
    Shigarov A O. Table understanding using a rule engine[J]. Expert Systems with Applications, 2015, 42(2): 929-937.
    [15]
    Lopresti D, Nagy G. A tabular survey of automated table processing[C]// Graphics Recognition Recent Advances. Springer, 2000: 93-120.
    [16]
    Wang X. Tabular abstraction, editing, and formatting[D]. University of Waterloo, Canada, 1996.
    [17]
    Kumar T V V, Goel A, Jain N. Mining information for constructing materialised views[J]. International Journal of Information and Communication Technology, 2010, 2(4): 386-405.
    [18]
    Frakes W B, Baeza-Yates R. Information Retrieval: Data Structure and Algorithms[M]. Upper Saddle River, USA: Prentice-Hall, 1992.)

    Article Metrics

    Article views (24) PDF downloads(79)
    Proportional views

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return