ISSN 0253-2778

CN 34-1054/N

Open AccessOpen Access JUSTC Original Paper

A latent semantic analysis classification technique based on optimized categorization information

Cite this:
https://doi.org/10.3969/j.issn.0253-2778.2015.04.009
  • Received Date: 21 March 2014
  • Accepted Date: 04 November 2014
  • Rev Recd Date: 04 November 2014
  • Publish Date: 30 April 2015
  • As an effective method in the way of dimensionality reduction, latent semantic analysis( LSA) has been widely applied to many text learning missions, such as information retrieval and text categorization. Based on professional literature text classification tasks, features of text from same and different categories were analyzed under a strict classification system, patent documents classification was taken as an example, an optimized LSA classification technique was purposed based on categorization information. Utilizing features information from different category text, the technique divided original documents into a variety of fake documents, strengthens occurrence frequency of exclusive features from different categories, thus building optimized latent semantic space and improving the performance of the classification model. The experimental result shows that the method effectively improves categorization precision when applied to text categorization.
    As an effective method in the way of dimensionality reduction, latent semantic analysis( LSA) has been widely applied to many text learning missions, such as information retrieval and text categorization. Based on professional literature text classification tasks, features of text from same and different categories were analyzed under a strict classification system, patent documents classification was taken as an example, an optimized LSA classification technique was purposed based on categorization information. Utilizing features information from different category text, the technique divided original documents into a variety of fake documents, strengthens occurrence frequency of exclusive features from different categories, thus building optimized latent semantic space and improving the performance of the classification model. The experimental result shows that the method effectively improves categorization precision when applied to text categorization.
  • loading
  • [1]
    Deerwester S, Dumais S T, Furnas G W, et al. Indexing by latent semantic analysis[J]. Journal of the American Society for Information Science, 1990, 41(6): 391-407.
    [2]
    Thorleuchter D, Van den Poel D. Technology classification with latent semantic indexing[J]. Expert Systems with Applications, 2013, 40(5): 1786-1795.
    [3]
    Wang L, Wan Y. Sentiment classification of documents based on latent semantic analysis[A]// Advanced Research on Computer Education, Simulation and Modeling Communications in Computer and Information Science, Berlin: Springer, 2011, 176: 356-361.
    [4]
    L’Huillier G, Hevia A, Weber R, et al. Latent semantic analysis and keyword extraction for phishing classification[C]// International Conference on Intelligence and Security Informatics. Vancouver, Canada: IEEE Press, 2010: 129-131.
    [5]
    叶浩, 王明文, 曾雪强. 基于潜在语义的多类文本分类模型研究[J]. 清华大学学报(自然科学版), 2005, 45(S1): 1818-1822.
    Ye H, Wang M W,Zeng X Q. Automatic text mult-classification model based on latent semantic[J]. Journal of Tsinghua University(Science & Technology), 2005, 45(S1): 1818-1822.
    [6]
    郭东波. 基于伪文档的潜在语义索引优化技术的研究[D]. 沈阳航空工业学院, 2010.
    [7]
    Kontostathis A, Pottenger W M. A mathematical view of latent semantic indexing: Tracing term co-occurrences[R]. LU-CSE-02-006, Department of Computer Science and Engineering, Lehigh University, 2002.
    [8]
    Larkey L S. Some issues in the automatic classification of U.S. patents[R]. Technical Report WS-98-05, Learning for Text Categorization, 1998: 87-90.
    [9]
    屈鹏, 王惠临. 专利文本分类的基础问题研究[J]. 现代图书情报技术, 2013, (3): 38-44.
    [10]
    张启蕊, 张凌, 董守斌, 等. 训练集类别分布对文本分类的影响[J]. 清华大学学报(自然科学版), 2005, 45(S1): 1802-1805.
    Zhang Q R, Wang H L Dong S B, et al. Effects of category distribution in a training set on textcategorization[J]. Journal of Tsinghua University(Science & Technology), 2005, 45(S1): 1802-1805.)
  • 加载中

Catalog

    [1]
    Deerwester S, Dumais S T, Furnas G W, et al. Indexing by latent semantic analysis[J]. Journal of the American Society for Information Science, 1990, 41(6): 391-407.
    [2]
    Thorleuchter D, Van den Poel D. Technology classification with latent semantic indexing[J]. Expert Systems with Applications, 2013, 40(5): 1786-1795.
    [3]
    Wang L, Wan Y. Sentiment classification of documents based on latent semantic analysis[A]// Advanced Research on Computer Education, Simulation and Modeling Communications in Computer and Information Science, Berlin: Springer, 2011, 176: 356-361.
    [4]
    L’Huillier G, Hevia A, Weber R, et al. Latent semantic analysis and keyword extraction for phishing classification[C]// International Conference on Intelligence and Security Informatics. Vancouver, Canada: IEEE Press, 2010: 129-131.
    [5]
    叶浩, 王明文, 曾雪强. 基于潜在语义的多类文本分类模型研究[J]. 清华大学学报(自然科学版), 2005, 45(S1): 1818-1822.
    Ye H, Wang M W,Zeng X Q. Automatic text mult-classification model based on latent semantic[J]. Journal of Tsinghua University(Science & Technology), 2005, 45(S1): 1818-1822.
    [6]
    郭东波. 基于伪文档的潜在语义索引优化技术的研究[D]. 沈阳航空工业学院, 2010.
    [7]
    Kontostathis A, Pottenger W M. A mathematical view of latent semantic indexing: Tracing term co-occurrences[R]. LU-CSE-02-006, Department of Computer Science and Engineering, Lehigh University, 2002.
    [8]
    Larkey L S. Some issues in the automatic classification of U.S. patents[R]. Technical Report WS-98-05, Learning for Text Categorization, 1998: 87-90.
    [9]
    屈鹏, 王惠临. 专利文本分类的基础问题研究[J]. 现代图书情报技术, 2013, (3): 38-44.
    [10]
    张启蕊, 张凌, 董守斌, 等. 训练集类别分布对文本分类的影响[J]. 清华大学学报(自然科学版), 2005, 45(S1): 1802-1805.
    Zhang Q R, Wang H L Dong S B, et al. Effects of category distribution in a training set on textcategorization[J]. Journal of Tsinghua University(Science & Technology), 2005, 45(S1): 1802-1805.)

    Article Metrics

    Article views (41) PDF downloads(69)
    Proportional views

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return