ISSN 0253-2778

CN 34-1054/N

Open AccessOpen Access JUSTC Original Paper

Research on web page automatic categorization based on structural and text information

Cite this:
https://doi.org/10.3969/j.issn.0253-2778.2017.04.002
  • Received Date: 01 March 2016
  • Rev Recd Date: 17 September 2016
  • Publish Date: 30 April 2017
  • Since web pages contain abundant information resources, a better extraction and management of the information can be achieved through web page categorization. Considering the complex structure and abundant text information, a method was proposed for web page categorization based on the structure and text. The method of combining joint features and atomic features was employed to classify the web pages. The experiment result shows that the proposed method is feasible to some extent and has a higher precision and recall rate than using text information only.
    Since web pages contain abundant information resources, a better extraction and management of the information can be achieved through web page categorization. Considering the complex structure and abundant text information, a method was proposed for web page categorization based on the structure and text. The method of combining joint features and atomic features was employed to classify the web pages. The experiment result shows that the proposed method is feasible to some extent and has a higher precision and recall rate than using text information only.
  • loading
  • [1]
    孙建涛, 沈抖, 陆玉昌,等. 网页分类技术[J]. 清华大学学报(自然科学版), 2004, 44(1):65-68.
    SUN Jiantao, SHEN Dou, LU Yuchang, et al. Web classification technology[J]. Chinese Journal of Tsinghua University(Natural Science), 2004, 44(1): 65-68.
    [2]
    Fürnkranz J. Exploiting structural information for text classification on the WWW[C]// Proceedings of the 3rd International Symposium on Advances in Intelligent Data Analysis. Springer, 1999: 487-498.
    [3]
    SHEN D, SUN J T, YANG Q, et al. A comparison of implicit and explicit links for web page classification[C]// Proceedings of the 15th International Conference on World Wide Web. Edinburgh, UK: ACM Press, 2006: 643-650.
    [4]
    SRIURAI W, MEESAD P, HARUECHAIYASAK C. Improving web page classification by integrating neighboring pages via a topic model[C]// 10th International Conference on Innovative Internet Community Systems. Bonn, Germany: Gesellschaft Für Informatik, 2010: 238-246.
    [5]
    JING X Y, LIU Q, WU F, et al. Web page classification based on uncorrelated semi-supervised intra-view and inter-view manifold discriminant feature extraction[C]// Proceedings of the 24th International Conference on Artificial Intelligence. Buenos Aires, Agentina: AAAI Press, 2015: 2255-2261.
    [6]
    MARKOV A, LAST M, KANDEL A. Model-based classification of web documents represented by graphs[C]//Proceedings of the Conjunction with the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Philadelphia, USA: ACM Press, 2006. oai:CiteSeerX.psu:10.1.1.86.3971.
    [7]
    范焱, 郑诚, 王清毅, 等. 用Naive Bayes方法协调分类Web网页[J]. 软件学报, 2001, 12(9):1386-1392.
    FAN Yan, ZHEN Cheng, WANG Qingyi, et al. Using nave Bayes to Coordinate the classification Web pages[J]. Chinese Journal of Software, 2001, 12(9): 1386-1392.
    [8]
    张亮, 叶允明, 于水,等. SLMBSVMs-KNN: 一种新的网页分类算法[C]// 全国搜索引擎和网上信息挖掘学术讨论会. 北京, 2003: 80-85.
    [9]
    李蓉, 孙媛. SVM-KNN分类器在网页分类中的应用[J]. 科学技术与工程, 2009, 9(16): 4653-4656.
    LI Rong, SUN Yuan. Application of SVM-KNN classifier into Web page classification[J]. Chinese Journal of Science Technology and Engineering, 2009, 9(16): 4653-4656.
    [10]
    SUN S, LIU F, LIU J, et al. Web Classification Using Deep Belief Networks[C]// Proceedings of the 17th International Conference on Computational Science and Engineering. Chengdu, China: IEEE Computer Society, 2014: 768-773.
    [11]
    QUEK C Y, MITCHELL T. Classification of world wide web documents[J]. Senior Honors Thesis, 1997: 1-12.
    [12]
    YANG Y, SLATTERY S, GHANI R. A Study of Approaches to HypertextCategorization[J]. Journal of Intelligent Information Systems, 2002, 18(2): 219-241.
    [13]
    刘欣. 基于结构信息的中文网页自动分类技术研究[D]. 南京: 南京航空航天大学, 2010.
    [14]
    侯小静, 王黎明. 利用HTML标签筛选网页分类样本[J]. 微机发展, 2005, 15(3): 142-144.
    HOU Xiaojing, WANG Liming. Using HTML tag to screen Web page classification [J]. Chinese Microcomputer Development, 2005, 15(3): 142-144.
    [15]
    郭晓, 蒋宗礼. 基于网页结构与链接关系的中文文本分类方法[J]. 现代电子技术, 2010, 33(22): 54-56.
    GUO Xiao, JIANG Zongli. A novel Chinese text classification method using webpage tags and hyperlinks[J]. Chinese Journal of Modern Electronic Technology, 2010, 33(22): 54-56.
    [16]
    兰均, 施化吉, 李星毅,等. 基于特征词复合权重的关联网页分类[J]. 计算机科学, 2011, 38(3):187-190.
    LAN jun, SHI Huaji, LI Xingyi, et al. Associated web page classification based on the weight of composite features[J]. Chinese journal of computer science, 2011, 38(3):187-190.
    [17]
    张海雷, 王会珍, 王安慧,等. 基于朴素贝叶斯模型的垃圾邮件过滤技术比较分析[A]// 全国网络与信息安全技术研讨会论文集(下册). 2007: 551-557.
    [18]
    王振宇, 唐远华, 郭力. 面向分层结构的网页分类与抓取[J]. 计算机工程与科学, 2012, 34(11): 1-6.
    WANG Zhenyu, TANG Yuanhua, GUO Li. Categorization and extraction of Web pages based on Hierarchy [J]. Chinese Journal of Computer Engineering and Science, 2012, 34(11): 1-6.
  • 加载中

Catalog

    [1]
    孙建涛, 沈抖, 陆玉昌,等. 网页分类技术[J]. 清华大学学报(自然科学版), 2004, 44(1):65-68.
    SUN Jiantao, SHEN Dou, LU Yuchang, et al. Web classification technology[J]. Chinese Journal of Tsinghua University(Natural Science), 2004, 44(1): 65-68.
    [2]
    Fürnkranz J. Exploiting structural information for text classification on the WWW[C]// Proceedings of the 3rd International Symposium on Advances in Intelligent Data Analysis. Springer, 1999: 487-498.
    [3]
    SHEN D, SUN J T, YANG Q, et al. A comparison of implicit and explicit links for web page classification[C]// Proceedings of the 15th International Conference on World Wide Web. Edinburgh, UK: ACM Press, 2006: 643-650.
    [4]
    SRIURAI W, MEESAD P, HARUECHAIYASAK C. Improving web page classification by integrating neighboring pages via a topic model[C]// 10th International Conference on Innovative Internet Community Systems. Bonn, Germany: Gesellschaft Für Informatik, 2010: 238-246.
    [5]
    JING X Y, LIU Q, WU F, et al. Web page classification based on uncorrelated semi-supervised intra-view and inter-view manifold discriminant feature extraction[C]// Proceedings of the 24th International Conference on Artificial Intelligence. Buenos Aires, Agentina: AAAI Press, 2015: 2255-2261.
    [6]
    MARKOV A, LAST M, KANDEL A. Model-based classification of web documents represented by graphs[C]//Proceedings of the Conjunction with the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Philadelphia, USA: ACM Press, 2006. oai:CiteSeerX.psu:10.1.1.86.3971.
    [7]
    范焱, 郑诚, 王清毅, 等. 用Naive Bayes方法协调分类Web网页[J]. 软件学报, 2001, 12(9):1386-1392.
    FAN Yan, ZHEN Cheng, WANG Qingyi, et al. Using nave Bayes to Coordinate the classification Web pages[J]. Chinese Journal of Software, 2001, 12(9): 1386-1392.
    [8]
    张亮, 叶允明, 于水,等. SLMBSVMs-KNN: 一种新的网页分类算法[C]// 全国搜索引擎和网上信息挖掘学术讨论会. 北京, 2003: 80-85.
    [9]
    李蓉, 孙媛. SVM-KNN分类器在网页分类中的应用[J]. 科学技术与工程, 2009, 9(16): 4653-4656.
    LI Rong, SUN Yuan. Application of SVM-KNN classifier into Web page classification[J]. Chinese Journal of Science Technology and Engineering, 2009, 9(16): 4653-4656.
    [10]
    SUN S, LIU F, LIU J, et al. Web Classification Using Deep Belief Networks[C]// Proceedings of the 17th International Conference on Computational Science and Engineering. Chengdu, China: IEEE Computer Society, 2014: 768-773.
    [11]
    QUEK C Y, MITCHELL T. Classification of world wide web documents[J]. Senior Honors Thesis, 1997: 1-12.
    [12]
    YANG Y, SLATTERY S, GHANI R. A Study of Approaches to HypertextCategorization[J]. Journal of Intelligent Information Systems, 2002, 18(2): 219-241.
    [13]
    刘欣. 基于结构信息的中文网页自动分类技术研究[D]. 南京: 南京航空航天大学, 2010.
    [14]
    侯小静, 王黎明. 利用HTML标签筛选网页分类样本[J]. 微机发展, 2005, 15(3): 142-144.
    HOU Xiaojing, WANG Liming. Using HTML tag to screen Web page classification [J]. Chinese Microcomputer Development, 2005, 15(3): 142-144.
    [15]
    郭晓, 蒋宗礼. 基于网页结构与链接关系的中文文本分类方法[J]. 现代电子技术, 2010, 33(22): 54-56.
    GUO Xiao, JIANG Zongli. A novel Chinese text classification method using webpage tags and hyperlinks[J]. Chinese Journal of Modern Electronic Technology, 2010, 33(22): 54-56.
    [16]
    兰均, 施化吉, 李星毅,等. 基于特征词复合权重的关联网页分类[J]. 计算机科学, 2011, 38(3):187-190.
    LAN jun, SHI Huaji, LI Xingyi, et al. Associated web page classification based on the weight of composite features[J]. Chinese journal of computer science, 2011, 38(3):187-190.
    [17]
    张海雷, 王会珍, 王安慧,等. 基于朴素贝叶斯模型的垃圾邮件过滤技术比较分析[A]// 全国网络与信息安全技术研讨会论文集(下册). 2007: 551-557.
    [18]
    王振宇, 唐远华, 郭力. 面向分层结构的网页分类与抓取[J]. 计算机工程与科学, 2012, 34(11): 1-6.
    WANG Zhenyu, TANG Yuanhua, GUO Li. Categorization and extraction of Web pages based on Hierarchy [J]. Chinese Journal of Computer Engineering and Science, 2012, 34(11): 1-6.

    Article Metrics

    Article views (508) PDF downloads(199)
    Proportional views

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return