ISSN 0253-2778

CN 34-1054/N

Open AccessOpen Access JUSTC Original Paper

A hierarchical classification model for class-imbalanced data

Cite this:
https://doi.org/10.3969/j.issn.0253-2778.2015.01.010
  • Received Date: 09 June 2014
  • Accepted Date: 29 July 2014
  • Rev Recd Date: 29 July 2014
  • Publish Date: 30 January 2015
  • Traditional machine learning methods have lower classification performance when dealing with class imbalanced data. A hierarchical classification model for class imbalanced data was thus proposed. With an AdaBoost classifier as its basis classifier, the model builds mathematical models by the features and false positive rates of the classifier, and demonstrates that parameters of the hierarchical classification model could be calculated. First, the hierarchical classification tree was as the structure, and then the classification cost of the hierarchical classification tree mode was obtained as well as a quantitative and mathematical description of the features of each layer. Finally, the classification cost could be converted to a optimization problem, and the solving process of the optimization problem was given. Meanwhile, results of the hierarchical classification are presented. Experiments have been conducted on UCI dataset, and the results show that the proposed method has higher AUC and F-measure compared to many existing class-imbalanced learning methods.
    Traditional machine learning methods have lower classification performance when dealing with class imbalanced data. A hierarchical classification model for class imbalanced data was thus proposed. With an AdaBoost classifier as its basis classifier, the model builds mathematical models by the features and false positive rates of the classifier, and demonstrates that parameters of the hierarchical classification model could be calculated. First, the hierarchical classification tree was as the structure, and then the classification cost of the hierarchical classification tree mode was obtained as well as a quantitative and mathematical description of the features of each layer. Finally, the classification cost could be converted to a optimization problem, and the solving process of the optimization problem was given. Meanwhile, results of the hierarchical classification are presented. Experiments have been conducted on UCI dataset, and the results show that the proposed method has higher AUC and F-measure compared to many existing class-imbalanced learning methods.
  • loading
  • [1]
    Phua C, Alahakoon D, Lee V. Minority report in fraud detection: classification of skewed data[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 50-59.
    [2]
    Sun A X, Lim E P, Liu Y. On strategies for imbalanced text classification using SVM: A comparative study[J]. Decision Support Systems, 2009, 48(1): 191-201.
    [3]
    Turney P D. Learning algorithms for key phrase extraction[J]. Information Retrieval, 2000, 2(4): 303-336.
    [4]
    Burez J, van den Poel D. Handling class imbalance in customer churn prediction[J]. Expert Systems with Applications, 2009, 36(3): 4 626-4 636.
    [5]
    Brekke C, Solberg A H S. Oil spill detection by satellite remote sensing[J]. Remote sensing of environment, 2005, 95(1): 1-13.
    [6]
    Plant C, Bhm C, Tilg B, et al. Enhancing instance-based classification with local density: a new algorithm for classifying unbalanced biomedical data[J]. Bioinformatics, 2006, 22(8): 981-988.
    [7]
    Branch J W, Giannella C, Szymanski B, et al. In-network outlier detection in wireless sensor networks[J]. Knowledge and information systems, 2013, 34(1): 23-54.
    [8]
    Sahbi H, Geman D. A hierarchy of support vector machines for pattern detection[J]. Journal of Machine Learning Research, 2006, 7: 2 087-2 123.
    [9]
    Blake C, Keogh E, Merz C J. UCI repository of machine learning databases[EB/OL]. http://www.ics.uci.edu/_mlearn/MLRepository.html.
    [10]
    Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16: 321-357.
    [11]
    Han H, Wang W Y, Mao B H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning[C]//Advances in Intelligent Computing. Berlin Heidelberg, Germany: Springer, 2005: 878-887.
    [12]
    Liu A, Ghosh J, Martin C E. Generative oversampling for mining imbalanced datasets[C]// Proceedings of International Conference on Data Mining. Las Vegas, USA: IEEE Press, 2007: 66-72.
    [13]
    Batista G E, Prati R C, Monard M C. A study of the behavior of several methods for balancing machine learning training data[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 20-29.
    [14]
    Weiss G M, Provost F J. Learning when training data are costly: The effect of class distribution on tree induction[J]. Journal of Artificial Intelligence Research, 2003, 19: 315-354.
    [15]
    Domingos P. MetaCost: A general method for making classifiers cost-sensitive[C]// Proceedings of the International Conference on Knowledge Discovery and Data Mining. San Diego, USA: ACM Press, 1999: 155-164.
    [16]
    Chen C, Liaw A, Breiman L. Using random forest to learn imbalanced data[R]. TR666, Statistics Department, University of California at Berkeley, 2004.
    [17]
    Chew H G, Bogner R E, Lim C C. Dual ν-support vector machine with error rate and training size biasing[C]// Proceedings of the 26th International Conference on Acoustics, Speech and Signal Processing. Salt Lake City, USA: IEEE Press, 2001, 2: 1 269-1 272.
    [18]
    Raskutti B, Kowalczyk A. Extreme re-balancing for SVMs: A case study[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 60-69.
    [19]
    Juszczak P, Duin R P W. Uncertainty sampling methods for one-class classifiers[C]// Proceedings of International Conference on Machine Learning. Washington, USA: IEEE Press, 2003: 81-88.
    [20]
    Zhou Z H, Liu X Y. Training cost-sensitive neural networks with methods addressing the class imbalance problem[J]. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(1): 63-77.
    [21]
    Liu X Y, Wu J X, Zhou Z H. Exploratory undersampling for class-imbalance learning[J]. IEEE Transactions on Systems, Man, and Cybernetics, 2009, 39(2): 539-550.
    [22]
    Galar M, Fernandez A, Barrenechea E, et al. A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches[J]. IEEE Transactions on Systems, Man, and Cybernetics, 2012, 42(4): 463-484.
    [23]
    Viola P, Jones M. Rapid object detection using a boosted cascade of simple features[C]// Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. London, IEEE Press, 2001, 1: I-511-518.
    [24]
    Liu X Y, Li Q Q, Zhou Z H. Learning imbalanced multi-class data with optimal dichotomy weights[C]// IEEE 13th International Conference on Data Mining. Omaha, USA: IEEE Press, 2013: 478-487.
    [25]
    Drummond C, Holte R C. C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling[EB/OL]. http://www.site.uottawa.ca/~nat/Workshop2003/drummondc.pdf.
  • 加载中

Catalog

    [1]
    Phua C, Alahakoon D, Lee V. Minority report in fraud detection: classification of skewed data[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 50-59.
    [2]
    Sun A X, Lim E P, Liu Y. On strategies for imbalanced text classification using SVM: A comparative study[J]. Decision Support Systems, 2009, 48(1): 191-201.
    [3]
    Turney P D. Learning algorithms for key phrase extraction[J]. Information Retrieval, 2000, 2(4): 303-336.
    [4]
    Burez J, van den Poel D. Handling class imbalance in customer churn prediction[J]. Expert Systems with Applications, 2009, 36(3): 4 626-4 636.
    [5]
    Brekke C, Solberg A H S. Oil spill detection by satellite remote sensing[J]. Remote sensing of environment, 2005, 95(1): 1-13.
    [6]
    Plant C, Bhm C, Tilg B, et al. Enhancing instance-based classification with local density: a new algorithm for classifying unbalanced biomedical data[J]. Bioinformatics, 2006, 22(8): 981-988.
    [7]
    Branch J W, Giannella C, Szymanski B, et al. In-network outlier detection in wireless sensor networks[J]. Knowledge and information systems, 2013, 34(1): 23-54.
    [8]
    Sahbi H, Geman D. A hierarchy of support vector machines for pattern detection[J]. Journal of Machine Learning Research, 2006, 7: 2 087-2 123.
    [9]
    Blake C, Keogh E, Merz C J. UCI repository of machine learning databases[EB/OL]. http://www.ics.uci.edu/_mlearn/MLRepository.html.
    [10]
    Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16: 321-357.
    [11]
    Han H, Wang W Y, Mao B H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning[C]//Advances in Intelligent Computing. Berlin Heidelberg, Germany: Springer, 2005: 878-887.
    [12]
    Liu A, Ghosh J, Martin C E. Generative oversampling for mining imbalanced datasets[C]// Proceedings of International Conference on Data Mining. Las Vegas, USA: IEEE Press, 2007: 66-72.
    [13]
    Batista G E, Prati R C, Monard M C. A study of the behavior of several methods for balancing machine learning training data[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 20-29.
    [14]
    Weiss G M, Provost F J. Learning when training data are costly: The effect of class distribution on tree induction[J]. Journal of Artificial Intelligence Research, 2003, 19: 315-354.
    [15]
    Domingos P. MetaCost: A general method for making classifiers cost-sensitive[C]// Proceedings of the International Conference on Knowledge Discovery and Data Mining. San Diego, USA: ACM Press, 1999: 155-164.
    [16]
    Chen C, Liaw A, Breiman L. Using random forest to learn imbalanced data[R]. TR666, Statistics Department, University of California at Berkeley, 2004.
    [17]
    Chew H G, Bogner R E, Lim C C. Dual ν-support vector machine with error rate and training size biasing[C]// Proceedings of the 26th International Conference on Acoustics, Speech and Signal Processing. Salt Lake City, USA: IEEE Press, 2001, 2: 1 269-1 272.
    [18]
    Raskutti B, Kowalczyk A. Extreme re-balancing for SVMs: A case study[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 60-69.
    [19]
    Juszczak P, Duin R P W. Uncertainty sampling methods for one-class classifiers[C]// Proceedings of International Conference on Machine Learning. Washington, USA: IEEE Press, 2003: 81-88.
    [20]
    Zhou Z H, Liu X Y. Training cost-sensitive neural networks with methods addressing the class imbalance problem[J]. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(1): 63-77.
    [21]
    Liu X Y, Wu J X, Zhou Z H. Exploratory undersampling for class-imbalance learning[J]. IEEE Transactions on Systems, Man, and Cybernetics, 2009, 39(2): 539-550.
    [22]
    Galar M, Fernandez A, Barrenechea E, et al. A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches[J]. IEEE Transactions on Systems, Man, and Cybernetics, 2012, 42(4): 463-484.
    [23]
    Viola P, Jones M. Rapid object detection using a boosted cascade of simple features[C]// Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. London, IEEE Press, 2001, 1: I-511-518.
    [24]
    Liu X Y, Li Q Q, Zhou Z H. Learning imbalanced multi-class data with optimal dichotomy weights[C]// IEEE 13th International Conference on Data Mining. Omaha, USA: IEEE Press, 2013: 478-487.
    [25]
    Drummond C, Holte R C. C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling[EB/OL]. http://www.site.uottawa.ca/~nat/Workshop2003/drummondc.pdf.

    Article Metrics

    Article views (28) PDF downloads(154)
    Proportional views

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return