A hierarchical classification model for class-imbalanced data

SHI Peibei; LIU Guiquan; WANG Zhong; WEI Bing

doi:10.3969/j.issn.0253-2778.2015.01.010

PDF( 1376 KB)

Open Access JUSTC Original Paper

A hierarchical classification model for class-imbalanced data

1.
Department of Public Computer Teaching, Hefei Normal University, Hefei 230601, China
2.
School of Computer Science and Technology, University of Science and Technology of China,Hefei 230027, China
3.
Department of Digital Technology, No.38 Research Institute of CETC, Hefei 230088, China

Cite this:

https://doi.org/10.3969/j.issn.0253-2778.2015.01.010

Received Date: 09 June 2014
Accepted Date: 29 July 2014
Rev Recd Date: 29 July 2014
Publish Date: 30 January 2015

Abstract Full text PDF

Abstract

Abstract

Traditional machine learning methods have lower classification performance when dealing with class imbalanced data. A hierarchical classification model for class imbalanced data was thus proposed. With an AdaBoost classifier as its basis classifier, the model builds mathematical models by the features and false positive rates of the classifier, and demonstrates that parameters of the hierarchical classification model could be calculated. First, the hierarchical classification tree was as the structure, and then the classification cost of the hierarchical classification tree mode was obtained as well as a quantitative and mathematical description of the features of each layer. Finally, the classification cost could be converted to a optimization problem, and the solving process of the optimization problem was given. Meanwhile, results of the hierarchical classification are presented. Experiments have been conducted on UCI dataset, and the results show that the proposed method has higher AUC and F-measure compared to many existing class-imbalanced learning methods.

Abstract

Traditional machine learning methods have lower classification performance when dealing with class imbalanced data. A hierarchical classification model for class imbalanced data was thus proposed. With an AdaBoost classifier as its basis classifier, the model builds mathematical models by the features and false positive rates of the classifier, and demonstrates that parameters of the hierarchical classification model could be calculated. First, the hierarchical classification tree was as the structure, and then the classification cost of the hierarchical classification tree mode was obtained as well as a quantitative and mathematical description of the features of each layer. Finally, the classification cost could be converted to a optimization problem, and the solving process of the optimization problem was given. Meanwhile, results of the hierarchical classification are presented. Experiments have been conducted on UCI dataset, and the results show that the proposed method has higher AUC and F-measure compared to many existing class-imbalanced learning methods.

FullText(HTML)

References(25)

References

[1]	Phua C, Alahakoon D, Lee V. Minority report in fraud detection: classification of skewed data[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 50-59.
[2]	Sun A X, Lim E P, Liu Y. On strategies for imbalanced text classification using SVM: A comparative study[J]. Decision Support Systems, 2009, 48(1): 191-201.
[3]	Turney P D. Learning algorithms for key phrase extraction[J]. Information Retrieval, 2000, 2(4): 303-336.
[4]	Burez J, van den Poel D. Handling class imbalance in customer churn prediction[J]. Expert Systems with Applications, 2009, 36(3): 4 626-4 636.
[5]	Brekke C, Solberg A H S. Oil spill detection by satellite remote sensing[J]. Remote sensing of environment, 2005, 95(1): 1-13.
[6]	Plant C, Bhm C, Tilg B, et al. Enhancing instance-based classification with local density: a new algorithm for classifying unbalanced biomedical data[J]. Bioinformatics, 2006, 22(8): 981-988.
[7]	Branch J W, Giannella C, Szymanski B, et al. In-network outlier detection in wireless sensor networks[J]. Knowledge and information systems, 2013, 34(1): 23-54.
[8]	Sahbi H, Geman D. A hierarchy of support vector machines for pattern detection[J]. Journal of Machine Learning Research, 2006, 7: 2 087-2 123.
[9]	Blake C, Keogh E, Merz C J. UCI repository of machine learning databases[EB/OL]. http://www.ics.uci.edu/_mlearn/MLRepository.html.
[10]	Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16: 321-357.
[11]	Han H, Wang W Y, Mao B H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning[C]//Advances in Intelligent Computing. Berlin Heidelberg, Germany: Springer, 2005: 878-887.
[12]	Liu A, Ghosh J, Martin C E. Generative oversampling for mining imbalanced datasets[C]// Proceedings of International Conference on Data Mining. Las Vegas, USA: IEEE Press, 2007: 66-72.
[13]	Batista G E, Prati R C, Monard M C. A study of the behavior of several methods for balancing machine learning training data[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 20-29.
[14]	Weiss G M, Provost F J. Learning when training data are costly: The effect of class distribution on tree induction[J]. Journal of Artificial Intelligence Research, 2003, 19: 315-354.
[15]	Domingos P. MetaCost: A general method for making classifiers cost-sensitive[C]// Proceedings of the International Conference on Knowledge Discovery and Data Mining. San Diego, USA: ACM Press, 1999: 155-164.
[16]	Chen C, Liaw A, Breiman L. Using random forest to learn imbalanced data[R]. TR666, Statistics Department, University of California at Berkeley, 2004.
[17]	Chew H G, Bogner R E, Lim C C. Dual ν-support vector machine with error rate and training size biasing[C]// Proceedings of the 26th International Conference on Acoustics, Speech and Signal Processing. Salt Lake City, USA: IEEE Press, 2001, 2: 1 269-1 272.
[18]	Raskutti B, Kowalczyk A. Extreme re-balancing for SVMs: A case study[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 60-69.
[19]	Juszczak P, Duin R P W. Uncertainty sampling methods for one-class classifiers[C]// Proceedings of International Conference on Machine Learning. Washington, USA: IEEE Press, 2003: 81-88.
[20]	Zhou Z H, Liu X Y. Training cost-sensitive neural networks with methods addressing the class imbalance problem[J]. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(1): 63-77.
[21]	Liu X Y, Wu J X, Zhou Z H. Exploratory undersampling for class-imbalance learning[J]. IEEE Transactions on Systems, Man, and Cybernetics, 2009, 39(2): 539-550.
[22]	Galar M, Fernandez A, Barrenechea E, et al. A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches[J]. IEEE Transactions on Systems, Man, and Cybernetics, 2012, 42(4): 463-484.
[23]	Viola P, Jones M. Rapid object detection using a boosted cascade of simple features[C]// Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. London, IEEE Press, 2001, 1: I-511-518.
[24]	Liu X Y, Li Q Q, Zhou Z H. Learning imbalanced multi-class data with optimal dichotomy weights[C]// IEEE 13th International Conference on Data Mining. Omaha, USA: IEEE Press, 2013: 478-487.
[25]	Drummond C, Holte R C. C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling[EB/OL]. http://www.site.uottawa.ca/~nat/Workshop2003/drummondc.pdf.

Supplements(0)

Track Citations

Proportional views

Proportional views

Get Citation

PDF

XML

[1]	Phua C, Alahakoon D, Lee V. Minority report in fraud detection: classification of skewed data[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 50-59.
[2]	Sun A X, Lim E P, Liu Y. On strategies for imbalanced text classification using SVM: A comparative study[J]. Decision Support Systems, 2009, 48(1): 191-201.
[3]	Turney P D. Learning algorithms for key phrase extraction[J]. Information Retrieval, 2000, 2(4): 303-336.
[4]	Burez J, van den Poel D. Handling class imbalance in customer churn prediction[J]. Expert Systems with Applications, 2009, 36(3): 4 626-4 636.
[5]	Brekke C, Solberg A H S. Oil spill detection by satellite remote sensing[J]. Remote sensing of environment, 2005, 95(1): 1-13.
[6]	Plant C, Bhm C, Tilg B, et al. Enhancing instance-based classification with local density: a new algorithm for classifying unbalanced biomedical data[J]. Bioinformatics, 2006, 22(8): 981-988.
[7]	Branch J W, Giannella C, Szymanski B, et al. In-network outlier detection in wireless sensor networks[J]. Knowledge and information systems, 2013, 34(1): 23-54.
[8]	Sahbi H, Geman D. A hierarchy of support vector machines for pattern detection[J]. Journal of Machine Learning Research, 2006, 7: 2 087-2 123.
[9]	Blake C, Keogh E, Merz C J. UCI repository of machine learning databases[EB/OL]. http://www.ics.uci.edu/_mlearn/MLRepository.html.
[10]	Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16: 321-357.
[11]	Han H, Wang W Y, Mao B H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning[C]//Advances in Intelligent Computing. Berlin Heidelberg, Germany: Springer, 2005: 878-887.
[12]	Liu A, Ghosh J, Martin C E. Generative oversampling for mining imbalanced datasets[C]// Proceedings of International Conference on Data Mining. Las Vegas, USA: IEEE Press, 2007: 66-72.
[13]	Batista G E, Prati R C, Monard M C. A study of the behavior of several methods for balancing machine learning training data[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 20-29.
[14]	Weiss G M, Provost F J. Learning when training data are costly: The effect of class distribution on tree induction[J]. Journal of Artificial Intelligence Research, 2003, 19: 315-354.
[15]	Domingos P. MetaCost: A general method for making classifiers cost-sensitive[C]// Proceedings of the International Conference on Knowledge Discovery and Data Mining. San Diego, USA: ACM Press, 1999: 155-164.
[16]	Chen C, Liaw A, Breiman L. Using random forest to learn imbalanced data[R]. TR666, Statistics Department, University of California at Berkeley, 2004.
[17]	Chew H G, Bogner R E, Lim C C. Dual ν-support vector machine with error rate and training size biasing[C]// Proceedings of the 26th International Conference on Acoustics, Speech and Signal Processing. Salt Lake City, USA: IEEE Press, 2001, 2: 1 269-1 272.
[18]	Raskutti B, Kowalczyk A. Extreme re-balancing for SVMs: A case study[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 60-69.
[19]	Juszczak P, Duin R P W. Uncertainty sampling methods for one-class classifiers[C]// Proceedings of International Conference on Machine Learning. Washington, USA: IEEE Press, 2003: 81-88.
[20]	Zhou Z H, Liu X Y. Training cost-sensitive neural networks with methods addressing the class imbalance problem[J]. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(1): 63-77.
[21]	Liu X Y, Wu J X, Zhou Z H. Exploratory undersampling for class-imbalance learning[J]. IEEE Transactions on Systems, Man, and Cybernetics, 2009, 39(2): 539-550.
[22]	Galar M, Fernandez A, Barrenechea E, et al. A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches[J]. IEEE Transactions on Systems, Man, and Cybernetics, 2012, 42(4): 463-484.
[23]	Viola P, Jones M. Rapid object detection using a boosted cascade of simple features[C]// Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. London, IEEE Press, 2001, 1: I-511-518.
[24]	Liu X Y, Li Q Q, Zhou Z H. Learning imbalanced multi-class data with optimal dichotomy weights[C]// IEEE 13th International Conference on Data Mining. Omaha, USA: IEEE Press, 2013: 478-487.
[25]	Drummond C, Holte R C. C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling[EB/OL]. http://www.site.uottawa.ca/~nat/Workshop2003/drummondc.pdf.

TrendMD

Volume 45 Issue 1 page: 61-68

Cover

Keywords

Article Metrics

Article views (14) PDF downloads(67)

A hierarchical classification model for class-imbalanced data

Abstract

Abstract

References

Proportional views

Catalog

Recommended articles

TrendMD

Article Metrics

Proportional views

Authors

Browse

Contact Us

About

A hierarchical classification model for class-imbalanced data

Share

Tools

Abstract

Abstract

References

Proportional views

Catalog

Recommended articles

TrendMD

Article Metrics

Proportional views

Authors

Browse

Contact Us

About

Export File

Citation

Format

Content