Machine learning in data envelopment analysis: A smart mechanism for indicator selection

Jie Wu; Yumeng Wu

doi:10.52396/JUSTC-2022-0106

PDF( 3628 KB)

Open Access JUSTC Management 22 November 2022

Machine learning in data envelopment analysis: A smart mechanism for indicator selection

Jie Wu,
Yumeng Wu^,

School of Management, University of Science and Technology of China, Hefei 230026, China

Cite this:

https://doi.org/10.52396/JUSTC-2022-0106

More Information

Author Bio:
Jie Wu is currently a Professor at the School of Management, University of Science and Technology of China (USTC). He received his Ph.D. degree in Management Science from the USTC in 2008. His research mainly focuses on operations research

Yumeng Wu is currently a master student at the School of Management, University of Science and Technology of China. She received her B.S. degree from the Northwestern Polytechnical University in 2016. Her research interests focus on operations research and computer science
Corresponding author: E-mail: wuyumeng@mail.ustc.edu.cn
Received Date: 20 July 2022
Accepted Date: 05 September 2022

Available Online: 22 November 2022

Abstract Full text PDF

Abstract

Abstract

Indicator selection has been a compelling problem in data envelopment analysis. With the advent of the big data era, scholars are faced with more complex indicator selection situations. The boom in machine learning presents an opportunity to address this problem. However, poor quality indicators may be selected if inappropriate methods are used in overfitting or underfitting scenarios. To date, some scholars have pioneered the use of the least absolute shrinkage and selection operator to select indicators in overfitting scenarios, but researchers have not proposed classifying the big data scenarios encountered by DEA into overfitting and underfitting scenarios, nor have they attempted to develop a complete indicator selection system for both scenarios. To fill these research gaps, this study employs machine learning methods and proposes a mean score approach based on them. Our Monte Carlo simulations show that the least absolute shrinkage and selection operator dominates in overfitting scenarios but fails to select good indicators in underfitting scenarios, while the ensemble methods are superior in underfitting scenarios, and the proposed mean approach performs well in both scenarios. Based on the strengths and limitations of the different methods, a smart indicator selection mechanism is proposed to facilitate the selection of DEA indicators.

Graphical abstract

The overall framework of our research.

Abstract

Indicator selection has been a compelling problem in data envelopment analysis. With the advent of the big data era, scholars are faced with more complex indicator selection situations. The boom in machine learning presents an opportunity to address this problem. However, poor quality indicators may be selected if inappropriate methods are used in overfitting or underfitting scenarios. To date, some scholars have pioneered the use of the least absolute shrinkage and selection operator to select indicators in overfitting scenarios, but researchers have not proposed classifying the big data scenarios encountered by DEA into overfitting and underfitting scenarios, nor have they attempted to develop a complete indicator selection system for both scenarios. To fill these research gaps, this study employs machine learning methods and proposes a mean score approach based on them. Our Monte Carlo simulations show that the least absolute shrinkage and selection operator dominates in overfitting scenarios but fails to select good indicators in underfitting scenarios, while the ensemble methods are superior in underfitting scenarios, and the proposed mean approach performs well in both scenarios. Based on the strengths and limitations of the different methods, a smart indicator selection mechanism is proposed to facilitate the selection of DEA indicators.

Public Summary

The purpose of this study is to categorize big data scenarios encountered by data envelopment analysis into overfitting and underfitting scenarios, and to develop a smart indicator selection mechanism based on numerous machine learning approaches.
This study advances the development of data envelopment analysis in the context of big data and combines it with machine learning to provide a convenient and intelligent indicator selection mechanism for scholars in the field of efficiency evaluation.
Most machine learning approaches in indicator selection are confined to certain circumstances, according to Monte Carlo simulations, but the proposed mean score methodology performs well in both overfitting and underfitting scenarios.

FullText(HTML)

References(27)

References

[1]	An Q, Chen H, Wu J, et al. Measuring slacks-based efficiency for commercial banks in China by using a two-stage DEA model with undesirable output. Annals of Operations Research, 2015, 235 (1): 13–35. doi: 10.1007/s10479-015-1987-1
[2]	Cook W D, Liang L, Zha Y, et al. A modified super-efficiency DEA model for infeasibility. Journal of the Operational Research Society, 2009, 60 (2): 276–81. doi: 10.1057/palgrave.jors.2602544
[3]	Liang X, Zhou Z. Cooperation and competition among urban agglomerations in environmental efficiency measurement: A cross-efficiency approach. JUSTC, 2022, 52 (4): 3. doi: 10.52396/JUSTC-2022-0028
[4]	Chen Y, Tsionas M G, Zelenyuk V. LASSO+DEA for small and big wide data. Omega, 2021, 102: 102419. doi: 10.1016/j.omega.2021.102419
[5]	Lee C Y, Cai J Y. LASSO variable selection in data envelopment analysis with small datasets. Omega, 2020, 91: 102019. doi: 10.1016/j.omega.2018.12.008
[6]	Golany B, Roll Y. An application procedure for DEA. Omega, 1989, 17 (3): 237–250. doi: 10.1016/0305-0483(89)90029-7
[7]	Boussofiane A, Dyson R G, Thanassoulis E. Applied data envelopment analysis. European Journal of Operational Research, 1991, 52 (1): 1–15. doi: 10.1016/0377-2217(91)90331-O
[8]	Bowlin W F. Measuring performance: An introduction to data envelopment analysis (DEA). The Journal of Cost Analysis, 1998, 15 (2): 3–27. doi: 10.1080/08823871.1998.10462318
[9]	Cooper W W, Seiford L M, Tone K. Data Envelopment Analysis: A Comprehensive Text with Models, Applications, References and DEA-Solver Software. New York: Springer, 2007.
[10]	Sehra S, Flores D, Montañez G D. Undecidability of underfitting in learning algorithms. In: 2021 2nd International Conference on Computing and Data Science (CDS). Stanford, CA: IEEE, 2021: 28–29.
[11]	Ueda T, Hoshiai Y. Application of principal component analysis for parsimonious summarization of DEA inputs and/or outputs. Journal of the Operations Research Society of Japan, 1997, 40 (4): 466–478. doi: 10.15807/jorsj.40.466
[12]	Adler N, Golany B. Including principal component weights to improve discrimination in data envelopment analysis. Journal of the Operational Research Society, 2002, 53 (9): 985–991. doi: 10.1057/palgrave.jors.2601400
[13]	Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society:Series B (Methodological), 1996, 58 (1): 267–288. doi: 10.1111/j.2517-6161.1996.tb02080.x
[14]	Rosa G J M. The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Hastie T, Tibshirani R, and Friedman J. Biometrics, 2010, 66 (4): 1315–1315. doi: 10.1111/j.1541-0420.2010.01516.x
[15]	Li S, Fang H, Liu X. Parameter optimization of support vector regression based on sine cosine algorithm. Expert Systems with Applications, 2018, 91: 63–77. doi: 10.1016/j.eswa.2017.08.038
[16]	Breiman L. Random forests. Machine Learning, 2001, 45 (1): 5–32. doi: 10.1023/A:1010933404324
[17]	Friedman J H. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 2001, 29 (5): 1189–1232. doi: 10.1214/aos/1013203450
[18]	Guyon I, Elisseeff A. An introduction to variable and feature selection. Journal of Machine Learning Research, 2003, 3: 1157–1182.
[19]	Mézard M, Montanari A. Information, Physics, and Computation. Oxford: Oxford University Press, 2009: 584.
[20]	Profillidis V A, Botzoris G N. Chapter 5: Statistical methods for transport demand modeling. In: Modeling of Transport Demand. Amsterdam: Elsevier, 2019: 163–224.
[21]	Biswas S, Bordoloi M, Purkayastha B. Review on feature selection and classification using neuro-fuzzy approaches. International Journal of Applied Evolutionary Computation, 2016, 7: 28–44. doi: 10.4018/IJAEC.2016100102
[22]	Fraser A M, Swinney H L. Independent coordinates for strange attractors from mutual information. Physical Review A, 1986, 33 (2): 1134–1140. doi: 10.1103/PhysRevA.33.1134
[23]	Reshef D N, Reshef Y A, Finucane H K, et al. Detecting novel associations in large data sets. Science, 2011, 334 (6062): 1518–1524. doi: 10.1126/science.1205438
[24]	Zhang Z, Dong J, Luo X, et al. Heartbeat classification using disease-specific feature selection. Computers in Biology and Medicine, 2014, 46: 79–89. doi: 10.1016/j.compbiomed.2013.11.019
[25]	Soares F, Anzanello M J. Support vector regression coupled with wavelength selection as a robust analytical method. Chemometrics and Intelligent Laboratory Systems, 2018, 172: 167–173. doi: 10.1016/j.chemolab.2017.12.007
[26]	Friedman J H. Multivariate adaptive regression splines. The Annals of Statistics, 1991, 19 (1): 1–67. doi: 10.1214/aos/1176347963
[27]	Breiman L. Bagging predictors. Machine Learning, 1996, 24 (2): 123–140. doi: 10.1023/A:1018054314350

Supplements(0)

Track Citations

Proportional views

Proportional views

Get Citation

PDF

XML

Figure 1. Flow chart for the mean score methodology.

Figure 2. Comparison of indicator identification ability of methods in underfitting scenarios.

Figure 3. Comparison of indicator identification ability of methods in overfitting scenarios.

Figure 4. AMSE of methods between the overfitting and underfitting scenarios.

Figure 5. MSE of methods in the overfitting scenario with 30 replications.

Figure 6. MSE of methods in underfitting scenario I with 30 replications.

Figure 7. MSE of methods in underfitting scenario II with 30 replications.

Figure 8. A smart indicator selection mechanism for DEA.

[1]	An Q, Chen H, Wu J, et al. Measuring slacks-based efficiency for commercial banks in China by using a two-stage DEA model with undesirable output. Annals of Operations Research, 2015, 235 (1): 13–35. doi: 10.1007/s10479-015-1987-1
[2]	Cook W D, Liang L, Zha Y, et al. A modified super-efficiency DEA model for infeasibility. Journal of the Operational Research Society, 2009, 60 (2): 276–81. doi: 10.1057/palgrave.jors.2602544
[3]	Liang X, Zhou Z. Cooperation and competition among urban agglomerations in environmental efficiency measurement: A cross-efficiency approach. JUSTC, 2022, 52 (4): 3. doi: 10.52396/JUSTC-2022-0028
[4]	Chen Y, Tsionas M G, Zelenyuk V. LASSO+DEA for small and big wide data. Omega, 2021, 102: 102419. doi: 10.1016/j.omega.2021.102419
[5]	Lee C Y, Cai J Y. LASSO variable selection in data envelopment analysis with small datasets. Omega, 2020, 91: 102019. doi: 10.1016/j.omega.2018.12.008
[6]	Golany B, Roll Y. An application procedure for DEA. Omega, 1989, 17 (3): 237–250. doi: 10.1016/0305-0483(89)90029-7
[7]	Boussofiane A, Dyson R G, Thanassoulis E. Applied data envelopment analysis. European Journal of Operational Research, 1991, 52 (1): 1–15. doi: 10.1016/0377-2217(91)90331-O
[8]	Bowlin W F. Measuring performance: An introduction to data envelopment analysis (DEA). The Journal of Cost Analysis, 1998, 15 (2): 3–27. doi: 10.1080/08823871.1998.10462318
[9]	Cooper W W, Seiford L M, Tone K. Data Envelopment Analysis: A Comprehensive Text with Models, Applications, References and DEA-Solver Software. New York: Springer, 2007.
[10]	Sehra S, Flores D, Montañez G D. Undecidability of underfitting in learning algorithms. In: 2021 2nd International Conference on Computing and Data Science (CDS). Stanford, CA: IEEE, 2021: 28–29.
[11]	Ueda T, Hoshiai Y. Application of principal component analysis for parsimonious summarization of DEA inputs and/or outputs. Journal of the Operations Research Society of Japan, 1997, 40 (4): 466–478. doi: 10.15807/jorsj.40.466
[12]	Adler N, Golany B. Including principal component weights to improve discrimination in data envelopment analysis. Journal of the Operational Research Society, 2002, 53 (9): 985–991. doi: 10.1057/palgrave.jors.2601400
[13]	Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society:Series B (Methodological), 1996, 58 (1): 267–288. doi: 10.1111/j.2517-6161.1996.tb02080.x
[14]	Rosa G J M. The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Hastie T, Tibshirani R, and Friedman J. Biometrics, 2010, 66 (4): 1315–1315. doi: 10.1111/j.1541-0420.2010.01516.x
[15]	Li S, Fang H, Liu X. Parameter optimization of support vector regression based on sine cosine algorithm. Expert Systems with Applications, 2018, 91: 63–77. doi: 10.1016/j.eswa.2017.08.038
[16]	Breiman L. Random forests. Machine Learning, 2001, 45 (1): 5–32. doi: 10.1023/A:1010933404324
[17]	Friedman J H. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 2001, 29 (5): 1189–1232. doi: 10.1214/aos/1013203450
[18]	Guyon I, Elisseeff A. An introduction to variable and feature selection. Journal of Machine Learning Research, 2003, 3: 1157–1182.
[19]	Mézard M, Montanari A. Information, Physics, and Computation. Oxford: Oxford University Press, 2009: 584.
[20]	Profillidis V A, Botzoris G N. Chapter 5: Statistical methods for transport demand modeling. In: Modeling of Transport Demand. Amsterdam: Elsevier, 2019: 163–224.
[21]	Biswas S, Bordoloi M, Purkayastha B. Review on feature selection and classification using neuro-fuzzy approaches. International Journal of Applied Evolutionary Computation, 2016, 7: 28–44. doi: 10.4018/IJAEC.2016100102
[22]	Fraser A M, Swinney H L. Independent coordinates for strange attractors from mutual information. Physical Review A, 1986, 33 (2): 1134–1140. doi: 10.1103/PhysRevA.33.1134
[23]	Reshef D N, Reshef Y A, Finucane H K, et al. Detecting novel associations in large data sets. Science, 2011, 334 (6062): 1518–1524. doi: 10.1126/science.1205438
[24]	Zhang Z, Dong J, Luo X, et al. Heartbeat classification using disease-specific feature selection. Computers in Biology and Medicine, 2014, 46: 79–89. doi: 10.1016/j.compbiomed.2013.11.019
[25]	Soares F, Anzanello M J. Support vector regression coupled with wavelength selection as a robust analytical method. Chemometrics and Intelligent Laboratory Systems, 2018, 172: 167–173. doi: 10.1016/j.chemolab.2017.12.007
[26]	Friedman J H. Multivariate adaptive regression splines. The Annals of Statistics, 1991, 19 (1): 1–67. doi: 10.1214/aos/1176347963
[27]	Breiman L. Bagging predictors. Machine Learning, 1996, 24 (2): 123–140. doi: 10.1023/A:1018054314350

TrendMD

Volume 52 Issue 12 page: 5

Cover

Keywords

Article Metrics

Article views (517) PDF downloads(1161)

Machine learning in data envelopment analysis: A smart mechanism for indicator selection

Abstract

Graphical abstract

Abstract

Public Summary

References

Proportional views

Catalog

Recommended articles

TrendMD

Article Metrics

Proportional views

Authors

Browse

Contact Us

About

Machine learning in data envelopment analysis: A smart mechanism for indicator selection

Share

Tools

Abstract

Graphical abstract

Abstract

Public Summary

References

Proportional views

Catalog

Recommended articles

TrendMD

Article Metrics

Proportional views

Authors

Browse

Contact Us

About

Export File

Citation

Format

Content