Haishan Yu is currently a Senior Engineer at the National Synchrotron Radiation Laboratory, University of Science and Technology of China (USTC). He received his Ph.D. degree in Physical Chemistry from USTC in 2020. His research mainly focuses on data processing combined with machine learning
Lei Cui is currently a Postdoctoral Researcher at the National Synchrotron Radiation Laboratory, University of Science and Technology of China (USTC). He received his Ph.D. degree in Physical Chemistry from USTC in 2020. His research mainly focuses on the development of control systems and databases
DaDi Zhang is currently a Senior Engineer at the National Synchrotron Radiation Laboratory, University of Science and Technology of China (USTC). He received his Ph.D. degree in Physical Chemistry from USTC in 2017. His research mainly focuses on IT integration technology
Hefei Light Source (HLS) is a synchrotron radiation light source that primarily produces vacuum ultraviolet and soft X-rays. It currently consists of ten experimental stations, including a soft X-ray microscopy station. As part of its ongoing efforts to establish a centralized scientific data management platform, HLS is in the process of developing a test system that covers the entire lifecycle of scientific data, including data generation, acquisition, processing, analysis, and destruction. However, the instruments used in the soft X-ray microscopy experimental station rely on commercial proprietary software for data acquisition and processing. We developed a semi-automatic data acquisition program to facilitate the integration of soft X-ray microscopy stations into a centralized scientific data management platform. Additionally, we created an online data processing platform to assist users in analyzing their scientific data. The system we developed and deployed meets the design requirements, successfully integrating the soft X-ray microscopy station into the full lifecycle management of scientific data.
Graphical Abstract
Flowchart of the scientific data management system.
Abstract
Hefei Light Source (HLS) is a synchrotron radiation light source that primarily produces vacuum ultraviolet and soft X-rays. It currently consists of ten experimental stations, including a soft X-ray microscopy station. As part of its ongoing efforts to establish a centralized scientific data management platform, HLS is in the process of developing a test system that covers the entire lifecycle of scientific data, including data generation, acquisition, processing, analysis, and destruction. However, the instruments used in the soft X-ray microscopy experimental station rely on commercial proprietary software for data acquisition and processing. We developed a semi-automatic data acquisition program to facilitate the integration of soft X-ray microscopy stations into a centralized scientific data management platform. Additionally, we created an online data processing platform to assist users in analyzing their scientific data. The system we developed and deployed meets the design requirements, successfully integrating the soft X-ray microscopy station into the full lifecycle management of scientific data.
Public Summary
We have developed a semi-automatic data acquisition system to incorporate the data from the soft X-ray microscopy station into the full lifecycle management.
The integrated platform includes six components: DAQ, storage system, processing system, metadata database, and front-/back-end.
HDF5-standardized workflows, SciCat-managed metadata, and OpenWhisk computing automate scientific data lifecycle management.
Data envelopment analysis (DEA) has long been used to compare and evaluate the efficiency of a set of decision-making units (DMUs) with multiple inputs and outputs[1-3]. DEA is prone to overfitting (a large number of DMUs are evaluated as efficient) when a limited number of DMUs are employed to estimate a high-dimensional frontier[4, 5], which is the reason for a series of discussions on the selection of DEA indicators. Golany and Roll[6] asserted that the number of DMUs should be at least twice the number of inputs and outputs considered. Boussofiane et al.[7] stated that to effectively distinguish the DEA weights, the minimum number of DMUs should not be less than the product of the number of inputs and that of outputs. Bowlin[8] and Cooper et al.[9] claimed that the number of DMUs should be at least 3 times the sum of the number of inputs and outputs to give a meaningful estimate. From a practical point of view, these rule-of-thumb decisions are reasonable in the case of small data but may not work in big data scenarios[4, 5]. The boom in machine learning techniques has provided researchers with the opportunity to identify good indicators from big data with noise, but if inappropriate indicator selection methods are used, low-quality indicators may be selected. Specifically, if the number of indicators is large compared to that of DMUs, a machine learning method is prone to overfitting, as the information provided by each DMU in high-dimensional variable space is too sufficient to obtain good training results[4, 5]. In contrast, when the number of indicators is much smaller than that of DMUs, the information provided by indicators of DMUs may be too insufficient to obtain a good result, which is the case of underfitting. In information theory, overfitting depicts a scenario where a method goes beyond learning the true regularities in the data due to the inadvertent capturing of noise in the modeling process, while underfitting is proposed to describe the scenarios where a method fails to capture enough information to obtain the true patterns of the data[10]. For the sake of visualization, we introduce the scenarios based on the two application conditions in the real world. One is that the Ministry of Education of the People’s Republic of China plans to use DEA to evaluate the performance of 20 universities in China, with a total of 100 indicators for 20 universities, which is a typical overfitting scenario; the other is that Quacquarelli Symonds (QS) intends to use DEA to evaluate the performance of 1000 universities worldwide to produce QS World University Rankings, but some of the indicators are not publicly available, so QS only obtains 20 indicators for 1000 DMUs, corresponding to an underfitting scenario. In both scenarios, indicator selection is essential. Indicator selection in overfitting scenarios has attracted the attention of several scholars. Ueda and Hoshiai[11] and Adler and Golany[12] successively used principal component analysis (PCA) to solve the indicator selection problem in DEA, but many of the weights of the indicators defining the principal components in PCA are negative[11], which can lead to counterintuitive integrated negative inputs[5]. Recently, some scholars[4, 5] have become pioneers in marrying the least absolute shrinkage and selection operator (LASSO)[13] with DEA, illustrating that LASSO can avoid overfitting challenges and excels in identifying good indicators, but if the relationship between outputs and inputs is nonlinear, then LASSO (using a linear model in the original unit) will be misspecified [4]. In underfitting scenarios, the relationship between outputs and inputs tends to be nonlinear, so LASSO may fail to select good indicators. To our knowledge, few studies have focused on underfitting scenarios, and researchers have not proposed classifying the big data scenarios encountered by DEA into overfitting and underfitting scenarios, nor have they attempted to develop an indicator selection system for both scenarios.
To fill these research gaps, the purpose of this study is to establish a smart indicator selection mechanism for overfitting and underfitting scenarios as an alternative to conventional methods, thus helping scholars in the field of DEA select indicators in the context of big data. In this study, the feature selection system (including regularization, wrappers, filters, and ensemble methods) is used to prevent overfitting and underfitting because regularization and wrappers are associated with cross-validation [14] and grid search[15] that are committed to overcoming overfitting, while ensemble models[16, 17] and kernel functions are developed to fit nonlinear and complex data, thereby overcoming underfitting. Based on the feature selection methods, we propose a mean score methodology that is expected to perform well in both scenarios. Then, we will learn the strengths and limitations of the different methods by applying these methods and the proposed mean score methodology in different DEA scenarios to establish a smart indicator selection mechanism. To the best of our knowledge, this paper is the first in the field of DEA to propose an indicator selection process for both overfitting and underfitting scenarios. The remainder of this study unfolds as follows: The next section briefly reviews the feature selection system, and a mean score methodology based on them is proposed in Section 3. The purpose of Section 4 is to compare and analyze the performance of the indicator selection methods by Monte Carlo simulations in both scenarios and to propose a smart mechanism for DEA indicator selection. Finally, Section 5 draws significant conclusions.
2.
Preliminaries: The feature selection system
With the past few decades witnessing the prosperity of machine learning methods, indicator selection has formed a complete system, which falls into four categories based on how the selection process relates to the relevant prediction task, namely, regularization, filters, ensemble, and wrappers methods[18].
To propose an approach for indicator selection that is expected to perform well in both overfitting and underfitting scenarios, we introduce the above machine learning system for indicator selection.
2.1
Regularization methods
Assuming that a DEA scenario with n DMUs (each DMU has p inputs and one output) is equivalent to a standard regression problem with n observations, where each observation has p standardized input variables xij,i=1,⋯,n;j=1,⋯,p and a dependent variable yi,i=1,⋯,n, the objective of DEA is to estimate the following production function:
yi=p∑j=1βjxij+α,i=1,⋯,n,
(1)
where α is the intercept and βj is the coefficient of input variable xij.
Regularization methods include LASSO (a linear model trained with the L1 penalty), ridge regression (linear least squares with the L2 penalty), and elastic net (linear regression with the combined L1 and L2 penalty). Good indicators can be selected by imposing an L1 or L2 penalty on the sum of coefficients and solving the following regression problem.
Here, we interpret yi and ˆyi as true and predicted values, respectively, whereas λ is the tuning parameter (or penalty price) that is chosen by cross-validation. Note that r can be 1 and 2, where r=2 corresponds to ridge regression, and r=1 results in LASSO. Elastic net (EN) is a linear regression with a combined L1 and L2 penalty, which can be illustrated as follows:
Among the regularization methods, LASSO shrinks the coefficient of certain indicators to zero so that indicators with nonzero coefficients can be selected, and scholars[4, 5] have demonstrated its powerful ability to select indicators in overfitting scenarios. However, LASSO will be misspecified if the relationship between outputs and inputs is nonlinear and thus may not select good indicators in underfitting scenarios.
2.2
Filter methods
Filter methods (mutual information [19] and Pearson correlation coefficient[20]) rank indicators without prediction models and have the fastest running time; however, they do not consider variable dependencies and evaluate each variable separately[21]. Moreover, they are not always effective for improving the generalization ability of the model, which is why filter methods are often used as an informative indicator selection method, as is done in this study.
Mutual information (MI) measures how much the information of one random indicator is communicated with another[22]. Here, high MI means a large reduction in uncertainty, while low MI indicates a small reduction. In this study, MIC[23] is used to measure the degree of association between two variables because it has higher accuracy than MI.
2.3
Ensemble methods
Gradient boosting regression (GBR) and random forest (RF) are two powerful ensemble methods that reduce the number of indicators by solving predictive models[18] and can be solved by popular programming software.
Random forest (RF) creates forests of randomized decision trees to handle high-dimensional data, which is why it is suitable for underfitting scenarios. In RF, the depth in the tree where the indicators are used as decision nodes is employed. The choice of depth is vital since it ultimately determines the number of samples of the input dataset whose predictions will depend on the given indicators. The indicator selection process can be well performed as the variance is sufficiently reduced.
Gradient boosting (GB) is capable of discovering complex relationships between the indicators and the dependent variable for underfitting scenarios. In the GB process, we scale up the complexity with each iteration. At each stage, the error of each model (i.e., the gradient of the loss function) is computed. In the next stage, the new model makes up for the shortcomings of the previous weak model[17].
Wrapper methods[24] evaluate the performance of variable subsets on preselected predictors and have the advantages of better generalization and robust interaction with the classifier used for indicator selection[21]. Recursive feature elimination (RFE) is a greedy algorithm for finding the optimal subset of indicators. It investigates and determines the optimal variable subset by repeatedly removing irrelevant indicators. Under RFE, models (regression model or machine learning model) are constructed repeatedly. In this study, random forest-based recursive feature elimination (RF-RFE)[16] and support vector regression-based recursive feature elimination (SVR-RFE)[25] are used. Each time the chosen model is built, the best or worst performing variable based on variable coefficients is selected and set aside. The remaining indicators are then employed to repeat the same process, which is stopped until there are no more indicators to create a model. Ultimately, all indicators are sorted according to the order in which they were eliminated, resulting in a subset of the best performing indicators. However, the stability of RFE depends on which model is used at the bottom of the iterative process. If a model without regularization is used, then RFE is unstable.
3.
Methodology
3.1
A mean score methodology
To build on the strengths and avoid the weaknesses of the above machine learning methods, a mean score methodology (Fig. 1) is proposed in this section. It consists of four types of methods:
Figure
1.
Flow chart for the mean score methodology.
Type Ⅰ. Linear methods and regularization: ① Linear regression (Linear); ② LASSO; ③ Ridge regression (Ridge); ④ EN (the ratio of L1 penalty is 0.7).
Type Ⅱ. Filter methods: ⑤ Pearson coefficient (Pearson); ⑥ MIC.
Type Ⅲ. Wrapper methods: ⑦ RF-RFE; ⑧SVR-RFE.
Type Ⅳ. Ensemble methods: ⑨ RF; ⑩ GBR.
The process of the mean score methodology is as follows:
Step 1: In each run, the score for each indicator is obtained based on the indicator importance (coefficient) under each method. In line with Chen et al.[4], the tuning parameters are chosen automatically for each method (using cross-validation);
Step 2: Normalize the indicator score under each method so that each indicator score is between 0 (lowest-ranked indicator) and 1 (top-ranked indicator);
Step 3: The scores obtained from all methods are averaged to obtain the mean importance score of an indicator, and the indicators are ranked according to the scores.
3.2
A three-step approach for selecting indicators in the DEA context
To apply the proposed mean score methodology and the above methods to DEA indicator selection, we propose a three-step approach:
① Obtain indicator importance score: Obtain the indicator importance score under each method by running the program.
② Remove indicators: After obtaining the importance score under each method, the input indicators are removed based on the importance scores (since the scores are obtained from their correlation with the output, we consider indicators with scores less than 0.2 to be insignificant, and all indicators with scores less than 0.2 will be eliminated).
③ Run DEA: Based on the indicators selected in the previous step, the efficiency of each DMU is estimated by the following output-oriented variable return-to-scale DEA estimator.
min θos.t.∑k∈Kλkxki⩽
(4)
where index o\in K indicates a specific DMU (which is an alias of index k ), and the decision variables \theta and {\lambda }_{k} are the efficiency estimator and the intensity multiplier for a linear combination of DMUs, respectively [5].
4.
Monte Carlo results and discussion
The purpose of this section is first to see if the above methods in Section 3 can obtain accurate indicator importance scores to help identify truly relevant indicators and discard irrelevant ones, and then to illustrate the performance of these methods by the proposed three-step approach in different DEA scenarios.
4.1
Indicator identification ability
In this section, we compare the performance of all methods in Section 3 in overfitting and underfitting scenarios to explore their indicator identification capabilities.
4.1.1
Data generating process
Real-world data are complex, with not only linear but also nonlinear relationships between indicators and the output. Therefore, instead of using the DGP of Chen et al.[4], we consider complex scenarios where both linear and nonlinear relationships exist between indicators (inputs) and the dependent variable. In addition, some interfering variables associated with the truly relevant indicators are also considered. The DGP used in this section is based on Friedman regression datasets[26, 27]. The true output {y}^{\rm T} and the observed output y are generated according to the following equations:
where the first 5 indicators {x}_{1},\cdots ,{x}_{5} have a real effect on the production function, the last 4 indicators are related to the first 4 true indicators {x}_{1},\cdots ,{x}_{4} , which are generated by f\left(x\right)=x+N(0,\rho ) (the correlation coefficient \rho =0.025 ), and the other indicators are independent of the dependent variable. Note that the last 4 indicators were added by us, considering that some indicators may be interrelated in real life (the deviation \varepsilon \sim N\left(0, 1\right)).
Assume that the researcher does not know which indicators are truly correlated with the dependent variable, and we use this DGP to generate 30 datasets (replications) in the following scenarios:
① Overfitting scenario Ⅰ (overfitting scenario in a small dataset): The dataset includes 100 indicators (5 true inputs, 4 correlated inputs, and 91 noisy inputs) and one output, but only 30 DMUs. In this scenario, machine learning methods tend to overfit (p=100,\,n=30; hereafter, n denotes the total number of DMUs, and p means the total number of indicators).
② Overfitting scenario Ⅱ (overfitting scenario in a large dataset): There are 100 DMUs in total. The large dataset contains 1000 indicators (5 true inputs, 4 correlated inputs, and 991 noisy inputs) and one output (p=1000,\,n=100).
③ Underfitting scenarioⅠ (underfitting scenario in a small dataset): The dataset contains 5 true inputs, 4 correlated inputs, 11 noisy inputs, and 1 output (the dataset includes 20 potential inputs and gives no clue about the correct model specification). In this scenario, machine learning methods tend to underfit since the number of DMUs is only 100, which is too small relative to the number of indicators (p=20,\,n=100).
④ Underfitting scenario Ⅱ (underfitting scenario in a large dataset): 5 true inputs, 4 correlated inputs, 91 noisy inputs, and one output, and the number of DMUs is 100 (p=100,\,n=1000).
4.1.2
Comparison and analysis
First, it is necessary to explain our criteria for classifying large and small datasets: if n \ge 1000 or p \ge 1000, then the dataset is considered a large dataset; otherwise, it is a small dataset.
Next, we will describe our process for performing Monte Carlo simulations:
① Our program uses cross-validation to automatically choose the tuning parameters for each method.
② Each method will give every input indicator an importance score, which indicates the degree of association between this indicator and the dependent variable.
③ The final score will be normalized to the interval \left[\mathrm{0,1}\right] for comparison purposes.
④ Mean squared error (MSE) and mean absolute error (MAE) are used to analyze and compare the indicator identification ability of the above methods:
where {\theta }_{im}^{\rm T} is the expected indicator importance score (theoretically, the expected importance score of the true indicator should be 1, while that of others should be shrunk to 0), and {\theta }_{im}^{*} is the indicator importance score estimated by an indicator selection method. After M Monte Carlo experiments, the average MSE (AMSE) and the average MAE (AMAE) are as follows:
To facilitate comparison, underfitting and overfitting scenarios are discussed separately.
First, the above methods for identifying the five true inputs in underfitting scenarios (both large and small datasets) are considered. Fig. 2 shows the box plots for the MSEs in 30 datasets, from which we can see that the difference in performance for these methods is fairly large:
Figure
2.
Comparison of indicator identification ability of methods in underfitting scenarios.
① For both large and small datasets, GBR is the best among all methods, as the MSE under GBR is much smaller than that under other methods, especially when we focus on the median (the red line in Fig. 2); the second-best method is the proposed mean score methodology.
② SVR-RFE and RF-RFE have the worst performance, which shows that RFE is not suitable for either large or small datasets in underfitting scenarios. Following them, the MSE of LASSO is larger than that of others, indicating LASSO’s inability to identify true input indicators in underfitting scenarios. In addition, LASSO has the widest data distribution, meaning that LASSO is rather unstable and may select the wrong indicators instead of the true ones in this scenario.
③ Among all the regularization methods for both large and small datasets, LASSO is the worst performing method, but EN and Ridge perform quite well and are more stable. The reason is that there are correlations between indicators in the DGP (known as multicollinearity in statistics that LASSO is not capable of handling), but EN and Ridge are different, because they both use {L}_{2} regularization, which is more stable than {L}_{1} regularization. Therefore, LASSO is not recommended in underfitting scenarios.
Second, the indicator identification ability of each method in overfitting scenarios is compared. Fig. 3 reports the results and shows that:
Figure
3.
Comparison of indicator identification ability of methods in overfitting scenarios.
① The worst performances still belong to SVR-RFE and RF-RFE for both large and small datasets.
② The “top method” turns into LASSO in overfitting scenarios whether it is a big dataset or a small one. Close behind LASSO, ensemble methods (GBR and RF) and the mean score methodology perform well.
③ For linear regression and regularization methods, the performances of EN, Ridge, and linear regression are good and stable but not as outstanding as LASSO in these two overfitting scenarios. Hence, we suggest using LASSO in preference in overfitting scenarios.
4.1.3
Insights
The following insights are derived based on the above simulation experiments:
① In overfitting scenarios, linear methods (LASSO, Ridge, EN, and linear regression) are preferred. The experimental results that LASSO outperforms other machine learning methods provide experimental evidence for previous literature[4, 5] on the application of LASSO to DEA.
② In underfitting scenarios, LASSO fails to identify good indicators, while ensemble methods (GBR and RF) perform well, which supports the claim in the introduction that LASSO is not suitable for the underfitting scenarios.
③ The proposed mean score methodology can accurately identify the true indicators in both scenarios, which is ideal for “black-box” environments where researchers do not distinguish between underfitting or overfitting scenarios.
4.2
Indicator selection performance in DEA scenarios
The purpose of this section is to compare the indicator selection performance of the four methods in Section 3.1 in different scenarios by comparing the true DEA efficiency scores (without noise) with the estimated values after indicator selection through the proposed three-step approach for selecting indicators in Section 3.2.
4.2.1
Data generating process
The DGP of Lee and Cai[5] is used to generate the production function with 100 inputs and 1 output.
y_k^T = \prod\limits_{i \in I} {x_{ki}^{\left( {\tfrac{1}{{\left| I \right| + 1}}} \right)}} ,\forall k = 1, \cdots ,n,
where {x}_{ki} represents the input i of the DMU k , and {y}_{kj} denotes the output. The inputs are generated by a uniform distribution of the interval (10, 20) , and the inefficiency term {\mu }_{k} is half-normal with a mean of 0 and variance of 0.7 . Here, {y}_{k}^{T} is the output of a true frontier, and the observed output {y}_{k} represents an output affected by inefficiency.
Three sample sizes with 30 replications are considered in our Monte Carlo simulations:
① Overfitting scenario: the DGP has 30 DMUs ( n=30 ), and each DMU contains 101 dimensions ( p=100 , 100 inputs and 1 output);
② Underfitting scenario in a large dataset: this DGP contains 1000 DMUs ( n=1000 ), each with 101 dimensions ( p=100 , 100 inputs and 1 output).
In addition, to make the experimental results more convincing, we add another set of data to describe underfitting scenarios:
③ Underfitting scenario in a small dataset: 100 DMUs ( n=100 ), each with 21 dimensions ( p=20 , 20 inputs and 1 output).
4.2.2
Comparison and analysis
After running the proposed three-step approach, MSE, MAE, AMSE, and AMAE between each true frontier and estimated frontier under each method can be obtained by Eqs. (7)–(10). Note that {\theta }_{im}^{\rm T} and {\theta }_{im}^{*} indicate the true DEA efficiency score and the estimated score after indicator selection, respectively. Table 1 and Fig. 4 report the results.
Table
1.
Comparison of methods in overfitting and underfitting scenarios.
First, the change in the AMSE of each method is analyzed from overfitting to underfitting scenarios of Table 1. Theoretically, if we fix p=100 , then when n changes from 30 (an overfitting scenario) to 1000 (an underfitting scenario), the linear model-based methods tend to fail to select good indicators, resulting in a large deviation between the estimated value and the true value of efficiency, as evidenced in Table 1. From Table 1, linear regression, regularization methods (LASSO, Ridge, and EN), and Pearson show increases in AMSE, with the most pronounced rise in LASSO and EN (see Table 1 and Fig. 4a, LASSO increases from 0.127 to 1.0000 and EN rises from 0.120 to 0.882). The main reason may be that LASSO and EN are not suitable for fitting linear data rather than nonlinear data (data relationships tend to be linear in overfitting scenarios but nonlinear in underfitting scenarios). MIC, SVR-RFE, RF-RFE, RF, and GBR, which can fit complex data and identify nonlinear relationships, are considered in underfitting scenarios, and the results in Table 1 illustrate the superiority of these methods. The AMSE of these methods decreases from the overfitting scenario ( p=100,n=30 ) to the underfitting scenario ( p=100, n=1000 ), and our mean score methodology also performs well (AMSE decreases from 0.105 to 0.103).
The underfitting scenario with a small dataset ( p=20,n=100 ) is considered for comparison with the overfitting scenario ( p=100,n=30 ), and Table 1 and Fig. 4b demonstrate the results. From the overfitting scenario ( p=100,n=30 ) to this underfitting scenario (p=20, n=100), LASSO and EN show significant increases in AMSE, while both RF and GBR show decreases. These findings are consistent with the above analysis when the overfitting scenario ( p=100,n=30 ) is transferred to the underfitting scenario in the large dataset ( p=100,n=1000 ). A different conclusion is that the AMSE increases significantly for SVR-RFE and RF-RFE, indicating that the RFE-based approach is unstable. The reason is that RFE is dependent on its underlying method, and if the underlying method is unstable, RFE will hardly perform well.
In Fig. 4, except for LASSO and EN, there is no significant difference in the performance of most methods between underfitting and overfitting scenarios, meaning that in underfitting scenarios, LASSO and EN (TypeⅠ) tend to fail in indicator selection, while MIC (Type Ⅱ), RF (Type Ⅳ), GBR (Type Ⅳ), and the mean score methodology are all good performers.
Then, the performance of each method by scenario is analyzed. In the overfitting scenario of Table 1, the AMSE of MIC, the RFE-based methods (RF-RFE and SVR-RFE) and the mean score methodology are all excellent performers, which is the same conclusion as in Section 4.1. Not in line with the findings in Section 4.1, the performance of LASSO + DEA is moderate in the overfitting scenario and is not superior to other machine learning methods. By observing LASSO’s scoring of individual indicators, we find that the reason for LASSO’s mediocre performance is that LASSO gives most indicators a score of 0 (LASSO may be affected by the results of automatic parameter tuning), meaning that LASSO removes most indicators, even those of high importance. The analysis based on AMSE may be biased, as the mean value tends to be affected by outliers, so the results with 30 replications are plotted in Fig. 5. From Fig. 5, results similar to Table 1 can be obtained: the two methods based on RFE, MIC, and mean score methodology perform well, and the performance of Ridge is comparable to that of linear regression; the rest of the methods performed poorly, with RF and GBR being the worst performers, which implies that we should not easily use RF or GBR (Type IV) for indicator selection in overfitting scenarios.
Figure
5.
MSE of methods in the overfitting scenario with 30 replications.
Next, the underfitting scenario with a large dataset (p=100,\,n=1000) is analyzed. From Table 1, LASSO and EN are the two worst-performing methods, while there is little difference in the performance of the other good-performing methods (see Fig. 4a). The results with 30 replications are plotted in Fig. 6a. From Fig. 6a, the performance of LASSO and EN is still the worst. To compare the performance of the well-performing methods, we remove LASSO, EN, and the two RFE-based methods (as LASSO and EN have too large MSEs, and SVR-RFE and RF-RFE are not stable) to obtain Fig. 6b. If we use linear regression as the benchmark, then MIC performs the best, followed by the mean score methodology and Ridge, and finally GBR.
Figure
6.
MSE of methods in underfitting scenario I with 30 replications.
The results with 30 replications in the underfitting scenario of a small dataset (p=20,\,n=100) are plotted in Fig. 7a, which also shows that LASSO and EN are not suitable for underfitting scenarios. Similarly, we remove LASSO, EN, and the RFE-based methods to obtain Fig. 7b, from which we can draw the same conclusions as the underfitting scenario in a large dataset: MIC performed the best, followed by mean score methodology and ridge, and then GBR.
Figure
7.
MSE of methods in underfitting scenario II with 30 replications.
In overfitting scenarios, most indicator selection methods tend to yield good results, except for ensemble methods (GBR and RF) that are prone to overfitting. Considering the nature of LASSO to produce sparse solutions and its overwhelming advantage in indicator identification, LASSO is the most recommended method in overfitting scenarios, as demonstrated in previous studies[4, 5].
In underfitting scenarios, MIC is the best choice for the univariate case (MIC identifies both nonlinear and linear relationships, and it does not require tuning parameters, leading to small computational difficulty). In the multivariate scenarios, MIC is no longer applicable, while the ensemble method (GBR) is the recommended method.
The mean score methodology proposed works well in both scenarios, which enlightens us that it may be a good choice if indicator selection is a black-box process (it is difficult to distinguish between overfitting and underfitting scenarios).
4.3
A smart indicator selection mechanism for DEA
Given the different dimensional relationships between DMUs and indicators (different DEA scenarios), indicator selection methods may encounter overfitting or underfitting. To avoid complex and time-consuming indicator engineering, it is urgent to build a process that is superior in different scenarios. Therefore, we propose a smart indicator selection mechanism to address the problem (Fig. 8).
Figure
8.
A smart indicator selection mechanism for DEA.
Step 2: Identify scenarios based on the relationship between p and n .
Step 3: Select indicators with the corresponding approach. LASSO or EN is recommended in overfitting scenarios, ensemble methods are suggested in underfitting scenarios, and the mean score methodology is used in the black-box scenarios.
Step 4: Run DEA with the indicators selected in Step 3.
5.
Conclusions
The purpose of this study is to categorize big data scenarios encountered by DEA into overfitting and underfitting scenarios and to develop a smart indicator selection mechanism to facilitate the process of indicator selection. The main contributions are as follows:
This study is the first to classify the DEA scenarios into overfitting and underfitting scenarios and to propose a mean score methodology based on machine learning methods, advancing the development of DEA in the context of big data and its integration with machine learning.
The results of Monte Carlo experiments show that in overfitting scenarios, LASSO is superior to other methods, providing evidence for previous literature on the application of LASSO to DEA. In underfitting scenarios, however, LASSO fails, while the ensemble methods perform well, illustrating the importance of distinguishing between the underfitting and overfitting scenarios. Notably, our mean score methodology outperforms other methods in both scenarios and can serve as a good alternative in black-box environments.
However, our study has limitations because only single output scenarios are considered. Future research should consider extending our work to DEA contexts with multiple desirable outputs and undesirable outputs.
Acknowledgements
This work was supported by the Fundamental Research Funds for the Central Universities (WK2310000102).
Conflict of interest
The authors declare that they have no conflict of interest.
We have developed a semi-automatic data acquisition system to incorporate the data from the soft X-ray microscopy station into the full lifecycle management.
The integrated platform includes six components: DAQ, storage system, processing system, metadata database, and front-/back-end.
HDF5-standardized workflows, SciCat-managed metadata, and OpenWhisk computing automate scientific data lifecycle management.
Black A P, Sorrentino A, Fauth F, et al. Synchrotron radiation based operando characterization of battery materials. Chemical Science, 2023, 14: 1641–1665. DOI: 10.1039/D2SC04397A
[2]
Chen Y L, Wang Y C, Zhang Y, et al. Elastic-beam triboelectric nanogenerator for high-performance multifunctional applications: Sensitive scale, acceleration/force/vibration sensor, and intelligent keyboard. Advanced Energy Materials, 2018, 8: 1802159. DOI: 10.1002/aenm.201802159
[3]
Cao L N, Liu W, Luo Q Q, et al. Atomically dispersed iron hydroxide anchored on Pt for preferential oxidation of CO in H2. Nature, 2019, 565: 631–635. DOI: 10.1038/s41586-018-0869-5
[4]
Su H, Zhou W L, Zhou W, et al. In-situ spectroscopic observation of dynamic-coupling oxygen on atomically dispersed iridium electrocatalyst for acidic water oxidation. Nature Communications, 2021, 12: 6118. DOI: 10.1038/s41467-021-26416-3
[5]
Tan H, Tang B, Lu Y, et al. Engineering a local acid-like environment in alkaline medium for efficient hydrogen evolution reaction. Nature Communications, 2022, 13: 2024. DOI: 10.1038/s41467-022-29710-w
[6]
Liu J, Li F, Chen L, et al. Quantitative imaging of Candida utilis and its organelles by soft X-ray Nano-CT. Journal of Microscopy, 2018, 270 (1): 64–70. DOI: 10.1111/jmi.12650
[7]
Loconte V, Chen J H, Vanslembrouck B, et al. Soft X-ray tomograms provide a structural basis for whole-cell modeling. The FASEB Journal, 2023, 37: e22681. DOI: 10.1096/fj.202200253R
[8]
Bai H, Guan Y, Liu J, et al. Precise correlative method of Cryo-SXT and Cryo-FM for organelle identification. Journal of Synchrotron Radiation, 2020, 27: 176–184. DOI: 10.1107/S1600577519015194
[9]
Peng M W, Guan Y, Liu J H, et al. Quantitative three-dimensional nondestructive imaging of whole anaerobic ammonium-oxidizing bacteria. Journal of Synchrotron Radiation, 2020, 27: 753–761. DOI: 10.1107/S1600577520002349
Tian H L, Zhang J R, Yan L L, et al. Distributed data processing and analysis environment for neutron scattering experiments at CSNS. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 2016, 834: 24–29. DOI: 10.1016/j.nima.2016.07.043
[12]
The ICAT Collaboration. The ICAT project. 2021 . https://icatproject.org/. Accessed August 10, 2023.
Sjöblom P, Amjad A, Bell P J, et al. Stable operation of the MAX IV Laboratory synchrotron facility. Proceedings of the 17th International Conference on Accelerator and Large Experimental Physics Control Systems. New York: JACoW Publishing, 2020 : 530−535.
[15]
Wang C P, Yu F, Liu Y Y, et al. Deploying the Big Data Science Center at the Shanghai Synchrotron Radiation Facility: the first superfacility platform in China. Machine Learning: Science and Technology, 2021, 2: 035003. DOI: 10.1088/2632-2153/abe193
Lin G, Zhang D D, Zhang Z, et al. Development of experimental data management system for HLS-II based on SciCat. Nuclear Techniques, 2022, 45 (2): 020101. DOI: 10.11889/j.0253-3219.2022.hjs.45.020101
[18]
Habermann T, Folk M J. The Hierarchical Data Format (HDF): A foundation for sustainable data and software. In: Proceedings of the AGU Fall Meeting Abstracts. San Francisco, CA, USA: American Geophysical Union, 2014 : IN21D-07.
[19]
Willman J. Overview of PyQt5. In: Modern PyQt. Berkeley, CA, USA: Apress, 2021 : 1−42.
[20]
Chinese Academy of Sciences. Chinese Academy of Sciences Large Research Infrastructures User Service Platform. Beijing: Chinese Academy of Sciences, 2015 . https://lssf.cas.cn/. Accessed August 15, 2023.
[21]
Perkel J M. Why Jupyter is data scientists’ computational notebook of choice. Nature, 2018, 563: 145–146. DOI: 10.1038/d41586-018-07196-1
Figure
1.
Architecture diagram of the HLS scientific data management platform.
Figure
2.
The experimental procedure of the soft X-ray microscopy experimental station.
Figure
3.
Data acquisition and retrieval procedure.
Figure
4.
The data preprocessing procedure of OpenWhisk.
References
[1]
Black A P, Sorrentino A, Fauth F, et al. Synchrotron radiation based operando characterization of battery materials. Chemical Science, 2023, 14: 1641–1665. DOI: 10.1039/D2SC04397A
[2]
Chen Y L, Wang Y C, Zhang Y, et al. Elastic-beam triboelectric nanogenerator for high-performance multifunctional applications: Sensitive scale, acceleration/force/vibration sensor, and intelligent keyboard. Advanced Energy Materials, 2018, 8: 1802159. DOI: 10.1002/aenm.201802159
[3]
Cao L N, Liu W, Luo Q Q, et al. Atomically dispersed iron hydroxide anchored on Pt for preferential oxidation of CO in H2. Nature, 2019, 565: 631–635. DOI: 10.1038/s41586-018-0869-5
[4]
Su H, Zhou W L, Zhou W, et al. In-situ spectroscopic observation of dynamic-coupling oxygen on atomically dispersed iridium electrocatalyst for acidic water oxidation. Nature Communications, 2021, 12: 6118. DOI: 10.1038/s41467-021-26416-3
[5]
Tan H, Tang B, Lu Y, et al. Engineering a local acid-like environment in alkaline medium for efficient hydrogen evolution reaction. Nature Communications, 2022, 13: 2024. DOI: 10.1038/s41467-022-29710-w
[6]
Liu J, Li F, Chen L, et al. Quantitative imaging of Candida utilis and its organelles by soft X-ray Nano-CT. Journal of Microscopy, 2018, 270 (1): 64–70. DOI: 10.1111/jmi.12650
[7]
Loconte V, Chen J H, Vanslembrouck B, et al. Soft X-ray tomograms provide a structural basis for whole-cell modeling. The FASEB Journal, 2023, 37: e22681. DOI: 10.1096/fj.202200253R
[8]
Bai H, Guan Y, Liu J, et al. Precise correlative method of Cryo-SXT and Cryo-FM for organelle identification. Journal of Synchrotron Radiation, 2020, 27: 176–184. DOI: 10.1107/S1600577519015194
[9]
Peng M W, Guan Y, Liu J H, et al. Quantitative three-dimensional nondestructive imaging of whole anaerobic ammonium-oxidizing bacteria. Journal of Synchrotron Radiation, 2020, 27: 753–761. DOI: 10.1107/S1600577520002349
Tian H L, Zhang J R, Yan L L, et al. Distributed data processing and analysis environment for neutron scattering experiments at CSNS. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 2016, 834: 24–29. DOI: 10.1016/j.nima.2016.07.043
[12]
The ICAT Collaboration. The ICAT project. 2021 . https://icatproject.org/. Accessed August 10, 2023.
Sjöblom P, Amjad A, Bell P J, et al. Stable operation of the MAX IV Laboratory synchrotron facility. Proceedings of the 17th International Conference on Accelerator and Large Experimental Physics Control Systems. New York: JACoW Publishing, 2020 : 530−535.
[15]
Wang C P, Yu F, Liu Y Y, et al. Deploying the Big Data Science Center at the Shanghai Synchrotron Radiation Facility: the first superfacility platform in China. Machine Learning: Science and Technology, 2021, 2: 035003. DOI: 10.1088/2632-2153/abe193
Lin G, Zhang D D, Zhang Z, et al. Development of experimental data management system for HLS-II based on SciCat. Nuclear Techniques, 2022, 45 (2): 020101. DOI: 10.11889/j.0253-3219.2022.hjs.45.020101
[18]
Habermann T, Folk M J. The Hierarchical Data Format (HDF): A foundation for sustainable data and software. In: Proceedings of the AGU Fall Meeting Abstracts. San Francisco, CA, USA: American Geophysical Union, 2014 : IN21D-07.
[19]
Willman J. Overview of PyQt5. In: Modern PyQt. Berkeley, CA, USA: Apress, 2021 : 1−42.
[20]
Chinese Academy of Sciences. Chinese Academy of Sciences Large Research Infrastructures User Service Platform. Beijing: Chinese Academy of Sciences, 2015 . https://lssf.cas.cn/. Accessed August 15, 2023.
[21]
Perkel J M. Why Jupyter is data scientists’ computational notebook of choice. Nature, 2018, 563: 145–146. DOI: 10.1038/d41586-018-07196-1