Qi Yin is currently a graduate student at the University of Science and Technology of China. His research interests focus on variable selection
Lei Shu is currently a Ph.D. student at the University of Science and Technology of China. His research interests focus on high-dimensional statistical inference, including variable selection, change point detection, and factor analysis
We consider the problem of interpretable classification in a high-dimensional setting, where the number of features is extremely large and the number of observations is limited. This setting has been extensively studied in the chemometric literature and has recently become pervasive in the biological and medical literature. Linear discriminant analysis (LDA) is a canonical approach for solving this problem. However, in the case of high dimensions, LDA is unsuitable for two reasons. First, the standard estimate of the within-class covariance matrix is singular; therefore, the usual discriminant rule cannot be applied. Second, when p is large, it is difficult to interpret the classification rules obtained from LDA because p features are involved. In this setting, motivated by the success of the primal-dual active set algorithm for best subset selection, we propose a method for sparse linear discriminant analysis via ℓ0 constraint, which imposes a sparsity criterion when performing linear discriminant analysis, allowing classification and feature selection to be performed simultaneously. Numerical results on synthetic and real data suggest that our method obtains competitive results compared with existing alternative methods.
Graphical Abstract
An algorithm for adding the exchange step to improve naive sparse linear discriminant analysis.
Abstract
We consider the problem of interpretable classification in a high-dimensional setting, where the number of features is extremely large and the number of observations is limited. This setting has been extensively studied in the chemometric literature and has recently become pervasive in the biological and medical literature. Linear discriminant analysis (LDA) is a canonical approach for solving this problem. However, in the case of high dimensions, LDA is unsuitable for two reasons. First, the standard estimate of the within-class covariance matrix is singular; therefore, the usual discriminant rule cannot be applied. Second, when p is large, it is difficult to interpret the classification rules obtained from LDA because p features are involved. In this setting, motivated by the success of the primal-dual active set algorithm for best subset selection, we propose a method for sparse linear discriminant analysis via ℓ0 constraint, which imposes a sparsity criterion when performing linear discriminant analysis, allowing classification and feature selection to be performed simultaneously. Numerical results on synthetic and real data suggest that our method obtains competitive results compared with existing alternative methods.
Public Summary
This paper extends LDA to a high-dimensional setting by adding an ℓ0 penalty to produce a discriminant vector involving only a subset of features.
We propose an SLDA-BASS algorithm to directly solve the estimation of sparse LDA with ℓ0 constraint, avoiding the unnecessary information loss caused by relaxing the constraint in the conventional algorithm.
Compared to other estimation methods, the SLDA-BASS algorithm is derived from a natural criterion and is superior in terms of sparsity recovery as well as computational efficiency.
Linear discriminant analysis (LDA) is a prevalent supervised classification tool in many applications, owing to its simplicity, robustness, and predictive accuracy. LDA uses label information to learn discriminant projections, which can dramatically maximize the between-class distance and decrease the within-class distance, thus improving classification accuracy. Simultaneously, low-dimensional data projections in most discriminant directions are valuable for data interpretation. LDA classifiers can be constructed in three different ways: the multivariate Gaussian model, optimal scoring problem, and Fisher’s discriminant problem (see, for example, Hastie et al.[1]).
LDA is effective and asymptotically optimal when the dimension p is fixed and the number of observations n is large; that is, its misclassification rate converges to 0 over the optimal rule as n increases to infinity. Shao et al.[2] indicated that LDA remains asymptotic when p diverges to infinity at a rate slower than √n. However, with tremendous advances in data collection, high-dimensional data with dimension p potentially larger than the number of observations n are now frequently encountered in a wide range of applications, and the classification of these data has recently attracted considerable attention. Common applications include genomics, functional magnetic resonance imaging, risk management, and web searches.
In high-dimensional settings, standard LDA performs poorly and may even fail completely. For example, Bickel and Levina[3] indicated that LDA could not perform better than random guessing when p>n. In addition, in this case, the sample covariance matrix is singular, and its inverse matrix is not well identified. Consequently, it is challenging to select and extract the most discriminative features for supervised classification. A natural solution is to replace the inverse matrix with the generalized inverse matrix of the sample covariance matrix. However, such an estimation is highly biased, unrobust, and can contribute to the poor performance of the classifier. When p is large, the resulting classifier is difficult to interpret because the classification rules involve a linear combination of all p features. Thus, when p≫n, one may desire a classifier with parsimonious features, that is, a classifier that involves only a subset of p features. Such a sparse classifier ensures that the model is easier to interpret and can reduce the overfitting of training data.
Recently, several studies have extended LDA to high-dimensional settings. Some of this literature addressed non-sparse classifiers. For example, in multivariate Gaussian models of LDA, Dudoit et al.[4] and Bickel and Levina[3] assumed the independence of features (naive Bayes) and Friedman[5] suggested applying a ridge penalty to the within-class covariance matrix. Xu et al.[6] considered other positive definite estimates of the within-class covariance matrix. In addition, some research on sparse classifiers has been conducted. Tibshirani et al.[7] adapted a naive Bayesian classifier by soft thresholding the mean vector, and Guo et al.[8] combined a ridge-type penalty on the within-class covariance matrix with a soft thresholding operation. Witten and Tibshirani[9] employed an ℓ1 penalty for Fisher’s discriminant problem to obtain sparse discriminant vectors, but this approach does not generalize to the Gaussian mixture setting and lacks simplicity if it is in regression-based optimal scoring problems.
Motivated by the Fisher’s discriminant framework of Witten and Tibshirani[9], we develop a sparse version of LDA with the ℓ0 constraint. Previous sparse LDAs were typically implemented by imposing penalties of ℓ1, ℓ2, or a mixture of ℓ1 and ℓ2, because ℓ0 regularization is non-convex and NP-hard. Whether the optimization problem is more correct is based on the prediction effectiveness, false discovery rate, and sparsity interpretation. ℓ0 penalized methods have lower error bounds than ℓ1 methods. Mathematically, for a pre-specified degree of sparsity s, the discriminant vector can be determined by the following optimization problem:
where ‖β‖0 denotes the number of nonzero elements of β, Σb is the between-class covariance matrix, and Σw is the within-class covariance matrix. Further details are introduced in Section 2.
This study presents a new iterative thresholding algorithm to estimate linear discriminant vectors, which is a development of the primal-dual active-set algorithm proposed by Zhu et al. [10] to solve the best subset selection problem in regression. The contribution of this study is two-fold. First, we consider addressing the estimation problem of sparse LDA by directly solving the ℓ0 constrained optimization problem, which avoids unnecessary information loss owing to relaxing the constraints. Second, we constructed a polynomial algorithm for estimating sparse linear discriminant vectors, which is computationally efficient and easy to implement.
The remainder of this paper is organized as follows. In Section 2, we review LDA and classical solution procedures and then propose a new solution for sparse linear discriminant analysis. Sections 3 and 4 compare the results of our proposed method and existing methods in simulated experiments and applications to real data, respectively. Section 5 presents our conclusions.
2.
Methodology
2.1
A review of linear discriminant analysis
Let X be an n×p data matrix with p features measured on n observations and suppose that each of the n observations belongs to one of the K classes. In addition, we assume that each of the p features is centered to satisfy the zero mean and is normalized to have an identical variance if they are not measured on the same scale. Let xi denote the ith observation and Ck denote the index set of observations in the kth class.
Consider a simple multivariate Gaussian data generation process in which the distribution of observations in class k is N(μk,Σw), where μk∈Rp is the mean vector of the kth class and Σw is a p×p within-class covariance matrix over all K classes. Here, |Ck|−1∑i∈Ckxi is the estimate of μk and n−1∑Kk=1∑i∈Ck(xi−μk)(xi−μk)⊤ is the estimate of Σw, where |⋅| denotes the cardinality of the index set. The LDA classification rule then results in the application of the Bayesian rule to estimate the most likely class for a test observation.
LDA can also be argued to arise from Fisher’s discriminant problem. We define the between-class covariance matrix Σb=∑Kk=1πkμkμ⊤k, where πk is the prior probability of class k. Fisher’s discriminant problem involves solving for the discriminant vectors β1,⋯,βK−1, which sequentially maximizes the vector. The corresponding optimization problem is
Because the rank of Σb is at most K−1, the above-generalized eigenproblem has at most K−1 nontrivial solutions and, therefore, at most K−1 discriminant vectors. These solutions are the directions in which the data have the maximal between-class covariance relative to their within-class covariance. Moreover, it has been demonstrated that classification based on the nearest centroid of matrix (Xβ1,⋯,XβK−1) produces the same LDA classification rule as the multivariate Gaussian model described previously (see Hastie et al.[1]). One advantage of Fisher’s discriminant problem over the multivariate Gaussian model of LDA is that it allows for reduced-rank classification by performing nearest centroid classification on the matrix (Xβ1,⋯,Xβq) with q<K−1. It can be proven that performing nearest centroid classification on this n×q matrix is exactly equivalent to conducting a full-rank LDA on the n×q matrix.
The standard estimate of the within-class covariance matrix Σw is:
ˆΣw=1nK∑k=1∑i∈Ck(xi−ˆμk)(xi−ˆμk)⊤,
(1)
where ˆμk denotes the sample mean vector for class k. In this subsection, we assume that ˆΣw is nonsingular. Furthermore, the standard estimate for the between-class covariance matrix Σb is given by:
ˆΣb=1nX⊤X−ˆΣw=1nK∑k=1nkˆμkˆμ⊤k,
where nk=|Ck|.
In later subsections, we will make use of the fact that
ˆΣb=1nX⊤Y(Y⊤Y)−1Y⊤X,
where Y is an n×K matrix and Yik is an indicator of whether observation i is in the kth class. Then, the empirical version of LDA can be written as
Problem (2) is commonly written in terms of the equality constraint instead of an inequality constraint, but the two are equivalent when ˆΣw is full rank, as detailed in Ref. [9, Appendix A]. The solution ˆβk to problem (2) is generally referred to as the kth discriminant vector. Problem (2) can be solved by the variable substitution ˜βk=ˆΣ1/2wβk, where ˆΣ1/2w is the symmetric matrix square root of ˆΣk. The Fisher’s discriminant problem was then reduced to a standard eigenproblem. However, when p>n, ˆΣw becomes singular. Any discriminant vector in the null space of ˆΣw but not in the null space of ˆΣb can lead to an arbitrarily large value of the objective function.
To address this singularity problem, some modifications to the Fisher’s discriminant problem have been proposed. Krzanowski et al.[11] have considered a modification of problem (2) by trying to find a unit vector β that maximizes β⊤ˆΣbβ subject to β⊤ˆΣwβ=0. Tebbens and Schlesinger[12] further required that the solution does not lie in the null space of ˆΣb. It has also been proposed to modify problem (2) using positive definite estimates of Σw. For example, Friedman[5], Dudoit et al.[4], and Bickel and Levina[3] considerd the use of a diagonal estimate
˜Σw=diag(ˆσ21,⋯,ˆσ2p),
where ˆσ2j is the jth diagonal element of ˆΣw in Eq. (1). There are of course some other forms of positive definite estimates for Σw suggested in Xu et al.[6]. Given a positive definite estimate ˜Σw, the resulting optimization problem is
The new optimization problem (3) addresses the singularity issue but not the interpretability issue. At this point, we extend problem (3) such that the resulting discriminant vector is interpretable. We use Lemma 2.1, which provides a reformulation of problem (3) to obtain the same solution.
Lemma 2.1. The solution ˆβk to problem (3) is equivalent to the solution to the following problem:
maximizeβk{β⊤kˆΣkbβk}subjecttoβ⊤k˜Σwβk≤1,
where
ˆΣkb=1nX⊤Y(Y⊤Y)−1/2P⊥k(Y⊤Y)−1/2Y⊤X.
P⊥k is defined as follows: P⊥1=I, and for k>1, P⊥k is an orthogonal projection matrix in a space orthogonal to (Y⊤Y)−1/2Y⊤Xˆβi for all i<k.
The proof of Lemma 2.1 can be found in Ref. [9]. When we obtain the discriminant vector ˆβ1,⋯,ˆβk, we use ˆΣk+1b to replace ˆΣkb and repeat the same procedure for computing ˆβ1 until we obtain all the discriminant vectors.
In this study, we modify problem (3) by imposing ℓ0 constraint on the discriminant vectors. Given a pre-specified degree of sparsity s, we define the kth sparse discriminant vector ˆβk as the solution to the following optimization problem:
where λ>0 is a parameter that controls the normalization to ˜Σw of the discriminant vector β. We denote β⋄ as a coordinate-wise minimizer of problem (5). If the active set A⋄={j:β⋄j≠0} is known, then by disregarding the constraint ‖β‖0=s, we can minimize the objective function L(β) without ℓ0 constraint.
To determine a valid active set A, we introduce the definition of sacrifice Δ={Δj:j=1,⋯,p}, which is similar to that in Ref. [10]. Specifically, the jth sacrifice Δj measures the increase or decrease in the value of the objective function L(β) with the current ˆβj set to 0 during each round of iterations. Consider the mth iterative process, by fixing the other coordinates to their current global optimum, the new marginal optimal of βj is given by β(m+1)j=β(m)j+d(m)j, where d(m)j=−∂L(β)/∂βj∂2L(β)/∂βj2|β=β(m)=(ˆΣj⋅β(m)−λˆσ2jβ(m)j)/(λˆσ2j−ˆΣjj) denote the dual variable of β(m)j, ˆΣj⋅ denote the jth row of ˆΣkb and ˆΣjj denote the (j,j)-element of ˆΣkb. Here, we add the symbol “(m)” in the upper-right corner of each variable to indicate that this is the value taken during the mth iteration. Then, the sacrifice Δj is defined as Δ(m)j=(λˆσ2j−ˆΣjj)(β(m)j+d(m)j)2, which indicates an approximation of the loss change. Denote the inactive set I⋄=A⋄c={j:β⋄j=0}. With the primal-dual condition, we have
● If j∈A⋄, then β⋄j≠0, d⋄j=0.
● If j∈I⋄, then β⋄j=0, d⋄j=(ˆΣj⋅β⋄−λˆσ2jβ⋄j)/(λˆσ2j−ˆΣjj).
The purpose of problem (5) is to minimize the objective function L(β), which implies that the sacrifice should be as small as possible. That is, among all candidates, we enforce that the coordinates corresponding to the smaller sacrifice are set to zero. To achieve this, we rearrange {Δj,j=1,⋯,p} in decreasing order; that is, Δ[1]⩾Δ[2]⩾⋯⩾Δ[p], where Δ[i] denotes the ith largest value among {Δj,j=1,⋯,p}. We then truncate the ordered sacrifice vector at position s, that is, set the estimate of the active set A={j:Δj⩾Δ[s]}.
When A is given, we can extract the corresponding rows and columns of ˆΣkb and ˜Σw, from which the problem becomes a classical LDA problem. The solution to this problem is then obtained, with βA being the first eigenvector of (˜Σ−1/2wˆΣkb˜Σ−1/2w)AA and βI=0. The above discussion is summarized in Algorithm 2.1.
2.3
Best subset selection
Owing to the discontinuity of the ℓ0 norm, the naive algorithm for sparse linear discriminant analysis may converge to a local minimum and encounter the problem of periodic iterations. To obtain sparse linear discriminant vectors more efficiently, an iterative algorithm based on the primal-dual condition of problem (5) was proposed in this subsection. Motivated by Zhu et al.[10], who developed a novel algorithm based on exchanging active sets to avoid periodic iterations in solving the best subset selection problem in linear regression models, we extended their work to the sparse LDA problem. An exchanging step is added to Algorithm 2.1 to prevent the algorithm from entering a loop, that is, choosing a subset of the active set to exchange with a subset of the inactive set. We then decide whether to adopt the new candidate solution by comparing the objective function values before and after the exchange.
Specifically, at the mth iteration of computing the kth discriminant vector of Algorithm 2.1, we obtain the sacrifice Δ(m). Subsequently, we classified the variables that needed to be exchanged in the active and inactive sets.
● Subset of the active set that need to be exchanged to the inactive set:
B(m)c,1={j∈A(m):∑i∈A(m)I(Δ(m)j⩾Δ(m)i)⩽c}.
● Subset of the inactive set that need to be exchanged to the active set:
B(m)c,2={j∈I(m):∑i∈I(m)I(Δ(m)j⩽Δ(m)i)⩽c},
where c is an arbitrary constant ranging from 1 to min{s,p−s}.
Then, we updated the active and inactive sets using A(m+1)c=(A(m)/B(m)c,1)∪B(m)c,2 and I(m+1)c=(I(m)/B(m)c,2)∪B(m)c,1. Given A(m)c and I(m)c, we define ˜β(m)c as the solution to problem (5), which is given by
˜β(m)c=argmaxβI(m)c=0,β⊤˜Σwβ=1β⊤ˆΣbβ.
Further, for accuracy, we search c over 1 to min{s,p−s} and determine an optimal sparse vector with the smallest value of L(β), that is,ˆc=argmaxcL(˜β(m)c). Algorithm 2.2 displays the process of exchanging some variables in the active and inactive sets.
Based on the exchange step proposed above, we developed an efficient algorithm to avoid falling into a local minimum and guarantee convergence. Specifically, we added an exchange step after updating the primal and dual variables in Algorithm 2.1. The details are presented in Algorithm 2.3, which is called SLDA-BESS.
Algorithm 2.3. SLDA-BESS for sparse LDA
Input: data matrix X, indicator matrix Y, sparsity s, λ.
Output:β1,⋯,βq.
1: ˆΣb=n−1X⊤Y(Y⊤Y)−1Y⊤X, ˆΣw=X⊤X−ˆΣb and ˜Σw=diag{ˆΣw}.
2: Initialize β(0) with normalization β(0)⊤˜Σwβ(0)=1.
The improved Algorithm 2.3 has the theoretical guarantee of Theorem 2.1, which indicates that such a best subset selection algorithm incorporating the exchange step is feasible.
Theorem 2.1. The SLDA-BESS algorithm terminates after a finite number of iterations.
Proof. The objective function always increases, and the choice of the active set is finite. Algorithm 2.3 terminates after a finite number of iterations.
3.
Numeric studies
We compared SLDA-BESS with penalized LDA-ℓ1[9], nearest shrunken centroids (NSC)[7], and shrunken centroids regularized discriminant analysis (RDA)[8] in simulation studies. LDA-ℓ1 is a method that adds a ℓ1 penalty to the objective function β⊤Σβ. NSC is a simple modified version of the nearest centroid method that divides a between-class standard deviation when calculating the centroid distance and is a modified version based on NSC. In each simulation, 1200 observations were set to belong equally to several different classes. Arbitrary 300 of these 1200 observations are set as the training set, and the remaining 900 belong to the test set. Each simulation consisted of measurements of 500 features, i.e., p=500.
Simulation 1. Consider four different classes where the features are independent of each other and the mean value is shifted. Given the set of indicators for four classes, C1, C2, C3 and C4, xi∼N(μk,I) if i∈Ck, where μ1,j=0.7 for 1⩽j⩽25, μ2,j=0.7 for 26⩽j⩽50, μ3,j=0.7 for 51⩽j⩽75, μ4,j=0.7 for 76⩽j⩽100 and μkj=0 otherwise for j=1,…,500.
Simulation 2. Consider two different classes where the features are dependent of each other and the mean value is shifted. Given the set of indicators for two classes, C1 and C2, xi∼N(0,Σ) for i∈C1 and xi∼N(μ,Σ) for i∈C2, where μj=0.6 if j⩽100 and μj=0 otherwise and Σ=(Σij)=(0.6|i−j|). The covariance structure Σ is intended to mimic gene expression data, in which genes are positively correlated within a pathway and independent between pathways.
Simulation 3. Consider four different classes in which the features are independent of each other and have only one dimension, where the mean is shifted. Given four indicator sets, C1, C2, C3 and C4, for i∈Ck, xij∼N((k−1)/3,1) if j≤100 and xij∼N(0,1) otherwise. The one-dimensional projection of the data fully captures the structure of the class.
For each method, the models were fitted to the training set using a range of values for the tuning parameters. Tuning parameter values were chosen to minimize errors in the validation set. The parameters estimated for the training set were then evaluated for the test set. After obtaining the estimated discriminant vectors, we applied the KNN method to the new dataset to perform classification. This process was repeated 200 times.
The classification error in the test set and the number of non-zero features used by the discriminant vectors are listed in Table 1. In the “errors” row of each simulation result, the average number of misclassified individuals on the test set with 900 observations is presented, and the standard deviation is in parentheses. Correspondingly, in the “features” row, the average number of non-zero features used in estimating the discriminant vectors is reported, with the standard deviation in parentheses. As shown in Table 1, our method has a smaller error when the number of features used is almost equal. Moreover, it is easier to adjust the parameters using the ℓ0 penalty method.
Algorithm 2.1. Naive algorithm for sparse linear discriminant analysis
Input: data matrix X, indicator matrix Y, sparsity s.
Output:ˆβ1,⋯,ˆβq.
1: ˆΣb=n−1X⊤Y(Y⊤Y)−1Y⊤X.
2: ˆΣw=X⊤X/n−ˆΣb and ˜Σw=diag{ˆΣw}.
3: fork=1,⋯,qdo
4: (a) ˆΣkb=n−1X⊤Y(Y⊤Y)−1/2P⊥k(Y⊤Y)−1/2Y⊤X.
5: (b) Initialize βk with normalization β⊤k˜Σwβk=1.
6: (c) Iterate until convergence or a maximum number of iteration is reached:
·Δj=(λˆσ2j−ˆΣjj)(βj+dj)2, A={j:Δj⩾Δ[s]} and I=Ac.
·βk,A is the first eigenvector of (˜Σ−1/2wˆΣkb˜Σ−1/2w)AA and βk,I=0.
This section compares our SLDA-BESS method with three existing methods: penalized LDA-ℓ1, NSC, and RDA.
We applied the four methods to the three gene expression datasets.
Ramaswamy dataset[13]: A dataset consisting of 16063 gene expression measurements and 198 samples belonging to 14 distinct cancer subtypes.
Nakayama dataset[14]: A dataset consisting of 105 samples from 10 types of soft tissue tumors, with 22283 gene expression measurements per sample. We shall limit the analysis to five tumor types with at least 15 samples in the data, resulting in a subset of data containing 86 samples.
Sun dataset[15]: A dataset consisting of 180 samples and 54613 expression measurements. The sample falls into four classes: one non-tumor class and three glioma classes.
Each dataset is split into a training set containing 75% of the samples and a test set containing the remaining 25% of the samples. A total of 100 replications were performed, each with randomly selected training and testing sets. The results are presented in Table 2. The results suggest that the four methods perform roughly equally in terms of error, but our SLDA-BESS method utilizes fewer features. Moreover, our SLDA-BESS method is more suitable for remarkably sparse models, whereas the other three methods may fail when using a few features.
Algorithm 2.2. Exchange step
Input:ˆΣkb, ˜Σw, λ, sacrifice Δ, active set A, inactive set I, sparsity s.
Output: discriminant vector ˆβ, dual variables d and sacrifice Δ.
1: L=−β⊤ˆΣkbβ+λ(β⊤˜Σwβ−1).
2: forc=1,⋯,min{s,p−s}do
3: Bc,1={j∈A:∑i∈AI(Δj⩾Δi)⩽c} and Bc,2={j∈I:∑i∈II(Δj⩽Δi)⩽c}.
4: Ac=(A/Bc,1)∪Bc,2 and Ic=(I/Bc,2)∪Bc,1.
5: Calculate ˜βc based on Ac, and ˜L=−˜β⊤cˆΣkb˜βc+λ(˜β⊤c˜Σw˜βc−1).
Linear discriminant analysis is a commonly used classification method. However, it fails if the number of features is large relative to the number of observations. This study extended LDA to a high-dimensional setting by adding an ℓ0 penalty to produce a discriminant vector involving only a subset of features. Our extension is based on Fisher’s discriminant problem, such as a generalized eigen problem, and then uses the SLDA-BESS algorithm, which combines the naive algorithm and the exchange step to solve the sparse discriminant vector. Sparse discriminant vectors were generated while making the classifier more interpretable in practical situations. The results of both numerical simulations and analysis of real data demonstrate the superior performance of our SLDA-BESS method. The convergence rate and complexity of the SLDA-BESS algorithm, as well as the theoretical properties of the minimax lower and upper bounds of the estimator, can be further considered in the future.
Acknowledgements
This work was supported by the National Natural Science Foundation of China (71771203).
Conflict of interest
The authors declare that they have no conflict of interest.
Conflict of Interest
The authors declare that they have no conflict of interest.
This paper extends LDA to a high-dimensional setting by adding an ℓ0 penalty to produce a discriminant vector involving only a subset of features.
We propose an SLDA-BASS algorithm to directly solve the estimation of sparse LDA with ℓ0 constraint, avoiding the unnecessary information loss caused by relaxing the constraint in the conventional algorithm.
Compared to other estimation methods, the SLDA-BASS algorithm is derived from a natural criterion and is superior in terms of sparsity recovery as well as computational efficiency.
Hastie T, Tibshirani R, Friedman J H, et al. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. Berlin: Springer, 2009.
[2]
Shao J, Wang Y, Deng X, et al. Sparse linear discriminant analysis by thresholding for high dimensional data. The Annals of Statistics,2011, 39 (2): 1241–1265. DOI: 10.1214/10-AOS870
[3]
Bickel P, Levina E. Some theory of Fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli,2004, 10 (6): 989–1010. DOI: 10.3150/bj/1106314847
[4]
Dudoit S, Fridlyand J, Speed T P. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association,2002, 97 (457): 77–87. DOI: 10.1198/016214502753479248
[5]
Friedman J H. Regularized discriminant analysis. Journal of the American Statistical Association,1989, 84 (405): 165–175. DOI: 10.1080/01621459.1989.10478752
[6]
Xu P, Brock G N, Parrish R S. Modified linear discriminant analysis approaches for classification of high-dimensional microarray data. Computational Statistics & Data Analysis,2009, 53 (5): 1674–1687. DOI: 10.1016/j.csda.2008.02.005
[7]
Tibshirani R, Hastie T, Narasimhan B, et al. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences of the United States of America,2002, 99 (10): 6567–6572. DOI: 10.1073/pnas.082099299
[8]
Guo Y, Hastie T, Tibshirani R. Regularized linear discriminant analysis and its application in microarrays. Biostatistics,2007, 8 (1): 86–100. DOI: 10.1093/biostatistics/kxj035
[9]
Witten D M, Tibshirani R. Penalized classification using Fisher’s linear discriminant. Journal of the Royal Statistical Society:Series B (Statistical Methodology),2011, 73 (5): 753–772. DOI: 10.1111/j.1467-9868.2011.00783.x
[10]
Zhu J, Wen C, Zhu J, et al. A polynomial algorithm for best-subset selection problem. Proceedings of the National Academy of Sciences of the United States of America,2020, 117 (52): 33117–33123. DOI: 10.1073/pnas.2014241117
[11]
Krzanowski W, Jonathan P, McCarthy W, et al. Discriminant analysis with singular covariance matrices: Methods and applications to spectroscopic data. Journal of the Royal Statistical Society:Series C (Applied Statistics),1995, 44 (1): 101–115. DOI: 10.2307/2986198
[12]
Tebbens J D, Schlesinger P. Improving implementation of linear discriminant analysis for the high dimension/small sample size problem. Computational Statistics & Data Analysis,2007, 52 (1): 423–437. DOI: 10.1016/j.csda.2007.02.001
[13]
Ramaswamy S, Tamayo P, Rifkin R, et al. Multiclass cancer diagnosis using tumor gene expression signatures. Proceedings of the National Academy of Sciences of the United States of America,2001, 98 (26): 15149–15154. DOI: 10.1073/pnas.211566398
[14]
Nakayama R, Nemoto T, Takahashi H, et al. Gene expression analysis of soft tissue sarcomas: Characterization and reclassification of malignant fibrous histiocytoma. Modern Pathology,2007, 20 (7): 749–759. DOI: 10.1038/modpathol.3800794
[15]
Sun L, Hui A M, Su Q, et al. Neuronal and glioma-derived stem cell factor induces angiogenesis within the brain. Cancer Cell,2006, 9 (4): 287–300. DOI: 10.1016/j.ccr.2006.03.003
Hastie T, Tibshirani R, Friedman J H, et al. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. Berlin: Springer, 2009.
[2]
Shao J, Wang Y, Deng X, et al. Sparse linear discriminant analysis by thresholding for high dimensional data. The Annals of Statistics,2011, 39 (2): 1241–1265. DOI: 10.1214/10-AOS870
[3]
Bickel P, Levina E. Some theory of Fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli,2004, 10 (6): 989–1010. DOI: 10.3150/bj/1106314847
[4]
Dudoit S, Fridlyand J, Speed T P. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association,2002, 97 (457): 77–87. DOI: 10.1198/016214502753479248
[5]
Friedman J H. Regularized discriminant analysis. Journal of the American Statistical Association,1989, 84 (405): 165–175. DOI: 10.1080/01621459.1989.10478752
[6]
Xu P, Brock G N, Parrish R S. Modified linear discriminant analysis approaches for classification of high-dimensional microarray data. Computational Statistics & Data Analysis,2009, 53 (5): 1674–1687. DOI: 10.1016/j.csda.2008.02.005
[7]
Tibshirani R, Hastie T, Narasimhan B, et al. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences of the United States of America,2002, 99 (10): 6567–6572. DOI: 10.1073/pnas.082099299
[8]
Guo Y, Hastie T, Tibshirani R. Regularized linear discriminant analysis and its application in microarrays. Biostatistics,2007, 8 (1): 86–100. DOI: 10.1093/biostatistics/kxj035
[9]
Witten D M, Tibshirani R. Penalized classification using Fisher’s linear discriminant. Journal of the Royal Statistical Society:Series B (Statistical Methodology),2011, 73 (5): 753–772. DOI: 10.1111/j.1467-9868.2011.00783.x
[10]
Zhu J, Wen C, Zhu J, et al. A polynomial algorithm for best-subset selection problem. Proceedings of the National Academy of Sciences of the United States of America,2020, 117 (52): 33117–33123. DOI: 10.1073/pnas.2014241117
[11]
Krzanowski W, Jonathan P, McCarthy W, et al. Discriminant analysis with singular covariance matrices: Methods and applications to spectroscopic data. Journal of the Royal Statistical Society:Series C (Applied Statistics),1995, 44 (1): 101–115. DOI: 10.2307/2986198
[12]
Tebbens J D, Schlesinger P. Improving implementation of linear discriminant analysis for the high dimension/small sample size problem. Computational Statistics & Data Analysis,2007, 52 (1): 423–437. DOI: 10.1016/j.csda.2007.02.001
[13]
Ramaswamy S, Tamayo P, Rifkin R, et al. Multiclass cancer diagnosis using tumor gene expression signatures. Proceedings of the National Academy of Sciences of the United States of America,2001, 98 (26): 15149–15154. DOI: 10.1073/pnas.211566398
[14]
Nakayama R, Nemoto T, Takahashi H, et al. Gene expression analysis of soft tissue sarcomas: Characterization and reclassification of malignant fibrous histiocytoma. Modern Pathology,2007, 20 (7): 749–759. DOI: 10.1038/modpathol.3800794
[15]
Sun L, Hui A M, Su Q, et al. Neuronal and glioma-derived stem cell factor induces angiogenesis within the brain. Cancer Cell,2006, 9 (4): 287–300. DOI: 10.1016/j.ccr.2006.03.003