Sparse linear discriminant analysis via <i>ℓ</i><sub>0</sub> constraint

Qi Yin; Lei Shu

doi:10.52396/JUSTC-2022-0045

JUSTC > 2022 > 52(8): 4-1-4-7. > DOI: 10.52396/JUSTC-2022-0045 CSTR: 32290.14.JUSTC-2022-0045

PDF (807 KB)

Open Access JUSTC Mathematics Article

Sparse linear discriminant analysis via ℓ₀ constraint

Qi Yin,
Lei Shu^,

Department of Statistics and Finance, School of Management, University of Science and Technology of China, Hefei 230026, China

Cite this: JUSTC, 2022, 52(8): 4

https://doi.org/10.52396/JUSTC-2022-0045

CSTR: 32290.14.JUSTC-2022-0045

More Information

Author Bio:
Qi Yin is currently a graduate student at the University of Science and Technology of China. His research interests focus on variable selection

Lei Shu is currently a Ph.D. student at the University of Science and Technology of China. His research interests focus on high-dimensional statistical inference, including variable selection, change point detection, and factor analysis
Corresponding author:
Lei Shu, E-mail: sl2018@mail.ustc.edu.cn
Received Date: March 04, 2022
Accepted Date: April 17, 2022

Full text PDF

Abstract

Abstract

We consider the problem of interpretable classification in a high-dimensional setting, where the number of features is extremely large and the number of observations is limited. This setting has been extensively studied in the chemometric literature and has recently become pervasive in the biological and medical literature. Linear discriminant analysis (LDA) is a canonical approach for solving this problem. However, in the case of high dimensions, LDA is unsuitable for two reasons. First, the standard estimate of the within-class covariance matrix is singular; therefore, the usual discriminant rule cannot be applied. Second, when $p$ is large, it is difficult to interpret the classification rules obtained from LDA because $p$ features are involved. In this setting, motivated by the success of the primal-dual active set algorithm for best subset selection, we propose a method for sparse linear discriminant analysis via $\ell_0$ constraint, which imposes a sparsity criterion when performing linear discriminant analysis, allowing classification and feature selection to be performed simultaneously. Numerical results on synthetic and real data suggest that our method obtains competitive results compared with existing alternative methods.

Graphical Abstract

An algorithm for adding the exchange step to improve naive sparse linear discriminant analysis.

Abstract

We consider the problem of interpretable classification in a high-dimensional setting, where the number of features is extremely large and the number of observations is limited. This setting has been extensively studied in the chemometric literature and has recently become pervasive in the biological and medical literature. Linear discriminant analysis (LDA) is a canonical approach for solving this problem. However, in the case of high dimensions, LDA is unsuitable for two reasons. First, the standard estimate of the within-class covariance matrix is singular; therefore, the usual discriminant rule cannot be applied. Second, when $p$ is large, it is difficult to interpret the classification rules obtained from LDA because $p$ features are involved. In this setting, motivated by the success of the primal-dual active set algorithm for best subset selection, we propose a method for sparse linear discriminant analysis via $\ell_0$ constraint, which imposes a sparsity criterion when performing linear discriminant analysis, allowing classification and feature selection to be performed simultaneously. Numerical results on synthetic and real data suggest that our method obtains competitive results compared with existing alternative methods.
Public Summary
- This paper extends LDA to a high-dimensional setting by adding an ℓ₀ penalty to produce a discriminant vector involving only a subset of features.
- We propose an SLDA-BASS algorithm to directly solve the estimation of sparse LDA with ℓ₀ constraint, avoiding the unnecessary information loss caused by relaxing the constraint in the conventional algorithm.
- Compared to other estimation methods, the SLDA-BASS algorithm is derived from a natural criterion and is superior in terms of sparsity recovery as well as computational efficiency.

FullText(HTML)

1. Introduction

Linear discriminant analysis (LDA) is a prevalent supervised classification tool in many applications, owing to its simplicity, robustness, and predictive accuracy. LDA uses label information to learn discriminant projections, which can dramatically maximize the between-class distance and decrease the within-class distance, thus improving classification accuracy. Simultaneously, low-dimensional data projections in most discriminant directions are valuable for data interpretation. LDA classifiers can be constructed in three different ways: the multivariate Gaussian model, optimal scoring problem, and Fisher’s discriminant problem (see, for example, Hastie et al.^[1]).

LDA is effective and asymptotically optimal when the dimension $p$ is fixed and the number of observations $n$ is large; that is, its misclassification rate converges to $0$ over the optimal rule as $n$ increases to infinity. Shao et al.^[2] indicated that LDA remains asymptotic when $p$ diverges to infinity at a rate slower than $\sqrt{n}$ . However, with tremendous advances in data collection, high-dimensional data with dimension $p$ potentially larger than the number of observations $n$ are now frequently encountered in a wide range of applications, and the classification of these data has recently attracted considerable attention. Common applications include genomics, functional magnetic resonance imaging, risk management, and web searches.

In high-dimensional settings, standard LDA performs poorly and may even fail completely. For example, Bickel and Levina^[3] indicated that LDA could not perform better than random guessing when $p > n$ . In addition, in this case, the sample covariance matrix is singular, and its inverse matrix is not well identified. Consequently, it is challenging to select and extract the most discriminative features for supervised classification. A natural solution is to replace the inverse matrix with the generalized inverse matrix of the sample covariance matrix. However, such an estimation is highly biased, unrobust, and can contribute to the poor performance of the classifier. When $p$ is large, the resulting classifier is difficult to interpret because the classification rules involve a linear combination of all $p$ features. Thus, when $p \gg n$ , one may desire a classifier with parsimonious features, that is, a classifier that involves only a subset of $p$ features. Such a sparse classifier ensures that the model is easier to interpret and can reduce the overfitting of training data.

Recently, several studies have extended LDA to high-dimensional settings. Some of this literature addressed non-sparse classifiers. For example, in multivariate Gaussian models of LDA, Dudoit et al.^[4] and Bickel and Levina^[3] assumed the independence of features (naive Bayes) and Friedman^[5] suggested applying a ridge penalty to the within-class covariance matrix. Xu et al.^[6] considered other positive definite estimates of the within-class covariance matrix. In addition, some research on sparse classifiers has been conducted. Tibshirani et al.^[7] adapted a naive Bayesian classifier by soft thresholding the mean vector, and Guo et al.^[8] combined a ridge-type penalty on the within-class covariance matrix with a soft thresholding operation. Witten and Tibshirani^[9] employed an $\ell_1$ penalty for Fisher’s discriminant problem to obtain sparse discriminant vectors, but this approach does not generalize to the Gaussian mixture setting and lacks simplicity if it is in regression-based optimal scoring problems.

Motivated by the Fisher’s discriminant framework of Witten and Tibshirani^[9], we develop a sparse version of LDA with the $\ell_0$ constraint. Previous sparse LDAs were typically implemented by imposing penalties of $\ell_1$ , $\ell_2$ , or a mixture of $\ell_1$ and $\ell_2$ , because $\ell_0$ regularization is non-convex and NP-hard. Whether the optimization problem is more correct is based on the prediction effectiveness, false discovery rate, and sparsity interpretation. $\ell_0$ penalized methods have lower error bounds than $\ell_1$ methods. Mathematically, for a pre-specified degree of sparsity $s$ , the discriminant vector can be determined by the following optimization problem:

$\begin{aligned} &{\rm{maximize}}_{{\boldsymbol{\beta}}_k} & & \quad {\boldsymbol{\beta}}_k^\top {\boldsymbol{\varSigma}}_b {\boldsymbol{\beta}}_k\\ &{\text{subject to}} & & \quad {\boldsymbol{\beta}}_k^{\top} {\boldsymbol{\varSigma}}_w {\boldsymbol{\beta}}_k = 1, \quad {\boldsymbol{\beta}}_k^{\top} {\boldsymbol{\varSigma}}_w {\boldsymbol{\beta}}_l = 0 \quad \forall \ l < k, \\ & & & \quad \|{\boldsymbol{\beta}}_k\|_0 \leqslant s, \end{aligned}$

where $\| {\boldsymbol{\beta}} \|_0$ denotes the number of nonzero elements of ${\boldsymbol{\beta}}$ , ${\bf{\varSigma}}_b$ is the between-class covariance matrix, and ${\boldsymbol{\varSigma}}_w$ is the within-class covariance matrix. Further details are introduced in Section 2.

This study presents a new iterative thresholding algorithm to estimate linear discriminant vectors, which is a development of the primal-dual active-set algorithm proposed by Zhu et al. ^[10] to solve the best subset selection problem in regression. The contribution of this study is two-fold. First, we consider addressing the estimation problem of sparse LDA by directly solving the $\ell_0$ constrained optimization problem, which avoids unnecessary information loss owing to relaxing the constraints. Second, we constructed a polynomial algorithm for estimating sparse linear discriminant vectors, which is computationally efficient and easy to implement.

The remainder of this paper is organized as follows. In Section 2, we review LDA and classical solution procedures and then propose a new solution for sparse linear discriminant analysis. Sections 3 and 4 compare the results of our proposed method and existing methods in simulated experiments and applications to real data, respectively. Section 5 presents our conclusions.

2. Methodology

2.1 A review of linear discriminant analysis

Let $\boldsymbol{X}$ be an $n \times p$ data matrix with $p$ features measured on $n$ observations and suppose that each of the $n$ observations belongs to one of the $K$ classes. In addition, we assume that each of the $p$ features is centered to satisfy the zero mean and is normalized to have an identical variance if they are not measured on the same scale. Let $\boldsymbol{x}_i$ denote the $i$ th observation and ${\cal{C}}_k$ denote the index set of observations in the $k$ th class.

Consider a simple multivariate Gaussian data generation process in which the distribution of observations in class $k$ is ${\cal{N}}({\boldsymbol{\mu}}_k, {\boldsymbol{\varSigma}}_w)$ , where ${\boldsymbol{\mu}}_k \in \mathbb{R}^p$ is the mean vector of the $k$ th class and ${\boldsymbol{\varSigma}}_w$ is a $p \times p$ within-class covariance matrix over all $K$ classes. Here, $|{\cal{C}}_k|^{-1} \sum_{i \in {\cal{C}}_k} \boldsymbol{x}_i$ is the estimate of ${\boldsymbol{\mu}}_k$ and $n^{-1} \sum_{k=1}^{K} \sum_{i \in {\cal{C}}_k} (\boldsymbol{x}_i - {\boldsymbol{\mu}}_k) (\boldsymbol{x}_i - {\boldsymbol{\mu}}_k)^{\top}$ is the estimate of ${\boldsymbol{\varSigma}}_w$ , where $| \cdot |$ denotes the cardinality of the index set. The LDA classification rule then results in the application of the Bayesian rule to estimate the most likely class for a test observation.

LDA can also be argued to arise from Fisher’s discriminant problem. We define the between-class covariance matrix ${\boldsymbol{\varSigma}}_b = \sum_{k=1}^{K} \pi_k {\boldsymbol{\mu}}_k {\boldsymbol{\mu}}_k^{\top}$ , where $\pi_k$ is the prior probability of class $k$ . Fisher’s discriminant problem involves solving for the discriminant vectors ${\boldsymbol{\beta}}_1, \cdots, {\boldsymbol{\beta}}_{K-1}$ , which sequentially maximizes the vector. The corresponding optimization problem is

$\begin{aligned} &{\rm{maximize}}_{{\boldsymbol{\beta}}_k} & & \quad {\boldsymbol{\beta}}_k^\top {\boldsymbol{\varSigma}}_b {\boldsymbol{\beta}}_k \\ &{\text{subject to}} & & \quad {\boldsymbol{\beta}}_k^\top {\boldsymbol{\varSigma}}_w {\boldsymbol{\beta}}_k=1, \quad {\boldsymbol{\beta}}_k^{\top} {\boldsymbol{\varSigma}}_w {\boldsymbol{\beta}}_l=0 \quad \forall \ l<k. \end{aligned}$

Because the rank of ${\boldsymbol{\varSigma}}_b$ is at most $K-1$ , the above-generalized eigenproblem has at most $K-1$ nontrivial solutions and, therefore, at most $K-1$ discriminant vectors. These solutions are the directions in which the data have the maximal between-class covariance relative to their within-class covariance. Moreover, it has been demonstrated that classification based on the nearest centroid of matrix $(\boldsymbol{X}{\boldsymbol{\beta}}_1, \cdots, \boldsymbol{X}{\boldsymbol{\beta}}_{K-1})$ produces the same LDA classification rule as the multivariate Gaussian model described previously (see Hastie et al.^[1]). One advantage of Fisher’s discriminant problem over the multivariate Gaussian model of LDA is that it allows for reduced-rank classification by performing nearest centroid classification on the matrix $(\boldsymbol{X}{\boldsymbol{\beta}}_1, \cdots, \boldsymbol{X}{\boldsymbol{\beta}}_{q})$ with $q < K-1$ . It can be proven that performing nearest centroid classification on this $n \times q$ matrix is exactly equivalent to conducting a full-rank LDA on the $n \times q$ matrix.

The standard estimate of the within-class covariance matrix ${{\boldsymbol{\varSigma}}}_w$ is:

$\begin{equation} \widehat{{\boldsymbol{\varSigma}}}_w = \frac{1}{n} \sum\limits_{k=1}^{K} \sum\limits_{i \in {\cal{C}}_k} (\boldsymbol{x}_i - \hat{{\boldsymbol{\mu}}}_k) (\boldsymbol{x}_i - \hat{{\boldsymbol{\mu}}}_k)^{\top}, \end{equation}$

(1)

where $\hat{{\boldsymbol{\mu}}}_k$ denotes the sample mean vector for class $k$ . In this subsection, we assume that $\widehat{{\boldsymbol{\varSigma}}}_w$ is nonsingular. Furthermore, the standard estimate for the between-class covariance matrix ${\boldsymbol{\varSigma}}_b$ is given by:

$\widehat{{\boldsymbol{\varSigma}}}_b=\frac{1}{n} {\boldsymbol X}^{\top}{\boldsymbol X} - \hat{{\boldsymbol{\varSigma}}}_w = \frac{1}{n} \sum\limits_{k=1}^{K} n_k \hat{{\boldsymbol{\mu}}}_k \hat{{\boldsymbol{\mu}}}_k^{\top},$

where $n_k = | {\cal{C}}_k |.$

In later subsections, we will make use of the fact that

$\widehat{{\boldsymbol{\varSigma}}}_b=\frac{1}{n} \boldsymbol{X}^{\top} \boldsymbol{Y} (\boldsymbol{Y}^{\top} \boldsymbol{Y})^{-1}\boldsymbol{Y}^{\top} \boldsymbol{X},$

where $\boldsymbol{Y}$ is an $n \times K$ matrix and $Y_{ik}$ is an indicator of whether observation $i$ is in the $k$ th class. Then, the empirical version of LDA can be written as

$\begin{equation} \begin{aligned} & {\rm{maximize}}_{{\boldsymbol{\beta}}_k} & & \quad {\boldsymbol{\beta}}_k^{\top} \widehat{{\boldsymbol{\varSigma}}}_b {\boldsymbol{\beta}}_k \\ & {\text{subject to}} & & \quad {\boldsymbol{\beta}}_k^{\top} \widehat{{\boldsymbol{\varSigma}}}_w {\boldsymbol{\beta}}_k \leq 1, \quad {\boldsymbol{\beta}}_k^{\top} \widehat{{\boldsymbol{\varSigma}}}_w {\boldsymbol{\beta}}_l=0 \quad \forall \ l<k. \end{aligned} \end{equation}$

(2)

Problem (2) is commonly written in terms of the equality constraint instead of an inequality constraint, but the two are equivalent when $\widehat{{\boldsymbol{\varSigma}}}_w$ is full rank, as detailed in Ref. [9, Appendix A]. The solution $\hat{{\boldsymbol{\beta}}}_k$ to problem (2) is generally referred to as the $k$ th discriminant vector. Problem (2) can be solved by the variable substitution $\tilde{{\boldsymbol{\beta}}}_k = \widehat{{\boldsymbol{\varSigma}}}_w ^{1/2} {\boldsymbol{\beta}}_k$ , where $\widehat{{\boldsymbol{\varSigma}}}_w^{1/2}$ is the symmetric matrix square root of $\widehat{{\boldsymbol{\varSigma}}}_k$ . The Fisher’s discriminant problem was then reduced to a standard eigenproblem. However, when $p>n$ , $\widehat{{\boldsymbol{\varSigma}}}_w$ becomes singular. Any discriminant vector in the null space of $\widehat{{\boldsymbol{\varSigma}}}_w$ but not in the null space of $\widehat{{\boldsymbol{\varSigma}}}_b$ can lead to an arbitrarily large value of the objective function.

To address this singularity problem, some modifications to the Fisher’s discriminant problem have been proposed. Krzanowski et al.^[11] have considered a modification of problem (2) by trying to find a unit vector ${\boldsymbol{\beta}}$ that maximizes ${\boldsymbol{\beta}}^{\top} \widehat{{\boldsymbol{\varSigma}}}_b {\boldsymbol{\beta}}$ subject to ${\boldsymbol{\beta}}^{\top} \widehat{{\boldsymbol{\varSigma}}}_w {\boldsymbol{\beta}} =0$ . Tebbens and Schlesinger^[12] further required that the solution does not lie in the null space of $\widehat{{\boldsymbol{\varSigma}}}_b$ . It has also been proposed to modify problem (2) using positive definite estimates of ${\boldsymbol{\varSigma}}_w$ . For example, Friedman^[5], Dudoit et al.^[4], and Bickel and Levina^[3] considerd the use of a diagonal estimate

$\widetilde{{\boldsymbol{\varSigma}}}_w={\rm{diag}} (\hat{\sigma}_1^2,\cdots,\hat{\sigma}_p^2),$

where $\hat{\sigma}_j^2$ is the $j$ th diagonal element of $\widehat{{\boldsymbol{\varSigma}}}_w$ in Eq. (1). There are of course some other forms of positive definite estimates for ${\boldsymbol{\varSigma}}_w$ suggested in Xu et al.^[6]. Given a positive definite estimate $\widetilde{{\boldsymbol{\varSigma}}}_w$ , the resulting optimization problem is

$\begin{equation} \begin{aligned} & {\rm{maximize}}_{{\boldsymbol{\beta}}_k} & & \quad{\boldsymbol{\beta}}_k^\top \widehat{{\boldsymbol{\varSigma}}}_b {\boldsymbol{\beta}}_k \\ & {\text{subject to}} & & \quad {\boldsymbol{\beta}}_k^{\top} \widetilde{{\boldsymbol{\varSigma}}}_w {\boldsymbol{\beta}}_k \leq 1, \quad {\boldsymbol{\beta}}_k^{\top} \widetilde{{\boldsymbol{\varSigma}}}_w {\boldsymbol{\beta}}_l=0 \quad \forall \ l < k. \end{aligned} \end{equation}$

(3)

The new optimization problem (3) addresses the singularity issue but not the interpretability issue. At this point, we extend problem (3) such that the resulting discriminant vector is interpretable. We use Lemma 2.1, which provides a reformulation of problem (3) to obtain the same solution.

Lemma 2.1. The solution $\hat{\beta}_k$ to problem (3) is equivalent to the solution to the following problem:

$\begin{equation*} {{\rm{maximize}}}_{{\boldsymbol{\beta}}_k} \{{\boldsymbol{\beta}}_k^\top \widehat{{\boldsymbol{\varSigma}}}_b^k {\boldsymbol{\beta}}_k\} \quad {{\rm{subject}}} \ {{\rm{to}}} \quad {\boldsymbol{\beta}}_k^\top \widetilde{{\boldsymbol{\varSigma}}}_w {\boldsymbol{\beta}}_k \leq 1, \end{equation*}$

where

$\widehat{{\boldsymbol{\varSigma}}}_b^k=\frac{1}{n} {\boldsymbol{X}}^{\top} {\boldsymbol{Y}} ({\boldsymbol{Y}}^{\top} {\boldsymbol{Y}})^{-1/2} {\boldsymbol{P}}_k^{\perp} ({\boldsymbol{Y}}^{\top} {\boldsymbol{Y}})^{-1/2} {\boldsymbol{Y}}^{\top} {\boldsymbol{X}}.$

${\boldsymbol{P}}_k^{\perp}$ is defined as follows: ${\boldsymbol{P}}_1^{\perp}= {\boldsymbol{I}}$ , and for $k>1$ , ${\boldsymbol{P}}_k^{\perp}$ is an orthogonal projection matrix in a space orthogonal to $({\boldsymbol{Y}}^{\top} {\boldsymbol{Y}})^{-1/2} {\boldsymbol{Y}}^{\top} {\boldsymbol{X}} \hat{{\boldsymbol{\beta}}}_i$ for all $i<k$ .

The proof of Lemma 2.1 can be found in Ref. [9]. When we obtain the discriminant vector $\hat{{\boldsymbol{\beta}}}_1, \cdots,\hat{{\boldsymbol{\beta}}}_{k}$ , we use $\widehat{{\boldsymbol{\varSigma}}}_b^{k+1}$ to replace $\widehat{{\boldsymbol{\varSigma}}}_b^{k}$ and repeat the same procedure for computing $\hat{{\boldsymbol{\beta}}}_1$ until we obtain all the discriminant vectors.

In this study, we modify problem (3) by imposing $\ell_0$ constraint on the discriminant vectors. Given a pre-specified degree of sparsity $s$ , we define the $k$ th sparse discriminant vector $\hat{{\boldsymbol{\beta}}}_k$ as the solution to the following optimization problem:

$\begin{equation} {\rm{maximize}}_{{\boldsymbol{\beta}}_k} \{{\boldsymbol{\beta}}_k^\top \widehat{{\boldsymbol{\varSigma}}}_b^k {\boldsymbol{\beta}}_k\} \quad {\text{subject to}} \quad {\boldsymbol{\beta}}_k^\top \widetilde{{\boldsymbol{\varSigma}}}_w {\boldsymbol{\beta}}_k=1,\; \| {\boldsymbol{\beta}}_k \|_0 \leq s. \end{equation}$

(4)

2.2 Sparse linear discriminant analysis

Recall that the Lagrangian form of (4) is as follows:

$\begin{equation} {\rm{minimize}}_{{\boldsymbol{\beta}} \in \mathbb{R}^p} L({\boldsymbol{\beta}}) = -{\boldsymbol{\beta}}^{\top} \widehat{{\boldsymbol{\varSigma}}}_b^k {\boldsymbol{\beta}} + \lambda ({\boldsymbol{\beta}}^{\top} \widetilde{{\boldsymbol{\varSigma}}}_w {\boldsymbol{\beta}} -1) \;\; {\rm{subject\ to }} \;\; | {\boldsymbol{\beta}} \|_0 \leq s, \end{equation}$

(5)

where $\lambda>0$ is a parameter that controls the normalization to $\widetilde{{\boldsymbol{\varSigma}}}_w$ of the discriminant vector ${\boldsymbol{\beta}}$ . We denote ${\boldsymbol{\beta}}^\diamond$ as a coordinate-wise minimizer of problem (5). If the active set ${\cal{A}}^\diamond = \{ j: \beta_j^\diamond \neq 0\}$ is known, then by disregarding the constraint $\| {\boldsymbol{\beta}} \|_0 = s$ , we can minimize the objective function $L( {\boldsymbol{\beta}} )$ without $\ell_0$ constraint.

To determine a valid active set ${\cal{A}}$ , we introduce the definition of sacrifice ${\boldsymbol{\varDelta}} = \{ \varDelta_j: j = 1,\cdots , p \}$ , which is similar to that in Ref. [10]. Specifically, the $j$ th sacrifice $\varDelta_j$ measures the increase or decrease in the value of the objective function $L({\boldsymbol{\beta}})$ with the current $\hat{\beta}_j$ set to $0$ during each round of iterations. Consider the $m$ th iterative process, by fixing the other coordinates to their current global optimum, the new marginal optimal of $\beta_j$ is given by $\beta_j^{(m+1)} = \beta_j^{(m)} +d_j^{(m)}$ , where $d_j^{(m)} = -\dfrac{\partial {L({\boldsymbol{\beta}})}/\partial{\beta_{j}}}{\partial^2 {L({\boldsymbol{\beta}})}/\partial{\beta_{j}}^2} \Big|_{{\boldsymbol{\beta}} = {\boldsymbol{\beta}}^{(m)}} = ({\widehat{{\boldsymbol{\varSigma}}}_{j \cdot} {\boldsymbol{\beta}}^{(m)}-\lambda \hat{\sigma}_j^{2} \beta_j^{(m)}}) / ({\lambda \hat{\sigma}_j^{2} -\widehat{{\varSigma}}_{jj}})$ denote the dual variable of $\beta_j^{(m)}$ , $\widehat{{\boldsymbol{\varSigma}}}_{j \cdot}$ denote the $j$ th row of $\widehat{{\boldsymbol{\varSigma}}}_b^{k}$ and $\widehat{{\varSigma}}_{jj}$ denote the $(j, j)$ -element of $\widehat{{\boldsymbol{\varSigma}}}_b^k$ . Here, we add the symbol “ $(m)$ ” in the upper-right corner of each variable to indicate that this is the value taken during the $m$ th iteration. Then, the sacrifice $\varDelta_j$ is defined as $\varDelta_j^{(m)}= (\lambda \hat{\sigma}_j^2 - \widehat{{{\varSigma}}}_{jj})(\beta_j^{(m)} + d_j^{(m)})^2$ , which indicates an approximation of the loss change. Denote the inactive set ${\cal{I}}^\diamond = {{\cal{A}}^\diamond}^{c} = \{ j: \beta_j^\diamond = 0\}$ . With the primal-dual condition, we have

● If $j \in {\cal{A}}^\diamond$ , then $\beta_j^\diamond \neq 0$ , $d_j^\diamond=0$ .

● If $j \in {\cal{I}}^\diamond$ , then $\beta_j^\diamond = 0$ , $d_j^\diamond = ({\widehat{{\boldsymbol{\varSigma}}}_{j \cdot} {\boldsymbol{\beta}}^\diamond - \lambda \hat{\sigma}_j^2 \beta_j^\diamond}) / ({\lambda \hat{\sigma}_j^2 -\widehat{{\varSigma}}_{jj}})$ .

The purpose of problem (5) is to minimize the objective function $L({\boldsymbol{\beta}})$ , which implies that the sacrifice should be as small as possible. That is, among all candidates, we enforce that the coordinates corresponding to the smaller sacrifice are set to zero. To achieve this, we rearrange $\{ \varDelta_j, j = 1, \cdots, p\}$ in decreasing order; that is, $\varDelta_{[1]} \geqslant \varDelta_{[2] } \geqslant \cdots \geqslant \varDelta_{[p]}$ , where $\varDelta_{[i]}$ denotes the $i$ th largest value among $\{ \varDelta_j, j = 1, \cdots, p\}$ . We then truncate the ordered sacrifice vector at position $s$ , that is, set the estimate of the active set ${{\cal{A}}}=\{ j: \varDelta_j\geqslant \varDelta_{[s]}\}$ .

When ${\cal{A}}$ is given, we can extract the corresponding rows and columns of $\widehat{{\boldsymbol{\varSigma}}}_b^k$ and $\widetilde{{\boldsymbol{\varSigma}}}_w$ , from which the problem becomes a classical LDA problem. The solution to this problem is then obtained, with ${\boldsymbol{\beta}}_{{\cal{A}}}$ being the first eigenvector of $(\widetilde{{\boldsymbol{\varSigma}}}_w^{-1/2} \widehat{{\boldsymbol{\varSigma}}}_b^k \widetilde{{\boldsymbol{\varSigma}}}_w^{-1/2})_{{\cal{A}}{\cal{A}}}$ and ${\boldsymbol{\beta}}_{{\cal{I}}} = {\boldsymbol{0}}$ . The above discussion is summarized in Algorithm 2.1.

2.3 Best subset selection

Owing to the discontinuity of the $\ell_0$ norm, the naive algorithm for sparse linear discriminant analysis may converge to a local minimum and encounter the problem of periodic iterations. To obtain sparse linear discriminant vectors more efficiently, an iterative algorithm based on the primal-dual condition of problem (5) was proposed in this subsection. Motivated by Zhu et al.^[10], who developed a novel algorithm based on exchanging active sets to avoid periodic iterations in solving the best subset selection problem in linear regression models, we extended their work to the sparse LDA problem. An exchanging step is added to Algorithm 2.1 to prevent the algorithm from entering a loop, that is, choosing a subset of the active set to exchange with a subset of the inactive set. We then decide whether to adopt the new candidate solution by comparing the objective function values before and after the exchange.

Specifically, at the $m$ th iteration of computing the $k$ th discriminant vector of Algorithm 2.1, we obtain the sacrifice ${\boldsymbol{\varDelta}}^{(m)}$ . Subsequently, we classified the variables that needed to be exchanged in the active and inactive sets.

● Subset of the active set that need to be exchanged to the inactive set:

$\begin{equation*} {\cal{B}}_{c,1}^{(m)}= \{ j \in {\cal{A}}^{(m)} : \sum_{i \in {\cal{A}}^{(m)}} I(\varDelta_j^{(m)} \geqslant \varDelta_i^{(m)}) \leqslant c \}. \end{equation*}$

● Subset of the inactive set that need to be exchanged to the active set:

$\begin{equation*} {\cal{B}}_{c,2}^{(m)}= \{ j \in {\cal{I}}^{(m)} : \sum_{i \in {\cal{I}}^{(m)}} I(\varDelta_j^{(m)} \leqslant \varDelta_i^{(m)}) \leqslant c \}, \end{equation*}$

where $c$ is an arbitrary constant ranging from 1 to $\min \{ s, p-s \}$ .

Then, we updated the active and inactive sets using ${\cal{A}}_c^{(m+1)}= ({\cal{A}}^{(m)} / {\cal{B}}_{c,1}^{(m)}) \cup {\cal{B}}_{c,2}^{(m)}$ and ${\cal{I}}_c^{(m+1)} = ({\cal{I}}^{(m)} / {\cal{B}}_{c,2}^{(m)}) \cup {\cal{B}}_{c,1}^{(m)}$ . Given ${\cal{A}}_{c}^{(m)}$ and ${\cal{I}}_{c}^{(m)}$ , we define $\tilde{{\boldsymbol{\beta}}}_c^{(m)}$ as the solution to problem (5), which is given by

$\begin{equation*} \tilde{{\boldsymbol{\beta}}}_c^{(m)} ={\arg \max}_{{\boldsymbol{\beta}}_{{\cal{I}}_{c}^{(m)}} = {\boldsymbol{0}}, \ {\boldsymbol{\beta}}^\top \widetilde{{\boldsymbol{\varSigma}}}_w {\boldsymbol{\beta}} =1} \ {\boldsymbol{\beta}}^{\top} \widehat{{\boldsymbol{\varSigma}}}_b {\boldsymbol{\beta}}. \end{equation*}$

Further, for accuracy, we search $c$ over 1 to $\min \{s, p-s \}$ and determine an optimal sparse vector with the smallest value of $L({\boldsymbol{\beta}})$ , that is, $\hat{c}=\arg\max_{c} \, L(\tilde{{\boldsymbol{\beta}}}_{c}^{(m)}).$ Algorithm 2.2 displays the process of exchanging some variables in the active and inactive sets.

Based on the exchange step proposed above, we developed an efficient algorithm to avoid falling into a local minimum and guarantee convergence. Specifically, we added an exchange step after updating the primal and dual variables in Algorithm 2.1. The details are presented in Algorithm 2.3, which is called SLDA-BESS.

Algorithm 2.3. SLDA-BESS for sparse LDA

Input: data matrix

${\boldsymbol X}$ , indicator matrix

${\boldsymbol Y}$ , sparsity

$s$ ,

$\lambda$ .

Output:

${\boldsymbol{\beta} }_1,\cdots, {\boldsymbol{\beta} }_{q}$ .

$\widehat{ {\boldsymbol{\varSigma} } }_b = {n}^{-1} {\boldsymbol X}^{\top} {\boldsymbol Y}({\boldsymbol Y}^{\top} {\boldsymbol Y})^{-1} {\boldsymbol Y}^{\top} {\boldsymbol X}$ ,

$\widehat{ {\boldsymbol{\varSigma} } }_w ={\boldsymbol X}^{\top} {\boldsymbol X} -\widehat{ {\boldsymbol{\varSigma} } }_b$ and

$\widetilde{ {\boldsymbol{\varSigma} } }_w ={\rm{diag} } \{ \widehat{ {\boldsymbol{\varSigma} } }_w \}$ .

2: Initialize

${\boldsymbol{\beta}}^{(0)}$ with normalization

${\boldsymbol{\beta} }^{ {(0)} \top} \widetilde{ {\boldsymbol{\varSigma} } }_w {\boldsymbol{\beta} }^{(0)}=1$ .

3: Calculate

$d_j^{ {(0)} } = ({\widehat{ {\boldsymbol{\varSigma} } }_{j \cdot} {\boldsymbol{\beta} }^{(0)}-\lambda \hat{\sigma}_j^{2} \beta_j^{(0)} }) / ({\lambda \hat{\sigma}_j^{2} -\widehat{ {\varSigma} }_{jj} })$ .

4: for

$m=0,\cdots, M$ do

$\varDelta_j= (\lambda \hat{\sigma}_j^2 - \widehat{\varSigma}_{jj })(\beta_j^{(m)} + d_j^{(m)})^2$ ,

${\cal{A} }=\{j: \varDelta_j\geq \varDelta_{[s]}\}$ and

${\cal{I}}={\cal{A}}^{c}$ .

6: 　 Set

${\boldsymbol{\beta}}_{{\cal{A}}}^{(m+1)}$ be the first eigenvector of

$(\widetilde{ {\boldsymbol{\varSigma} } }_w^{-1/2} \widehat{ {\boldsymbol{\varSigma} } }_b^k \widetilde{ {\boldsymbol{\varSigma} } }_w^{-1/2})_{ {\cal{A} }{\cal{A} } }$ and
　　　

${ {\boldsymbol{\beta}}_{{\cal{I}}}^{(m+1)}} = {\boldsymbol{0}}$ .

7: 　 Update

$({\boldsymbol{\beta}}^{(m+1)} , {\boldsymbol{d}}^{(m+1)})$ by Algorithm 2.2 with

${\boldsymbol{\beta} }^{(m+1)}, {\boldsymbol{\varDelta} }^{(m+1)}, {\cal{A} }$ 　　　 and

${\cal{I}}$ .

8: 　 When

$\| {\boldsymbol{\beta}}^{(m+1)} - {\boldsymbol{\beta}}^{(m)} \|_2$ is sufficiently small, break.

9: end for

10: Repeat the above steps to achieve the rest discriminant vectors.

| Show Table

DownLoad: CSV

The improved Algorithm 2.3 has the theoretical guarantee of Theorem 2.1, which indicates that such a best subset selection algorithm incorporating the exchange step is feasible.

Theorem 2.1. The SLDA-BESS algorithm terminates after a finite number of iterations.

Proof. The objective function always increases, and the choice of the active set is finite. Algorithm 2.3 terminates after a finite number of iterations.

3. Numeric studies

We compared SLDA-BESS with penalized LDA- $\ell_1$ ^[9], nearest shrunken centroids (NSC)^[7], and shrunken centroids regularized discriminant analysis (RDA)^[8] in simulation studies. LDA- $\ell_1$ is a method that adds a $\ell_1$ penalty to the objective function ${\boldsymbol{\beta}}^{\top} {\boldsymbol{\varSigma}} {\boldsymbol{\beta}}$ . NSC is a simple modified version of the nearest centroid method that divides a between-class standard deviation when calculating the centroid distance and is a modified version based on NSC. In each simulation, $1200$ observations were set to belong equally to several different classes. Arbitrary $300$ of these $1200$ observations are set as the training set, and the remaining $900$ belong to the test set. Each simulation consisted of measurements of 500 features, i.e., $p = 500$ .

Simulation 1. Consider four different classes where the features are independent of each other and the mean value is shifted. Given the set of indicators for four classes, $C_1$ , $C_2$ , $C_3$ and $C_4$ , $\boldsymbol{x}_i \sim {\cal{N}} ({\boldsymbol{\mu}}_k, \boldsymbol{I})$ if $i \in C_k$ , where $\mu_{1,j}=0.7$ for $1 \leqslant j \leqslant 25$ , $\mu_{2,j}=0.7$ for $26 \leqslant j \leqslant 50$ , $\mu_{3,j}=0.7$ for $51 \leqslant j \leqslant 75$ , $\mu_{4,j}=0.7$ for $76 \leqslant j \leqslant 100$ and $\mu_{k j}=0$ otherwise for $j = 1, \ldots, 500$ .

Simulation 2. Consider two different classes where the features are dependent of each other and the mean value is shifted. Given the set of indicators for two classes, $C_1$ and $C_2$ , $\boldsymbol{x}_i \sim {\cal{N}} (0, {\boldsymbol{\varSigma}})$ for $i \in C_1$ and $\boldsymbol{x}_i \sim {\cal{N}} ({\boldsymbol{\mu}},{\boldsymbol{\varSigma}})$ for $i \in C_2$ , where $\mu_{j}=0.6$ if $j \leqslant 100$ and $\mu_j=0$ otherwise and ${{\boldsymbol \varSigma}} = (\varSigma_{ij}) = ( 0.6^{| i-j |} )$ . The covariance structure ${\boldsymbol \varSigma}$ is intended to mimic gene expression data, in which genes are positively correlated within a pathway and independent between pathways.

Simulation 3. Consider four different classes in which the features are independent of each other and have only one dimension, where the mean is shifted. Given four indicator sets, $C_1$ , $C_2$ , $C_3$ and $C_4$ , for $i \in C_k$ , ${{x}}_{ij} \sim {\cal{N}} ( (k-1)/3, 1)$ if $j \leq 100$ and ${{x}}_{ij} \sim {\cal{N}} (0,1)$ otherwise. The one-dimensional projection of the data fully captures the structure of the class.

For each method, the models were fitted to the training set using a range of values for the tuning parameters. Tuning parameter values were chosen to minimize errors in the validation set. The parameters estimated for the training set were then evaluated for the test set. After obtaining the estimated discriminant vectors, we applied the KNN method to the new dataset to perform classification. This process was repeated 200 times.

The classification error in the test set and the number of non-zero features used by the discriminant vectors are listed in Table 1. In the “errors” row of each simulation result, the average number of misclassified individuals on the test set with $900$ observations is presented, and the standard deviation is in parentheses. Correspondingly, in the “features” row, the average number of non-zero features used in estimating the discriminant vectors is reported, with the standard deviation in parentheses. As shown in Table 1, our method has a smaller error when the number of features used is almost equal. Moreover, it is easier to adjust the parameters using the $\ell_0$ penalty method.

Algorithm 2.1. Naive algorithm for sparse linear discriminant analysis

Input: data matrix

${\boldsymbol X}$ , indicator matrix

$\boldsymbol Y$ , sparsity

$s$ .

Output:

$\hat{ {\boldsymbol{\beta} } }_1, \cdots, \hat{ {\boldsymbol{\beta} } }_q$ .

$\widehat{ {\boldsymbol{\varSigma} } }_b = {n}^{-1} {\boldsymbol X}^{\top} {\boldsymbol Y}({\boldsymbol Y}^{\top} {\boldsymbol Y})^{-1} {\boldsymbol Y}^{\top} {\boldsymbol X}.$

$\widehat{ {\boldsymbol{\varSigma} } }_w ={\boldsymbol X}^{\top} {\boldsymbol X}/n-\widehat{ {\boldsymbol{\varSigma} } }_b$ and

$\widetilde{ {\boldsymbol{\varSigma} } }_w ={\rm{diag} } \{ \widehat{ {\boldsymbol{\varSigma} } }_w\}.$

3: for

$k=1,\cdots,q$ do

4: 　 (a)

$\widehat{ {\boldsymbol{\varSigma} } }_b^k = {n}^{-1} {\boldsymbol X}^{\top} {\boldsymbol Y} ({\boldsymbol Y}^{\top} {\boldsymbol Y})^{-1/2} {\boldsymbol P}_k^{\perp} ({\boldsymbol Y}^{\top} {\boldsymbol Y})^{-1/2} {\boldsymbol Y}^{\top} {\boldsymbol X}$ .

5: 　 (b) Initialize

${\boldsymbol{\beta}}_k$ with normalization

${\boldsymbol{\beta} }_k^{\top} \widetilde{ {\boldsymbol{\varSigma} } }_w {\boldsymbol{\beta} }_k = 1$ .

6: 　 (c) Iterate until convergence or a maximum number of iteration is 　　　reached:

$\varDelta_j= (\lambda \hat{\sigma}_j^2 - \widehat{\varSigma}_{jj})(\beta_j + d_j)^2$ ,

${\cal{A} }=\{j: \varDelta_j\geqslant \varDelta_{[s]}\}$ and

${\cal{I}}={\cal{A}}^{c}$ .

${\boldsymbol{\beta}}_{k, {\cal{A}}}$ is the first eigenvector of

$(\widetilde{ {\boldsymbol{\varSigma} } }_w^{-1/2} \widehat{ {\boldsymbol{\varSigma} } }_b^k \widetilde{ {\boldsymbol{\varSigma} } }_w^{-1/2})_{ {\cal{A} }{\cal{A} } }$ and

${\boldsymbol{\beta}}_{k,{\cal{I}}}=0$ .

　　 ·Update dual variables

$d_j = ({\widehat{ {\boldsymbol{\varSigma} } }_{j \cdot} {\boldsymbol{\beta} }_{k}-\lambda \hat{\sigma}_j^{2} \beta_{k,j} }) / ({\lambda \hat{\sigma}_j^{2} -\widehat{ {\varSigma} }_{jj} })$ .

7: end for

| Show Table

DownLoad: CSV

Table 1. Simulation result.

Simulation		SLDA-BESS	Penalized LDA- $\ell_1$	NSC	RDA
1	errors	57.90(10.69)	81.73(13.05)	56.44(18.93)	66.38(12.30)
1	features	95.76(4.19)	98.35(3.56)	95.91(3.52)	95.30(6.38)
2	error	46.62(9.09)	66.43(13.61)	56.00(8.75)	81.82(17.59)
2	features	100.00(0.00)	111.5(7.02)	107.11(7.53)	107.37(10.80)
3	errors	101.62(13.36)	115.12(15.53)	358.64(26.63)	318.18(14.98)
3	features	100.00(0.00)	99.79(0.71)	102.01(1.87)	107.99(14.98)

| Show Table

DownLoad: CSV

4. Real data analysis

This section compares our SLDA-BESS method with three existing methods: penalized LDA- $\ell_1$ , NSC, and RDA.

We applied the four methods to the three gene expression datasets.

Ramaswamy dataset^[13]: A dataset consisting of $16063$ gene expression measurements and 198 samples belonging to 14 distinct cancer subtypes.

Nakayama dataset^[14]: A dataset consisting of $105$ samples from $10$ types of soft tissue tumors, with $22283$ gene expression measurements per sample. We shall limit the analysis to five tumor types with at least $15$ samples in the data, resulting in a subset of data containing 86 samples.

Sun dataset^[15]: A dataset consisting of $180$ samples and $54613$ expression measurements. The sample falls into four classes: one non-tumor class and three glioma classes.

Each dataset is split into a training set containing $75\%$ of the samples and a test set containing the remaining $25\%$ of the samples. A total of $100$ replications were performed, each with randomly selected training and testing sets. The results are presented in Table 2. The results suggest that the four methods perform roughly equally in terms of error, but our SLDA-BESS method utilizes fewer features. Moreover, our SLDA-BESS method is more suitable for remarkably sparse models, whereas the other three methods may fail when using a few features.

Algorithm 2.2. Exchange step

Input:

$\widehat{ {\boldsymbol{\varSigma} } }_b^{k}$ ,

$\widetilde{ {\boldsymbol{\varSigma} } }_w$ ,

$\lambda$ , sacrifice

${\boldsymbol{\varDelta} }$ , active set

${\cal{A}}$ , inactive set

${\cal{I}}$ , sparsity

$s$ .

Output: discriminant vector

$\hat{{\boldsymbol{\beta}}}$ , dual variables

${\boldsymbol{d}}$ and sacrifice

${\boldsymbol{\varDelta} }$ .

$L = -{\boldsymbol{\beta} }^{\top} \widehat{ {\boldsymbol{\varSigma} } }_b^k {\boldsymbol{\beta} } +\lambda ({\boldsymbol{\beta} }^{\top} \widetilde{ {\boldsymbol{\varSigma} } }_w {\boldsymbol{\beta} } - 1)$ .

2: for

$c = 1, \cdots, \min \{ s,p-s\}$ do

${\cal{B} }_{c,1}= \{ j \in {\cal{A} } : \sum_{i \in {\cal{A} } } I( \varDelta_j \geqslant \varDelta_i) \leqslant c \}$ and
　　

$\qquad \quad {\cal{B} }_{c,2}= \{ j \in {\cal{I} } : \sum_{i \in {\cal{I} } } I( \varDelta_j \leqslant \varDelta_i) \leqslant c \}.$

${\cal{A}}_c= ({\cal{A}} / {\cal{B}}_{c,1}) \cup {\cal{B}}_{c,2}$ and

${\cal{I}}_c= ({\cal{I}} / {\cal{B}}_{c,2}) \cup {\cal{B}}_{c,1}$ .

5: 　 Calculate

$\tilde{{\boldsymbol{\beta}}}_{c}$ based on

${\cal{A}}_c$ , and

$\widetilde{L} = -\tilde{ {\boldsymbol{\beta} } }_{c}^\top \widehat{ {\boldsymbol{\varSigma} } }_b^k \tilde{ {\boldsymbol{\beta} } }_{c} +\lambda (\tilde{ {\boldsymbol{\beta} } }_{c}^\top \widetilde{ {\boldsymbol{\varSigma} } }_w \tilde{ {\boldsymbol{\beta} } }_{c}-1)$ .

6: 　 If

$L > \widetilde{L}$ , then

$L \leftarrow \widetilde{L}$ and

$({\cal{A}}, {\cal{I}}, {\boldsymbol{\beta}}) \leftarrow ({\cal{A}}_c, {\cal{I}}_c, \tilde{{\boldsymbol{\beta}}})$ .

7: end for

8: Update dual variables

${\boldsymbol{d}}$ and sacrifice

${\boldsymbol{\varDelta} }$ .

| Show Table

DownLoad: CSV

Table 2. Results obtained on three different gene expression datasets.

		SLDA-BESS	Penalized LDA- $\ell_1$	NSC	RDA
Ramaswamy	errors	13.00(1.82)	15.90(2.52)	11.27(2.68)	10.50(2.79)
Ramaswamy	features	429.52(10.56)	9426.00(573.78)	2629.66(321.19)	927.16(108.64)
Nakayamy	errors	4.86(1.74)	4.98(1.62)	4.84(1.51)	5.40(1.72)
Nakayamy	features	213.18(5.75)	11044.88(376.34)	1517.92(163.49)	2858.82(278.98)
Sun	errors	14.80(2.58)	16.06(2.29)	13.24(3.34)	24.14(3.39)
Sun	features	215.24(14.03)	20715.10(2396.99)	14453.12(1280.76)	4740.70(670.81)

| Show Table

DownLoad: CSV

5. Conclusions

Linear discriminant analysis is a commonly used classification method. However, it fails if the number of features is large relative to the number of observations. This study extended LDA to a high-dimensional setting by adding an $\ell_0$ penalty to produce a discriminant vector involving only a subset of features. Our extension is based on Fisher’s discriminant problem, such as a generalized eigen problem, and then uses the SLDA-BESS algorithm, which combines the naive algorithm and the exchange step to solve the sparse discriminant vector. Sparse discriminant vectors were generated while making the classifier more interpretable in practical situations. The results of both numerical simulations and analysis of real data demonstrate the superior performance of our SLDA-BESS method. The convergence rate and complexity of the SLDA-BESS algorithm, as well as the theoretical properties of the minimax lower and upper bounds of the estimator, can be further considered in the future.

Acknowledgements

This work was supported by the National Natural Science Foundation of China (71771203).

Conflict of interest

The authors declare that they have no conflict of interest.

Conflict of Interest

The authors declare that they have no conflict of interest.

This paper extends LDA to a high-dimensional setting by adding an ℓ₀ penalty to produce a discriminant vector involving only a subset of features. We propose an SLDA-BASS algorithm to directly solve the estimation of sparse LDA with ℓ₀ constraint, avoiding the unnecessary information loss caused by relaxing the constraint in the conventional algorithm. Compared to other estimation methods, the SLDA-BASS algorithm is derived from a natural criterion and is superior in terms of sparsity recovery as well as computational efficiency.

References (15)

References

[1]	Hastie T, Tibshirani R, Friedman J H, et al. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. Berlin: Springer, 2009.
[2]	Shao J, Wang Y, Deng X, et al. Sparse linear discriminant analysis by thresholding for high dimensional data. The Annals of Statistics, 2011, 39 (2): 1241–1265. DOI: 10.1214/10-AOS870
[3]	Bickel P, Levina E. Some theory of Fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli, 2004, 10 (6): 989–1010. DOI: 10.3150/bj/1106314847
[4]	Dudoit S, Fridlyand J, Speed T P. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 2002, 97 (457): 77–87. DOI: 10.1198/016214502753479248
[5]	Friedman J H. Regularized discriminant analysis. Journal of the American Statistical Association, 1989, 84 (405): 165–175. DOI: 10.1080/01621459.1989.10478752
[6]	Xu P, Brock G N, Parrish R S. Modified linear discriminant analysis approaches for classification of high-dimensional microarray data. Computational Statistics & Data Analysis, 2009, 53 (5): 1674–1687. DOI: 10.1016/j.csda.2008.02.005
[7]	Tibshirani R, Hastie T, Narasimhan B, et al. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences of the United States of America, 2002, 99 (10): 6567–6572. DOI: 10.1073/pnas.082099299
[8]	Guo Y, Hastie T, Tibshirani R. Regularized linear discriminant analysis and its application in microarrays. Biostatistics, 2007, 8 (1): 86–100. DOI: 10.1093/biostatistics/kxj035
[9]	Witten D M, Tibshirani R. Penalized classification using Fisher’s linear discriminant. Journal of the Royal Statistical Society:Series B (Statistical Methodology), 2011, 73 (5): 753–772. DOI: 10.1111/j.1467-9868.2011.00783.x
[10]	Zhu J, Wen C, Zhu J, et al. A polynomial algorithm for best-subset selection problem. Proceedings of the National Academy of Sciences of the United States of America, 2020, 117 (52): 33117–33123. DOI: 10.1073/pnas.2014241117
[11]	Krzanowski W, Jonathan P, McCarthy W, et al. Discriminant analysis with singular covariance matrices: Methods and applications to spectroscopic data. Journal of the Royal Statistical Society:Series C (Applied Statistics), 1995, 44 (1): 101–115. DOI: 10.2307/2986198
[12]	Tebbens J D, Schlesinger P. Improving implementation of linear discriminant analysis for the high dimension/small sample size problem. Computational Statistics & Data Analysis, 2007, 52 (1): 423–437. DOI: 10.1016/j.csda.2007.02.001
[13]	Ramaswamy S, Tamayo P, Rifkin R, et al. Multiclass cancer diagnosis using tumor gene expression signatures. Proceedings of the National Academy of Sciences of the United States of America, 2001, 98 (26): 15149–15154. DOI: 10.1073/pnas.211566398
[14]	Nakayama R, Nemoto T, Takahashi H, et al. Gene expression analysis of soft tissue sarcomas: Characterization and reclassification of malignant fibrous histiocytoma. Modern Pathology, 2007, 20 (7): 749–759. DOI: 10.1038/modpathol.3800794
[15]	Sun L, Hui A M, Su Q, et al. Neuronal and glioma-derived stem cell factor induces angiogenesis within the brain. Cancer Cell, 2006, 9 (4): 287–300. DOI: 10.1016/j.ccr.2006.03.003

Supplements (1)

Supplements
Other Related Supplements
- Graphic and text summary
  Download

Cited By

Track Citations

Get Citation

{{if article.articleBusiness.pdfLink && article.articleBusiness.pdfLink != ''}} {{else}} {{/if}}PDF

XML

References

[1]	Hastie T, Tibshirani R, Friedman J H, et al. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. Berlin: Springer, 2009.
[2]	Shao J, Wang Y, Deng X, et al. Sparse linear discriminant analysis by thresholding for high dimensional data. The Annals of Statistics, 2011, 39 (2): 1241–1265. DOI: 10.1214/10-AOS870
[3]	Bickel P, Levina E. Some theory of Fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli, 2004, 10 (6): 989–1010. DOI: 10.3150/bj/1106314847
[4]	Dudoit S, Fridlyand J, Speed T P. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 2002, 97 (457): 77–87. DOI: 10.1198/016214502753479248
[5]	Friedman J H. Regularized discriminant analysis. Journal of the American Statistical Association, 1989, 84 (405): 165–175. DOI: 10.1080/01621459.1989.10478752
[6]	Xu P, Brock G N, Parrish R S. Modified linear discriminant analysis approaches for classification of high-dimensional microarray data. Computational Statistics & Data Analysis, 2009, 53 (5): 1674–1687. DOI: 10.1016/j.csda.2008.02.005
[7]	Tibshirani R, Hastie T, Narasimhan B, et al. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences of the United States of America, 2002, 99 (10): 6567–6572. DOI: 10.1073/pnas.082099299
[8]	Guo Y, Hastie T, Tibshirani R. Regularized linear discriminant analysis and its application in microarrays. Biostatistics, 2007, 8 (1): 86–100. DOI: 10.1093/biostatistics/kxj035
[9]	Witten D M, Tibshirani R. Penalized classification using Fisher’s linear discriminant. Journal of the Royal Statistical Society:Series B (Statistical Methodology), 2011, 73 (5): 753–772. DOI: 10.1111/j.1467-9868.2011.00783.x
[10]	Zhu J, Wen C, Zhu J, et al. A polynomial algorithm for best-subset selection problem. Proceedings of the National Academy of Sciences of the United States of America, 2020, 117 (52): 33117–33123. DOI: 10.1073/pnas.2014241117
[11]	Krzanowski W, Jonathan P, McCarthy W, et al. Discriminant analysis with singular covariance matrices: Methods and applications to spectroscopic data. Journal of the Royal Statistical Society:Series C (Applied Statistics), 1995, 44 (1): 101–115. DOI: 10.2307/2986198
[12]	Tebbens J D, Schlesinger P. Improving implementation of linear discriminant analysis for the high dimension/small sample size problem. Computational Statistics & Data Analysis, 2007, 52 (1): 423–437. DOI: 10.1016/j.csda.2007.02.001
[13]	Ramaswamy S, Tamayo P, Rifkin R, et al. Multiclass cancer diagnosis using tumor gene expression signatures. Proceedings of the National Academy of Sciences of the United States of America, 2001, 98 (26): 15149–15154. DOI: 10.1073/pnas.211566398
[14]	Nakayama R, Nemoto T, Takahashi H, et al. Gene expression analysis of soft tissue sarcomas: Characterization and reclassification of malignant fibrous histiocytoma. Modern Pathology, 2007, 20 (7): 749–759. DOI: 10.1038/modpathol.3800794
[15]	Sun L, Hui A M, Su Q, et al. Neuronal and glioma-derived stem cell factor induces angiogenesis within the brain. Cancer Cell, 2006, 9 (4): 287–300. DOI: 10.1016/j.ccr.2006.03.003

[1]	Yuanli Ma, Yang Li, Jianjun Xu. Confidence intervals for high-dimensional multi-task regression[J]. JUSTC, 2023, 53(4): 0403. DOI: 10.52396/JUSTC-2022-0115
[2]	CUI Wenquan, HUANG Yuqiao. A new random projection-based ensemble classifier for high-dimensional data[J]. JUSTC, 2019, 49(12): 974-984. DOI: 10.3969/j.issn.0253-2778.2019.12.004
[3]	YU Xiangru, DING Jianpei, LI Jinping. Tire impurity defect detection based on morphology and projection histogram[J]. JUSTC, 2019, 49(1): 49-54. DOI: 10.3969/j.issn.0253-2778.2019.01.007
[4]	YAN Fei, WANG Xiaodong. Unsupervised feature selection method based on adaptive locality preserving projection[J]. JUSTC, 2018, 48(4): 290-297. DOI: 10.3969/j.issn.0253-2778.2018.04.004
[5]	HE Chao, LI Ying, SONG Weidong. A class of projectively flat and dually flat Finsler metrics[J]. JUSTC, 2017, 47(6): 459-464. DOI: 10.3969/j.issn.0253-2778.2017.06.002
[6]	GRIESSINGER K. (for the BaBar Collaboration). Measurement of the D0→π-e+ν branching fraction, form factor and implications for Vub[J]. JUSTC, 2016, 46(5): 369-373. DOI: 10.3969/j.issn.0253-2778.2016.05.003
[7]	HAO Jing, ZHANG Peng. Multi-period mean-variance optimization with cardinality constraints[J]. JUSTC, 2016, 46(2): 156-164. DOI: 10.3969/j.issn.0253-2778.2016.02.009
[8]	LI Ying, SONG Weidong. A new class of projectively flat Finsler metrics[J]. JUSTC, 2015, 45(12): 1015. DOI: 10.3969/j.issn.0253-2778.2015.12.008
[9]	PENG Tao, LI Longke, ZHANG Ziping. Backgrounds studies on D0→K0sπ+π- decay in the mixing parameters measurement of neutral D meson at Belle[J]. JUSTC, 2012, 42(1): 37-40. DOI: 10.3969/j.issn.0253-2778.2012.01.006
[10]	ZHANG Yi, XIE Lu, YUAN Zi-neng. Monte Carlo sampling of metabolic fluxes under thermodynamic constraints[J]. JUSTC, 2009, 39(4): 357-364.

TrendMD

Volume 52 Issue 8 PP. 4

Cover

Keywords

Article Metrics

Article views (747) PDF downloads (1987)

Sparse linear discriminant analysis via ℓ₀ constraint

Abstract

Graphical Abstract

Abstract

Public Summary

1. Introduction

2. Methodology

2.1 A review of linear discriminant analysis

2.2 Sparse linear discriminant analysis

2.3 Best subset selection

3. Numeric studies

4. Real data analysis

5. Conclusions

Acknowledgements

Conflict of interest

Conflict of Interest

References

Related Articles

Supplements

Other Related Supplements

Graphic and text summary

Catalog

References

Related Articles

TrendMD

Article Metrics

Authors

Browse

Contact Us

About

Sparse linear discriminant analysis via ℓ0 constraint

Share

Tools

Abstract

Graphical Abstract

Abstract

Public Summary

1. Introduction

2. Methodology

2.1 A review of linear discriminant analysis

2.2 Sparse linear discriminant analysis

2.3 Best subset selection

3. Numeric studies

4. Real data analysis

5. Conclusions

Acknowledgements

Conflict of interest

Conflict of Interest

References

Related Articles

Supplements

Other Related Supplements

Graphic and text summary

Catalog

References

Related Articles

TrendMD

Article Metrics

Authors

Browse

Contact Us

About

Export File

Citation

Format

Content

Sparse linear discriminant analysis via ℓ₀ constraint