Jingyu Shao is currently a graduate student under the tutelage of Professor Zemin Zheng at the University of Science and Technology of China. His research interests focus on statistical learning and variable selection
Ruipeng Dong currently is a postdoctoral researcher at the Chinese University of Hong Kong, Shenzhen. He received his PhD degree in Statistics from the University of Science and Technology of China. His research interests focus on statistical learning and variable selection
The data-driven conditional multinomial logit choice model with customer features performs well in the assortment personalization problem when the low-rank structure of the parameter matrix is considered. However, despite recent theoretical and algorithmic advances, parameter estimation in the choice model still poses a challenging task, especially when there are more predictors than observations. For this reason, we suggest a penalized likelihood approach based on a feature matrix to recover the sparse structure from populations and products toward the assortment. Our proposed method considers simultaneously low-rank and sparsity structures, which can further reduce model complexity and improve its estimation and prediction accuracy. A new algorithm, sparse factorial gradient descent (SFGD), was proposed to estimate the parameter matrix, which has high interpretability and efficient computing performance. As a first-order method, the SFGD works well in high-dimensional scenarios because of the absence of the Hessian matrix. Simulation studies show that the SFGD algorithm outperforms state-of-the-art methods in terms of estimation, sparsity recovery, and average regret. We also demonstrate the effectiveness of our proposed method using advertising behavior data analysis.
Graphical Abstract
A penalized likelihood approach is proposed, in which the low-rank and sparsity structure are considered simultaneously. New algorithm sparse factored gradient descent (SFGD) is proposed to estimate the parameter matrix.
Abstract
The data-driven conditional multinomial logit choice model with customer features performs well in the assortment personalization problem when the low-rank structure of the parameter matrix is considered. However, despite recent theoretical and algorithmic advances, parameter estimation in the choice model still poses a challenging task, especially when there are more predictors than observations. For this reason, we suggest a penalized likelihood approach based on a feature matrix to recover the sparse structure from populations and products toward the assortment. Our proposed method considers simultaneously low-rank and sparsity structures, which can further reduce model complexity and improve its estimation and prediction accuracy. A new algorithm, sparse factorial gradient descent (SFGD), was proposed to estimate the parameter matrix, which has high interpretability and efficient computing performance. As a first-order method, the SFGD works well in high-dimensional scenarios because of the absence of the Hessian matrix. Simulation studies show that the SFGD algorithm outperforms state-of-the-art methods in terms of estimation, sparsity recovery, and average regret. We also demonstrate the effectiveness of our proposed method using advertising behavior data analysis.
Public Summary
The data-driven conditional multinomial logit choice model with customer features has a good performance in assortment personalization problem when a low-rank structure of parameter matrix is considered.
Our proposed method considers both low-rank and sparsity structure, which can further reduce model complexity and improve estimation and prediction accuracy.
New algorithm sparse factored gradient descent (SFGD) is proposed to estimate the parameter matrix, which enjoys high interpretability and efficient performance in computing.
As an important part of revenue management, assortment planning has a wide range of applications in retail, advertising, and e-commerce. Personalization techniques are used to optimize the selection of products or services for certain customers. A key factor in optimizing assortment successfully is the ability to understand and predict the demand or customer preferences. Customer-specific data are available for companies in the scenario of many online applications. The feature information of customer data is of great significance in modelling the relationship between features and purchase decisions. In Refs.[1, 2], transactional data were used to estimate customer preferences. They rely on the discrete context of each customer type; that is, certain types of customers are discovered before the estimation. A practical algorithm for personalization under inventory constraints was proposed in Ref.[3]. The full feature data of customers are considered in Ref.[4], where covariate information was learned from the data.
Logit models are commonly used to better understand customer preferences and demands in practice. This is an advantage of interpretability and simplicity, which makes the logit model a popular choice. Such a framework is widely used in targeted advertising[5], pricing[6], and assortment personalization[1,3,4]. The data-driven logit model framework, as a special type of generalized linear model, uses the information of customer features to estimate the coefficients, based on which assortment optimization is carried out. In big data applications, it is challenging to learn and infer dependence structures, because the responses and predictors in such a generalized linear model (GLM) framework may be related through a few latent pathways or a subset of predictors. Furthermore, with the exponential growth in data volume, the curse of dimensionality and massive amounts of data make estimation and prediction more difficult to process. To successfully recover the sparse structure of predictors associated with the response, regularization methods such as lasso[7], group lasso[8], and group lasso for logistic regression[9] are used.
In the multi-response scenario, the data-driven multinomial logit model tackles the associations between the predictors and responses via a sparse and low-rank representation of the coefficient matrix. Sparse reduced-rank regression has been extensively researched in the literature, which maintains the interpretability of the estimated matrix by eliminating irrelevant features, and the low-rank structure helps to reduce the number of free parameters of the model[10–13]. Sparse reduced-rank regression has applications in social network community discovery[14], subspace clustering[15], and motion segmentation[16]. In multitask learning and noisy matrix decomposition, there are abundant references that have considered the sparse reduced-rank representation; see Refs. [17–19] and also note that these references therein have estimated matrices with a low-rank plus sparse structure which is different from our work, as we focus on the estimation of a matrix that is jointly low-rank and sparse (see Refs. [10–12] for similar frameworks). Regarding the sparsity of the parameter matrix, Refs. [20,21] focused on the co-sparsity structure in the matrix. However, the sparsity in our assortment personalization problem aims to help select the features of customers, and row-wise sparsity is introduced. To the best of our knowledge, in the application of assortment personalization problem, the simultaneous sparse and low-rank structure in the coefficient matrix of the multinomial logit model have been rarely considered in the literature. To meet this requirement in our multinomial logit model, we choose the penalized likelihood framework.
To derive a sparse reduced-rank approximation of the parameter matrix, it is common to choose L1 and nuclear norm regularizers. There are several methods for solving the penalized likelihood problem because of the convex relaxations to the sparsity and low rankness of a matrix. The resulting problem is convex and can be solved by the alternating direction method of multipliers (ADMM)[22]; see Ref. [14]. Other methods include the sequential co-sparse unit-rank method [20] and sparse eigenvalue decomposition[23]. All the above sparse reduced-rank approaches have desirable theoretical properties. However, they cannot be directly used in penalized likelihood frameworks. For the GLM problem, the factored gradient descent method[24] is commonly used in problems that can be posed as matrix factorization; see Ref. [25] for the precise convergence rate guarantees for a general convex function. Such a first-order method works in an alternative way and does not require the SVD of the parameter matrix at each step, which makes high-efficiency computation a possibility in solving the penalized likelihood problem.
The main contributions of this study are threefold. First, we provide the framework for the assortment personalization problem, which maximizes the expected revenue over a feasible assortment. We introduce customer features related to the utility model and present our data-driven conditional multinomial logit choice model. For the sparsity of the parameter matrix, we use a group lasso-type penalty to derive the row-wise sparsity, which is the same as the feature selection for customers. Thus, we make the estimation in the high-dimensional feature scenario available. Second, to solve the penalized maximum likelihood problem, we propose a first-order sparse factored gradient descent (SFGD) approach, in which both sparsity and low-rank structures are considered. Because of the low rank of the parameter matrix, the SVD can be used to reduce the number of parameters. We illustrate the details of the thresholding rule in SFGD and how it proceeds in the alternative updating of the two matrices derived from the decomposition. Moreover, we show the local convergence of SFGD and present a structure-aware dynamic assortment personalization procedure based on the SFGD method. Third, our simulation, which contains high-dimensional settings, shows that SFGD can consistently estimate the parameter matrix and accurately recover the support of the features. The average regret of different structure settings was compared with the growth of the time horizon, and the SFGD method with the sparse reduced-rank structure considered outperformed the sparsity structure-ignorant methods. We applied the proposed method to advertising behavior data, in which the features of both users and advertisements are considered. Furthermore, the SFGD based assortment personalization procedure exhibited the best precision.
2.
Model specification
In this section, we present our modeling framework for data-driven assortment personalization problems, in which customer features are considered. Throughout this paper, bold letters are used to denote the matrices and vectors. In this study, {\boldsymbol{z}}_i is the column vector of the i th row of {\boldsymbol{Z}} , and z_{ij} is the j element of the vector {\boldsymbol{z}}_i without special instructions. For any matrix {\boldsymbol{Z}} = (z_{ij}) , denoted by ||{\boldsymbol{Z}}||_F = \sqrt{\sum_{i,j}z_{ij}^2} , ||{\boldsymbol{Z}}||_{2,1} = \sum_{i}||{\boldsymbol{z}}_i||_2 and ||{\boldsymbol{Z}}||_{1} = \sum_{i,j}|z_{ij}| denotes the Frobenius norm, rows l_{2,1} -norm and element-wise l_{1} -norm. Furthermore, \sigma_1({\boldsymbol{Z}}) is the largest singular value of {\boldsymbol{Z}} .
In the assortment personalization problem, the retailer records the observed transactional data in the past, which contains customer features, items (products) chosen by customers, and the assortment arrangement provided by the retailer. For time horizon T , the decision maker observes customer data in the past time t = 1,\cdots,T. At time t , the decision maker obtains customer data {\boldsymbol{x}}_t with p features that include individual information, assortment S_t\subset\{1,\cdots,q\}, and items j_t\in\{1,\cdots,q\}, which were chosen by the t customer.
2.1
Data-driven conditionally multinomial logit choice model
In the data-driven assortment problem, we assume that the customer data matrix {\boldsymbol{X}} of size T\times p is obtained directly from the past, also known as feature vectors. We assume that customers choose among the products according to some conditional probability \mathbb{P}_{{\boldsymbol{\varTheta}}}(j|S) when the assortment is shown to the customer. Here, {\boldsymbol{\varTheta}} is the parameter matrix that plays an important role in the conditional multinomial logit choice model. The choice of {\boldsymbol{\varTheta}} will be presented later. For each item j\in \{1,\cdots,q\}, let r_j be the associated revenue. Here, r_0 = 0 in revenue for the no-purchase option. Then, the decision-maker maximizes the expected revenue.
over a feasible assortment S\subset \{1,\cdots,q\}. The assortment personalization problem aims to find an assortment that maximizes the expected revenue.
To obtain a clear view of {\boldsymbol{\varTheta}}, we first introduce the utility of items. A popular way to model customer choice probability is to utilize the random utility model[26]. We assume that a customer with the feature vector {\boldsymbol{x}}\in \mathbb{R}^p has utility
for each product j , where V_j^x can be interpreted as the mean utility of product j for this customer and \epsilon_j is a standard Gumbel random variable with a mean of zero. When a decision maker offers assortment S\subset\{1,\cdots,q \} to a customer with feature {\boldsymbol{x}} , the customer will choose the product in S with the highest U_j^x . The utility V_0^x of no-purchase option it to be zero. Here, we assume that the mean utility is given by the linear model V_j^x = \left <{\boldsymbol{x}},{\boldsymbol{\varTheta}}_j^{*}\right > , where {\boldsymbol{\varTheta}}_j^{*}\in \mathbb{R}^p for 1\leqslant j \leqslant q. Hence, we obtain the mean utility matrix for all items {\boldsymbol{V}}^{x} = {\boldsymbol{X}}{\boldsymbol{\varTheta}}^{*}, where the underlying parameter matrix is {\boldsymbol{\varTheta}}^{*} = ({\boldsymbol{\varTheta}}_1,...,{\boldsymbol{\varTheta}}_q)\in \mathbb{R}^{p\times q}.
The data-driven conditional multinomial logit choice model is over time t = 1,\cdots,T , and items j = 1,\cdots,q . We introduce two random variables: customer I and item (choice) J . Using a well-known result from discrete choice theory[27], given assortment S\subset \{1,\cdots,q\} , we derive a personalized case of choice probability
We choose the linear model of {\boldsymbol{x}} and {\boldsymbol{\varTheta}}_j^{*} to represent V_j^x ; then, the choice J has the conditional distribution
\begin{split} \mathbb{P}_{{\boldsymbol{\varTheta}}^{*}}(J = j| I = {\boldsymbol{x}}_t;S)& = \frac{1}{1+ \sum_{j^{\prime}\in S}{\rm{exp}}({\boldsymbol{x_t}}^{\rm{T}}{\boldsymbol{\varTheta}}_{j^{\prime}}^{*})}\times \\& \left\{\begin{aligned} &1,&j = 0 \\ &0,&j\neq 0, j\notin S \\ &{\rm{exp}}({\boldsymbol{x_{t}}}^{\rm{T}}{\boldsymbol{\varTheta}}_j^{*}),&j\neq 0, j\in S \end{aligned} \right. \end{split}
(4)
where J = 0 indicates that no product has been purchased in assortment S . A no-purchase option is common in the choice model. In our data-driven framework, the decision maker can observe customer features {\boldsymbol{x}}_t\in \mathbb{X},t = 1,\cdots,T, where \mathbb{X}\subset \mathbb{R}^p is a space of possible contexts. We also assume that {\boldsymbol{x}}_t is scaled to satisfy ||{\boldsymbol{x}}_t||_\infty\leqslant1 , for t = 1,\cdots,T.
2.2
Penalized maximum likelihood approach
We suppose that we have T observations ({\boldsymbol{x_t}},j_t,S_t) for t = 1,\cdots,T, where S_t comes from the set of subsets of \{1,\cdots,q\} of size K , and j_t are i. i. d., according to model (4). Based on the specific form of \mathbb{P}_{{\boldsymbol{\varTheta}}^{*}}(J = j| I = {\boldsymbol{x}}_t;S) in (4), we define the loss function constructed from the negative log-likelihood as
Similar to classical methods, we often assume that the underlying parameter matrix {\boldsymbol{\varTheta}}^{*} has a certain special structure, such as the low-rank structure of {\boldsymbol{\varTheta}}^{*} and the sparsity of {\boldsymbol{\varTheta}}^{*}[12]. In the customer choice model, it is reasonable to assume that for customer {\boldsymbol{x}} , only a few features have a significant impact on the utility of choosing different items. Because sparsity depends on the items, we introduce the row-wise sparsity of {\boldsymbol{\varTheta}}^{*}. To recover the sparse structure of the parameter matrix, the regularization method can be helpful for variable selection as well as sparsity recovery. In sparse reduced-rank learning, we tend to recover the sparsity and low-rank structures simultaneously. Choosing a large number of features is also a procedure for variable selection in our generalized multi-response regression problem. When large numbers of predictor variables (i.e., features) are available, some may not be helpful for both the estimation and prediction. Therefore, it is important to perform feature selection using the shrinkage method.
Inspired by the regularization method in regression, we chose a grouped lasso-type[8,12] penalty to avoid overfitting and improve interpretability. Another widely used method is to derive the sparsity in the matrix using an element-wise lasso penalty (see Refs. [28,29]). However, note that element-wise sparsity does not imply the row-wise sparsity that we expect to have, and the model will be unable to select the features of data {\boldsymbol{X}} . In our problem, setting the entire row of {\boldsymbol{\varTheta}} to zero corresponds to excluding a feature from the customer data. Therefore, we introduced the l_{2,1} norm of {\boldsymbol{\varTheta}} rather than an element-wise l_{1} norm. Let 1\leqslant\tilde{r}\leqslant{\rm{min}}\{p,q\} ; then, we have the form of our problem as
In a full-rank problem, \tilde{r} = {\rm min}\{p,q \} must be chosen. However, if the problem has a low-rank structure or if we want to enforce a low rank, then we use a proper choice of smaller \tilde{r} , reducing storage and computational work; see a similar motivation in Ref. [1]. If {\boldsymbol{\varTheta}}^{*} has rank r , we may factor {\boldsymbol{\varTheta}}^{*} to find the vectors {\boldsymbol{u}}_i = (u_{i1},\cdots,u_{ir})^{\rm{T}} and {\boldsymbol{v}}_j = (v_{j1},\cdots,v_{jr})^{\rm{T}} for i = 1,\cdots,p and j = 1,\cdots,q such that {\boldsymbol{\varTheta}}^{*}_{ij} is approximately equal to \sum_{l = 1}^{r}u_{il}v_{jl} . We denote {\boldsymbol{U}} = (u_{il})_{p\times r} and {\boldsymbol{V}} = (v_{jl})_{q\times r} ; then, the right factors {\boldsymbol{V}} can be considered latent item weights, and the left factors {\boldsymbol{U}} as latent features[1]. We now derive an appealing sparse SVD representation of {\boldsymbol{X}}{\boldsymbol{\varTheta}} as \dfrac{1}{\sqrt{T}}{\boldsymbol{X}}{\boldsymbol{\varTheta}} = \bigg(\dfrac{1}{\sqrt{T}}{\boldsymbol{XU}}_{0}{\boldsymbol{D}}_{0}\bigg){\boldsymbol{V}}_{0}^{\rm{T}} = \dfrac{1}{\sqrt{T}}{\boldsymbol{XUV}}^{\rm{T}} where {\boldsymbol{U}} = {\boldsymbol{U}}_{0}{\boldsymbol{D}}_{0}\in \mathbb{R}^{p\times r} , {\boldsymbol{V}} = {\boldsymbol{V}}_{0}\in \mathbb{R}^{q\times r}, {\boldsymbol{D}}_{0} = {\rm diag} \{d_1^{0},\cdots,d_r^{0}\}. This encourages us to use {\boldsymbol{U}} and {\boldsymbol{V}} to factorize the matrix {\boldsymbol{\varTheta}}; that is, {\boldsymbol{\varTheta}} = {\boldsymbol{U}}{\boldsymbol{V}}^{\rm{T}}. Thus, it is feasible to introduce the first-order method over {\boldsymbol{U}} and {\boldsymbol{V}} to derive the estimation of {\boldsymbol{\varTheta}}. It is worth mentioning that our goal is to accurately provide a low-rank and row-sparse estimate of {\boldsymbol{\varTheta}} rather than the estimates of {\boldsymbol{U}} and {\boldsymbol{V}} independently. This factorial form of {\boldsymbol{\varTheta}} allows our method to approximate {\boldsymbol{\varTheta}} alternately. A similar framework can be found in Refs. [1,25]. The tuning parameters \lambda , which are chosen based on an information criterion, are discussed later. We can shrink ||{\boldsymbol{\varTheta}}_i||_2 to zero by setting the i th row of {\boldsymbol{U}} to zero and then derive the row-wise sparsity on {\boldsymbol{U}} and {\boldsymbol{\varTheta}} accordingly. In our generalized multiresponse regression problem, all the items have probabilities to be chosen, which motivates us to introduce row-wise instead of column-wise sparsity.
We define our estimator \hat{{\boldsymbol{\varTheta}}} for {\boldsymbol{\varTheta}}^{*} as the solution to the maximum likelihood problem with the low-rank assumption {\rm rank}({\boldsymbol{\varTheta}}^{*})\ll {\rm min}\{p,q\}. Because problem (6) is convex, we can apply a variety of convex methods. In the next section, we will attempt to use a first-order algorithm on the non-convex and factored forms.
3.
Algorithm
With the convexity of {\cal{Q}}({\boldsymbol{X}};{\boldsymbol{\varTheta}}), many fast optimization approaches, such as the alternating direction method of multipliers, accelerated projected gradient descent, and factored gradient descent, can perform well. The commonly used method for estimating a parameter matrix with a low-rank structure is the factored gradient descent method[24]. In this section, we introduce a data-driven sparse factored gradient descent (SFGD) algorithm to approximate {\boldsymbol{\varTheta}}^{*} using a low-rank and sparse structure. The SFGD is an interactive method in which the row-wise sparsity of {\boldsymbol{U}} is considered and the updates of {\boldsymbol{U}} and {\boldsymbol{V}} overlap. In the update of {\boldsymbol{U}} , we used a subgradient approach that cooperates with the gradient descent method.
3.1
Sparse factored gradient descent method
In a scenario of high dimensions, the computation of the Hessian matrix can be difficult or even not feasible. In this section, we introduce a first-order algorithm for computing \hat{{\boldsymbol{\varTheta}}}, which works on the factored form of the low-rank and sparse constraint likelihood optimization problem (6). First, we consider the problem without regularization.
It is clear that the algorithm reduces the computational cost because this model has only r\times (p+q) optimization parameters, rather than p\times q . Moreover, our SFGD algorithm works in an alternative manner; that is, we optimize the factors {\boldsymbol{U}} and {\boldsymbol{V}} of the parameter matrix {\boldsymbol{\varTheta}} = {\boldsymbol{UV}}^{\rm{T}} rather than producing SVD at each step.
From the convexity of {\cal{L}}({\boldsymbol{X}};{\boldsymbol{\varTheta}}) with respect to {\boldsymbol{\varTheta}}, it is feasible to use the gradient-descent method. Inspired by the factored form of our problem, we introduce the factored gradient descent procedure, which is a data-driven nonconvex method and a fundamental part of SFGD. The SFGD algorithm first solves the unconstrained problem (7) using the alternate updating rule
which is closely related to the alternating convex search (ACS) method, as in Refs. [20,30]. The main difference between our SFGD is that {\boldsymbol{U}} and {\boldsymbol{V}} overlap with each other with rank r\geqslant1 , rather than the unit-rank problem. We begin the line search with a step size of \eta = 1 , after which the adaptive step size is repeatedly decreased by a shrinkage factor \beta until the objective decreases.
It is easy to compute the gradients of the objective in (7). According to the chain rule of the differentiable function, we have
Here, we do not need to explicitly form \nabla {\cal{L}}({\boldsymbol{X}};{\boldsymbol{UV}}^{\rm{T}}) to compute gradients. Recall that {\boldsymbol{u}} and {\boldsymbol{v}} are the r\times 1 column vectors of rows of {\boldsymbol{U}} and {\boldsymbol{V}} ; then, we have the following form of gradients:
which clarifies the gradient descent direction. See Appendix A for more details.
We now introduce the row-wise sparsity of {\boldsymbol{\varTheta}}. To solve this problem, in the (m) th step, we used the subgradient method to screen the rows of {\boldsymbol{U}}^{(m)} , which aims to find sparsity when ||{\boldsymbol{u}}_i^{(m)}||_2 = 0 . The j th element of {\boldsymbol{x}}_t is denoted as x_{t_j} . For any i = 1,\cdots,p, we use the subgradient method with respect to {\boldsymbol{u}}_i^{(m)} and let the subgradient be zero, which leads to
Let {\boldsymbol{s}}_i = \frac{{\boldsymbol{u}}_i^{(m)}}{||{\boldsymbol{u}}_i^{(m)}||_2} if ||{\boldsymbol{u}}_i^{(m)}||_2\neq 0 . Further, {\boldsymbol{s}}_i is an r vector satisfying ||{\boldsymbol{s}}_i||_2<1 if ||{\boldsymbol{u}}_i^{(m)}||_2 = 0 ; then, we have
({\rm{i}}) For the (m+1) th repeat in gradient descent, we screen the row-wise sparsity before finally updating {\boldsymbol{U}}^{(m)} .
({\rm{ii}}) For i = 1,\cdots,p, denote x_{t_j} for the j th element of {\boldsymbol{x}}_t , then compute {\boldsymbol{s}}_i . Update the j th row of {\boldsymbol{U}}^{(m)} using the threshold rule
where (z)_{+} = {\rm{max}}\{0,z \} for all z\in \mathbb{R} . Without loss of generality, if ||{\boldsymbol{s}}_i||_2-1 = 0 , we let {\boldsymbol{u}}_i^{(m+1)} = {\boldsymbol{u}}_i^{(m)} .
({\rm{iii}}) After screening for all rows of {\boldsymbol{U}}^{(m)} , update {\boldsymbol{U}}^{(m)} derive {\boldsymbol{U}}^{(m+1)} and then enter the next iteration or stop.
Our SFGD method can be initialized using the technique from Ref. [25], which only requires gradients of {\cal{L}}({\boldsymbol{X}};{\boldsymbol{\varTheta}}). By the SVD of -\nabla{\cal{L}}({\boldsymbol{X}};{\bf{0}}) , it entails -\nabla{\cal{L}}({\boldsymbol{X}};{\bf{0}}) = \tilde{{\boldsymbol{U}}}{\rm diag}(\tilde{\sigma}_1,\cdots, \tilde{\sigma}_{{\rm{min}}\{p,q\}}) \tilde{{\boldsymbol{V}}}^{\rm{T}}. We denote by \tilde{{\boldsymbol{U}}}_{r} and \tilde{{\boldsymbol{V}}}_{r} the first r columns of \tilde{{\boldsymbol{U}}} and \tilde{{\boldsymbol{V}}} , where r\leqslant{\tilde{ r}} is one of the tuning parameters. Let {\boldsymbol{E}}_1 be a p\times q matrix that has a value of one in the (1,1) element with other zeros. Then we initialize
where \omega = ||\nabla{\cal{L}}({\boldsymbol{X}};{\bf{0}})-(\nabla{\cal{L}}({\boldsymbol{X}};{\boldsymbol{E}}_{\bf{1}})+\lambda{\boldsymbol{E}}_{\bf{1}})||_F . Besides, the termination of our algorithm is met when the decrease in the objective function value is smaller than the tolerance \tau .
The selection of the tuning parameter \lambda was based on the information criterion which will be discussed later. Considering the overshooting problem in the line search process, the step size shrinkage factor \beta adjusts \eta to ensure that the local optimal is not missed. The details of SFGD are presented in Algorithm 3.1.
16 {\boldsymbol{u}}_i^{'}=\dfrac{1}{||{\boldsymbol{s}}_i||_2-1}\big(||{\boldsymbol{s}}_i||_2-1 \big)_{+}{\boldsymbol{u}}_i^{'} with {\boldsymbol{s}}_i defined in (10)
Now, we provide the convergence performance of the SFGD algorithm as follows: Appendix B provides the proof.
Theorem 3.1. Let {\boldsymbol{\varTheta}} = {\boldsymbol{U}}{\boldsymbol{V}}^{\rm{T}} and {\boldsymbol{\varTheta}}' = {\boldsymbol{U}}'{\boldsymbol{V}}'^{\rm{T}} denote the input and output of an iteration of SFGD. Then there exist a constant M>0 , if the step size \eta \leqslant \dfrac{1}{3M( {\sigma _1} ({\boldsymbol{U}}^{\rm{T}}{\boldsymbol{U}})+ {\sigma _1} ({\boldsymbol{V}}^{\rm{T}}{\boldsymbol{V}}))}, then
There exists an optimum \tilde{{\boldsymbol{\varTheta}}} = \tilde{{\boldsymbol{U}}}\tilde{{\boldsymbol{V}}}^{\rm{T}} such that {\cal{Q}}({\boldsymbol{X}};{\boldsymbol{\varTheta}}) converges to {\cal{Q}}({\boldsymbol{X}};\tilde{{\boldsymbol{\varTheta}}}).
The theorem states that the step size \eta adjusted by the shrinkage factor is always below the upper bound. Updating by such a step size ensures that {\cal{Q}}({\boldsymbol{X}};{\boldsymbol{\varTheta}}) decreases towards the optimum value.
3.3
The structure-aware dynamic assortment personalization problem
In the scenario of the dynamic assortment personalization problem, we learn from the past until the time horizon T , first by affording random assortments S_t and recording the observations ({\boldsymbol{x}}_t,j_t,S_t) , which is the exploration procedure. In the next step, we implement the SFGD algorithm to estimate {\boldsymbol{\varTheta}} using both low-rank and sparse structures. With an increase in T , the in-sample prediction of the assortment can be derived by maximizing expected revenue (1). For the given p,q , and r , there is a critical value as a function C(T) that depends on T , such as C(T) = Cr(p+q){\rm{log}}(T) in Ref. [1]. When T meets C(T) , the problem becomes exploitation, which is the out-of-sample prediction of the assortment. We denote by {\cal{A}} the collection of observations, and C(T) slowly varies with respect to T . We then present the details of our dynamic assortment personalization problem in Algorithm 3.2.
After the exploration step, it yields the structure-aware estimate {\boldsymbol{ \hat \varTheta }}, and based on {\boldsymbol{ \hat \varTheta }}, we derive the conditional distribution using (4) with respect to the new data {\boldsymbol{x}}_t . Now, we can see that our structure-aware dynamic assortment personalization approach serves every incoming individual at time t rather than several types of customers, as in Ref. [1].
In this section, we describe the implementation of the simulation to demonstrate the advantages of the proposed approach. We use the generalized information criterion (GIC)[31] for high-dimensional penalized likelihood settings to select the tuning parameter \lambda and rank r by minimizing
where \alpha_\lambda\subset \{1,\cdots,p\} is the row-wise support for estimate {\boldsymbol{ \hat \varTheta }}. Denote by \alpha_0 the true row-wise support of {\boldsymbol{ \hat \varTheta }}. Then, there exists a \lambda_0 such that \alpha_{\lambda_0} = \alpha_0 . In addition, a_T is a positive sequence that depends only on T . We choose a modified BIC-type a_T such that a_T = C_T{\rm{log}}(T) with a diverging C_T sequence, as in Ref. [32]. In this study, we used this strategy by letting C_T = c{\rm{log}}({\rm{log}}(T+p+q)) , where c is a positive constant. In the following analysis, we select \lambda and r by minimizing {\rm{GIC}}_{a_T}(\lambda,r) .
4.1
Estimation accuracy
First, we generate true {\boldsymbol{\varTheta}}^{*} as follows: First, we generate the p\times q matrix {\boldsymbol{\varTheta}}_0 from the elemental standard normal, take the SVD of {\boldsymbol{\varTheta}}_0 as {\boldsymbol{\varTheta}}_0 = {\boldsymbol{U}}{\rm{diag}}(\sigma_1,\sigma_2,\cdots){\boldsymbol{V}}^{\rm{T}}, reserve the first r singular values, and derive {\boldsymbol{\varTheta}}_1 = {\boldsymbol{U}}{\rm{diag}}(\sigma_1,\cdots,\sigma_r,0,\cdots,0){\boldsymbol{V}}^{\rm{T}}. Then, {\boldsymbol{\varTheta}}_2 = {\boldsymbol{\varTheta}}_1/{\rm sd}({\rm vec}({\boldsymbol{\varTheta}}_1)). Finally, we derive row-wise sparsity by randomly choosing S\subseteq\{1,\cdots,p \} and |S| = s as the corresponding non-sparse rows of {\boldsymbol{\varTheta}}_2 with other row zeros, which yields {\boldsymbol{\varTheta}}^{*}. Let customer data {\boldsymbol{X}} be drawn from the normal distribution N(0,{\boldsymbol{\varSigma}}), where {\boldsymbol{\varSigma}} = (\sigma_{ij})_{p\times p} with \sigma_{ij} = 0.5^{|i-j|} , assortments S_t, t = 1,\cdots,T be uniformly drawn from \{1,\cdots,q\} with subset size K = 10 , and then derive j_t according to the conditional distribution, as demonstrated in (4). We considered different settings of true rank r = 2,3,5 with \tilde{r} = 2r, s = 10,\tau = 10^{-10},\eta = 0.05, and c = 5 . To tune the parameter selection, we minimize {\rm{GIC}}_{a_T}(\lambda,r) for every fixed value of r\leqslant \tilde{r} , and then we finish the tuning by choosing (\lambda,r) that minimize {\rm{GIC}}_{a_T}(\lambda,r) globally. In a real-world application, GIC will in turn help approximate the upper bound \tilde{r} of the low-rank constraint because when the rank reaches a certain value and after, the GIC value will hardly change with different \lambda .
In the method comparison, we considered the ordinary factored gradient descent (OFGD) method, which solves problem (7) using the alternative updating rule (8). OFGD recovers only the low-rank structure of {\boldsymbol{\varTheta}}. Moreover, we introduce the maximum likelihood estimation (MLE) method with the structure-free {\boldsymbol{\varTheta}}; that is, both sparse and low-rank structures of {\boldsymbol{\varTheta}} are ignored, and the rank of {\boldsymbol{\varTheta}} is chosen as {\rm{min}}\{p,q\} .
The error of estimation is measured by root mean squared error (RMSE)
Moreover, we choose two indicators: the false positive {\rm FPR} = \dfrac{{\rm FP}}{{\rm FP}+{\rm TN}} and false negative rate {\rm FNR} = \dfrac{{\rm FN}}{{\rm FN}+{\rm TP}} to evaluate the results of sparsity recovery, in which for {\boldsymbol{\varTheta}}_i , if {\boldsymbol{\varTheta}}^{*}_i = {\bf{0}} , but \hat{{\boldsymbol{\varTheta}}}_i\neq{\bf{0}} , then i goes into the counter of {\rm FP} , and {\rm FN},{\rm TN},{\rm TP} are calculated by analogy. We also introduce the number of iterations k , which sums the total updating times of the alternate updating rule (8) in the SFGD with a proper choice of step size \eta .
In Table 1, we compare the performance of OFGD, SFGD, and MLE in 100 replications and report the results in a high-dimensional setting with p>T . As reported in Table 1, under the setting c = 5 in the tuning of \lambda , the structure-aware SFGD method outperforms all the other methods in terms of the error of both estimation and utility. Moreover, as expected, the OFGD method, which only considers the low-rank structure, performs better than the structure-ignorant MLE method. The SFGD has the ability in sparsity recovery because {\rm FPR} and {\rm FNR} are well controlled. In a high-dimensional setting when p>T , the SFGD maintains the performance of estimation accuracy as well as sparse recovery. For different settings of true rank r , we find that SFGD still enjoys the lowest RMSE and has good control of the FNR, which indicates the robustness of our methods for different structures of {\boldsymbol{\varTheta}}. The CPU time and number of iterations k are also reported in Table 1, from which we determine the efficiency of our sparse reduced rank method SFGD in both low-and high-dimensional settings.
Table
1.
Results in methods OFGD, SFGD, and MLE with different r,p,q,T settings, 100 replications (standard deviations are shown in parentheses).
Next, we consider the dynamic assortment personalization problem. We compared the average regret[33] of the three methods. One alternative is the structure-ignorant algorithm, in which we fit a single MNL model by MLE to the entire population without the low-rank and sparse structure on {\boldsymbol{\varTheta}}. We first define the average regret as follows:
Definition 4.1. Given an instance (p,q,{\boldsymbol{\varTheta}}^{*}), the average regret of algorithm \pi at time T is
In the simulation of the average regret, the feature matrix {\boldsymbol{X}} and underlying {\boldsymbol{\varTheta}}^{*} are generated based on the previous simulation of the estimation accuracy. We now construct the true revenue for each product as follows:
({\rm{i}}) K out of q items have revenue parameters r_i = 1 .
({\rm{ii}}) For the other (q-K) items, both revenues are uniformly distributed in [0.05,0.1] .
For comparison, we also introduce the average regret O\left( {r{\rm{max}} (p,q)\dfrac{{\rm{log}}(T)}{T}} \right) of Ref. [1] as a baseline that considers customer types rather than feature data.
In Fig.1, we report all results for different settings of p,q , and T after 100 replications. From Fig.1, we can see that SFGD has a lower average regret than both OFGD and MLE, which means that both low-rank and sparse structures reduce the regret level. Moreover, with the growth of types p and items q , the consideration of both low-rank and sparse structures is closer to the baseline.
Figure
1.
Comparison of average regret between our proposed methods OFGD, SFGD, and MLE. The time horizon T ranges from 1000 to 20000. The constant of the baseline is chosen as C = 0.12 .
We found that the SFGD method stabilized at a mean regret level that was much lower than that of the OFGD and MLE methods. Before reaching the minimum regret level, with an increase in the time horizon T , the average regret will decrease for SFGD, whereas for OFGD and MLE, the regret will not decrease with a larger T . Furthermore, in all settings, the OFGD method that only uses the low-rank structure achieves a better performance than the structure-ignorant MLE. Therefore, our results confirm the necessity of sparse recovery and the effectiveness of our proposed algorithm for the dynamic-assessment personalization problem.
The structure-aware method, which contains low-rank and sparsity structures, is of great significance when handling large-scale customer feature data, especially in high-dimensional scenarios p\geqslant T. We also observe the effective variable selection capability of the grouped lasso-type shrinkage method on {\boldsymbol{\varTheta}}, which is computed iteratively using our proposed SFGD method. Furthermore, with the growth of horizon T , SFGD always enjoys the lowest average regret among different algorithms.
5.
Application to advertising behavior data
This section analyzes the advertising behavior data collected on seven consecutive days, which contain the features of users, available at Kaggle ①. Target advertising[5,34] is a key problem in advertising computation. Increasing the accuracy of personal advertising is crucial for improving the effectiveness of precision marketing. We will analyze the advertising behavior dataset, which contains both the information of users and advertisements. The advertising dataset has 10000 records of advertisers' information offered, attributes of the advertisements, attributes of the users, and advertisements clicked by users. We consider 30 features of users that have interaction effects with the click behavior such as user age; city rank: level of the resident city of a user; device name: phone model used by a user; career; gender; net type: network status when a behavior occurs; residence: resident province of a user; App storage size; release time; app rating score; device price; active time by mobile phone; and membership lifecycle. To facilitate the computation of discrete variables in Euclidean space, we introduce one-hot encoding and finally obtain p = 501 features. There are q = 113 advertisers' types of ads clicked by users, denoted by j_t\in \{1,\cdots,q\}.
We begin by splitting our data into a training set of 70% records and a test set of 30% records. We fit and evaluate our model 100 times over each setting of the training set sizes T\in \{200,700, 1700,3000,5000,7000\} using the following steps. First, we randomly selected T users from the training pool and selected the tuning parameters \lambda and r using our GIC method. Noting that the value of the GIC function rarely changes when r>5 , we set \tilde{r} = 5 . We then fit the model to this training set of size T . Finally, we tested its performance on a fully held-out test set with a size of 3000 .
According to the structure-aware dynamic assortment personalization procedure in Algorithm 3.2, in the exploration stage, we fit the model and provide a sparse reduced-rank representation of {\boldsymbol{\varTheta}}. Without additional knowledge, we treat the rewards of all the items(ads) equally by letting r_1 = r_2 = \cdots = r_q. Then, in the exploitation stage, we assign a size of K = 10 to every user in the test set by maximizing the expected revenue. To evaluate our model, we use precision, which is the percentage of users' click behavior of advertisers' types j_t successfully covered by the predicted assortment S_t . We benchmark our SFGD method with OFGD and MLE as in the simulations; the results are shown in Fig.2.
From Fig.2, we observe that the precision of assortment personalization by SFGD increases from 83.1\% to 87.7\% as the size of the training set increases from 200 to 7000 . It is worth mentioning that our SFGD approach reaches 83.1\% precision under a high-dimensional scenario when T = 200, p = 501 , in which features such as age, city rank, career, device price, and consumer purchase are selected in a block-wise manner. Moreover, the advantages of the structure-aware SFGD method will become increasingly evident with increasing sample size. The performance differential seems to grow larger when comparing SFGD with sparsity-ignorant OFGD and structure-ignorant MLE.
6.
Discussion
This study focused on the assortment personalization problem using a data-driven conditional multinomial logit choice model, in which the sparse and low-rank settings of the parameter matrix are considered. Then, we present the SFGD method for our penalized maximum likelihood problem (i.e., a negative likelihood loss function plus certain penalties), leading to computational efficiency. Moreover, we prove that the SFGD exhibits the local convergence property, and the simulations show that SFGD achieves good estimation accuracy and feature selection ability with massive and high-dimensional data. A real-world application of advertising behavior data is presented, in which we demonstrate the excellent performance of our assortment personalization procedure.
One interesting direction for future research is the non-asymptotic analysis of the multinomial logit penalized likelihood in a high-dimensional setting, which statistically describes the estimation accuracy. Another research direction is to extend the SFGD method to a co-sparse framework that considers both the row-wise and column-wise sparsity of the parameter matrix.
Acknowledgements
This work was supported by the National Natural Science Foundation of China (72071187, 11671374, 71731010, 71921001) and the Fundamental Research Funds for the Central Universities (WK3470000017,WK2040000027).
Conflict of interest
The authors declare that they have no conflict of interest.
Conflict of Interest
The authors declare that they have no conflict of interest.
The data-driven conditional multinomial logit choice model with customer features has a good performance in assortment personalization problem when a low-rank structure of parameter matrix is considered.
Our proposed method considers both low-rank and sparsity structure, which can further reduce model complexity and improve estimation and prediction accuracy.
New algorithm sparse factored gradient descent (SFGD) is proposed to estimate the parameter matrix, which enjoys high interpretability and efficient performance in computing.
Kallus N, Udell M. Dynamic assortment personalization in high dimensions. Operations Research,2020, 68 (4): 1020–1037. DOI: 10.1287/opre.2019.1948
[2]
Bernstein F, Kök A G, Xie L. Dynamic assortment customization with limited inventories. Manufacturing & Service Operations Management,2015, 17 (4): 538–553. DOI: 10.1287/msom.2015.0544
[3]
Golrezaei N, Nazerzadeh H, Rusmevichientong P. Real-time optimization of personalized assortments. Management Science,2014, 60 (6): 1532–1551. DOI: 10.1287/mnsc.2014.1939
[4]
Chen X, Owen Z, Pixton C, et al. A statistical learning approach to personalization in revenue management. Management Science,2021, 68 (3): 1923–1937. DOI: 10.1287/mnsc.2020.3772
[5]
Luo X, Andrews M, Fang Z, et al. Mobile targeting. Management Science,2014, 60 (7): 1738–1756. DOI: 10.1287/mnsc.2013.1836
[6]
Xue Z, Wang Z, Ettl M. Pricing personalized bundles: A new approach and an empirical study. Manufacturing & Service Operations Management,2016, 18 (1): 51–68. DOI: 10.1287/msom.2015.0563
[7]
Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology),1996, 58 (1): 267–288. DOI: 10.1111/j.2517-6161.1996.tb02080.x
[8]
Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology),2006, 68 (1): 49–67. DOI: 10.1111/j.1467-9868.2005.00532.x
[9]
Meier L, Van De Geer S, Bühlmann P. The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology),2008, 70 (1): 53–71. DOI: 10.1111/j.1467-9868.2007.00627.x
[10]
Negahban S, Wainwright M J. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. The Annals of Statistics,2011, 39 (2): 1069–1097. DOI: 10.1214/10-AOS850
[11]
Chen K, Chan K S, Stenseth N C. Reduced rank stochastic regression with a sparse singular value decomposition. Journal of the Royal Statistical Society: Series B (Statistical Methodology),2012, 74 (2): 203–221. DOI: 10.1111/j.1467-9868.2011.01002.x
[12]
Chen L, Huang J Z. Sparse reduced-rank regression for simultaneous dimension reduction and variable selection. Journal of the American Statistical Association,2012, 107 (500): 1533–1545. DOI: 10.1080/01621459.2012.734178
[13]
Chen K, Dong H, Chan K S. Reduced rank regression via adaptive nuclear norm penalization. Biometrika,2013, 100 (4): 901–920. DOI: 10.1093/biomet/ast036
[14]
Zhou K, Zha H, Song L. Learning social infectivity in sparse low-rank networks using multi-dimensional hawkes processes. In: Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics. PMLR, 2013: 641-649.
[15]
Wang Y X, Xu H, Leng C. Provable subspace clustering: When LRR meets SSC. In: Proceedings of the 26th International Conference on Neural Information Processing Systems: Volume 1. Red Hook, NY: Curran Associates Inc, 2013: 64–72.
[16]
Feng J, Lin Z, Xu H, et al. Robust subspace segmentation with block-diagonal prior. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2014: 3818–3825.
[17]
Chen J, Liu J, Ye J. Learning incoherent sparse and low-rank patterns from multiple tasks. ACM Transactions on Knowledge Discovery from Data (TKDD),2012, 5 (4): 1–31. DOI: 10.1145/2086737.2086742
[18]
Agarwal A, Negahban S, Wainwright M J. Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions. The Annals of Statistics,2012, 40 (2): 1171–1197. DOI: 10.1214/12-AOS1000
[19]
Chen J, Zhou J, Ye J. Integrating low-rank and group-sparse structures for robust multi-task learning. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: Association for Computing Machinery, 2011: 42–50.
[20]
Mishra A, Dey D K, Chen K. Sequential co-sparse factor regression. Journal of Computational and Graphical Statistics,2017, 26 (4): 814–825. DOI: 10.1080/10618600.2017.1340891
[21]
Mishra A, Dey D K, Chen Y, et al. Generalized co-sparse factor regression. Computational Statistics & Data Analysis,2021, 157: 107127. DOI: 10.1016/j.csda.2020.107127
[22]
Boyd S, Parikh N, Chu E. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning,2011, 3 (1): 1–122. DOI: 10.1561/2200000016
[23]
Zheng Z, Bahadori M T, Liu Y, et al. Scalable interpretable multi-response regression via SEED. Journal of Machine Learning Research,2019, 20 (107): 1–34.
[24]
Jain P, Netrapalli P, Sanghavi S. Low-rank matrix completion using alternating minimization. In: Proceedings of the Forty-Fifth Annual ACM Symposium on Theory of Computing. New York: Association for Computing Machinery, 2013: 665–674.
[25]
Bhojanapalli S, Kyrillidis A, Sanghavi S. Dropping convexity for faster semi-definite optimization. In: 29th Annual Conference on Learning Theory. PMLR, 2016, 49: 530-582.
[26]
Golrezaei N, Nazerzadeh H, Rusmevichientong P. Real-time optimization of personalized assortments. Management Science,2014, 60 (6): 1532–1551. DOI: 10.1287/mnsc.2014.1939
[27]
Train K E. Discrete Choice Methods with Simulation. Cambridge, UK: Cambridge University Press, 2009.
[28]
Song Z, Woodruff D P, Zhong P. Low rank approximation with entrywise l1-norm error. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing. New York: Association for Computing Machinery, 2017: 688–701.
[29]
Klopp O, Lounici K, Tsybakov A B. Robust matrix completion. Probability Theory and Related Fields,2017, 169 (1): 523–564. DOI: 10.1007/s00440-016-0736-y
[30]
Gorski J, Pfeuffer F, Klamroth K. Biconvex sets and optimization with biconvex functions: A survey and extensions. Mathematical Methods of Operations Research,2007, 66 (3): 373–407. DOI: 10.1007/s00186-007-0161-1
[31]
Fan Y, Tang C Y. Tuning parameter selection in high dimensional penalized likelihood. Journal of the Royal Statistical Society: Series B: (Statistical Methodology),2013, 75 (3): 531–552. DOI: 10.1111/rssb.12001
[32]
Wang H, Li B, Leng C. Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society: Series B (Statistical Methodology),2009, 71 (3): 671–683. DOI: 10.1111/j.1467-9868.2008.00693.x
[33]
Chen X, Krishnamurthy A, Wang Y. Robust dynamic assortment optimization in the presence of outlier customers. https:// arxiv.org/abs/1910.04183.
[34]
Boerman S C, Kruikemeier S, Zuiderveen Borgesius F J. Online behavioral advertising: A literature review and research agenda. Journal of Advertising,2017, 46 (3): 363–376. DOI: 10.1080/00913367.2017.1339368
[35]
Mirsky L. A trace inequality of John von Neumann. Monatshefte für Mathematik,1975, 79 (4): 303–306. DOI: 10.1007/BF01647331
Figure
1.
Comparison of average regret between our proposed methods OFGD, SFGD, and MLE. The time horizon T ranges from 1000 to 20000. The constant of the baseline is chosen as C = 0.12 .
Figure
2.
Model performance by dataset size.
References
[1]
Kallus N, Udell M. Dynamic assortment personalization in high dimensions. Operations Research,2020, 68 (4): 1020–1037. DOI: 10.1287/opre.2019.1948
[2]
Bernstein F, Kök A G, Xie L. Dynamic assortment customization with limited inventories. Manufacturing & Service Operations Management,2015, 17 (4): 538–553. DOI: 10.1287/msom.2015.0544
[3]
Golrezaei N, Nazerzadeh H, Rusmevichientong P. Real-time optimization of personalized assortments. Management Science,2014, 60 (6): 1532–1551. DOI: 10.1287/mnsc.2014.1939
[4]
Chen X, Owen Z, Pixton C, et al. A statistical learning approach to personalization in revenue management. Management Science,2021, 68 (3): 1923–1937. DOI: 10.1287/mnsc.2020.3772
[5]
Luo X, Andrews M, Fang Z, et al. Mobile targeting. Management Science,2014, 60 (7): 1738–1756. DOI: 10.1287/mnsc.2013.1836
[6]
Xue Z, Wang Z, Ettl M. Pricing personalized bundles: A new approach and an empirical study. Manufacturing & Service Operations Management,2016, 18 (1): 51–68. DOI: 10.1287/msom.2015.0563
[7]
Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology),1996, 58 (1): 267–288. DOI: 10.1111/j.2517-6161.1996.tb02080.x
[8]
Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology),2006, 68 (1): 49–67. DOI: 10.1111/j.1467-9868.2005.00532.x
[9]
Meier L, Van De Geer S, Bühlmann P. The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology),2008, 70 (1): 53–71. DOI: 10.1111/j.1467-9868.2007.00627.x
[10]
Negahban S, Wainwright M J. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. The Annals of Statistics,2011, 39 (2): 1069–1097. DOI: 10.1214/10-AOS850
[11]
Chen K, Chan K S, Stenseth N C. Reduced rank stochastic regression with a sparse singular value decomposition. Journal of the Royal Statistical Society: Series B (Statistical Methodology),2012, 74 (2): 203–221. DOI: 10.1111/j.1467-9868.2011.01002.x
[12]
Chen L, Huang J Z. Sparse reduced-rank regression for simultaneous dimension reduction and variable selection. Journal of the American Statistical Association,2012, 107 (500): 1533–1545. DOI: 10.1080/01621459.2012.734178
[13]
Chen K, Dong H, Chan K S. Reduced rank regression via adaptive nuclear norm penalization. Biometrika,2013, 100 (4): 901–920. DOI: 10.1093/biomet/ast036
[14]
Zhou K, Zha H, Song L. Learning social infectivity in sparse low-rank networks using multi-dimensional hawkes processes. In: Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics. PMLR, 2013: 641-649.
[15]
Wang Y X, Xu H, Leng C. Provable subspace clustering: When LRR meets SSC. In: Proceedings of the 26th International Conference on Neural Information Processing Systems: Volume 1. Red Hook, NY: Curran Associates Inc, 2013: 64–72.
[16]
Feng J, Lin Z, Xu H, et al. Robust subspace segmentation with block-diagonal prior. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2014: 3818–3825.
[17]
Chen J, Liu J, Ye J. Learning incoherent sparse and low-rank patterns from multiple tasks. ACM Transactions on Knowledge Discovery from Data (TKDD),2012, 5 (4): 1–31. DOI: 10.1145/2086737.2086742
[18]
Agarwal A, Negahban S, Wainwright M J. Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions. The Annals of Statistics,2012, 40 (2): 1171–1197. DOI: 10.1214/12-AOS1000
[19]
Chen J, Zhou J, Ye J. Integrating low-rank and group-sparse structures for robust multi-task learning. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: Association for Computing Machinery, 2011: 42–50.
[20]
Mishra A, Dey D K, Chen K. Sequential co-sparse factor regression. Journal of Computational and Graphical Statistics,2017, 26 (4): 814–825. DOI: 10.1080/10618600.2017.1340891
[21]
Mishra A, Dey D K, Chen Y, et al. Generalized co-sparse factor regression. Computational Statistics & Data Analysis,2021, 157: 107127. DOI: 10.1016/j.csda.2020.107127
[22]
Boyd S, Parikh N, Chu E. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning,2011, 3 (1): 1–122. DOI: 10.1561/2200000016
[23]
Zheng Z, Bahadori M T, Liu Y, et al. Scalable interpretable multi-response regression via SEED. Journal of Machine Learning Research,2019, 20 (107): 1–34.
[24]
Jain P, Netrapalli P, Sanghavi S. Low-rank matrix completion using alternating minimization. In: Proceedings of the Forty-Fifth Annual ACM Symposium on Theory of Computing. New York: Association for Computing Machinery, 2013: 665–674.
[25]
Bhojanapalli S, Kyrillidis A, Sanghavi S. Dropping convexity for faster semi-definite optimization. In: 29th Annual Conference on Learning Theory. PMLR, 2016, 49: 530-582.
[26]
Golrezaei N, Nazerzadeh H, Rusmevichientong P. Real-time optimization of personalized assortments. Management Science,2014, 60 (6): 1532–1551. DOI: 10.1287/mnsc.2014.1939
[27]
Train K E. Discrete Choice Methods with Simulation. Cambridge, UK: Cambridge University Press, 2009.
[28]
Song Z, Woodruff D P, Zhong P. Low rank approximation with entrywise l1-norm error. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing. New York: Association for Computing Machinery, 2017: 688–701.
[29]
Klopp O, Lounici K, Tsybakov A B. Robust matrix completion. Probability Theory and Related Fields,2017, 169 (1): 523–564. DOI: 10.1007/s00440-016-0736-y
[30]
Gorski J, Pfeuffer F, Klamroth K. Biconvex sets and optimization with biconvex functions: A survey and extensions. Mathematical Methods of Operations Research,2007, 66 (3): 373–407. DOI: 10.1007/s00186-007-0161-1
[31]
Fan Y, Tang C Y. Tuning parameter selection in high dimensional penalized likelihood. Journal of the Royal Statistical Society: Series B: (Statistical Methodology),2013, 75 (3): 531–552. DOI: 10.1111/rssb.12001
[32]
Wang H, Li B, Leng C. Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society: Series B (Statistical Methodology),2009, 71 (3): 671–683. DOI: 10.1111/j.1467-9868.2008.00693.x
[33]
Chen X, Krishnamurthy A, Wang Y. Robust dynamic assortment optimization in the presence of outlier customers. https:// arxiv.org/abs/1910.04183.
[34]
Boerman S C, Kruikemeier S, Zuiderveen Borgesius F J. Online behavioral advertising: A literature review and research agenda. Journal of Advertising,2017, 46 (3): 363–376. DOI: 10.1080/00913367.2017.1339368
[35]
Mirsky L. A trace inequality of John von Neumann. Monatshefte für Mathematik,1975, 79 (4): 303–306. DOI: 10.1007/BF01647331