Dataset | Users | Items | Training | Validation | Test |
Coat | 290 | 300 | 6960 | 232 | 4176 |
Yahoo!R3 | 15400 | 1000 | 311704 | 2700 | 48600 |
Simulation | 500 | 500 | 12500 | 3750 | 67500 |
Semi-supervised learning (SSL) has been applied to many practical applications over the past few years. Recently, distributed graph-based semi-supervised learning (DGSSL) has been shown to have good performance. Current DGSSL algorithms usually have the problems of inefficient graph construction and the straggler effect. This paper proposes a novel coded DGSSL (CDGSSL) to solve these problems. We first provide a novel parallel and distributed solution of matrix completion for efficient graph construction. Then, we develop the CDGSSL algorithm based on coding theory. Specifically, the proposed algorithm consists of two parts separately designed based on the maximum distance separable (MDS) code. In general, the proposed coded distributed algorithm is efficient and straggler tolerant. Moreover, we provide an optimal parameter design for the proposed algorithm. The results of the experiments on the Alibaba Cloud elastic compute service (ECS) demonstrate the superiority of the proposed algorithm.
The proposed algorithm consists of two parts: coded matrix completion and coded label propagation.
Semi-supervised learning (SSL) has been applied to many practical applications over the past few years. Recently, distributed graph-based semi-supervised learning (DGSSL) has been shown to have good performance. Current DGSSL algorithms usually have the problems of inefficient graph construction and the straggler effect. This paper proposes a novel coded DGSSL (CDGSSL) to solve these problems. We first provide a novel parallel and distributed solution of matrix completion for efficient graph construction. Then, we develop the CDGSSL algorithm based on coding theory. Specifically, the proposed algorithm consists of two parts separately designed based on the maximum distance separable (MDS) code. In general, the proposed coded distributed algorithm is efficient and straggler tolerant. Moreover, we provide an optimal parameter design for the proposed algorithm. The results of the experiments on the Alibaba Cloud elastic compute service (ECS) demonstrate the superiority of the proposed algorithm.
In the era of information explosion, recommender systems (RSs) that connect users with the right items have been recognized as the most effective means to alleviate information overload[1]. Existing recommendation methods mainly focus on developing a machine learning model to better fit the collected user behavior data[2, 3]. However, because the behavior data in practice are observational rather than experimental, various biases occur. For example, user behaviors are subject to what was exposed to the users (aka, exposure bias) or what were the public opinions (aka, conformity bias), deviating from reflecting the real interests of users. Consequently, blindly fitting the RS model without tackling the bias issues will lead to unacceptable performance[4, 5].
To alleviate the unfavorable effect of biases, a surge of recent work has been devoted to debiasing solutions. For example, the inverse propensity score (IPS) was proposed to reweight training samples for expectation-unbiased learning[6, 7]; Data imputation was explored to fill missing values[8, 9]; Knowledge distillation was also used to bridge the models trained on biased data and unbiased data. More recently, AutoDebias[10] proposed to optimize objective function on unbiased data has applied a metalearning technique, which has been demonstrated to be a generic and effective solution for recommendation debiasing. AutoDebias gives both pseudolabels and confidence weights for each user-item pair and is flexible enough to address various biases and outperform previous work by a large margin.
Although effective at managing diverse biases, AutoDebias suffers from serious inefficiency. Specifically, the training of AutoDebias involves calculations for each user-item pair. As such, the complexity of AutoDebias is linear with the product of the number of users and items, which can easily scale to one billion or even larger. This crucial defect severely limits the applicability and potential of AutoDebias. Although stochastic gradient descent with a uniform sampler can be employed for acceleration, this treatment significantly deteriorates model convergence and stability, leading to inferior recommendation performance. Hence, how to accelerate AutoDebias without reducing its accuracy is still an open problem.
To this end, in this work, we propose LightAutoDebias (short as LightAD), which equips AutoDebias with a carefully designed adaptive sampling strategy. We make three important improvements: (ⅰ) We adopt a dynamic sampling distribution that favors the informative training instances with large confidence weights. Further theoretical analysis proves that such a distribution could reduce the sampling variance and accelerate convergence. (ⅱ) We develop a customized factorization-based sampling strategy to support fast sampling from the above dynamic distribution. (ⅲ) Because the model may not reliably estimate the confidence weights in the early training stage, we propose a self-paced strategy that adopts uniform sampling at the beginning while gradually tilting toward factorization-based sampling along the training process.
To summarize, the contributions of this work are as follows:
● We identify the serious inefficiency issue of AutoDebias, which is imperative to be conquered.
● We propose a dynamic and adaptive sampler for AutoDebias, which yields provably better convergence and stability than those of the standard uniform sampler.
● We conduct extensive experiments on three benchmark datasets, validating that our LightAD accelerates AutoDebias by several magnitudes while maintaining almost equal recommendation accuracy.
The remainder of this paper is organized as follows. We first provide a review of related works in Section 2. We then define the problem and recap AutoDebias in Section 3. We elaborate upon the proposed method in detail in Section 4. After that, we theoretically demonstrate the unbiasedness and effectiveness of our approach in Section 5. The experimental results and discussion are presented in Section 6. Finally, we conclude the paper and present some directions for future work are presented in Section 7.
The fact that biases exist in practical RSs and can strongly compromise the performance of RSs has been proven by many scholars. Since we study how to apply unbiased sampling strategies to improve the efficiency issues in this paper, we first make a general classification of biases in data and review some related work on tackling them. Then, we included some specific sampling methods in the debiasing area.
Intuitively, biases in an RS indicate that models lose the ability to provide accurate recommendations for users. In general, we can roughly divide them into the following categories:
Selection bias. Selection bias occurs when users are free to choose which items to rate, thus the collected ratings are not a representative sample of all the ratings[11]. Intuitively, the data we used in RS are not missing at random. The research of Marlin et al.[12] fully proved the presence of selection bias. Schnabel et al.[6] considered introducing used inversed propensity scores (IPSs) to reweight the observed data. Steck et al.[13] designed a novel unbiased metric named ATOP to remedy selection bias. Considering that users can freely rate items they like, Hernández et al.[8] proposed imputation-based methods that directly impute the missing entries with pseudolabels and then weights down the contribution of these missing ratings.
Conformity bias. Conformity bias occurs as users tend to align themselves with others in the group regardless of whether it goes against their own will[11]. In other words, our rating of some products may succumb to public opinion and therefore not reflect our real interests. Krishnan et al.[14] conducted comparative experiments suggesting that social influence bias is an important factor in RS. Liu et al.[15] took conformity considered and directly used rating data behavior as an important feature to quantify conformity bias. In addition, Tang et al.[16] and Hao et al.[17] leveraged social factors more directly to produce the final prediction and introduce specific parameters to eliminate the impact of conformity bias.
Exposure bias. Exposure bias exists when users are only exposed to a small portion of the items provided[11]. Exposure bias is readily comprehensible as the unobserved data can be attributed to users not liking the item or simply not seeing the item. One simple and straightforward strategy to address exposure bias is to consider all noninteraction data as negative samples and specify their confidence. Hu et al.[18] put forward the classic WMF model in which the unobserved data are assigned lower weights. Similarly, Pan et al.[19] went a step further and specified the data confidence with examined user behavior. Another way to address exposure bias is to design an exposure-based probabilistic model that is able to capture whether a user has been exposed to an item. Liang introduced the EXMF model of users’ activity through an exposure variable and proposed a generative process of implicit feedback. Chen et al.[20] assumed that users frequently come into contact with specific communities and thus modelled confidence weights using a community-based strategy.
Position bias. Position bias emerges as users are inclined to interact with items in a higher position on the recommendation list[11]. When dealing with position bias, many researchers assuming that users’ click behavior occurs if and only if an item is relevant and examined by users[21, 22]. Another conventional strategy is to apply a model and its variants[23–25], which assumes that users examine items from the top to the bottom of the recommendation list. Hence, the click behavior relies only on the relevance of all the items that displayed to users.
Popularity bias. Popularity bias occurs when popular items are recommended more often than their popularity[11]. Due to the common long tail phenomenon in recommendation data; recommendation models usually pay more attention to popular items and hence give higher scores to popular items than their real value. Kamishima et al.[26] used a mean-match regularizer in their models to offset popularity bias. Zheng et al.[27] applied counterfactual reasoning to address the popularity bias and assumed that users’ click behavior depends mainly on their interest and items’ popularity. Krishnan et al.[28] introduced adversarial learning in RS to seek a trade-off between the popular item-biased reconstruction objective and the accuracy of long-tail item recommendations.
In addition to the methods mentioned above, sampling is a promising debiasing method that has been studied by many scholars. Specifically, we can use various sampling strategies to determine which data are used to update the parameters and how often in the training of our models. Steffen et al.[29] employed the uniform negative sampler as the basic strategy to improve efficiency. Several neural-based methods[30–32] also consider using negative sampler as an efficient way to increase the model performance. In response to exposure bias, Yu et al.[33] proposed to over sampled popular negative samples because they had a greater chance of being exposed. Nevertheless, these samplers follow a predetermined sampling distribution which may dynamically change according to the model’s state in each iteration; increasing the likehood genetating low-quality samples. Dae et al.[34], Steffen et al.[35], and Ding et al.[36] proposed an adaptive negative sampler, for the purpose of over sampling “difficult” instances to increase learning efficiency. One disadvantage of these heuristic methods is that they fail to capture the real negative instances. To address this problem, Ding et al.[37] explored leveraging side information to enhance the studied the performance of samplers; Chen et al.[38] took advantage of social information in their sampling strategy.
In this section, we first formulate the task of recommendation systems and then expose the nature of bias from the point of view of risk discrepancy. Finally, we analyze the temporal and spatial complexity of AutoDebias to demonstrate the challenges we face in deploying this algorithm.
Suppose we have a recommender system with user set
L(f)=EpU(u,i,r)[δ(f(u,i),r)], | (1) |
where
ˆLT(f)=1|DT||DT|∑k=1δ(f(uk,ik),rk). | (2) |
According to the PAC learning theory[39], if and only if
To eliminate the distribution discrepancy, Ref. [10] puts forward AutoDebias, a general proposed a debiasing framework, which aims at addressing various biases. The main idea of AutoDebias transforms the aforementioned empirical risk functions into:
ˆLT(f|ϕ)=1|DT||DT|∑k=1w(1)kδ(f(uk,ik),rk)+∑u∈U,i∈Iw(2)uiδ(f(u,i),mui), | (3) |
where AutoDebias imputes the pseudolabels
AutoDebias also devises a metlearning-based strategy to learn the appropriate
w(1)k=exp(φT1[euk∘eik∘erk]),w(2)ui=exp(φT2[eu∘ei∘eOui]),mui=σ(φT3[erui∘eOui]), | (4) |
where
Now, we conduct time complexity analyses of AutoDebias. The time complexity mainly comes from the following two parts:
Calculation on observed training instances. As depicted by the first part of the Eq. (3), the calculation only involves the instances in
Calculation on imputed training instances. As depicted by the second part of the Eq. (3), the complexity involves the calculation over all user-item pairs, which is
Considering the sparse nature of real-world recommendation data, the second part has become the bottleneck of AutoDebias. A naive solution for acceleration employs stochastic gradient descent and a uniform sampler, where the gradient can be quickly calculated on the sampled portion of the dataset. However, we find that this treatment would significantly deteriorate model convergence and stability, leading to inferior recommendation performance. In fact, not all user-item pairs are equally important in model training. We observe fairly scattered
In this section, we present LightAD which equips AutoDebias with a carefully designed adaptive sampling strategy. LightAD adopts an uneven sampling distribution and factorization-based efficient sampling strategy. We will detail LightAD in the following parts.
As mentioned before, the second part of Eq. (3) has become the bottleneck of AutoDebias. Here we leverage a sampler to accelerate the training, where the gradient can be estimated from a much smaller sampled training dataset. Formally, the learning with the sampler can be formulated as follows:
S∼Multinomial(N,ps(u,i)), | (5) |
which draws a small set of users and items, i.e.,
ˆLT(f|ϕ)=1|DT||DT|∑k=1w(1)kδ(f(uk,ik),rk)+1|S|∑u,i∈Sδ(f(u,i),mui), | (6) |
where the confidence weights
Importance-aware sampling distribution. Note that different data may have different confidence weights
Fast factorization-based sampling. Now the question lies in how to perform fast sampling from the distribution
w(2)ui=exp(φT2[eu∘ei])=exp(φT2[eu])∗exp(φT2[ei])≡w(2)u∗w(2)i, | (7) |
where
Su∼Multinomial(N,pu),Si∼Multinomial(N,pi), | (8) |
where
Note that in the early training stage, the model may not yield a reliable estimate of the confidence weights. Directly learning a model from an informative sampler may lead to suboptimal results. To address this problem, we propose a self-paced strategy that adopts a uniform distribution at the beginning while gradually alters the distribution as the training proceeds. Formally, we adopt the following sampling strategy:
S1∼Multinomial(αN,ps(u,i)),S2∼Multinomial((1−α)N,U(u,i)), | (9) |
where
ˆLP(f|ϕ)=1|DT||DT|∑k=1w(1)kδ(f(uk,ik),rk)+α∑u,iw(2)ui|S1|⋅∑u,i∈S1δ(f(u,i),mui)+(1−α)nm|S2|∑u,i∈S2w(2)uiδ(f(u,i),mui). | (10) |
The former section elaborates how LightAD accelerates AutoDebias. In this section, we conduct theoretical analyses to answer the following questions: Does the unbiasedness still hold with our proposed sampling strategy? Does the proposed informative sampler reduce the sampling variance?
It has been proven that the empirical risk function of AutoDebias can be an unbiased estimator of the true risk. Now we connect the objective of LightAD (Eq. (10)) with AutoDebias, which reveals the unbiasedness of LightAD. In fact, we have:
ES1,S2[ˆLP(f|ϕ)]=1|DT||DT|∑k=1w(1)kδ(f(uk,ik),rk)+ES1[α∑u,iw(2)ui|S1|∑u,i∈S1δ(f(u,i),mui)]+ES2[(1−α)nm|S2|∑u,i∈S2w(2)uiδ(f(u,i),mui)]=1|DT||DT|∑k=1w(1)kδ(f(uk,ik),rk)+α∑u,iw(2)ui|S1||S1|Eu,i∼ps[δ(f(u,i),mui)]+(1−α)nm|S2||S2|Eu,i∼U[w(2)uiδ(f(u,i),mui)]= |
1|DT||DT|∑k=1w(1)kδ(f(uk,ik),rk)+α∑u,iw(2)ui∑u,iw(2)ui∑u,iw(2)uiδ(f(u,i),mui)+(1−α)nm∑u,iw(2)uiδ(f(u,i),mui)=ˆLS(f|ϕ). | (11) |
From the above mathematical deduction, we can safely draw the conclusion that the objective of LightAD is indeed to use an unbiased estimator of AutoDebias and the ideal true risk.
How does the conventional uniform sampler perform? Does the proposed informative sampler reduce the variance? In this section, we aim to provide a theoretical answer to these problems.
Let us first formulate the variance of the estimated objective with the sampler. In fact, we have the following upper-bound of the variance:
VarS[ˆLS]=1|S|Varps[w(2)uips(u,i)δ(f(u,i),mui)]⩽1|S|(∑u,iVarui[w(2)ui]+Eu,i[w(2)ui]2ps(u,i))×Eps[δ(f(u,i),mui)]2−1|S|Eps[w(2)uiδ(f(u,i),mui)]2. | (12) |
For a uniform sampler, the variance of the objective is subject to the variance of weights
To address this problem, we should choose a better sampling distribution to reduce the upper bound of the estimation variance. In fact, we have:
∑ui(w(2)ui)2ps(u,i)=∑ui(w(2)ui)2ps(u,i)∑uips(u,i)⩾(∑ui√(w(2)ui)2ps(u,i)√ps(u,i))2=Eui[w(2)ui]2, | (13) |
where we employ the Cauchy inequality, and the equation holds if and only if
In this section, we first describe the experimental settings, and then present the detailed experiments with the purpose of answering the following questions:
RQ1: How does our LightAD perform compared with AutoDebias in recommendation performance?
RQ2: How does the proposed LightAD accelerate AutoDebias?
RQ3: How does the hyperparameter
RQ4: How does LightAD perform when dealing with the simultaneous presence of various biases?
We conduct our experiments with two publicly accessible datasets and one synthetic dataset: Yahoo!R3, Coat, and Simulation. The concrete statistics of the three datasets are summarized in Table 1.
Dataset | Users | Items | Training | Validation | Test |
Coat | 290 | 300 | 6960 | 232 | 4176 |
Yahoo!R3 | 15400 | 1000 | 311704 | 2700 | 48600 |
Simulation | 500 | 500 | 12500 | 3750 | 67500 |
Yahoo!R3. This dataset contains two types of ratings collected from two sources. The first source collects ratings of users’ normal interactions with music in Yahoo! services (i.e., users pick and rate items according to their own preferences), which can be considered as a stochastic logging policy. Thus, this type of data is obsessed with bias. The second kind of data is ratings from randomly selected songs collected during an online survey (i.e., uniform logging policy). Approximately, it can be regarded as unbiased and reflects the real interest of users.
Coat. This is a dataset collected by Ref. [7], which simulates customers shopping behaviors for coats in an online service. In this dataset, users were first given a web-shop interface and then asked to select the coat they wanted to buy most. Shortly afterwards, they were requested to rate 24 of the self-selected coats and 16 of the randomly picked ones on a five-point scale. Therefore, this dataset consists of self-selected ratings and the uniformly selected ratings which can be regarded as biased data and unbiased data respectively.
Simulation. A great advantage of AutoDebias is that it can handle multiple biases simultaneously, which means good universality in complex scenario. We need to verify that this universality also persists in our model. Since there is no public dataset that satisfies such a scenario, we borrow from AutoDebias and adopt a synthetic dataset Simulation where both position bias and selection bias are taken into account. Unlike the previous two datasets, this one contains the position information of items when the interaction occurs. In short, this dataset also includes a large amount of biased data and a relatively small amount of unbiased data. The rationale for the way of data synthesis can be found in Refs. [40, 41].
To be consistent with AutoDebias, we follow the experimental setup with them in training, validating, and testing. Specifically, we use the biased data
Evaluation protocols. The evaluation metrics in this work are the negative logarithmic loss (NLL), the area under the ROC curve (AUC), and normalized discounted cumulative gain (NDCG):
NLL. This metric evaluates the performance of the predictions:
NLL=−1|Dve|∑(u,i,r)∈Dvelog(1+e−r∗fθ(u,i)), | (14) |
where
AUC. This metric evaluates the performance of rankings:
AUC=∑(u,i)∈D+ve^Ranku,i−(|D+ve|+1)(|D+ve|)/2(|D+ve|)∗(|Dve|−|D+ve|), | (15) |
where
NDCG@k. This metric evaluates the quality of recommendation through discounted importance based on ranking position:
DCGu@k=∑(u,i)∈DveI(^Ranku,i⩽k)log(^Ranku,i+1),NDCG@k=1|U|∑u∈UDCGu@kIDCGu@k, | (16) |
where
We implement our experiment with Pytorch and the code is available at https://github.com/Chris-Mraz/LightAD. We perform grid search to tune the hyperparameters for all candidate methods on the validation set.
Baselines. To demonstrate the effectiveness of our proposed sampled method, we use matrix factorization (MF)[42] as the backbone and the following models as our baselines:
Matrix factorization (MF)[42]. MF is one classical model in RS. In this model, the preference of user u for item i can be formulated as
Inverse propensity score (IPS)[6]. IPS is a counterfactual-based recommendation method that estimates the propensity score via naive bayes.
Doubly robust (DR)[43]. DR combines the ideas of the IPS and data imputation. The debiasing capacity of this model can be ensured if either the imputed data or propensity score are accurately estimated
CausE and KD-label. CausE and KD-label are excellent methods that transfer the unbiased information with a teacher model using knowledge distillation techniques. Both of them are the best performing models in Ref. [5].
AutoDebias. AutoDebias is a generic and effective solution for recommendation debiasing.
We also test three versions of our LightAD for ablation studies: (ⅰ) LightAD-uniform, which improves AutoDebias with a uniform sampler. (ⅱ) LightAD-fixed, where we remove the selfpaced strategy and fix
Table 2 shows the performance of our LightAD and the baselines with respect to three evaluation metrics on two datasets. We make the following two observations:
Model | Yahoo!R3 | Coat | |||||
NLL | AUC | NDCG@5 | NLL | AUC | NDCG@5 | ||
MF (biased) | −0.587 | 0.727 | 0.547 | −0.546 | 0.757 | 0.484 | |
MF (uniform) | −0.503 | 0.568 | 0.435 | −0.641 | 0.540 | 0.362 | |
MF (combined) | −0.579 | 0.729 | 0.551 | −0.538 | 0.750 | 0.492 | |
IPS | −0.450 | 0.725 | 0.556 | −0.531 | 0.755 | 0.484 | |
DR | −0.444 | 0.723 | 0.552 | −0.513 | 0.760 | 0.501 | |
CausE | −0.585 | 0.729 | 0.553 | −0.529 | 0.762 | 0.500 | |
KD-label | −0.587 | 0.727 | 0.547 | −0.546 | 0.757 | 0.484 | |
AutoDebias | −0.419 | 0.748 | 0.648 | −0.529 | 0.766 | 0.501 | |
LightAD-uniform | −0.410 | 0.734 | 0.586 | −0.505 | 0.710 | 0.473 | |
LightAD-fixed | –0.406 | 0.737 | 0.632 | –0.499 | 0.742 | 0.489 | |
LightAD-selfpaced | −0.425 | 0.748 | 0.641 | −0.510 | 0.765 | 0.517 |
(Ⅰ) As a whole, our LightAD-selfpaced outperforms other baselines by a large margin. Contrast with AutoDebias, our model achieves similar performance with AutoDebias in both datasets and some metrics even exceed it. This can be attributed to the fact that AutoDebias gives the same attention to all samples during training, which will hinder the capture of truly useful information. It’s worth noting that the learning of sampling strategy is only conducted on a very small amount of data instances i.e., the same order of magnitude as the biased instances, which is far less than
(Ⅱ) In terms of all the sampling strategies, LightAD with simple uniform sampling obtains the worst performance and convergence. This is consistent with our intuition that uniform sampling suffers from high variance. The fact that LightAD-fixed outperforms LightAD-uniform proves that the informative sampler could partly address this problem. LightAD-selfpaced achieves the best performance which validates the effectiveness of the proposed selfpaced strategy.
As we mention in Section 3.3, the calculation on imputed training instances consumes enormous computing resources and restricts the practicality of the AutoDebias. In this part, we test this notion through ablation experiments. Table 3 shows the time cost of the ablated methods with different components being deleted. From the table, we can conclude that the introduction of these components
Model | w(1) | m | w(2) | Epoch (s) |
MF | × | × | × | 0.02 |
AutoDebias-w(1) | √ | × | × | 0.07 |
AutoDebias-w(1)m | √ | √ | × | 0.92 |
AutoDebias | √ | √ | √ | 1.3 |
To verify the efficiency promotion of our methods, we conduct experiments in two dimensions: the time cost in each epoch and the overall time cost. To ensure the reliability of the experiment as much as possible, we validated the experiment under the same conditions i.e., the same hyperparameter configuration and computing resource. The statistics of training time for these four models are shown in Table 4. From this table, we can find that employing samplers could indeed accelerate the training of AutoDebias. More interestingly, employing a uniform sampler achieves fast calculation in each epoch while requiring more time cost in total. We ascribe it to the fact that informative sampler could improve convergence and reduce the number of the required epochs. To prove this, we plotted the training process of LightAD-fixed and LightAD-uniform. As shown in the Fig. 1, it takes less than 20 epochs for LightAD-fixed to converge while taking more than 40 epochs to achieve the analogous level in LightAD-uniform. The experimental results are consistent with our previous theoretical proof.
Model | Epoch (s) | Total time (s) |
AutoDebias | 1.30 | 1260.50 |
LightAD-uniform | 0.14 | 74.10 |
LightAD-fixed | 0.34 | 107.36 |
LightAD-selfpaced | 0.20 | 29.05 |
Note that in LightAD-selfpaced, we gradually increase the contribution of the informative sampler while hyperparamter
To demonstrate the universality of our LightAD, we conduct experiments on Simulation. We compare our three sampling strategies with two SOTA methods in mitigating both position bias and selection bias:
Dual learning algorithm (DLA)[44]. DLA treated the learning of the propensity function and the ranking list as a dual problem and designed specific EM algorithms to learn the two models, which is one SOTA method in mitigating position bias.
Heckman ensemble (HeckE)[45]. Ascribed bias to the reality that users can only view a truncated list of recommendations. HeckE adopted a two-step method in which it first designed a Probit model to estimate the probability of a item being examined and then took advantage of this probability to modify their model.
Different from the models in Section 6.2, we add position information in the modeling of
NLL | AUC | NDCG@5 | |
DLA | −0.693 | 0.572 | 0.595 |
HeckE | −0.692 | 0.587 | 0.657 |
LightAD-uniform | −0.731 | 0.603 | 0.665 |
LightAD-fixed | −0.745 | 0.604 | 0.680 |
LightAD-selfpaced | −0.685 | 0.611 | 0.700 |
In this work, we identify the inefficiency issue of AutoDebias, and propose to accelerate AutoDebias with a carefully designed sampling strategy. The sampler could adaptively and dynamically draw informative training instances, which brings provably better convergence and stability than the standard uniform sampler. Extensive experiments on three benchmark datasets validate that our LightAD accelerates AutoDebias by several magnitudes while maintaining almost equal recommendation performance and universality.
One promising research direction for future work is to explore pruning strategy for AutoDebias. AutoDebias involves a large number of debiasing parameters, and of course not all parameters are necessary. Leveraging AutoML or lottery hypothesis to locate important parameters while leave others would significantly reduce the time and space cost, mitigate over-fitting and potentially boost recommendation performance.
This work was supported by the National Key R&D Program of China (2021YFB2900302).
The authors declare that they have no conflict of interest.
[1] |
van Engelen J E, Hoos H H. A survey on semi-supervised learning. Machine Learning, 2020, 109: 373–440. DOI: 10.1007/s10994-019-05855-6
|
[2] |
Ang J C, Mirzal A, Haron H, et al. Supervised, unsupervised, and semi-supervised feature selection: A review on gene selection. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2016, 13 (5): 971–989. DOI: 10.1109/tcbb.2015.2478454
|
[3] |
Zhu X, Goldberg A B. Introduction to Semi-Supervised Learning. Cham, Switzerland: Springer, 2009.
|
[4] |
Scudder H. Probability of error of some adaptive pattern-recognition machines. IEEE Transactions on Information Theory, 1965, 11 (3): 363–371. DOI: 10.1109/tit.1965.1053799
|
[5] |
Blum A, Mitchell T. Combining labeled and unlabeled data with co-training. In: COLT' 98: Proceedings of the Eleventh Annual Conference on Computational Learning Theory. New York: ACM, 1998: 92–100.
|
[6] |
Belkin M, Niyogi P, Sindhwani V. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 2006, 7: 2399–2434. DOI: 10.5555/1248547.1248632
|
[7] |
Belkin M, Niyogi P. Semi-supervised learning on Riemannian manifolds. Machine Learning, 2004, 56: 209–239. DOI: 10.1023/B:MACH.0000033120.25363.1e
|
[8] |
Chapelle O, Schölkopf B, Zien A, Transductive support vector machines. In: Semi-Supervised Learning. Cambridge: MIT Press. 2006, 105–117.
|
[9] |
Chong Y, Ding Y, Yan Q, et al. Graph-based semi-supervised learning: A review. Neurocomputing, 2020, 408 (30): 216–230. DOI: 10.1016/j.neucom.2019.12.130
|
[10] |
Zhu X, Ghahramani Z, Lafferty J. Semi-supervised learning using Gaussian fields and harmonic functions. In: ICML'03: Proceedings of the Twentieth International Conference on International Conference on Machine Learning. Washington, DC: AAAI Press, 2003: 912–919.
|
[11] |
Zhou D, Bousquet O, Lal T N, et al. Learning with local and global consistency. In: NIPS'03: Proceedings of the 16th International Conference on Neural Information Processing Systems. New York: ACM, 2003: 321–328.
|
[12] |
Chen J, Wang C, Sun Y, et al. Semi-supervised Laplacian regularized least squares algorithm for localization in wireless sensor networks. Computer Networks, 2011, 55 (10): 2481–2491. DOI: 10.1016/j.comnet.2011.04.010
|
[13] |
Szummer M, Jaakkola T. Partially labeled classification with Markov random walks. In: NIPS'01: Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic. Cambridge: MIT Press, 2001: 945–952.
|
[14] |
Grira N, Crucianu M, Boujemaa N. Active semi-supervised fuzzy clustering for image database categorization. In: MIR '05: Proceedings of the 7th ACM SIGMM International Workshop on Multimedia Information Retrieval. New York: ACM, 2005: 9–16.
|
[15] |
Chapelle O, Zien A. Semi-supervised classification by low density separation. In: AISTATS 2005–Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics. Stuttgart, Germany: Max-Planck-Gesellschaft, 2005: 57–64.
|
[16] |
Kostopoulos G, Karlos S, Kotsiantis S, et al. Semi-supervised regression: A recent review. Journal of Intelligent & Fuzzy Systems, 2018, 35 (2): 1483–1500. DOI: 10.3233/JIFS-169689
|
[17] |
Torii M, Wagholikar K, Liu H. Using machine learning for concept extraction on clinical documents from multiple data sources. Journal of the American Medical Informatics Association, 2011, 18: 580–587 . DOI: 10.1136/amiajnl-2011-000155
|
[18] |
Scardapane S, Fierimonte R, Wang D, et al. Distributed music classification using random vector functional-link nets. In: 2015 International Joint Conference on Neural Networks (IJCNN). Killarney, Ireland: IEEE, 2015: 1–8.
|
[19] |
Shih T K, Distributed multimedia databases In: Shih T K, editor. Distributed Multimedia Databases: Techniques and Applications. Hershey, PA: IGI Global, 2002: 2–12.
|
[20] |
Shen P, Du X, Li C. Distributed semi-supervised metric learning. IEEE Access, 2016, 4: 8558–8571. DOI: 10.1109/ACCESS.2016.2632158
|
[21] |
Scardapane S, Fierimonte R, Di Lorenzo P, et al. Distributed semi-supervised support vector machines. Neural Networks, 2016, 80: 43–52 . DOI: 10.1016/j.neunet.2016.04.007
|
[22] |
Fierimonte R, Scardapane S, Uncini A, et al. Fully decentralized semi-supervised learning via privacy-preserving matrix completion. IEEE Transactions on Neural Networks and Learning Systems, 2017, 28: 2699–2711. DOI: 10.1109/TNNLS.2016.2597444
|
[23] |
Gan H, Li Z, Wu W, et al. Safety-aware graph-based semi-supervised learning. Expert Systems With Applications, 2018, 107: 243–254. DOI: 10.1016/j.eswa.2018.04.031
|
[24] |
Lee K, Lam M, Pedarsani R, et al. Speeding up distributed machine learning using codes. IEEE Transactions on Information Theory, 2018, 64 (3): 1514–1529. DOI: 10.1109/tit.2017.2736066
|
[25] |
Chen L, Han K, Du Y, et al. Block-division-based wireless coded computation. IEEE Wireless Communications Letters, 2022, 11 (2): 283–287. DOI: 10.1109/LWC.2021.3125983
|
[26] |
Agarwal A, Duchi J C. Distributed delayed stochastic optimization. In: 2012 IEEE 51st IEEE Conference on Decision and Control (CDC). Maui, USA: IEEE, 2012: 5451–5452.
|
[27] |
Alfakih A Y, Khandani A K, Wolkowicz H. Solving euclidean distance matrix completion problems via semidefinite programming. Computational Optimization and Applications, 1999, 12: 13–30. DOI: 10.1023/A:1008655427845
|
[28] |
Al-Homidan S, Wolkowicz H. Approximate and exact completion problems for Euclidean distance matrices using semidefinite programming. Linear Algebra and Its Applications, 2005, 406: 109–141. DOI: 10.1016/j.laa.2005.03.021
|
[29] |
Liu W, Chen L, Zhang W. Decentralized federated learning: Balancing communication and computing costs. IEEE Transactions on Signal and Information Processing Over Networks, 2022, 8: 131–143. DOI: 10.1109/TSIPN.2022.3151242
|
[30] |
Liu W, Chen L, Chen Y, et al. Accelerating federated learning via momentum gradient descent. IEEE Transactions on Parallel and Distributed Systems, 2020, 31 (8): 1754–1766. DOI: 10.1109/TPDS.2020.2975189
|
[31] |
Wang Z, Du Y, Wei K, et al. Vision, application scenarios, and key technology trends for 6G mobile communications. Science China Information Sciences, 2022, 65: 151301. DOI: 10.1007/s11432-021-3351-5
|
[1] |
van Engelen J E, Hoos H H. A survey on semi-supervised learning. Machine Learning, 2020, 109: 373–440. DOI: 10.1007/s10994-019-05855-6
|
[2] |
Ang J C, Mirzal A, Haron H, et al. Supervised, unsupervised, and semi-supervised feature selection: A review on gene selection. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2016, 13 (5): 971–989. DOI: 10.1109/tcbb.2015.2478454
|
[3] |
Zhu X, Goldberg A B. Introduction to Semi-Supervised Learning. Cham, Switzerland: Springer, 2009.
|
[4] |
Scudder H. Probability of error of some adaptive pattern-recognition machines. IEEE Transactions on Information Theory, 1965, 11 (3): 363–371. DOI: 10.1109/tit.1965.1053799
|
[5] |
Blum A, Mitchell T. Combining labeled and unlabeled data with co-training. In: COLT' 98: Proceedings of the Eleventh Annual Conference on Computational Learning Theory. New York: ACM, 1998: 92–100.
|
[6] |
Belkin M, Niyogi P, Sindhwani V. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 2006, 7: 2399–2434. DOI: 10.5555/1248547.1248632
|
[7] |
Belkin M, Niyogi P. Semi-supervised learning on Riemannian manifolds. Machine Learning, 2004, 56: 209–239. DOI: 10.1023/B:MACH.0000033120.25363.1e
|
[8] |
Chapelle O, Schölkopf B, Zien A, Transductive support vector machines. In: Semi-Supervised Learning. Cambridge: MIT Press. 2006, 105–117.
|
[9] |
Chong Y, Ding Y, Yan Q, et al. Graph-based semi-supervised learning: A review. Neurocomputing, 2020, 408 (30): 216–230. DOI: 10.1016/j.neucom.2019.12.130
|
[10] |
Zhu X, Ghahramani Z, Lafferty J. Semi-supervised learning using Gaussian fields and harmonic functions. In: ICML'03: Proceedings of the Twentieth International Conference on International Conference on Machine Learning. Washington, DC: AAAI Press, 2003: 912–919.
|
[11] |
Zhou D, Bousquet O, Lal T N, et al. Learning with local and global consistency. In: NIPS'03: Proceedings of the 16th International Conference on Neural Information Processing Systems. New York: ACM, 2003: 321–328.
|
[12] |
Chen J, Wang C, Sun Y, et al. Semi-supervised Laplacian regularized least squares algorithm for localization in wireless sensor networks. Computer Networks, 2011, 55 (10): 2481–2491. DOI: 10.1016/j.comnet.2011.04.010
|
[13] |
Szummer M, Jaakkola T. Partially labeled classification with Markov random walks. In: NIPS'01: Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic. Cambridge: MIT Press, 2001: 945–952.
|
[14] |
Grira N, Crucianu M, Boujemaa N. Active semi-supervised fuzzy clustering for image database categorization. In: MIR '05: Proceedings of the 7th ACM SIGMM International Workshop on Multimedia Information Retrieval. New York: ACM, 2005: 9–16.
|
[15] |
Chapelle O, Zien A. Semi-supervised classification by low density separation. In: AISTATS 2005–Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics. Stuttgart, Germany: Max-Planck-Gesellschaft, 2005: 57–64.
|
[16] |
Kostopoulos G, Karlos S, Kotsiantis S, et al. Semi-supervised regression: A recent review. Journal of Intelligent & Fuzzy Systems, 2018, 35 (2): 1483–1500. DOI: 10.3233/JIFS-169689
|
[17] |
Torii M, Wagholikar K, Liu H. Using machine learning for concept extraction on clinical documents from multiple data sources. Journal of the American Medical Informatics Association, 2011, 18: 580–587 . DOI: 10.1136/amiajnl-2011-000155
|
[18] |
Scardapane S, Fierimonte R, Wang D, et al. Distributed music classification using random vector functional-link nets. In: 2015 International Joint Conference on Neural Networks (IJCNN). Killarney, Ireland: IEEE, 2015: 1–8.
|
[19] |
Shih T K, Distributed multimedia databases In: Shih T K, editor. Distributed Multimedia Databases: Techniques and Applications. Hershey, PA: IGI Global, 2002: 2–12.
|
[20] |
Shen P, Du X, Li C. Distributed semi-supervised metric learning. IEEE Access, 2016, 4: 8558–8571. DOI: 10.1109/ACCESS.2016.2632158
|
[21] |
Scardapane S, Fierimonte R, Di Lorenzo P, et al. Distributed semi-supervised support vector machines. Neural Networks, 2016, 80: 43–52 . DOI: 10.1016/j.neunet.2016.04.007
|
[22] |
Fierimonte R, Scardapane S, Uncini A, et al. Fully decentralized semi-supervised learning via privacy-preserving matrix completion. IEEE Transactions on Neural Networks and Learning Systems, 2017, 28: 2699–2711. DOI: 10.1109/TNNLS.2016.2597444
|
[23] |
Gan H, Li Z, Wu W, et al. Safety-aware graph-based semi-supervised learning. Expert Systems With Applications, 2018, 107: 243–254. DOI: 10.1016/j.eswa.2018.04.031
|
[24] |
Lee K, Lam M, Pedarsani R, et al. Speeding up distributed machine learning using codes. IEEE Transactions on Information Theory, 2018, 64 (3): 1514–1529. DOI: 10.1109/tit.2017.2736066
|
[25] |
Chen L, Han K, Du Y, et al. Block-division-based wireless coded computation. IEEE Wireless Communications Letters, 2022, 11 (2): 283–287. DOI: 10.1109/LWC.2021.3125983
|
[26] |
Agarwal A, Duchi J C. Distributed delayed stochastic optimization. In: 2012 IEEE 51st IEEE Conference on Decision and Control (CDC). Maui, USA: IEEE, 2012: 5451–5452.
|
[27] |
Alfakih A Y, Khandani A K, Wolkowicz H. Solving euclidean distance matrix completion problems via semidefinite programming. Computational Optimization and Applications, 1999, 12: 13–30. DOI: 10.1023/A:1008655427845
|
[28] |
Al-Homidan S, Wolkowicz H. Approximate and exact completion problems for Euclidean distance matrices using semidefinite programming. Linear Algebra and Its Applications, 2005, 406: 109–141. DOI: 10.1016/j.laa.2005.03.021
|
[29] |
Liu W, Chen L, Zhang W. Decentralized federated learning: Balancing communication and computing costs. IEEE Transactions on Signal and Information Processing Over Networks, 2022, 8: 131–143. DOI: 10.1109/TSIPN.2022.3151242
|
[30] |
Liu W, Chen L, Chen Y, et al. Accelerating federated learning via momentum gradient descent. IEEE Transactions on Parallel and Distributed Systems, 2020, 31 (8): 1754–1766. DOI: 10.1109/TPDS.2020.2975189
|
[31] |
Wang Z, Du Y, Wei K, et al. Vision, application scenarios, and key technology trends for 6G mobile communications. Science China Information Sciences, 2022, 65: 151301. DOI: 10.1007/s11432-021-3351-5
|