Coded computing for distributed graph-based semi-supervised learning

Siqi Tan; Li Chen; Weidong Wang

doi:10.52396/JUSTC-2022-0133

Abstract

5. Theoretical analyses

References

Supplements

JUSTC > 2023 > 53(4): 0401. > DOI: 10.52396/JUSTC-2022-0133 CSTR: 32290.14.JUSTC-2022-0133

PDF (4381 KB)

Open Access JUSTC Information Science and Technology Article 01 May 2023

Coded computing for distributed graph-based semi-supervised learning

School of Information Science and Technology, University of Science of Technology of China, Hefei 230027, China

Cite this: JUSTC, 2023, 53(4): 0401

https://doi.org/10.52396/JUSTC-2022-0133

CSTR: 32290.14.JUSTC-2022-0133

More Information

Author Bio:
Siqi Tan received the B.E. degree in Electronic Information Engineering from the University of Science and Technology of China (USTC) in 2020. Now she is currently pursuing the M.E. degree at the Department of Electronic Engineering and Information Science, USTC. Her research interests include wireless communications and coded distributed computation

Li Chen received the B.E. degree in Electrical and Information Engineering from the Harbin Institute of Technology in 2009, and the Ph.D. degree in Electrical Engineering from the University of Science and Technology of China (USTC) in 2014. He is currently an Associate Professor at the Department of Electronic Engineering and Information Science, USTC. His research interests include integrated computation and communication, integrated sensing, and communication
Corresponding author:
Li Chen, E-mail: chenli87@ustc.edu.cn
Received Date: September 16, 2022
Accepted Date: November 14, 2022
Available Online: May 01, 2023

Abstract Full text PDF

Abstract

Abstract

Semi-supervised learning (SSL) has been applied to many practical applications over the past few years. Recently, distributed graph-based semi-supervised learning (DGSSL) has been shown to have good performance. Current DGSSL algorithms usually have the problems of inefficient graph construction and the straggler effect. This paper proposes a novel coded DGSSL (CDGSSL) to solve these problems. We first provide a novel parallel and distributed solution of matrix completion for efficient graph construction. Then, we develop the CDGSSL algorithm based on coding theory. Specifically, the proposed algorithm consists of two parts separately designed based on the maximum distance separable (MDS) code. In general, the proposed coded distributed algorithm is efficient and straggler tolerant. Moreover, we provide an optimal parameter design for the proposed algorithm. The results of the experiments on the Alibaba Cloud elastic compute service (ECS) demonstrate the superiority of the proposed algorithm.

Graphical Abstract

The proposed algorithm consists of two parts: coded matrix completion and coded label propagation.

Abstract

Semi-supervised learning (SSL) has been applied to many practical applications over the past few years. Recently, distributed graph-based semi-supervised learning (DGSSL) has been shown to have good performance. Current DGSSL algorithms usually have the problems of inefficient graph construction and the straggler effect. This paper proposes a novel coded DGSSL (CDGSSL) to solve these problems. We first provide a novel parallel and distributed solution of matrix completion for efficient graph construction. Then, we develop the CDGSSL algorithm based on coding theory. Specifically, the proposed algorithm consists of two parts separately designed based on the maximum distance separable (MDS) code. In general, the proposed coded distributed algorithm is efficient and straggler tolerant. Moreover, we provide an optimal parameter design for the proposed algorithm. The results of the experiments on the Alibaba Cloud elastic compute service (ECS) demonstrate the superiority of the proposed algorithm.
Public Summary
- Existing distributed graph-based semi-supervised learning algorithms suffer from extensive communication and computation. The proposed scheme is more efficient with low complexity by providing a parallel and distributed solution for global Euclidean distance matrix completion.
- The iteration time of distributed graph-based semi-supervised learning is defined by the slowest node, noted as the straggler node. A novel coded computation design for distributed graph-based semi-supervised learning is proposed to decrease the straggler effect.
- We numerically verified the superiority of the proposed scheme on Alibaba cloud elastic compute service. In general, the simulation results have demonstrate that our proposed algorithm is efficient and straggler tolerant.

FullText(HTML)

1. Introduction

In the era of information explosion, recommender systems (RSs) that connect users with the right items have been recognized as the most effective means to alleviate information overload^[1]. Existing recommendation methods mainly focus on developing a machine learning model to better fit the collected user behavior data^{[2, 3]}. However, because the behavior data in practice are observational rather than experimental, various biases occur. For example, user behaviors are subject to what was exposed to the users (aka, exposure bias) or what were the public opinions (aka, conformity bias), deviating from reflecting the real interests of users. Consequently, blindly fitting the RS model without tackling the bias issues will lead to unacceptable performance^{[4, 5]}.

To alleviate the unfavorable effect of biases, a surge of recent work has been devoted to debiasing solutions. For example, the inverse propensity score (IPS) was proposed to reweight training samples for expectation-unbiased learning^{[6, 7]}; Data imputation was explored to fill missing values^{[8, 9]}; Knowledge distillation was also used to bridge the models trained on biased data and unbiased data. More recently, AutoDebias^[10] proposed to optimize objective function on unbiased data has applied a metalearning technique, which has been demonstrated to be a generic and effective solution for recommendation debiasing. AutoDebias gives both pseudolabels and confidence weights for each user-item pair and is flexible enough to address various biases and outperform previous work by a large margin.

Although effective at managing diverse biases, AutoDebias suffers from serious inefficiency. Specifically, the training of AutoDebias involves calculations for each user-item pair. As such, the complexity of AutoDebias is linear with the product of the number of users and items, which can easily scale to one billion or even larger. This crucial defect severely limits the applicability and potential of AutoDebias. Although stochastic gradient descent with a uniform sampler can be employed for acceleration, this treatment significantly deteriorates model convergence and stability, leading to inferior recommendation performance. Hence, how to accelerate AutoDebias without reducing its accuracy is still an open problem.

To this end, in this work, we propose LightAutoDebias (short as LightAD), which equips AutoDebias with a carefully designed adaptive sampling strategy. We make three important improvements: (ⅰ) We adopt a dynamic sampling distribution that favors the informative training instances with large confidence weights. Further theoretical analysis proves that such a distribution could reduce the sampling variance and accelerate convergence. (ⅱ) We develop a customized factorization-based sampling strategy to support fast sampling from the above dynamic distribution. (ⅲ) Because the model may not reliably estimate the confidence weights in the early training stage, we propose a self-paced strategy that adopts uniform sampling at the beginning while gradually tilting toward factorization-based sampling along the training process.

To summarize, the contributions of this work are as follows:

● We identify the serious inefficiency issue of AutoDebias, which is imperative to be conquered.

● We propose a dynamic and adaptive sampler for AutoDebias, which yields provably better convergence and stability than those of the standard uniform sampler.

● We conduct extensive experiments on three benchmark datasets, validating that our LightAD accelerates AutoDebias by several magnitudes while maintaining almost equal recommendation accuracy.

The remainder of this paper is organized as follows. We first provide a review of related works in Section 2. We then define the problem and recap AutoDebias in Section 3. We elaborate upon the proposed method in detail in Section 4. After that, we theoretically demonstrate the unbiasedness and effectiveness of our approach in Section 5. The experimental results and discussion are presented in Section 6. Finally, we conclude the paper and present some directions for future work are presented in Section 7.

2. Related work

The fact that biases exist in practical RSs and can strongly compromise the performance of RSs has been proven by many scholars. Since we study how to apply unbiased sampling strategies to improve the efficiency issues in this paper, we first make a general classification of biases in data and review some related work on tackling them. Then, we included some specific sampling methods in the debiasing area.

2.1 Biases in recommendation

Intuitively, biases in an RS indicate that models lose the ability to provide accurate recommendations for users. In general, we can roughly divide them into the following categories:

Selection bias. Selection bias occurs when users are free to choose which items to rate, thus the collected ratings are not a representative sample of all the ratings^[11]. Intuitively, the data we used in RS are not missing at random. The research of Marlin et al.^[12] fully proved the presence of selection bias. Schnabel et al.^[6] considered introducing used inversed propensity scores (IPSs) to reweight the observed data. Steck et al.^[13] designed a novel unbiased metric named ATOP to remedy selection bias. Considering that users can freely rate items they like, Hernández et al.^[8] proposed imputation-based methods that directly impute the missing entries with pseudolabels and then weights down the contribution of these missing ratings.

Conformity bias. Conformity bias occurs as users tend to align themselves with others in the group regardless of whether it goes against their own will^[11]. In other words, our rating of some products may succumb to public opinion and therefore not reflect our real interests. Krishnan et al.^[14] conducted comparative experiments suggesting that social influence bias is an important factor in RS. Liu et al.^[15] took conformity considered and directly used rating data behavior as an important feature to quantify conformity bias. In addition, Tang et al.^[16] and Hao et al.^[17] leveraged social factors more directly to produce the final prediction and introduce specific parameters to eliminate the impact of conformity bias.

Exposure bias. Exposure bias exists when users are only exposed to a small portion of the items provided^[11]. Exposure bias is readily comprehensible as the unobserved data can be attributed to users not liking the item or simply not seeing the item. One simple and straightforward strategy to address exposure bias is to consider all noninteraction data as negative samples and specify their confidence. Hu et al.^[18] put forward the classic WMF model in which the unobserved data are assigned lower weights. Similarly, Pan et al.^[19] went a step further and specified the data confidence with examined user behavior. Another way to address exposure bias is to design an exposure-based probabilistic model that is able to capture whether a user has been exposed to an item. Liang introduced the EXMF model of users’ activity through an exposure variable and proposed a generative process of implicit feedback. Chen et al.^[20] assumed that users frequently come into contact with specific communities and thus modelled confidence weights using a community-based strategy.

Position bias. Position bias emerges as users are inclined to interact with items in a higher position on the recommendation list^[11]. When dealing with position bias, many researchers assuming that users’ click behavior occurs if and only if an item is relevant and examined by users^{[21, 22]}. Another conventional strategy is to apply a model and its variants^[23–25], which assumes that users examine items from the top to the bottom of the recommendation list. Hence, the click behavior relies only on the relevance of all the items that displayed to users.

Popularity bias. Popularity bias occurs when popular items are recommended more often than their popularity^[11]. Due to the common long tail phenomenon in recommendation data; recommendation models usually pay more attention to popular items and hence give higher scores to popular items than their real value. Kamishima et al.^[26] used a mean-match regularizer in their models to offset popularity bias. Zheng et al.^[27] applied counterfactual reasoning to address the popularity bias and assumed that users’ click behavior depends mainly on their interest and items’ popularity. Krishnan et al.^[28] introduced adversarial learning in RS to seek a trade-off between the popular item-biased reconstruction objective and the accuracy of long-tail item recommendations.

2.2 Sampling

In addition to the methods mentioned above, sampling is a promising debiasing method that has been studied by many scholars. Specifically, we can use various sampling strategies to determine which data are used to update the parameters and how often in the training of our models. Steffen et al.^[29] employed the uniform negative sampler as the basic strategy to improve efficiency. Several neural-based methods^[30–32] also consider using negative sampler as an efficient way to increase the model performance. In response to exposure bias, Yu et al.^[33] proposed to over sampled popular negative samples because they had a greater chance of being exposed. Nevertheless, these samplers follow a predetermined sampling distribution which may dynamically change according to the model’s state in each iteration; increasing the likehood genetating low-quality samples. Dae et al.^[34], Steffen et al.^[35], and Ding et al.^[36] proposed an adaptive negative sampler, for the purpose of over sampling “difficult” instances to increase learning efficiency. One disadvantage of these heuristic methods is that they fail to capture the real negative instances. To address this problem, Ding et al.^[37] explored leveraging side information to enhance the studied the performance of samplers; Chen et al.^[38] took advantage of social information in their sampling strategy.

3. Preliminary

In this section, we first formulate the task of recommendation systems and then expose the nature of bias from the point of view of risk discrepancy. Finally, we analyze the temporal and spatial complexity of AutoDebias to demonstrate the challenges we face in deploying this algorithm.

3.1 Problem definition

Suppose we have a recommender system with user set ${\mathcal {U}}$ (including $n$ users) and item set ${\cal{I}}$ (including $m$ items). Let $r \in {\mathcal {R}}$ be the users’ feedback on items, which can be binary (implicit feedback) or the numerical ratings (explicit feedback). Let $D_{\rm{T}}$ denote the historical user behavior, which can be seen as a set of triplets $\{(u_k,i_k,r_k)\}_{1\le k \le |D_{\rm{T}}|}$ generated from an unknown training distribution $p_{\rm{T}}(u,i,r)$ over the space ${\mathcal {U}}\times {\cal{I}} \times {\mathcal {R}}$ . For convenience, we use $\delta (\cdot,\cdot)$ to represent the predefined error function, e.g., the MSE, cross-entropy, and Hinge loss. The objective of recommendation system is used to learn a proper function $f_{\theta}$ from $D_{\rm{T}}$ that minimizes the following true risk:

$\begin{array}{l} L(f)=\mathbb E_{p_{\rm{U}}(u,i,r)}[\delta (f(u,i),r)], \\ \end{array}$

(1)

where $p_{\rm{U}}(u,i,r)$ denotes the ideal unbiased data distribution for model evaluation. Regrettably, $p_{\rm{U}}$ is not accessible in most cases. Instead, training is often performed on the training dataset $D_{\rm{T}}$ , which optimizes the following empirical risk:

$\begin{array}{l} \hat L_{\rm{T}}(f)=\dfrac{1}{|D_{\rm{T}}|}\displaystyle \sum\limits_{k=1}^{|D_{\rm{T}}|}\delta (f(u_k,i_k),r_{k}). \\ \end{array}$

(2)

According to the PAC learning theory^[39], if and only if $\hat L_{\rm{T}}(f)$ is an unbiased estimator, i.e., $\mathbb E_{p_{\rm{T}}}[L_{\rm{T}}(f)]=L(f)$ , the learned model will be approximately optimal as the training data size increases. However, since the collected behavior data often contain bias, the training data distribution $p_{\rm{T}}$ cannot be consistent with the unbiased test one $p_{\rm{U}}$ . In other words, the unbiasedness cannot hold in many cases. Therefore, directly training a recommendation model on $D_{\rm{T}}$ with Eq. (2) without considering the inherent biases would easily lead to inferior results.

3.2 AutoDebias brief

To eliminate the distribution discrepancy, Ref. [10] puts forward AutoDebias, a general proposed a debiasing framework, which aims at addressing various biases. The main idea of AutoDebias transforms the aforementioned empirical risk functions into:

$\begin{array}{l} {{\hat L}_{\rm{T}}}(f|\phi)=\dfrac{1}{|D_{\rm{T}}|}\displaystyle \sum\limits_{k=1}^{|D_{\rm{T}}|} {w_k^{(1)}\delta (f({u_k},{i_k}),{r_k})}+{\displaystyle \sum\limits_{u \in {\mathcal {U}},i \in {\cal{I}}}} {w_{ui}^{(2)}\delta (f(u,i),{m_{ui}})}, \end{array}$

(3)

where AutoDebias imputes the pseudolabels $m_{ui}$ for each user-item pair, and gives diverse confidence weights ${w_k^{(1)}}$ for each training instance. When the set of hyperparameters $\phi \equiv \{{w^{(1)}},{w^{(2)}},m\}$ is properly specified, $\hat L_{\rm{T}}(f|\phi)$ can be an unbiased estimator, even if the training dataset $D_{\rm{T}}$ contains any type of bias.

AutoDebias also devises a metlearning-based strategy to learn the appropriate $\phi$ . Specifically, they first propose reparameterizing $\phi$ with a concise linear meta model:

$\begin{array}{l} w_k^{(1)} = \exp (\varphi _1^{\rm{T}}[{{\boldsymbol{ e}}_{{u_k}}} \circ {{\boldsymbol{ e}}_{{i_k}}} \circ {{\boldsymbol{ e}}_{r_k}}]),\\ w_{ui}^{(2)} = \exp (\varphi _2^{\rm{T}}[{{\boldsymbol{ e}}_u} \circ {{\boldsymbol{ e}}_i} \circ {{\boldsymbol{ e}}_{{\rm{O}}_{ui}}}]),\\ {m_{ui}} = \sigma (\varphi _3^{\rm{T}}[{{\boldsymbol{ e}}_{r_{ui}}} \circ {{\boldsymbol{ e}}_{{\rm{O}}_{ui}}}]), \end{array}$

(4)

where $\boldsymbol{e}_u$ , $\boldsymbol{e}_i$ , $\boldsymbol{e}_r$ , and $\boldsymbol{e}_{{\rm{O}}_{ui}}$ denote the one-hot vectors of the user $u$ , item $i$ , feedback $r$ and observation ${\rm{O}}_{ui}$ , respectively. This process can reduce the number of parameters and encode useful debiasing information. They then leveraged a metalearning strategy to optimize $\phi$ by utilizing a small set of uniform data. Specifically, AutoDebias updates $\phi$ and $f_\theta$ in an alternative fashion: making an assumed update of $\theta$ with $\theta'(\phi) = \theta - \eta_1 {\nabla _\theta }{{\hat L}_{\rm{T}}}({f_\theta }|\phi )$ ; validating the performance of $\theta'$ on the uniform data with ${{\hat L}_{\rm{U}}}({f_\theta })$ , i.e., ${{\hat L}_{\rm{U}}}({f_\theta })= \dfrac{1}{{|{D_{\rm{U}}}|}}\displaystyle \sum\limits_{l = 1}^{|{D_{\rm{U}}}|} {\delta ({f_\theta }({u_l},{i_l}),{r_l})}$ , which in turn gives a feedback signal (gradient) to update the meta model $\phi$ : ${\varphi} \leftarrow {\varphi - \eta_2 {\nabla _\varphi } {{\hat L}_{\rm{U}}}({f_{\theta'(\phi)} }) }$ ; given the updated $\phi$ , updating $\theta$ actually: $\theta \leftarrow \theta - \eta_1 {\nabla _\theta }{{\hat L}_{\rm{T}}}({f_\theta }|\phi ))$ .

3.3 Complexity analysis of AutoDebias

Now, we conduct time complexity analyses of AutoDebias. The time complexity mainly comes from the following two parts:

Calculation on observed training instances. As depicted by the first part of the Eq. (3), the calculation only involves the instances in $D_{\rm{T}}$ . As such, the complexity of this part is linear with the number of the observed training instances, i.e., $O(|D_{\rm{T}}|)$ .

Calculation on imputed training instances. As depicted by the second part of the Eq. (3), the complexity involves the calculation over all user-item pairs, which is $O(nm)$ . The number of users and items can be easily scaled to millions or more in practice, AutoDebias would be computationally infeasible.

Considering the sparse nature of real-world recommendation data, the second part has become the bottleneck of AutoDebias. A naive solution for acceleration employs stochastic gradient descent and a uniform sampler, where the gradient can be quickly calculated on the sampled portion of the dataset. However, we find that this treatment would significantly deteriorate model convergence and stability, leading to inferior recommendation performance. In fact, not all user-item pairs are equally important in model training. We observe fairly scattered $w^{(2)}$ in practice. The uniform sampler easily samples the uninformative training instances with small $w^{(2)}$ , which limits contribution to the training process. As such, we need a new sampler that can adaptively draw informative training instances.

4. Method

In this section, we present LightAD which equips AutoDebias with a carefully designed adaptive sampling strategy. LightAD adopts an uneven sampling distribution and factorization-based efficient sampling strategy. We will detail LightAD in the following parts.

4.1 Fast informative sampler

As mentioned before, the second part of Eq. (3) has become the bottleneck of AutoDebias. Here we leverage a sampler to accelerate the training, where the gradient can be estimated from a much smaller sampled training dataset. Formally, the learning with the sampler can be formulated as follows:

$\begin{array}{l} S \sim \text{Multinomial}(N,p_{{\rm{s}}}(u,i)), \end{array}$

(5)

which draws a small set of users and items, i.e., $S$ , with the sample size $N$ and the distribution $p_{{\rm{s}}}(u,i)$ . In this way, a recommendation model can be trained efficiently with sampled instances, which can be expressed as:

$\begin{array}{l} {{\hat L}_{\rm{T}}}(f|\phi) = \dfrac{1}{|D_{\rm{T}}|}\displaystyle \sum\limits_{k=1}^{|D_{\rm{T}}|} {w_k^{(1)}\delta (f({u_k},{i_k}),{r_k})} + \dfrac{1}{|S|}{\displaystyle \sum\limits_{u,i \in \cal{S}}} {\delta (f(u,i),{m_{ui}})}, \end{array}$

(6)

where the confidence weights $w_{ui}^{(2)}$ are offset by the sampling distribution. Now, the question lies in which sampling distribution to use and how to implement fast sampling.

Importance-aware sampling distribution. Note that different data may have different confidence weights $w^{(2)}$ . This suggests that the data sampling distribution is highly important since it determines which data are used to update parameters and how often it is. Intuitively, the informative data that have a larger $w^{(2)}$ should be sampled with a larger sampling probability. The reason lies in the fact that they would bring larger gradients and make a greater contribution to training, which could speed up model convergence. In fact, our theoretical analyses presented in Section 5.2 prove that sampling with the distribution $p_{\rm{s}}(u,i) \propto w^{(2)}_{ui}$ can reduce the sampling variance.

Fast factorization-based sampling. Now the question lies in how to perform fast sampling from the distribution $p_{\rm{s}}(u,i) \propto w^{(2)}_{ui}$ . Direct sampling from $p_{\rm{s}}(u,i)$ would be highly time-consuming, as it involves a large instance space. Moreover, the sampling distribution would evolve as the training proceeds. To address this problem, we propose a subtle fast factorization-based sampling algorithm. The merit of the algorithm lies in the decomposition nature of the hyperparameters $w_{ui}^{(2)}$ . In fact, we have:

$\begin{array}{l} w_{ui}^{(2)} = \exp (\varphi _2^{\rm{T}}[{{\boldsymbol{ e}}_u} \circ {{\boldsymbol{ e}}_i} ])=\exp (\varphi _2^{\rm{T}}[{{\boldsymbol{ e}}_u} ])*\exp (\varphi _2^{\rm{T}}[{{\boldsymbol{ e}}_i} ])\equiv w_{u}^{(2)}*w_{i}^{(2)},\\ \end{array}$

(7)

where $\boldsymbol{e}_u$ and $\boldsymbol{e}_i$ denote the one-hot vectors of user ID $u$ and item ID $i$ . The $w_{ui}^{(2)}$ can be decomposed as the product of the item-based weight $w_{i}^{(2)}$ and user-based weight $w_{u}^{(2)}$ . As such, we could separate the sampling of users and items as follows:

$\begin{array}{l} S_{{{u}}} \sim \text{Multinomial}(N,p_{{u}}),\\ S_{{{i}}} \sim \text{Multinomial}(N,p_{{i}}), \end{array}$

(8)

where $p_{{u}} \propto w_{{{u}}}^{(2)}$ and $p_{{i}} \propto w_{{{i}}}^{(2)}$ . In this way, we can find that the sampling distribution for a user-item pair is proportional to $w_{ui}^{(2)}$ , without requiring direct sampling from the large user-item space. As such, the sampling complexity could be reduced from $O(nm)$ to $O(n+m)$ .

4.2 Self-paced sampling strategy.

Note that in the early training stage, the model may not yield a reliable estimate of the confidence weights. Directly learning a model from an informative sampler may lead to suboptimal results. To address this problem, we propose a self-paced strategy that adopts a uniform distribution at the beginning while gradually alters the distribution as the training proceeds. Formally, we adopt the following sampling strategy:

$\begin{array}{l} S_1 \sim \text{Multinomial}(\alpha N,p_{s}(u,i)), \\ S_2 \sim \text{Multinomial}((1-\alpha)N,U(u,i)), \end{array}$

(9)

where $U(u,i)$ denotes the uniform distribution. $\alpha$ controls the ratio of the instances sampled from the informative sampler and the uniform sampler, which would gradually increases as the training proceeds. In this work, we simply set $\alpha$ with $\text{min}(e/\beta,1)$ , where $e$ denotes the current epoch, and $\beta$ controls the increase rate of $\alpha$ , while leaving other choices as future work. After obtaining $S_1$ and $S_2$ , the model can be updated with the following objective functions:

$\begin{split} {{\hat L}_{\rm{P}}}(f|\phi)=& \frac{1}{|D_{\rm{T}}|}\sum\limits_{k=1}^{|D_{\rm{T}}|} {w_k^{(1)}\delta (f({u_k},{i_k}),{r_k})} + \frac{\alpha \displaystyle \sum\limits_{u,i}{w^{(2)}_{ui}}}{|S_1|}\cdot\\&{\sum\limits_{u,i \in {\cal{S}}_1}} {\delta (f(u,i),{m_{ui}})}+\frac{(1-\alpha)nm}{|S_2|}{\sum\limits_{u,i \in {\cal{S}}_2}} {w_{ui}^{(2)}\delta (f(u,i),{m_{ui}})}. \end{split}$

(10)

5. Theoretical analyses

The former section elaborates how LightAD accelerates AutoDebias. In this section, we conduct theoretical analyses to answer the following questions: Does the unbiasedness still hold with our proposed sampling strategy? Does the proposed informative sampler reduce the sampling variance?

5.1 Unbiasedness of LightAD

It has been proven that the empirical risk function of AutoDebias can be an unbiased estimator of the true risk. Now we connect the objective of LightAD (Eq. (10)) with AutoDebias, which reveals the unbiasedness of LightAD. In fact, we have:

$\begin{split} &E_{S_1,S_2}[{{\hat L}_{\rm{P}}}(f|\phi)] = \dfrac{1}{|D_{\rm{T}}|}\displaystyle \sum\limits_{k=1}^{|D_{\rm{T}}|} {w_k^{(1)}\delta (f({u_k},{i_k}),{r_k})} + E_{S_1}[\dfrac{\alpha \displaystyle \sum\limits_{u,i}{w^{(2)}_{ui}}}{|S_1|}{\displaystyle \sum\limits_{u,i \in {{\cal{S}}_1}}} {\delta (f(u,i),{m_{ui}})}] +E_{S_2}[\dfrac{(1-\alpha)nm}{|S_2|}{\displaystyle \sum\limits_{u,i \in {{\cal {S}}_2}}} {w_{ui}^{(2)}\delta (f(u,i),{m_{ui}})}] =\\& \dfrac{1}{|D_{\rm{T}}|}\displaystyle \sum\limits_{k=1}^{|D_{\rm{T}}|} {w_k^{(1)}\delta (f({u_k},{i_k}),{r_k})} + \dfrac{\alpha \displaystyle \sum\limits_{u,i}{w^{(2)}_{ui}}|S_1|}{|S_1|}E_{u,i \sim p_{\rm{s}}} [{\delta (f(u,i),{m_{ui}})}] +\dfrac{(1-\alpha)nm|S_2|}{|S_2|}E_{u,i \sim U} [{w_{ui}^{(2)}\delta (f(u,i),{m_{ui}})}] = \end{split}$

$\dfrac{1}{|D_{\rm{T}}|}\displaystyle \sum\limits_{k=1}^{|D_{\rm{T}}|} {w_k^{(1)}\delta (f({u_k},{i_k}),{r_k})} + \alpha \displaystyle \sum\limits_{u,i}{\dfrac{w^{(2)}_{ui}\displaystyle \sum\limits_{u,i}{w^{(2)}_{ui}}}{\displaystyle \sum\limits_{u,i}{w^{(2)}_{ui}}} {\delta (f(u,i),{m_{ui}})}}+(1-\alpha)nm \displaystyle \sum\limits_{u,i} {w_{ui}^{(2)}\delta (f(u,i),{m_{ui}})} ={{\hat L}_S}(f|\phi).$

(11)

From the above mathematical deduction, we can safely draw the conclusion that the objective of LightAD is indeed to use an unbiased estimator of AutoDebias and the ideal true risk.

5.2 Variance reduction

How does the conventional uniform sampler perform? Does the proposed informative sampler reduce the variance? In this section, we aim to provide a theoretical answer to these problems.

Let us first formulate the variance of the estimated objective with the sampler. In fact, we have the following upper-bound of the variance:

$\begin{array}{l} {\text{Var}}{_S}[{{\hat L}_S}] = \dfrac{1}{{|S|}}\text{Var}{_{p_{\rm{s}}}}\left[ {\dfrac{{w_{ui}^{(2)}}}{{{p_{\rm{s}}}(u,i)}}\delta \left({f(u,i),{m_{ui}}} \right)} \right]\leqslant\\ \dfrac{1}{{|S|}}\left( {\displaystyle \sum\limits_{u,i} {\dfrac{{{\rm{Va}}{{\rm{r}}_{ui}}[w_{ui}^{(2)}] + {E_{u,i}}{{[w_{ui}^{(2)}]}^2}}}{{{p_{\rm{s}}}(u,i)}}} } \right) \times {E_{{p_{\rm{s}}}}}{{[\delta \left({f(u,i),{m_{ui}}} \right)]}^2}-\\ \dfrac{1}{{|S|}}{E_{{p_{\rm{s}}}}}{[w_{ui}^{(2)}\delta \left({f(u,i),{m_{ui}}} \right)]^2}. \end{array}$

(12)

For a uniform sampler, the variance of the objective is subject to the variance of weights $w_{ui}^{(2)}$ , which is unsatisfactory. As the training proceeds, we observe a diverse distribution of $w_{ui}^{(2)}$ with large variance. The gradient from ${\hat L}_S$ with a uniform sampler fluctuates heavily, making the training process highly unstable.

To address this problem, we should choose a better sampling distribution to reduce the upper bound of the estimation variance. In fact, we have:

$\begin{split} &\displaystyle \sum\limits_{ui} {\dfrac{{{{(w_{ui}^{(2)})}^2}}}{{{p_{\rm{s}}}(u,i)}}} = \displaystyle \sum\limits_{ui} {\dfrac{{{{(w_{ui}^{(2)})}^2}}}{{{p_{\rm{s}}}(u,i)}}} \displaystyle \sum\limits_{ui} {{p_{\rm{s}}}(u,i)} \geqslant\\& {\left( {\displaystyle\sum\limits_{ui} {\sqrt {\dfrac{{{{(w_{ui}^{(2)})}^2}}}{{{p_{\rm{s}}}(u,i)}}} \sqrt {{p_{\rm{s}}}(u,i)} } } \right)^2} = {E_{ui}}{[w_{ui}^{(2)}]^2}, \end{split}$

(13)

where we employ the Cauchy inequality, and the equation holds if and only if $p_{\rm{s}}(u,i)\propto w_{ui}^{(2)}$ . This means that for the informative sampler whose sampling distribution is proportional to the confidence weights, the estimation variance achieves the lowest upper bound. The estimation variance is only subject to the mean of the confidence weights and independent of the variance of the confidence weights. The informative sampler would bring better stability and convergence.

6. Experiments

In this section, we first describe the experimental settings, and then present the detailed experiments with the purpose of answering the following questions:

RQ1: How does our LightAD perform compared with AutoDebias in recommendation performance?

RQ2: How does the proposed LightAD accelerate AutoDebias?

RQ3: How does the hyperparameter $\beta$ affect the recommendation performance?

RQ4: How does LightAD perform when dealing with the simultaneous presence of various biases?

6.1 Experimental setup

We conduct our experiments with two publicly accessible datasets and one synthetic dataset: Yahoo!R3, Coat, and Simulation. The concrete statistics of the three datasets are summarized in Table 1.

Table 1. Statistics of the datasets.

Dataset	Users	Items	Training	Validation	Test
Coat	290	300	6960	232	4176
Yahoo!R3	15400	1000	311704	2700	48600
Simulation	500	500	12500	3750	67500

| Show Table

DownLoad: CSV

Yahoo!R3. This dataset contains two types of ratings collected from two sources. The first source collects ratings of users’ normal interactions with music in Yahoo! services (i.e., users pick and rate items according to their own preferences), which can be considered as a stochastic logging policy. Thus, this type of data is obsessed with bias. The second kind of data is ratings from randomly selected songs collected during an online survey (i.e., uniform logging policy). Approximately, it can be regarded as unbiased and reflects the real interest of users.

Coat. This is a dataset collected by Ref. [7], which simulates customers shopping behaviors for coats in an online service. In this dataset, users were first given a web-shop interface and then asked to select the coat they wanted to buy most. Shortly afterwards, they were requested to rate 24 of the self-selected coats and 16 of the randomly picked ones on a five-point scale. Therefore, this dataset consists of self-selected ratings and the uniformly selected ratings which can be regarded as biased data and unbiased data respectively.

Simulation. A great advantage of AutoDebias is that it can handle multiple biases simultaneously, which means good universality in complex scenario. We need to verify that this universality also persists in our model. Since there is no public dataset that satisfies such a scenario, we borrow from AutoDebias and adopt a synthetic dataset Simulation where both position bias and selection bias are taken into account. Unlike the previous two datasets, this one contains the position information of items when the interaction occurs. In short, this dataset also includes a large amount of biased data and a relatively small amount of unbiased data. The rationale for the way of data synthesis can be found in Refs. [40, 41].

To be consistent with AutoDebias, we follow the experimental setup with them in training, validating, and testing. Specifically, we use the biased data ${D}_{\rm{T }}$ to train our base model and uniform data ${D}_{\rm{U}}$ to assist in debiasing and testing. Given the paucity of uniform data, we split ${D}_{\rm{U}}$ into three parts: 5 $\%$ to assist in training hyperparameters $D_\text{ut}$ , 5 $\%$ for validation to tune the hyperparameters $D_\text{uv}$ , and the remaining 90 $\%$ for evaluating the model $D_\text{ue}$ . Additionally, we binarize the rating in three datasets with threshold 3, i.e., we treat the rating values that are larger than 3 as positive feedback and label them with $r=1$ .

Evaluation protocols. The evaluation metrics in this work are the negative logarithmic loss (NLL), the area under the ROC curve (AUC), and normalized discounted cumulative gain (NDCG):

NLL. This metric evaluates the performance of the predictions:

$\begin{array}{l} {\rm{NLL}} = - \dfrac{1}{|D_\text{ve}|} \displaystyle \sum\limits_{(u,i,r)\in D_\text{ve}} \log \left(1+{\rm{e}}^{-r*f_\theta(u,i)}\right), \end{array}$

(14)

where $D_\text{ve}$ denotes the validation set $D_\text{uv}$ or the test set $D_\text{ue}$ .

AUC. This metric evaluates the performance of rankings:

$\begin{array}{l} {\rm{AUC}} = \dfrac{\displaystyle \sum\limits_{(u,i)\in D_\text{ve}^+}\widehat {\text{Rank}}_{u,i} - (|D_\text{ve}^+|+1)(|D_\text{ve}^+|)/2}{(|D_\text{ve}^+|) * (|D_\text{ve}| - |D_\text{ve}^+|)}, \end{array}$

(15)

where $|D_\text{ve}^+|$ detotes the number of positive data in $D_\text{ve}$ , and $\widehat {\text{Rank}}_{u,i}$ denotes the rank position of a positive feedback $(u,i)$ .

NDCG@k. This metric evaluates the quality of recommendation through discounted importance based on ranking position:

$\begin{array}{l} \text{DCG}{_u}@k = \displaystyle \sum\limits_{(u,i) \in {D_\text{ve}}} {\dfrac{{\mathbf I({{\widehat {\text{Rank}}}_{u,i}} \leqslant k)}}{{\log({{\widehat {\text{Rank}}}_{u,i}} + 1)}}}, \\ \text{NDCG}@k = \dfrac{1}{|\cal{{U}}|} \displaystyle \sum\limits_{u\in \cal{{U}}} \dfrac{\text{DCG}_u@k}{\text{IDCG}_u@k}, \end{array}$

(16)

where $\text{IDCG}_u@k$ is a normalizer that ensures NDCG is between 0 and 1.

We implement our experiment with Pytorch and the code is available at https://github.com/Chris-Mraz/LightAD. We perform grid search to tune the hyperparameters for all candidate methods on the validation set.

6.2 Performance comparison (RQ1)

Baselines. To demonstrate the effectiveness of our proposed sampled method, we use matrix factorization (MF)^[42] as the backbone and the following models as our baselines:

Matrix factorization (MF)^[42]. MF is one classical model in RS. In this model, the preference of user u for item i can be formulated as $R_{u,i} = U_u^{\rm{T}}V_j + u_i + u_j + \mu$ . In our experimental setup, we train the MF model with biased, uniform, and combined dataset, i.e., ${D}_{\rm{T}}$ , ${D}_{\rm{U}}$ , and $D_{\rm{T}} + D_{\rm{U}},$ respectively.

Inverse propensity score (IPS)^[6]. IPS is a counterfactual-based recommendation method that estimates the propensity score via naive bayes.

Doubly robust (DR)^[43]. DR combines the ideas of the IPS and data imputation. The debiasing capacity of this model can be ensured if either the imputed data or propensity score are accurately estimated

CausE and KD-label. CausE and KD-label are excellent methods that transfer the unbiased information with a teacher model using knowledge distillation techniques. Both of them are the best performing models in Ref. [5].

AutoDebias. AutoDebias is a generic and effective solution for recommendation debiasing.

We also test three versions of our LightAD for ablation studies: (ⅰ) LightAD-uniform, which improves AutoDebias with a uniform sampler. (ⅱ) LightAD-fixed, where we remove the selfpaced strategy and fix $\alpha$ as a constant (i.e., $\alpha=0.5$ ). (ⅲ) LightAD-selfpaced, the complete model proposed in Section 4.2.

Table 2 shows the performance of our LightAD and the baselines with respect to three evaluation metrics on two datasets. We make the following two observations:

Table 2. Comparison results on explicit feedback data. The boldface font denotes the winner in that column.

Model	Yahoo!R3			Coat
Model	NLL	AUC	NDCG@5	NLL	AUC	NDCG@5
MF (biased)	−0.587	0.727	0.547	−0.546	0.757	0.484
MF (uniform)	−0.503	0.568	0.435	−0.641	0.540	0.362
MF (combined)	−0.579	0.729	0.551	−0.538	0.750	0.492
IPS	−0.450	0.725	0.556	−0.531	0.755	0.484
DR	−0.444	0.723	0.552	−0.513	0.760	0.501
CausE	−0.585	0.729	0.553	−0.529	0.762	0.500
KD-label	−0.587	0.727	0.547	−0.546	0.757	0.484
AutoDebias	−0.419	0.748	0.648	−0.529	0.766	0.501
LightAD-uniform	−0.410	0.734	0.586	−0.505	0.710	0.473
LightAD-fixed	–0.406	0.737	0.632	–0.499	0.742	0.489
LightAD-selfpaced	−0.425	0.748	0.641	−0.510	0.765	0.517

| Show Table

DownLoad: CSV

(Ⅰ) As a whole, our LightAD-selfpaced outperforms other baselines by a large margin. Contrast with AutoDebias, our model achieves similar performance with AutoDebias in both datasets and some metrics even exceed it. This can be attributed to the fact that AutoDebias gives the same attention to all samples during training, which will hinder the capture of truly useful information. It’s worth noting that the learning of sampling strategy is only conducted on a very small amount of data instances i.e., the same order of magnitude as the biased instances, which is far less than $O(nm)$ .

(Ⅱ) In terms of all the sampling strategies, LightAD with simple uniform sampling obtains the worst performance and convergence. This is consistent with our intuition that uniform sampling suffers from high variance. The fact that LightAD-fixed outperforms LightAD-uniform proves that the informative sampler could partly address this problem. LightAD-selfpaced achieves the best performance which validates the effectiveness of the proposed selfpaced strategy.

6.3 Efficiency comparison (RQ2)

As we mention in Section 3.3, the calculation on imputed training instances consumes enormous computing resources and restricts the practicality of the AutoDebias. In this part, we test this notion through ablation experiments. Table 3 shows the time cost of the ablated methods with different components being deleted. From the table, we can conclude that the introduction of these components $\{{w^{(1)}},{w^{(2)}},m\}$ has different effects on the efficiency of our model. $w^{(1)}$ has much less impact on model efficiency than the other two. This is consistent with our previous analysis and demonstrates the necessity of our sampling.

Table 3. Ablation study of time cost per training epoch on Yahoo!R3.

Model	$w^{(1)}$	m	$w^{(2)}$	Epoch (s)
MF	$\times$	$\times$	$\times$	0.02
AutoDebias-w⁽¹⁾	√	$\times$	$\times$	0.07
AutoDebias-w⁽¹⁾m	√	√	$\times$	0.92
AutoDebias	√	√	√	1.3

| Show Table

DownLoad: CSV

To verify the efficiency promotion of our methods, we conduct experiments in two dimensions: the time cost in each epoch and the overall time cost. To ensure the reliability of the experiment as much as possible, we validated the experiment under the same conditions i.e., the same hyperparameter configuration and computing resource. The statistics of training time for these four models are shown in Table 4. From this table, we can find that employing samplers could indeed accelerate the training of AutoDebias. More interestingly, employing a uniform sampler achieves fast calculation in each epoch while requiring more time cost in total. We ascribe it to the fact that informative sampler could improve convergence and reduce the number of the required epochs. To prove this, we plotted the training process of LightAD-fixed and LightAD-uniform. As shown in the Fig. 1, it takes less than 20 epochs for LightAD-fixed to converge while taking more than 40 epochs to achieve the analogous level in LightAD-uniform. The experimental results are consistent with our previous theoretical proof.

Table 4. Training time on Yahoo!R3.

Model	Epoch (s)	Total time (s)
AutoDebias	1.30	1260.50
LightAD-uniform	0.14	74.10
LightAD-fixed	0.34	107.36
LightAD-selfpaced	0.20	29.05

| Show Table

DownLoad: CSV

Figure 1. Training process of LightAD-fixed and LightAD-uniform.

DownLoad: Full-Size Img PowerPoint

6.4 Hyperparameter analysis (RQ3)

Note that in LightAD-selfpaced, we gradually increase the contribution of the informative sampler while hyperparamter $\beta$ controls the increasing ratio. In this subsection, we adjust the value of $\beta$ and explore how it affects recommendation performance. The results are presented in Fig. 2. As we can see, as $\beta$ becomes larger, with few exceptions, the performance will increase first and then drop when $\beta$ surpasses a threshold. The reason is that too small (or large) $\beta$ would make the model focus on informative sampler too early (or late). Setting proper $\beta$ makes the model achieve the best performance.

Figure 2. The influence of hyperparameter β on self-paced sampling strategy.

DownLoad: Full-Size Img PowerPoint

6.5 Universality verification (RQ4)

To demonstrate the universality of our LightAD, we conduct experiments on Simulation. We compare our three sampling strategies with two SOTA methods in mitigating both position bias and selection bias:

Dual learning algorithm (DLA)^[44]. DLA treated the learning of the propensity function and the ranking list as a dual problem and designed specific EM algorithms to learn the two models, which is one SOTA method in mitigating position bias.

Heckman ensemble (HeckE)^[45]. Ascribed bias to the reality that users can only view a truncated list of recommendations. HeckE adopted a two-step method in which it first designed a Probit model to estimate the probability of a item being examined and then took advantage of this probability to modify their model.

Different from the models in Section 6.2, we add position information in the modeling of $w_k^{(1)}$ and left other experimental settings unchanged, i.e., $w_k^{(1)} = \exp (\varphi_1^{\rm{T}}[{\boldsymbol x_{{u_k}}} \circ {\boldsymbol x_{i_k}} \circ {{\boldsymbol{ e}}_{r_k} } \circ {{\boldsymbol{ e}}_{p_k}} ]),$ where ${{\boldsymbol{ e}}_{p_k}}$ denotes the feature vectors of position information. Table 5 shows the performance of our LightAD and the baselines on Simulation. We can find LightAD vastly outperform SOTA methods which verifies that LightAD can handle various biases and their combination.

Table 5. Comparison results on simulation dataset.

	NLL	AUC	NDCG@5
DLA	−0.693	0.572	0.595
HeckE	−0.692	0.587	0.657
LightAD-uniform	−0.731	0.603	0.665
LightAD-fixed	−0.745	0.604	0.680
LightAD-selfpaced	−0.685	0.611	0.700

| Show Table

DownLoad: CSV

7. Conclusions

In this work, we identify the inefficiency issue of AutoDebias, and propose to accelerate AutoDebias with a carefully designed sampling strategy. The sampler could adaptively and dynamically draw informative training instances, which brings provably better convergence and stability than the standard uniform sampler. Extensive experiments on three benchmark datasets validate that our LightAD accelerates AutoDebias by several magnitudes while maintaining almost equal recommendation performance and universality.

One promising research direction for future work is to explore pruning strategy for AutoDebias. AutoDebias involves a large number of debiasing parameters, and of course not all parameters are necessary. Leveraging AutoML or lottery hypothesis to locate important parameters while leave others would significantly reduce the time and space cost, mitigate over-fitting and potentially boost recommendation performance.

Acknowledgements

This work was supported by the National Key R&D Program of China (2021YFB2900302).

Conflict of interest

The authors declare that they have no conflict of interest.

Existing distributed graph-based semi-supervised learning algorithms suffer from extensive communication and computation. The proposed scheme is more efficient with low complexity by providing a parallel and distributed solution for global Euclidean distance matrix completion. The iteration time of distributed graph-based semi-supervised learning is defined by the slowest node, noted as the straggler node. A novel coded computation design for distributed graph-based semi-supervised learning is proposed to decrease the straggler effect. We numerically verified the superiority of the proposed scheme on Alibaba cloud elastic compute service. In general, the simulation results have demonstrate that our proposed algorithm is efficient and straggler tolerant.

References (31)

References

[1]	van Engelen J E, Hoos H H. A survey on semi-supervised learning. Machine Learning, 2020, 109: 373–440. DOI: 10.1007/s10994-019-05855-6
[2]	Ang J C, Mirzal A, Haron H, et al. Supervised, unsupervised, and semi-supervised feature selection: A review on gene selection. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2016, 13 (5): 971–989. DOI: 10.1109/tcbb.2015.2478454
[3]	Zhu X, Goldberg A B. Introduction to Semi-Supervised Learning. Cham, Switzerland: Springer, 2009.
[4]	Scudder H. Probability of error of some adaptive pattern-recognition machines. IEEE Transactions on Information Theory, 1965, 11 (3): 363–371. DOI: 10.1109/tit.1965.1053799
[5]	Blum A, Mitchell T. Combining labeled and unlabeled data with co-training. In: COLT' 98: Proceedings of the Eleventh Annual Conference on Computational Learning Theory. New York: ACM, 1998: 92–100.
[6]	Belkin M, Niyogi P, Sindhwani V. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 2006, 7: 2399–2434. DOI: 10.5555/1248547.1248632
[7]	Belkin M, Niyogi P. Semi-supervised learning on Riemannian manifolds. Machine Learning, 2004, 56: 209–239. DOI: 10.1023/B:MACH.0000033120.25363.1e
[8]	Chapelle O, Schölkopf B, Zien A, Transductive support vector machines. In: Semi-Supervised Learning. Cambridge: MIT Press. 2006, 105–117.
[9]	Chong Y, Ding Y, Yan Q, et al. Graph-based semi-supervised learning: A review. Neurocomputing, 2020, 408 (30): 216–230. DOI: 10.1016/j.neucom.2019.12.130
[10]	Zhu X, Ghahramani Z, Lafferty J. Semi-supervised learning using Gaussian fields and harmonic functions. In: ICML'03: Proceedings of the Twentieth International Conference on International Conference on Machine Learning. Washington, DC: AAAI Press, 2003: 912–919.
[11]	Zhou D, Bousquet O, Lal T N, et al. Learning with local and global consistency. In: NIPS'03: Proceedings of the 16th International Conference on Neural Information Processing Systems. New York: ACM, 2003: 321–328.
[12]	Chen J, Wang C, Sun Y, et al. Semi-supervised Laplacian regularized least squares algorithm for localization in wireless sensor networks. Computer Networks, 2011, 55 (10): 2481–2491. DOI: 10.1016/j.comnet.2011.04.010
[13]	Szummer M, Jaakkola T. Partially labeled classification with Markov random walks. In: NIPS'01: Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic. Cambridge: MIT Press, 2001: 945–952.
[14]	Grira N, Crucianu M, Boujemaa N. Active semi-supervised fuzzy clustering for image database categorization. In: MIR '05: Proceedings of the 7th ACM SIGMM International Workshop on Multimedia Information Retrieval. New York: ACM, 2005: 9–16.
[15]	Chapelle O, Zien A. Semi-supervised classification by low density separation. In: AISTATS 2005–Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics. Stuttgart, Germany: Max-Planck-Gesellschaft, 2005: 57–64.
[16]	Kostopoulos G, Karlos S, Kotsiantis S, et al. Semi-supervised regression: A recent review. Journal of Intelligent & Fuzzy Systems, 2018, 35 (2): 1483–1500. DOI: 10.3233/JIFS-169689
[17]	Torii M, Wagholikar K, Liu H. Using machine learning for concept extraction on clinical documents from multiple data sources. Journal of the American Medical Informatics Association, 2011, 18: 580–587 . DOI: 10.1136/amiajnl-2011-000155
[18]	Scardapane S, Fierimonte R, Wang D, et al. Distributed music classification using random vector functional-link nets. In: 2015 International Joint Conference on Neural Networks (IJCNN). Killarney, Ireland: IEEE, 2015: 1–8.
[19]	Shih T K, Distributed multimedia databases In: Shih T K, editor. Distributed Multimedia Databases: Techniques and Applications. Hershey, PA: IGI Global, 2002: 2–12.
[20]	Shen P, Du X, Li C. Distributed semi-supervised metric learning. IEEE Access, 2016, 4: 8558–8571. DOI: 10.1109/ACCESS.2016.2632158
[21]	Scardapane S, Fierimonte R, Di Lorenzo P, et al. Distributed semi-supervised support vector machines. Neural Networks, 2016, 80: 43–52 . DOI: 10.1016/j.neunet.2016.04.007
[22]	Fierimonte R, Scardapane S, Uncini A, et al. Fully decentralized semi-supervised learning via privacy-preserving matrix completion. IEEE Transactions on Neural Networks and Learning Systems, 2017, 28: 2699–2711. DOI: 10.1109/TNNLS.2016.2597444
[23]	Gan H, Li Z, Wu W, et al. Safety-aware graph-based semi-supervised learning. Expert Systems With Applications, 2018, 107: 243–254. DOI: 10.1016/j.eswa.2018.04.031
[24]	Lee K, Lam M, Pedarsani R, et al. Speeding up distributed machine learning using codes. IEEE Transactions on Information Theory, 2018, 64 (3): 1514–1529. DOI: 10.1109/tit.2017.2736066
[25]	Chen L, Han K, Du Y, et al. Block-division-based wireless coded computation. IEEE Wireless Communications Letters, 2022, 11 (2): 283–287. DOI: 10.1109/LWC.2021.3125983
[26]	Agarwal A, Duchi J C. Distributed delayed stochastic optimization. In: 2012 IEEE 51st IEEE Conference on Decision and Control (CDC). Maui, USA: IEEE, 2012: 5451–5452.
[27]	Alfakih A Y, Khandani A K, Wolkowicz H. Solving euclidean distance matrix completion problems via semidefinite programming. Computational Optimization and Applications, 1999, 12: 13–30. DOI: 10.1023/A:1008655427845
[28]	Al-Homidan S, Wolkowicz H. Approximate and exact completion problems for Euclidean distance matrices using semidefinite programming. Linear Algebra and Its Applications, 2005, 406: 109–141. DOI: 10.1016/j.laa.2005.03.021
[29]	Liu W, Chen L, Zhang W. Decentralized federated learning: Balancing communication and computing costs. IEEE Transactions on Signal and Information Processing Over Networks, 2022, 8: 131–143. DOI: 10.1109/TSIPN.2022.3151242
[30]	Liu W, Chen L, Chen Y, et al. Accelerating federated learning via momentum gradient descent. IEEE Transactions on Parallel and Distributed Systems, 2020, 31 (8): 1754–1766. DOI: 10.1109/TPDS.2020.2975189
[31]	Wang Z, Du Y, Wei K, et al. Vision, application scenarios, and key technology trends for 6G mobile communications. Science China Information Sciences, 2022, 65: 151301. DOI: 10.1007/s11432-021-3351-5

Supplements (1)

Supplements
Other Related Supplements
- Graphic and text summary
  Download

Cited By

Track Citations

Get Citation

{{if article.articleBusiness.pdfLink && article.articleBusiness.pdfLink != ''}} {{else}} {{/if}}PDF

XML

Figure 1. Illustration of the distributed SSL setup with N client and 1 server. Each client owns a labeled data set ${{{\boldsymbol{S}}}_{i}}$ and an unlabeled data set ${{{\boldsymbol{U}}}_{i}}$ . The server collects information from clients to run a distributed SSL task to estimate all the unknown labels and return the unknown label to clients.

Figure 2. Illustration of distributed computation for distributed SSL.

Figure 3. Single iteration of coded distributed matrix completion.

Figure 4. Single iteration of coded distributed label propagation.

Figure 5. The round-trip time of a task once between the central server and the client on ECS instance.

Figure 6. Matrix completion accuracy and classification accuracy of with rank ratio $\rho ={r}/{K+2}$ .

Figure 7. Illustration of the random 100 images from the MNIST data set and their label after the CDGSSL task, as the rank $r$ increased. The highlighted label is the incorrect label after the training process.

Figure 8. Comparison of average running time under different delay conditions overhead of uncoded algorithm and MDS-coded algorithm with $k=9$ and $k=8$ .

References

[1]	van Engelen J E, Hoos H H. A survey on semi-supervised learning. Machine Learning, 2020, 109: 373–440. DOI: 10.1007/s10994-019-05855-6
[2]	Ang J C, Mirzal A, Haron H, et al. Supervised, unsupervised, and semi-supervised feature selection: A review on gene selection. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2016, 13 (5): 971–989. DOI: 10.1109/tcbb.2015.2478454
[3]	Zhu X, Goldberg A B. Introduction to Semi-Supervised Learning. Cham, Switzerland: Springer, 2009.
[4]	Scudder H. Probability of error of some adaptive pattern-recognition machines. IEEE Transactions on Information Theory, 1965, 11 (3): 363–371. DOI: 10.1109/tit.1965.1053799
[5]	Blum A, Mitchell T. Combining labeled and unlabeled data with co-training. In: COLT' 98: Proceedings of the Eleventh Annual Conference on Computational Learning Theory. New York: ACM, 1998: 92–100.
[6]	Belkin M, Niyogi P, Sindhwani V. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 2006, 7: 2399–2434. DOI: 10.5555/1248547.1248632
[7]	Belkin M, Niyogi P. Semi-supervised learning on Riemannian manifolds. Machine Learning, 2004, 56: 209–239. DOI: 10.1023/B:MACH.0000033120.25363.1e
[8]	Chapelle O, Schölkopf B, Zien A, Transductive support vector machines. In: Semi-Supervised Learning. Cambridge: MIT Press. 2006, 105–117.
[9]	Chong Y, Ding Y, Yan Q, et al. Graph-based semi-supervised learning: A review. Neurocomputing, 2020, 408 (30): 216–230. DOI: 10.1016/j.neucom.2019.12.130
[10]	Zhu X, Ghahramani Z, Lafferty J. Semi-supervised learning using Gaussian fields and harmonic functions. In: ICML'03: Proceedings of the Twentieth International Conference on International Conference on Machine Learning. Washington, DC: AAAI Press, 2003: 912–919.
[11]	Zhou D, Bousquet O, Lal T N, et al. Learning with local and global consistency. In: NIPS'03: Proceedings of the 16th International Conference on Neural Information Processing Systems. New York: ACM, 2003: 321–328.
[12]	Chen J, Wang C, Sun Y, et al. Semi-supervised Laplacian regularized least squares algorithm for localization in wireless sensor networks. Computer Networks, 2011, 55 (10): 2481–2491. DOI: 10.1016/j.comnet.2011.04.010
[13]	Szummer M, Jaakkola T. Partially labeled classification with Markov random walks. In: NIPS'01: Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic. Cambridge: MIT Press, 2001: 945–952.
[14]	Grira N, Crucianu M, Boujemaa N. Active semi-supervised fuzzy clustering for image database categorization. In: MIR '05: Proceedings of the 7th ACM SIGMM International Workshop on Multimedia Information Retrieval. New York: ACM, 2005: 9–16.
[15]	Chapelle O, Zien A. Semi-supervised classification by low density separation. In: AISTATS 2005–Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics. Stuttgart, Germany: Max-Planck-Gesellschaft, 2005: 57–64.
[16]	Kostopoulos G, Karlos S, Kotsiantis S, et al. Semi-supervised regression: A recent review. Journal of Intelligent & Fuzzy Systems, 2018, 35 (2): 1483–1500. DOI: 10.3233/JIFS-169689
[17]	Torii M, Wagholikar K, Liu H. Using machine learning for concept extraction on clinical documents from multiple data sources. Journal of the American Medical Informatics Association, 2011, 18: 580–587 . DOI: 10.1136/amiajnl-2011-000155
[18]	Scardapane S, Fierimonte R, Wang D, et al. Distributed music classification using random vector functional-link nets. In: 2015 International Joint Conference on Neural Networks (IJCNN). Killarney, Ireland: IEEE, 2015: 1–8.
[19]	Shih T K, Distributed multimedia databases In: Shih T K, editor. Distributed Multimedia Databases: Techniques and Applications. Hershey, PA: IGI Global, 2002: 2–12.
[20]	Shen P, Du X, Li C. Distributed semi-supervised metric learning. IEEE Access, 2016, 4: 8558–8571. DOI: 10.1109/ACCESS.2016.2632158
[21]	Scardapane S, Fierimonte R, Di Lorenzo P, et al. Distributed semi-supervised support vector machines. Neural Networks, 2016, 80: 43–52 . DOI: 10.1016/j.neunet.2016.04.007
[22]	Fierimonte R, Scardapane S, Uncini A, et al. Fully decentralized semi-supervised learning via privacy-preserving matrix completion. IEEE Transactions on Neural Networks and Learning Systems, 2017, 28: 2699–2711. DOI: 10.1109/TNNLS.2016.2597444
[23]	Gan H, Li Z, Wu W, et al. Safety-aware graph-based semi-supervised learning. Expert Systems With Applications, 2018, 107: 243–254. DOI: 10.1016/j.eswa.2018.04.031
[24]	Lee K, Lam M, Pedarsani R, et al. Speeding up distributed machine learning using codes. IEEE Transactions on Information Theory, 2018, 64 (3): 1514–1529. DOI: 10.1109/tit.2017.2736066
[25]	Chen L, Han K, Du Y, et al. Block-division-based wireless coded computation. IEEE Wireless Communications Letters, 2022, 11 (2): 283–287. DOI: 10.1109/LWC.2021.3125983
[26]	Agarwal A, Duchi J C. Distributed delayed stochastic optimization. In: 2012 IEEE 51st IEEE Conference on Decision and Control (CDC). Maui, USA: IEEE, 2012: 5451–5452.
[27]	Alfakih A Y, Khandani A K, Wolkowicz H. Solving euclidean distance matrix completion problems via semidefinite programming. Computational Optimization and Applications, 1999, 12: 13–30. DOI: 10.1023/A:1008655427845
[28]	Al-Homidan S, Wolkowicz H. Approximate and exact completion problems for Euclidean distance matrices using semidefinite programming. Linear Algebra and Its Applications, 2005, 406: 109–141. DOI: 10.1016/j.laa.2005.03.021
[29]	Liu W, Chen L, Zhang W. Decentralized federated learning: Balancing communication and computing costs. IEEE Transactions on Signal and Information Processing Over Networks, 2022, 8: 131–143. DOI: 10.1109/TSIPN.2022.3151242
[30]	Liu W, Chen L, Chen Y, et al. Accelerating federated learning via momentum gradient descent. IEEE Transactions on Parallel and Distributed Systems, 2020, 31 (8): 1754–1766. DOI: 10.1109/TPDS.2020.2975189
[31]	Wang Z, Du Y, Wei K, et al. Vision, application scenarios, and key technology trends for 6G mobile communications. Science China Information Sciences, 2022, 65: 151301. DOI: 10.1007/s11432-021-3351-5

TrendMD

Volume 53 Issue 4 PP. 0401

Cover

Keywords

Article Metrics

Article views (686) PDF downloads (1987)

Coded computing for distributed graph-based semi-supervised learning

Share

Tools

Abstract

Graphical Abstract

Abstract

Public Summary

1. Introduction

2. Related work

2.1 Biases in recommendation

2.2 Sampling

3. Preliminary

3.1 Problem definition

3.2 AutoDebias brief

3.3 Complexity analysis of AutoDebias

4. Method

4.1 Fast informative sampler

4.2 Self-paced sampling strategy.

5. Theoretical analyses

5.1 Unbiasedness of LightAD

5.2 Variance reduction

6. Experiments

6.1 Experimental setup

6.2 Performance comparison (RQ1)

6.3 Efficiency comparison (RQ2)

6.4 Hyperparameter analysis (RQ3)

6.5 Universality verification (RQ4)

7. Conclusions

Acknowledgements

Conflict of interest

References

Supplements

Other Related Supplements

Graphic and text summary

Catalog

References

Related Articles

TrendMD

Article Metrics

Authors

Browse

Contact Us

About

Export File

Citation

Format

Content