Dataset | Users | Items | Training | Validation | Test |
Coat | 290 | 300 | 6960 | 232 | 4176 |
Yahoo!R3 | 15400 | 1000 | 311704 | 2700 | 48600 |
Simulation | 500 | 500 | 12500 | 3750 | 67500 |
When domains, which represent underlying data distributions, differ between training and test datasets, traditional deep neural networks suffer from a substantial drop in their performance. Domain generalization methods aim to boost generalizability on an unseen target domain by using only training data from source domains. Mainstream domain generalization algorithms usually make modifications on some popular feature extraction networks such as ResNet, or add more complex parameter modules after the feature extraction networks. Popular feature extraction networks are usually well pre-trained on large-scale datasets, so they have strong feature extraction abilities, while modifications can weaken such abilities. Adding more complex parameter modules results in a deeper network and is much more computationally demanding. In this paper, we propose a novel feature transfer model based on popular feature extraction networks in domain generalization, without making any changes or adding any module. The generalizability of this feature transfer model is boosted by incorporating a contrastive loss and a data augmentation strategy (i.e., Mixup), and a new sample selection strategy is proposed to coordinate Mixup and contrastive loss. Experiments on the benchmarks PACS and Domainnet demonstrate the superiority of our proposed method against conventional domain generalization methods.
The proposed model with feature transfer and contrastive loss.
When domains, which represent underlying data distributions, differ between training and test datasets, traditional deep neural networks suffer from a substantial drop in their performance. Domain generalization methods aim to boost generalizability on an unseen target domain by using only training data from source domains. Mainstream domain generalization algorithms usually make modifications on some popular feature extraction networks such as ResNet, or add more complex parameter modules after the feature extraction networks. Popular feature extraction networks are usually well pre-trained on large-scale datasets, so they have strong feature extraction abilities, while modifications can weaken such abilities. Adding more complex parameter modules results in a deeper network and is much more computationally demanding. In this paper, we propose a novel feature transfer model based on popular feature extraction networks in domain generalization, without making any changes or adding any module. The generalizability of this feature transfer model is boosted by incorporating a contrastive loss and a data augmentation strategy (i.e., Mixup), and a new sample selection strategy is proposed to coordinate Mixup and contrastive loss. Experiments on the benchmarks PACS and Domainnet demonstrate the superiority of our proposed method against conventional domain generalization methods.
In the era of information explosion, recommender systems (RSs) that connect users with the right items have been recognized as the most effective means to alleviate information overload[1]. Existing recommendation methods mainly focus on developing a machine learning model to better fit the collected user behavior data[2, 3]. However, because the behavior data in practice are observational rather than experimental, various biases occur. For example, user behaviors are subject to what was exposed to the users (aka, exposure bias) or what were the public opinions (aka, conformity bias), deviating from reflecting the real interests of users. Consequently, blindly fitting the RS model without tackling the bias issues will lead to unacceptable performance[4, 5].
To alleviate the unfavorable effect of biases, a surge of recent work has been devoted to debiasing solutions. For example, the inverse propensity score (IPS) was proposed to reweight training samples for expectation-unbiased learning[6, 7]; Data imputation was explored to fill missing values[8, 9]; Knowledge distillation was also used to bridge the models trained on biased data and unbiased data. More recently, AutoDebias[10] proposed to optimize objective function on unbiased data has applied a metalearning technique, which has been demonstrated to be a generic and effective solution for recommendation debiasing. AutoDebias gives both pseudolabels and confidence weights for each user-item pair and is flexible enough to address various biases and outperform previous work by a large margin.
Although effective at managing diverse biases, AutoDebias suffers from serious inefficiency. Specifically, the training of AutoDebias involves calculations for each user-item pair. As such, the complexity of AutoDebias is linear with the product of the number of users and items, which can easily scale to one billion or even larger. This crucial defect severely limits the applicability and potential of AutoDebias. Although stochastic gradient descent with a uniform sampler can be employed for acceleration, this treatment significantly deteriorates model convergence and stability, leading to inferior recommendation performance. Hence, how to accelerate AutoDebias without reducing its accuracy is still an open problem.
To this end, in this work, we propose LightAutoDebias (short as LightAD), which equips AutoDebias with a carefully designed adaptive sampling strategy. We make three important improvements: (ⅰ) We adopt a dynamic sampling distribution that favors the informative training instances with large confidence weights. Further theoretical analysis proves that such a distribution could reduce the sampling variance and accelerate convergence. (ⅱ) We develop a customized factorization-based sampling strategy to support fast sampling from the above dynamic distribution. (ⅲ) Because the model may not reliably estimate the confidence weights in the early training stage, we propose a self-paced strategy that adopts uniform sampling at the beginning while gradually tilting toward factorization-based sampling along the training process.
To summarize, the contributions of this work are as follows:
● We identify the serious inefficiency issue of AutoDebias, which is imperative to be conquered.
● We propose a dynamic and adaptive sampler for AutoDebias, which yields provably better convergence and stability than those of the standard uniform sampler.
● We conduct extensive experiments on three benchmark datasets, validating that our LightAD accelerates AutoDebias by several magnitudes while maintaining almost equal recommendation accuracy.
The remainder of this paper is organized as follows. We first provide a review of related works in Section 2. We then define the problem and recap AutoDebias in Section 3. We elaborate upon the proposed method in detail in Section 4. After that, we theoretically demonstrate the unbiasedness and effectiveness of our approach in Section 5. The experimental results and discussion are presented in Section 6. Finally, we conclude the paper and present some directions for future work are presented in Section 7.
The fact that biases exist in practical RSs and can strongly compromise the performance of RSs has been proven by many scholars. Since we study how to apply unbiased sampling strategies to improve the efficiency issues in this paper, we first make a general classification of biases in data and review some related work on tackling them. Then, we included some specific sampling methods in the debiasing area.
Intuitively, biases in an RS indicate that models lose the ability to provide accurate recommendations for users. In general, we can roughly divide them into the following categories:
Selection bias. Selection bias occurs when users are free to choose which items to rate, thus the collected ratings are not a representative sample of all the ratings[11]. Intuitively, the data we used in RS are not missing at random. The research of Marlin et al.[12] fully proved the presence of selection bias. Schnabel et al.[6] considered introducing used inversed propensity scores (IPSs) to reweight the observed data. Steck et al.[13] designed a novel unbiased metric named ATOP to remedy selection bias. Considering that users can freely rate items they like, Hernández et al.[8] proposed imputation-based methods that directly impute the missing entries with pseudolabels and then weights down the contribution of these missing ratings.
Conformity bias. Conformity bias occurs as users tend to align themselves with others in the group regardless of whether it goes against their own will[11]. In other words, our rating of some products may succumb to public opinion and therefore not reflect our real interests. Krishnan et al.[14] conducted comparative experiments suggesting that social influence bias is an important factor in RS. Liu et al.[15] took conformity considered and directly used rating data behavior as an important feature to quantify conformity bias. In addition, Tang et al.[16] and Hao et al.[17] leveraged social factors more directly to produce the final prediction and introduce specific parameters to eliminate the impact of conformity bias.
Exposure bias. Exposure bias exists when users are only exposed to a small portion of the items provided[11]. Exposure bias is readily comprehensible as the unobserved data can be attributed to users not liking the item or simply not seeing the item. One simple and straightforward strategy to address exposure bias is to consider all noninteraction data as negative samples and specify their confidence. Hu et al.[18] put forward the classic WMF model in which the unobserved data are assigned lower weights. Similarly, Pan et al.[19] went a step further and specified the data confidence with examined user behavior. Another way to address exposure bias is to design an exposure-based probabilistic model that is able to capture whether a user has been exposed to an item. Liang introduced the EXMF model of users’ activity through an exposure variable and proposed a generative process of implicit feedback. Chen et al.[20] assumed that users frequently come into contact with specific communities and thus modelled confidence weights using a community-based strategy.
Position bias. Position bias emerges as users are inclined to interact with items in a higher position on the recommendation list[11]. When dealing with position bias, many researchers assuming that users’ click behavior occurs if and only if an item is relevant and examined by users[21, 22]. Another conventional strategy is to apply a model and its variants[23–25], which assumes that users examine items from the top to the bottom of the recommendation list. Hence, the click behavior relies only on the relevance of all the items that displayed to users.
Popularity bias. Popularity bias occurs when popular items are recommended more often than their popularity[11]. Due to the common long tail phenomenon in recommendation data; recommendation models usually pay more attention to popular items and hence give higher scores to popular items than their real value. Kamishima et al.[26] used a mean-match regularizer in their models to offset popularity bias. Zheng et al.[27] applied counterfactual reasoning to address the popularity bias and assumed that users’ click behavior depends mainly on their interest and items’ popularity. Krishnan et al.[28] introduced adversarial learning in RS to seek a trade-off between the popular item-biased reconstruction objective and the accuracy of long-tail item recommendations.
In addition to the methods mentioned above, sampling is a promising debiasing method that has been studied by many scholars. Specifically, we can use various sampling strategies to determine which data are used to update the parameters and how often in the training of our models. Steffen et al.[29] employed the uniform negative sampler as the basic strategy to improve efficiency. Several neural-based methods[30–32] also consider using negative sampler as an efficient way to increase the model performance. In response to exposure bias, Yu et al.[33] proposed to over sampled popular negative samples because they had a greater chance of being exposed. Nevertheless, these samplers follow a predetermined sampling distribution which may dynamically change according to the model’s state in each iteration; increasing the likehood genetating low-quality samples. Dae et al.[34], Steffen et al.[35], and Ding et al.[36] proposed an adaptive negative sampler, for the purpose of over sampling “difficult” instances to increase learning efficiency. One disadvantage of these heuristic methods is that they fail to capture the real negative instances. To address this problem, Ding et al.[37] explored leveraging side information to enhance the studied the performance of samplers; Chen et al.[38] took advantage of social information in their sampling strategy.
In this section, we first formulate the task of recommendation systems and then expose the nature of bias from the point of view of risk discrepancy. Finally, we analyze the temporal and spatial complexity of AutoDebias to demonstrate the challenges we face in deploying this algorithm.
Suppose we have a recommender system with user set
L(f)=EpU(u,i,r)[δ(f(u,i),r)], | (1) |
where
ˆLT(f)=1|DT||DT|∑k=1δ(f(uk,ik),rk). | (2) |
According to the PAC learning theory[39], if and only if
To eliminate the distribution discrepancy, Ref. [10] puts forward AutoDebias, a general proposed a debiasing framework, which aims at addressing various biases. The main idea of AutoDebias transforms the aforementioned empirical risk functions into:
ˆLT(f|ϕ)=1|DT||DT|∑k=1w(1)kδ(f(uk,ik),rk)+∑u∈U,i∈Iw(2)uiδ(f(u,i),mui), | (3) |
where AutoDebias imputes the pseudolabels
AutoDebias also devises a metlearning-based strategy to learn the appropriate
w(1)k=exp(φT1[euk∘eik∘erk]),w(2)ui=exp(φT2[eu∘ei∘eOui]),mui=σ(φT3[erui∘eOui]), | (4) |
where
Now, we conduct time complexity analyses of AutoDebias. The time complexity mainly comes from the following two parts:
Calculation on observed training instances. As depicted by the first part of the Eq. (3), the calculation only involves the instances in
Calculation on imputed training instances. As depicted by the second part of the Eq. (3), the complexity involves the calculation over all user-item pairs, which is
Considering the sparse nature of real-world recommendation data, the second part has become the bottleneck of AutoDebias. A naive solution for acceleration employs stochastic gradient descent and a uniform sampler, where the gradient can be quickly calculated on the sampled portion of the dataset. However, we find that this treatment would significantly deteriorate model convergence and stability, leading to inferior recommendation performance. In fact, not all user-item pairs are equally important in model training. We observe fairly scattered
In this section, we present LightAD which equips AutoDebias with a carefully designed adaptive sampling strategy. LightAD adopts an uneven sampling distribution and factorization-based efficient sampling strategy. We will detail LightAD in the following parts.
As mentioned before, the second part of Eq. (3) has become the bottleneck of AutoDebias. Here we leverage a sampler to accelerate the training, where the gradient can be estimated from a much smaller sampled training dataset. Formally, the learning with the sampler can be formulated as follows:
S∼Multinomial(N,ps(u,i)), | (5) |
which draws a small set of users and items, i.e.,
ˆLT(f|ϕ)=1|DT||DT|∑k=1w(1)kδ(f(uk,ik),rk)+1|S|∑u,i∈Sδ(f(u,i),mui), | (6) |
where the confidence weights
Importance-aware sampling distribution. Note that different data may have different confidence weights
Fast factorization-based sampling. Now the question lies in how to perform fast sampling from the distribution
w(2)ui=exp(φT2[eu∘ei])=exp(φT2[eu])∗exp(φT2[ei])≡w(2)u∗w(2)i, | (7) |
where
Su∼Multinomial(N,pu),Si∼Multinomial(N,pi), | (8) |
where
Note that in the early training stage, the model may not yield a reliable estimate of the confidence weights. Directly learning a model from an informative sampler may lead to suboptimal results. To address this problem, we propose a self-paced strategy that adopts a uniform distribution at the beginning while gradually alters the distribution as the training proceeds. Formally, we adopt the following sampling strategy:
S1∼Multinomial(αN,ps(u,i)),S2∼Multinomial((1−α)N,U(u,i)), | (9) |
where
ˆLP(f|ϕ)=1|DT||DT|∑k=1w(1)kδ(f(uk,ik),rk)+α∑u,iw(2)ui|S1|⋅∑u,i∈S1δ(f(u,i),mui)+(1−α)nm|S2|∑u,i∈S2w(2)uiδ(f(u,i),mui). | (10) |
The former section elaborates how LightAD accelerates AutoDebias. In this section, we conduct theoretical analyses to answer the following questions: Does the unbiasedness still hold with our proposed sampling strategy? Does the proposed informative sampler reduce the sampling variance?
It has been proven that the empirical risk function of AutoDebias can be an unbiased estimator of the true risk. Now we connect the objective of LightAD (Eq. (10)) with AutoDebias, which reveals the unbiasedness of LightAD. In fact, we have:
ES1,S2[ˆLP(f|ϕ)]=1|DT||DT|∑k=1w(1)kδ(f(uk,ik),rk)+ES1[α∑u,iw(2)ui|S1|∑u,i∈S1δ(f(u,i),mui)]+ES2[(1−α)nm|S2|∑u,i∈S2w(2)uiδ(f(u,i),mui)]=1|DT||DT|∑k=1w(1)kδ(f(uk,ik),rk)+α∑u,iw(2)ui|S1||S1|Eu,i∼ps[δ(f(u,i),mui)]+(1−α)nm|S2||S2|Eu,i∼U[w(2)uiδ(f(u,i),mui)]= |
1|DT||DT|∑k=1w(1)kδ(f(uk,ik),rk)+α∑u,iw(2)ui∑u,iw(2)ui∑u,iw(2)uiδ(f(u,i),mui)+(1−α)nm∑u,iw(2)uiδ(f(u,i),mui)=ˆLS(f|ϕ). | (11) |
From the above mathematical deduction, we can safely draw the conclusion that the objective of LightAD is indeed to use an unbiased estimator of AutoDebias and the ideal true risk.
How does the conventional uniform sampler perform? Does the proposed informative sampler reduce the variance? In this section, we aim to provide a theoretical answer to these problems.
Let us first formulate the variance of the estimated objective with the sampler. In fact, we have the following upper-bound of the variance:
VarS[ˆLS]=1|S|Varps[w(2)uips(u,i)δ(f(u,i),mui)]⩽1|S|(∑u,iVarui[w(2)ui]+Eu,i[w(2)ui]2ps(u,i))×Eps[δ(f(u,i),mui)]2−1|S|Eps[w(2)uiδ(f(u,i),mui)]2. | (12) |
For a uniform sampler, the variance of the objective is subject to the variance of weights
To address this problem, we should choose a better sampling distribution to reduce the upper bound of the estimation variance. In fact, we have:
∑ui(w(2)ui)2ps(u,i)=∑ui(w(2)ui)2ps(u,i)∑uips(u,i)⩾(∑ui√(w(2)ui)2ps(u,i)√ps(u,i))2=Eui[w(2)ui]2, | (13) |
where we employ the Cauchy inequality, and the equation holds if and only if
In this section, we first describe the experimental settings, and then present the detailed experiments with the purpose of answering the following questions:
RQ1: How does our LightAD perform compared with AutoDebias in recommendation performance?
RQ2: How does the proposed LightAD accelerate AutoDebias?
RQ3: How does the hyperparameter
RQ4: How does LightAD perform when dealing with the simultaneous presence of various biases?
We conduct our experiments with two publicly accessible datasets and one synthetic dataset: Yahoo!R3, Coat, and Simulation. The concrete statistics of the three datasets are summarized in Table 1.
Dataset | Users | Items | Training | Validation | Test |
Coat | 290 | 300 | 6960 | 232 | 4176 |
Yahoo!R3 | 15400 | 1000 | 311704 | 2700 | 48600 |
Simulation | 500 | 500 | 12500 | 3750 | 67500 |
Yahoo!R3. This dataset contains two types of ratings collected from two sources. The first source collects ratings of users’ normal interactions with music in Yahoo! services (i.e., users pick and rate items according to their own preferences), which can be considered as a stochastic logging policy. Thus, this type of data is obsessed with bias. The second kind of data is ratings from randomly selected songs collected during an online survey (i.e., uniform logging policy). Approximately, it can be regarded as unbiased and reflects the real interest of users.
Coat. This is a dataset collected by Ref. [7], which simulates customers shopping behaviors for coats in an online service. In this dataset, users were first given a web-shop interface and then asked to select the coat they wanted to buy most. Shortly afterwards, they were requested to rate 24 of the self-selected coats and 16 of the randomly picked ones on a five-point scale. Therefore, this dataset consists of self-selected ratings and the uniformly selected ratings which can be regarded as biased data and unbiased data respectively.
Simulation. A great advantage of AutoDebias is that it can handle multiple biases simultaneously, which means good universality in complex scenario. We need to verify that this universality also persists in our model. Since there is no public dataset that satisfies such a scenario, we borrow from AutoDebias and adopt a synthetic dataset Simulation where both position bias and selection bias are taken into account. Unlike the previous two datasets, this one contains the position information of items when the interaction occurs. In short, this dataset also includes a large amount of biased data and a relatively small amount of unbiased data. The rationale for the way of data synthesis can be found in Refs. [40, 41].
To be consistent with AutoDebias, we follow the experimental setup with them in training, validating, and testing. Specifically, we use the biased data
Evaluation protocols. The evaluation metrics in this work are the negative logarithmic loss (NLL), the area under the ROC curve (AUC), and normalized discounted cumulative gain (NDCG):
NLL. This metric evaluates the performance of the predictions:
NLL=−1|Dve|∑(u,i,r)∈Dvelog(1+e−r∗fθ(u,i)), | (14) |
where
AUC. This metric evaluates the performance of rankings:
AUC=∑(u,i)∈D+ve^Ranku,i−(|D+ve|+1)(|D+ve|)/2(|D+ve|)∗(|Dve|−|D+ve|), | (15) |
where
NDCG@k. This metric evaluates the quality of recommendation through discounted importance based on ranking position:
\begin{array}{l} \text{DCG}{_u}@k = \displaystyle \sum\limits_{(u,i) \in {D_\text{ve}}} {\dfrac{{\mathbf I({{\widehat {\text{Rank}}}_{u,i}} \leqslant k)}}{{\log({{\widehat {\text{Rank}}}_{u,i}} + 1)}}}, \\ \text{NDCG}@k = \dfrac{1}{|\cal{{U}}|} \displaystyle \sum\limits_{u\in \cal{{U}}} \dfrac{\text{DCG}_u@k}{\text{IDCG}_u@k}, \end{array} | (16) |
where
We implement our experiment with Pytorch and the code is available at https://github.com/Chris-Mraz/LightAD. We perform grid search to tune the hyperparameters for all candidate methods on the validation set.
Baselines. To demonstrate the effectiveness of our proposed sampled method, we use matrix factorization (MF)[42] as the backbone and the following models as our baselines:
Matrix factorization (MF)[42]. MF is one classical model in RS. In this model, the preference of user u for item i can be formulated as
Inverse propensity score (IPS)[6]. IPS is a counterfactual-based recommendation method that estimates the propensity score via naive bayes.
Doubly robust (DR)[43]. DR combines the ideas of the IPS and data imputation. The debiasing capacity of this model can be ensured if either the imputed data or propensity score are accurately estimated
CausE and KD-label. CausE and KD-label are excellent methods that transfer the unbiased information with a teacher model using knowledge distillation techniques. Both of them are the best performing models in Ref. [5].
AutoDebias. AutoDebias is a generic and effective solution for recommendation debiasing.
We also test three versions of our LightAD for ablation studies: (ⅰ) LightAD-uniform, which improves AutoDebias with a uniform sampler. (ⅱ) LightAD-fixed, where we remove the selfpaced strategy and fix
Table 2 shows the performance of our LightAD and the baselines with respect to three evaluation metrics on two datasets. We make the following two observations:
Model | Yahoo!R3 | Coat | |||||
NLL | AUC | NDCG@5 | NLL | AUC | NDCG@5 | ||
MF (biased) | −0.587 | 0.727 | 0.547 | −0.546 | 0.757 | 0.484 | |
MF (uniform) | −0.503 | 0.568 | 0.435 | −0.641 | 0.540 | 0.362 | |
MF (combined) | −0.579 | 0.729 | 0.551 | −0.538 | 0.750 | 0.492 | |
IPS | −0.450 | 0.725 | 0.556 | −0.531 | 0.755 | 0.484 | |
DR | −0.444 | 0.723 | 0.552 | −0.513 | 0.760 | 0.501 | |
CausE | −0.585 | 0.729 | 0.553 | −0.529 | 0.762 | 0.500 | |
KD-label | −0.587 | 0.727 | 0.547 | −0.546 | 0.757 | 0.484 | |
AutoDebias | −0.419 | 0.748 | 0.648 | −0.529 | 0.766 | 0.501 | |
LightAD-uniform | −0.410 | 0.734 | 0.586 | −0.505 | 0.710 | 0.473 | |
LightAD-fixed | –0.406 | 0.737 | 0.632 | –0.499 | 0.742 | 0.489 | |
LightAD-selfpaced | −0.425 | 0.748 | 0.641 | −0.510 | 0.765 | 0.517 |
(Ⅰ) As a whole, our LightAD-selfpaced outperforms other baselines by a large margin. Contrast with AutoDebias, our model achieves similar performance with AutoDebias in both datasets and some metrics even exceed it. This can be attributed to the fact that AutoDebias gives the same attention to all samples during training, which will hinder the capture of truly useful information. It’s worth noting that the learning of sampling strategy is only conducted on a very small amount of data instances i.e., the same order of magnitude as the biased instances, which is far less than
(Ⅱ) In terms of all the sampling strategies, LightAD with simple uniform sampling obtains the worst performance and convergence. This is consistent with our intuition that uniform sampling suffers from high variance. The fact that LightAD-fixed outperforms LightAD-uniform proves that the informative sampler could partly address this problem. LightAD-selfpaced achieves the best performance which validates the effectiveness of the proposed selfpaced strategy.
As we mention in Section 3.3, the calculation on imputed training instances consumes enormous computing resources and restricts the practicality of the AutoDebias. In this part, we test this notion through ablation experiments. Table 3 shows the time cost of the ablated methods with different components being deleted. From the table, we can conclude that the introduction of these components
Model | w^{(1)} | m | w^{(2)} | Epoch (s) |
MF | \times | \times | \times | 0.02 |
AutoDebias-w(1) | √ | \times | \times | 0.07 |
AutoDebias-w(1)m | √ | √ | \times | 0.92 |
AutoDebias | √ | √ | √ | 1.3 |
To verify the efficiency promotion of our methods, we conduct experiments in two dimensions: the time cost in each epoch and the overall time cost. To ensure the reliability of the experiment as much as possible, we validated the experiment under the same conditions i.e., the same hyperparameter configuration and computing resource. The statistics of training time for these four models are shown in Table 4. From this table, we can find that employing samplers could indeed accelerate the training of AutoDebias. More interestingly, employing a uniform sampler achieves fast calculation in each epoch while requiring more time cost in total. We ascribe it to the fact that informative sampler could improve convergence and reduce the number of the required epochs. To prove this, we plotted the training process of LightAD-fixed and LightAD-uniform. As shown in the Fig. 1, it takes less than 20 epochs for LightAD-fixed to converge while taking more than 40 epochs to achieve the analogous level in LightAD-uniform. The experimental results are consistent with our previous theoretical proof.
Model | Epoch (s) | Total time (s) |
AutoDebias | 1.30 | 1260.50 |
LightAD-uniform | 0.14 | 74.10 |
LightAD-fixed | 0.34 | 107.36 |
LightAD-selfpaced | 0.20 | 29.05 |
Note that in LightAD-selfpaced, we gradually increase the contribution of the informative sampler while hyperparamter
To demonstrate the universality of our LightAD, we conduct experiments on Simulation. We compare our three sampling strategies with two SOTA methods in mitigating both position bias and selection bias:
Dual learning algorithm (DLA)[44]. DLA treated the learning of the propensity function and the ranking list as a dual problem and designed specific EM algorithms to learn the two models, which is one SOTA method in mitigating position bias.
Heckman ensemble (HeckE)[45]. Ascribed bias to the reality that users can only view a truncated list of recommendations. HeckE adopted a two-step method in which it first designed a Probit model to estimate the probability of a item being examined and then took advantage of this probability to modify their model.
Different from the models in Section 6.2, we add position information in the modeling of
NLL | AUC | NDCG@5 | |
DLA | −0.693 | 0.572 | 0.595 |
HeckE | −0.692 | 0.587 | 0.657 |
LightAD-uniform | −0.731 | 0.603 | 0.665 |
LightAD-fixed | −0.745 | 0.604 | 0.680 |
LightAD-selfpaced | −0.685 | 0.611 | 0.700 |
In this work, we identify the inefficiency issue of AutoDebias, and propose to accelerate AutoDebias with a carefully designed sampling strategy. The sampler could adaptively and dynamically draw informative training instances, which brings provably better convergence and stability than the standard uniform sampler. Extensive experiments on three benchmark datasets validate that our LightAD accelerates AutoDebias by several magnitudes while maintaining almost equal recommendation performance and universality.
One promising research direction for future work is to explore pruning strategy for AutoDebias. AutoDebias involves a large number of debiasing parameters, and of course not all parameters are necessary. Leveraging AutoML or lottery hypothesis to locate important parameters while leave others would significantly reduce the time and space cost, mitigate over-fitting and potentially boost recommendation performance.
This work was supported by the National Natural Science Foundation of China (12171451) and Anhui Center for Applied Mathematics.
The authors declare that they have no conflict of interest.
[1] |
Shao R, Lan X, Li J, et al. Multi-adversarial discriminative deep domain generalization for face presentation attack detection. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, 2020 : 10015–10023.
|
[2] |
Ouyang C, Chen C, Li S, et al. Causality-inspired single-source domain generalization for medical image segmentation. IEEE Transactions on Medical Imaging, 2023, 42: 1095–1106. DOI: 10.1109/TMI.2022.3224067
|
[3] |
Guo L L, Pfohl S R, Fries J, et al. Evaluation of domain generalization and adaptation on improving model robustness to temporal dataset shift in clinical medicine. Scientific Reports, 2022, 12: 2726. DOI: 10.1038/s41598-022-06484-1
|
[4] |
Motiian S, Piccirilli M, Adjeroh D A, et al. Unified deep supervised domain adaptation and generalization. In: 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE, 2017 : 5716–5726.
|
[5] |
Li H, Pan S J, Wang S, et al. Domain generalization with adversarial feature learning. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018 : 5400–5409.
|
[6] |
Li D, Yang Y, Song Y Z, et al. Learning to generalize: Meta-learning for domain generalization. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32 (1). DOI: 10.1609/aaai.v32i1.11596
|
[7] |
Mancini M, Bulò S R, Caputo B, et al. Robust place categorization with deep domain generalization. IEEE Robotics and Automation Letters, 2018, 3: 2093–2100. DOI: 10.1109/LRA.2018.2809700
|
[8] |
Mancini M, Bulò S R, Caputo B, et al. Best sources forward: Domain generalization through source-specific nets. In: 2018 25th IEEE International Conference on Image Processing (ICIP). Athens, Greece: IEEE, 2018 : 1353–1357.
|
[9] |
Segu M, Tonioni A, Tombari F. Batch normalization embeddings for deep domain generalization. Pattern Recognition, 2023, 135: 109115. DOI: 10.1016/j.patcog.2022.109115
|
[10] |
Zhang H, Cisse M, Dauphin Y N, et al. Mixup: Beyond empirical risk minimization. arXiv: 1710.09412, 2017.
|
[11] |
Ding Z, Fu Y. Deep domain generalization with structured low-rank constraint. IEEE Transactions on Image Processing, 2018, 27: 304–313. DOI: 10.1109/TIP.2017.2758199
|
[12] |
Nalisnick E, Matsukawa A, Teh Y W, et al. Do deep generative models know what they don’t know? arXiv: 1810.09136, 2019 .
|
[13] |
Li Y, Tian X, Gong M, et al. Deep domain generalization via conditional invariant adversarial networks. In: Ferrari V, Hebert M, Sminchisescu C, editors. Computer Vision–ECCV 2018. Cham: Springer, 2018 , 11219: 647–663.
|
[14] |
Long M, Cao Y, Wang J, et al. Learning transferable features with deep adaptation networks. In: Proceedings of the 32nd International Conference on Machine Learning. New York: ACM, 2015 , 37: 97–105.
|
[15] |
Ganin Y, Lempitsky V. Unsupervised domain adaptation by backpropagation. In: Proceedings of the 32nd International Conference on Machine Learning. New York: ACM, 2015 , 37: 1180–1189.
|
[16] |
Bousmalis K, Silberman N, Dohan D, et al. Unsupervised pixel-level domain adaptation with generative adversarial networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA: IEEE, 2017 : 95–104.
|
[17] |
Xu R, Chen Z, Zuo W, et al. Deep cocktail network: Multi-source unsupervised domain adaptation with category shift. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018 : 3964–3973.
|
[18] |
Zhou K, Yang Y, Hospedales T, et al. Learning to generate novel domains for domain generalization. In: Vedaldi A, Bischof H, Brox T, editors. Computer Vision–ECCV 2020. Cham: Springer, 2020 , 12361: 561–578.
|
[19] |
Bai H, Sun R, Hong L, et al. DecAug: Out-of-distribution generalization via decomposed feature representation and semantic augmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35: 6705–6713. DOI: 10.1609/aaai.v35i8.16829
|
[20] |
Chattopadhyay P, Balaji Y, Hoffman J. Learning to balance specificity and invariance for in and out of domain generalization. In: European Conference on Computer Vision. Cham: Springer, 2020 : 301–318.
|
[21] |
Peng X, Bai Q, Xia X, et al. Moment matching for multi-source domain adaptation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE, 2020 : 1406–1415.
|
[22] |
Li D, Zhang J, Yang Y, et al. Episodic training for domain generalization. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE, 2020 : 1446–1455.
|
[23] |
S Seo, Y Suh, D Kim, G Kim, J Han, and B Han. Learning to optimize domain specific normalization for domain generalization. In: Vedaldi A, Bischof H, Brox T, et al. editors. Computer Vision─ECCV 2020. Cham: Springer, 2020 : 68–83.
|
[24] |
K Zhou, Y Yang, Y Qiao, et al. Domain generalization with mixstyle. arXiv: 2104.02008, 2021 .
|
[25] |
Cai Q, Wang Y, Pan Y, et al. Joint contrastive learning with infinite possibilities. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. New York: ACM, 2020 : 12638–12648.
|
[26] |
Chen T, Kornblith S, Norouzi M, et al. A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning. New York: ACM, 2020 : 1597–1607.
|
[27] |
Mitrovic J, McWilliams B, Walker J, et al. Representation learning via invariant causal mechanisms. arXiv: 2010.07922, 2020 .
|
[28] |
He K, Fan H, Wu Y, et al. Momentum contrast for unsupervised visual representation learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA: IEEE, 2020 : 9726–9735.
|
[29] |
Wu Z, Xiong Y, Yu S X, et al. Unsupervised feature learning via non-parametric instance discrimination. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018 : 3733–3742.
|
[30] |
Yao T, Zhang Y, Qiu Z, et al. SeCo: Exploring sequence supervision for unsupervised representation learning. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35: 10656–10664. DOI: 10.1609/aaai.v35i12.17274
|
[31] |
Li D, Yang Y, Song Y Z, et al. Deeper, broader and artier domain generalization. In: 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE, 2017 : 5543–5551.
|
[32] |
Carlucci F M, D'Innocente A, Bucci S, et al. Domain generalization by solving jigsaw puzzles. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, 2020 : 2224–2233.
|
[33] |
Matsuura T, Harada T. Domain generalization using a mixture of multiple latent domains. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34: 11749–11756. DOI: 10.1609/aaai.v34i07.6846
|
[34] |
Mahajan D, Tople S, Sharma A. Domain generalization using causal matching. In: Proceedings of the 38th International Conference on Machine Learning. Volume 139 of Proceedings of Machine Learning Research. PMLR, 2021 , 139: 7313–7324.
|
[35] |
Nam H, Lee H, Park J, et al. Reducing domain gap by reducing style bias. arXiv: 1910.11645, 2019.
|
[36] |
Zhou K, Yang Y, Hospedales T, et al. Deep domain-adversarial image generation for domain generalisation. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34: 13025–13032. DOI: 10.1609/aaai.v34i07.7003
|
[37] |
Balaji Y, Sankaranarayanan S, Chellappa R. MetaReg: towards domain generalization using meta-regularization. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. New York: ACM, 2018 : 1006–1016.
|
[38] |
Chattopadhyay P, Balaji Y, Hoffman J. Learning to balance specificity and invariance for in and out of domain generalization. In: Vedaldi A, Bischof H, Brox T, editors. Computer Vision–ECCV 2020. Cham: Springer, 2020 : 301–318.
|
Figure
2.
A causal model underlying our proposed method.
[1] |
Shao R, Lan X, Li J, et al. Multi-adversarial discriminative deep domain generalization for face presentation attack detection. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, 2020 : 10015–10023.
|
[2] |
Ouyang C, Chen C, Li S, et al. Causality-inspired single-source domain generalization for medical image segmentation. IEEE Transactions on Medical Imaging, 2023, 42: 1095–1106. DOI: 10.1109/TMI.2022.3224067
|
[3] |
Guo L L, Pfohl S R, Fries J, et al. Evaluation of domain generalization and adaptation on improving model robustness to temporal dataset shift in clinical medicine. Scientific Reports, 2022, 12: 2726. DOI: 10.1038/s41598-022-06484-1
|
[4] |
Motiian S, Piccirilli M, Adjeroh D A, et al. Unified deep supervised domain adaptation and generalization. In: 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE, 2017 : 5716–5726.
|
[5] |
Li H, Pan S J, Wang S, et al. Domain generalization with adversarial feature learning. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018 : 5400–5409.
|
[6] |
Li D, Yang Y, Song Y Z, et al. Learning to generalize: Meta-learning for domain generalization. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32 (1). DOI: 10.1609/aaai.v32i1.11596
|
[7] |
Mancini M, Bulò S R, Caputo B, et al. Robust place categorization with deep domain generalization. IEEE Robotics and Automation Letters, 2018, 3: 2093–2100. DOI: 10.1109/LRA.2018.2809700
|
[8] |
Mancini M, Bulò S R, Caputo B, et al. Best sources forward: Domain generalization through source-specific nets. In: 2018 25th IEEE International Conference on Image Processing (ICIP). Athens, Greece: IEEE, 2018 : 1353–1357.
|
[9] |
Segu M, Tonioni A, Tombari F. Batch normalization embeddings for deep domain generalization. Pattern Recognition, 2023, 135: 109115. DOI: 10.1016/j.patcog.2022.109115
|
[10] |
Zhang H, Cisse M, Dauphin Y N, et al. Mixup: Beyond empirical risk minimization. arXiv: 1710.09412, 2017.
|
[11] |
Ding Z, Fu Y. Deep domain generalization with structured low-rank constraint. IEEE Transactions on Image Processing, 2018, 27: 304–313. DOI: 10.1109/TIP.2017.2758199
|
[12] |
Nalisnick E, Matsukawa A, Teh Y W, et al. Do deep generative models know what they don’t know? arXiv: 1810.09136, 2019 .
|
[13] |
Li Y, Tian X, Gong M, et al. Deep domain generalization via conditional invariant adversarial networks. In: Ferrari V, Hebert M, Sminchisescu C, editors. Computer Vision–ECCV 2018. Cham: Springer, 2018 , 11219: 647–663.
|
[14] |
Long M, Cao Y, Wang J, et al. Learning transferable features with deep adaptation networks. In: Proceedings of the 32nd International Conference on Machine Learning. New York: ACM, 2015 , 37: 97–105.
|
[15] |
Ganin Y, Lempitsky V. Unsupervised domain adaptation by backpropagation. In: Proceedings of the 32nd International Conference on Machine Learning. New York: ACM, 2015 , 37: 1180–1189.
|
[16] |
Bousmalis K, Silberman N, Dohan D, et al. Unsupervised pixel-level domain adaptation with generative adversarial networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA: IEEE, 2017 : 95–104.
|
[17] |
Xu R, Chen Z, Zuo W, et al. Deep cocktail network: Multi-source unsupervised domain adaptation with category shift. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018 : 3964–3973.
|
[18] |
Zhou K, Yang Y, Hospedales T, et al. Learning to generate novel domains for domain generalization. In: Vedaldi A, Bischof H, Brox T, editors. Computer Vision–ECCV 2020. Cham: Springer, 2020 , 12361: 561–578.
|
[19] |
Bai H, Sun R, Hong L, et al. DecAug: Out-of-distribution generalization via decomposed feature representation and semantic augmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35: 6705–6713. DOI: 10.1609/aaai.v35i8.16829
|
[20] |
Chattopadhyay P, Balaji Y, Hoffman J. Learning to balance specificity and invariance for in and out of domain generalization. In: European Conference on Computer Vision. Cham: Springer, 2020 : 301–318.
|
[21] |
Peng X, Bai Q, Xia X, et al. Moment matching for multi-source domain adaptation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE, 2020 : 1406–1415.
|
[22] |
Li D, Zhang J, Yang Y, et al. Episodic training for domain generalization. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE, 2020 : 1446–1455.
|
[23] |
S Seo, Y Suh, D Kim, G Kim, J Han, and B Han. Learning to optimize domain specific normalization for domain generalization. In: Vedaldi A, Bischof H, Brox T, et al. editors. Computer Vision─ECCV 2020. Cham: Springer, 2020 : 68–83.
|
[24] |
K Zhou, Y Yang, Y Qiao, et al. Domain generalization with mixstyle. arXiv: 2104.02008, 2021 .
|
[25] |
Cai Q, Wang Y, Pan Y, et al. Joint contrastive learning with infinite possibilities. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. New York: ACM, 2020 : 12638–12648.
|
[26] |
Chen T, Kornblith S, Norouzi M, et al. A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning. New York: ACM, 2020 : 1597–1607.
|
[27] |
Mitrovic J, McWilliams B, Walker J, et al. Representation learning via invariant causal mechanisms. arXiv: 2010.07922, 2020 .
|
[28] |
He K, Fan H, Wu Y, et al. Momentum contrast for unsupervised visual representation learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA: IEEE, 2020 : 9726–9735.
|
[29] |
Wu Z, Xiong Y, Yu S X, et al. Unsupervised feature learning via non-parametric instance discrimination. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018 : 3733–3742.
|
[30] |
Yao T, Zhang Y, Qiu Z, et al. SeCo: Exploring sequence supervision for unsupervised representation learning. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35: 10656–10664. DOI: 10.1609/aaai.v35i12.17274
|
[31] |
Li D, Yang Y, Song Y Z, et al. Deeper, broader and artier domain generalization. In: 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE, 2017 : 5543–5551.
|
[32] |
Carlucci F M, D'Innocente A, Bucci S, et al. Domain generalization by solving jigsaw puzzles. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, 2020 : 2224–2233.
|
[33] |
Matsuura T, Harada T. Domain generalization using a mixture of multiple latent domains. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34: 11749–11756. DOI: 10.1609/aaai.v34i07.6846
|
[34] |
Mahajan D, Tople S, Sharma A. Domain generalization using causal matching. In: Proceedings of the 38th International Conference on Machine Learning. Volume 139 of Proceedings of Machine Learning Research. PMLR, 2021 , 139: 7313–7324.
|
[35] |
Nam H, Lee H, Park J, et al. Reducing domain gap by reducing style bias. arXiv: 1910.11645, 2019.
|
[36] |
Zhou K, Yang Y, Hospedales T, et al. Deep domain-adversarial image generation for domain generalisation. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34: 13025–13032. DOI: 10.1609/aaai.v34i07.7003
|
[37] |
Balaji Y, Sankaranarayanan S, Chellappa R. MetaReg: towards domain generalization using meta-regularization. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. New York: ACM, 2018 : 1006–1016.
|
[38] |
Chattopadhyay P, Balaji Y, Hoffman J. Learning to balance specificity and invariance for in and out of domain generalization. In: Vedaldi A, Bischof H, Brox T, editors. Computer Vision–ECCV 2020. Cham: Springer, 2020 : 301–318.
|