A feature transfer model with Mixup and contrastive loss in domain generalization

Author Bio:
Yuesong Wang is currently a graduate student under the tutelage of Prof. Hong Zhang at the University of Science and Technology of China. His research interests focus on machine learning

Hong Zhang is a Full Professor with the University of Science and Technology of China (USTC). He received his Bachelor’s degree in Mathematics and Ph.D. degree in Statistics from USTC in 1997 and 2003, respectively. His major research interests include statistical genetics, causal inference, and machine learning
Corresponding author:
Hong Zhang, E-mail: zhangh@ustc.edu.cn
Received Date: January 19, 2023
Accepted Date: May 03, 2023
Available Online: June 16, 2024

Abstract Full text PDF

Abstract

Abstract

When domains, which represent underlying data distributions, differ between training and test datasets, traditional deep neural networks suffer from a substantial drop in their performance. Domain generalization methods aim to boost generalizability on an unseen target domain by using only training data from source domains. Mainstream domain generalization algorithms usually make modifications on some popular feature extraction networks such as ResNet, or add more complex parameter modules after the feature extraction networks. Popular feature extraction networks are usually well pre-trained on large-scale datasets, so they have strong feature extraction abilities, while modifications can weaken such abilities. Adding more complex parameter modules results in a deeper network and is much more computationally demanding. In this paper, we propose a novel feature transfer model based on popular feature extraction networks in domain generalization, without making any changes or adding any module. The generalizability of this feature transfer model is boosted by incorporating a contrastive loss and a data augmentation strategy (i.e., Mixup), and a new sample selection strategy is proposed to coordinate Mixup and contrastive loss. Experiments on the benchmarks PACS and Domainnet demonstrate the superiority of our proposed method against conventional domain generalization methods.

Graphical Abstract

The proposed model with feature transfer and contrastive loss.

Abstract

When domains, which represent underlying data distributions, differ between training and test datasets, traditional deep neural networks suffer from a substantial drop in their performance. Domain generalization methods aim to boost generalizability on an unseen target domain by using only training data from source domains. Mainstream domain generalization algorithms usually make modifications on some popular feature extraction networks such as ResNet, or add more complex parameter modules after the feature extraction networks. Popular feature extraction networks are usually well pre-trained on large-scale datasets, so they have strong feature extraction abilities, while modifications can weaken such abilities. Adding more complex parameter modules results in a deeper network and is much more computationally demanding. In this paper, we propose a novel feature transfer model based on popular feature extraction networks in domain generalization, without making any changes or adding any module. The generalizability of this feature transfer model is boosted by incorporating a contrastive loss and a data augmentation strategy (i.e., Mixup), and a new sample selection strategy is proposed to coordinate Mixup and contrastive loss. Experiments on the benchmarks PACS and Domainnet demonstrate the superiority of our proposed method against conventional domain generalization methods.
Public Summary
- We propose a feature transfer model for domain generalization and a new sampling strategy based on Mixup in cooperation with contrastive loss.
- Experiments on two mainstream datasets demonstrate the superiority of our method.

FullText(HTML)

1. Introduction

In the era of information explosion, recommender systems (RSs) that connect users with the right items have been recognized as the most effective means to alleviate information overload^[1]. Existing recommendation methods mainly focus on developing a machine learning model to better fit the collected user behavior data^{[2, 3]}. However, because the behavior data in practice are observational rather than experimental, various biases occur. For example, user behaviors are subject to what was exposed to the users (aka, exposure bias) or what were the public opinions (aka, conformity bias), deviating from reflecting the real interests of users. Consequently, blindly fitting the RS model without tackling the bias issues will lead to unacceptable performance^{[4, 5]}.

To alleviate the unfavorable effect of biases, a surge of recent work has been devoted to debiasing solutions. For example, the inverse propensity score (IPS) was proposed to reweight training samples for expectation-unbiased learning^{[6, 7]}; Data imputation was explored to fill missing values^{[8, 9]}; Knowledge distillation was also used to bridge the models trained on biased data and unbiased data. More recently, AutoDebias^[10] proposed to optimize objective function on unbiased data has applied a metalearning technique, which has been demonstrated to be a generic and effective solution for recommendation debiasing. AutoDebias gives both pseudolabels and confidence weights for each user-item pair and is flexible enough to address various biases and outperform previous work by a large margin.

Although effective at managing diverse biases, AutoDebias suffers from serious inefficiency. Specifically, the training of AutoDebias involves calculations for each user-item pair. As such, the complexity of AutoDebias is linear with the product of the number of users and items, which can easily scale to one billion or even larger. This crucial defect severely limits the applicability and potential of AutoDebias. Although stochastic gradient descent with a uniform sampler can be employed for acceleration, this treatment significantly deteriorates model convergence and stability, leading to inferior recommendation performance. Hence, how to accelerate AutoDebias without reducing its accuracy is still an open problem.

To this end, in this work, we propose LightAutoDebias (short as LightAD), which equips AutoDebias with a carefully designed adaptive sampling strategy. We make three important improvements: (ⅰ) We adopt a dynamic sampling distribution that favors the informative training instances with large confidence weights. Further theoretical analysis proves that such a distribution could reduce the sampling variance and accelerate convergence. (ⅱ) We develop a customized factorization-based sampling strategy to support fast sampling from the above dynamic distribution. (ⅲ) Because the model may not reliably estimate the confidence weights in the early training stage, we propose a self-paced strategy that adopts uniform sampling at the beginning while gradually tilting toward factorization-based sampling along the training process.

To summarize, the contributions of this work are as follows:

● We identify the serious inefficiency issue of AutoDebias, which is imperative to be conquered.

● We propose a dynamic and adaptive sampler for AutoDebias, which yields provably better convergence and stability than those of the standard uniform sampler.

● We conduct extensive experiments on three benchmark datasets, validating that our LightAD accelerates AutoDebias by several magnitudes while maintaining almost equal recommendation accuracy.

The remainder of this paper is organized as follows. We first provide a review of related works in Section 2. We then define the problem and recap AutoDebias in Section 3. We elaborate upon the proposed method in detail in Section 4. After that, we theoretically demonstrate the unbiasedness and effectiveness of our approach in Section 5. The experimental results and discussion are presented in Section 6. Finally, we conclude the paper and present some directions for future work are presented in Section 7.

2. Related work

The fact that biases exist in practical RSs and can strongly compromise the performance of RSs has been proven by many scholars. Since we study how to apply unbiased sampling strategies to improve the efficiency issues in this paper, we first make a general classification of biases in data and review some related work on tackling them. Then, we included some specific sampling methods in the debiasing area.

2.1 Biases in recommendation

Intuitively, biases in an RS indicate that models lose the ability to provide accurate recommendations for users. In general, we can roughly divide them into the following categories:

Selection bias. Selection bias occurs when users are free to choose which items to rate, thus the collected ratings are not a representative sample of all the ratings^[11]. Intuitively, the data we used in RS are not missing at random. The research of Marlin et al.^[12] fully proved the presence of selection bias. Schnabel et al.^[6] considered introducing used inversed propensity scores (IPSs) to reweight the observed data. Steck et al.^[13] designed a novel unbiased metric named ATOP to remedy selection bias. Considering that users can freely rate items they like, Hernández et al.^[8] proposed imputation-based methods that directly impute the missing entries with pseudolabels and then weights down the contribution of these missing ratings.

Conformity bias. Conformity bias occurs as users tend to align themselves with others in the group regardless of whether it goes against their own will^[11]. In other words, our rating of some products may succumb to public opinion and therefore not reflect our real interests. Krishnan et al.^[14] conducted comparative experiments suggesting that social influence bias is an important factor in RS. Liu et al.^[15] took conformity considered and directly used rating data behavior as an important feature to quantify conformity bias. In addition, Tang et al.^[16] and Hao et al.^[17] leveraged social factors more directly to produce the final prediction and introduce specific parameters to eliminate the impact of conformity bias.

Exposure bias. Exposure bias exists when users are only exposed to a small portion of the items provided^[11]. Exposure bias is readily comprehensible as the unobserved data can be attributed to users not liking the item or simply not seeing the item. One simple and straightforward strategy to address exposure bias is to consider all noninteraction data as negative samples and specify their confidence. Hu et al.^[18] put forward the classic WMF model in which the unobserved data are assigned lower weights. Similarly, Pan et al.^[19] went a step further and specified the data confidence with examined user behavior. Another way to address exposure bias is to design an exposure-based probabilistic model that is able to capture whether a user has been exposed to an item. Liang introduced the EXMF model of users’ activity through an exposure variable and proposed a generative process of implicit feedback. Chen et al.^[20] assumed that users frequently come into contact with specific communities and thus modelled confidence weights using a community-based strategy.

Position bias. Position bias emerges as users are inclined to interact with items in a higher position on the recommendation list^[11]. When dealing with position bias, many researchers assuming that users’ click behavior occurs if and only if an item is relevant and examined by users^{[21, 22]}. Another conventional strategy is to apply a model and its variants^[23–25], which assumes that users examine items from the top to the bottom of the recommendation list. Hence, the click behavior relies only on the relevance of all the items that displayed to users.

Popularity bias. Popularity bias occurs when popular items are recommended more often than their popularity^[11]. Due to the common long tail phenomenon in recommendation data; recommendation models usually pay more attention to popular items and hence give higher scores to popular items than their real value. Kamishima et al.^[26] used a mean-match regularizer in their models to offset popularity bias. Zheng et al.^[27] applied counterfactual reasoning to address the popularity bias and assumed that users’ click behavior depends mainly on their interest and items’ popularity. Krishnan et al.^[28] introduced adversarial learning in RS to seek a trade-off between the popular item-biased reconstruction objective and the accuracy of long-tail item recommendations.

2.2 Sampling

In addition to the methods mentioned above, sampling is a promising debiasing method that has been studied by many scholars. Specifically, we can use various sampling strategies to determine which data are used to update the parameters and how often in the training of our models. Steffen et al.^[29] employed the uniform negative sampler as the basic strategy to improve efficiency. Several neural-based methods^[30–32] also consider using negative sampler as an efficient way to increase the model performance. In response to exposure bias, Yu et al.^[33] proposed to over sampled popular negative samples because they had a greater chance of being exposed. Nevertheless, these samplers follow a predetermined sampling distribution which may dynamically change according to the model’s state in each iteration; increasing the likehood genetating low-quality samples. Dae et al.^[34], Steffen et al.^[35], and Ding et al.^[36] proposed an adaptive negative sampler, for the purpose of over sampling “difficult” instances to increase learning efficiency. One disadvantage of these heuristic methods is that they fail to capture the real negative instances. To address this problem, Ding et al.^[37] explored leveraging side information to enhance the studied the performance of samplers; Chen et al.^[38] took advantage of social information in their sampling strategy.

3. Preliminary

In this section, we first formulate the task of recommendation systems and then expose the nature of bias from the point of view of risk discrepancy. Finally, we analyze the temporal and spatial complexity of AutoDebias to demonstrate the challenges we face in deploying this algorithm.

3.1 Problem definition

Suppose we have a recommender system with user set ${\mathcal {U}}$ (including $n$ users) and item set ${\cal{I}}$ (including $m$ items). Let $r \in {\mathcal {R}}$ be the users’ feedback on items, which can be binary (implicit feedback) or the numerical ratings (explicit feedback). Let $D_{\rm{T}}$ denote the historical user behavior, which can be seen as a set of triplets $\{(u_k,i_k,r_k)\}_{1\le k \le |D_{\rm{T}}|}$ generated from an unknown training distribution $p_{\rm{T}}(u,i,r)$ over the space ${\mathcal {U}}\times {\cal{I}} \times {\mathcal {R}}$ . For convenience, we use $\delta (\cdot,\cdot)$ to represent the predefined error function, e.g., the MSE, cross-entropy, and Hinge loss. The objective of recommendation system is used to learn a proper function $f_{\theta}$ from $D_{\rm{T}}$ that minimizes the following true risk:

$\begin{array}{l} L(f)=\mathbb E_{p_{\rm{U}}(u,i,r)}[\delta (f(u,i),r)], \\ \end{array}$

(1)

where $p_{\rm{U}}(u,i,r)$ denotes the ideal unbiased data distribution for model evaluation. Regrettably, $p_{\rm{U}}$ is not accessible in most cases. Instead, training is often performed on the training dataset $D_{\rm{T}}$ , which optimizes the following empirical risk:

$\begin{array}{l} \hat L_{\rm{T}}(f)=\dfrac{1}{|D_{\rm{T}}|}\displaystyle \sum\limits_{k=1}^{|D_{\rm{T}}|}\delta (f(u_k,i_k),r_{k}). \\ \end{array}$

(2)

According to the PAC learning theory^[39], if and only if $\hat L_{\rm{T}}(f)$ is an unbiased estimator, i.e., $\mathbb E_{p_{\rm{T}}}[L_{\rm{T}}(f)]=L(f)$ , the learned model will be approximately optimal as the training data size increases. However, since the collected behavior data often contain bias, the training data distribution $p_{\rm{T}}$ cannot be consistent with the unbiased test one $p_{\rm{U}}$ . In other words, the unbiasedness cannot hold in many cases. Therefore, directly training a recommendation model on $D_{\rm{T}}$ with Eq. (2) without considering the inherent biases would easily lead to inferior results.

3.2 AutoDebias brief

To eliminate the distribution discrepancy, Ref. [10] puts forward AutoDebias, a general proposed a debiasing framework, which aims at addressing various biases. The main idea of AutoDebias transforms the aforementioned empirical risk functions into:

$\begin{array}{l} {{\hat L}_{\rm{T}}}(f|\phi)=\dfrac{1}{|D_{\rm{T}}|}\displaystyle \sum\limits_{k=1}^{|D_{\rm{T}}|} {w_k^{(1)}\delta (f({u_k},{i_k}),{r_k})}+{\displaystyle \sum\limits_{u \in {\mathcal {U}},i \in {\cal{I}}}} {w_{ui}^{(2)}\delta (f(u,i),{m_{ui}})}, \end{array}$

(3)

where AutoDebias imputes the pseudolabels $m_{ui}$ for each user-item pair, and gives diverse confidence weights ${w_k^{(1)}}$ for each training instance. When the set of hyperparameters $\phi \equiv \{{w^{(1)}},{w^{(2)}},m\}$ is properly specified, $\hat L_{\rm{T}}(f|\phi)$ can be an unbiased estimator, even if the training dataset $D_{\rm{T}}$ contains any type of bias.

AutoDebias also devises a metlearning-based strategy to learn the appropriate $\phi$ . Specifically, they first propose reparameterizing $\phi$ with a concise linear meta model:

$\begin{array}{l} w_k^{(1)} = \exp (\varphi _1^{\rm{T}}[{{\boldsymbol{ e}}_{{u_k}}} \circ {{\boldsymbol{ e}}_{{i_k}}} \circ {{\boldsymbol{ e}}_{r_k}}]),\\ w_{ui}^{(2)} = \exp (\varphi _2^{\rm{T}}[{{\boldsymbol{ e}}_u} \circ {{\boldsymbol{ e}}_i} \circ {{\boldsymbol{ e}}_{{\rm{O}}_{ui}}}]),\\ {m_{ui}} = \sigma (\varphi _3^{\rm{T}}[{{\boldsymbol{ e}}_{r_{ui}}} \circ {{\boldsymbol{ e}}_{{\rm{O}}_{ui}}}]), \end{array}$

(4)

where $\boldsymbol{e}_u$ , $\boldsymbol{e}_i$ , $\boldsymbol{e}_r$ , and $\boldsymbol{e}_{{\rm{O}}_{ui}}$ denote the one-hot vectors of the user $u$ , item $i$ , feedback $r$ and observation ${\rm{O}}_{ui}$ , respectively. This process can reduce the number of parameters and encode useful debiasing information. They then leveraged a metalearning strategy to optimize $\phi$ by utilizing a small set of uniform data. Specifically, AutoDebias updates $\phi$ and $f_\theta$ in an alternative fashion: making an assumed update of $\theta$ with $\theta'(\phi) = \theta - \eta_1 {\nabla _\theta }{{\hat L}_{\rm{T}}}({f_\theta }|\phi )$ ; validating the performance of $\theta'$ on the uniform data with ${{\hat L}_{\rm{U}}}({f_\theta })$ , i.e., ${{\hat L}_{\rm{U}}}({f_\theta })= \dfrac{1}{{|{D_{\rm{U}}}|}}\displaystyle \sum\limits_{l = 1}^{|{D_{\rm{U}}}|} {\delta ({f_\theta }({u_l},{i_l}),{r_l})}$ , which in turn gives a feedback signal (gradient) to update the meta model $\phi$ : ${\varphi} \leftarrow {\varphi - \eta_2 {\nabla _\varphi } {{\hat L}_{\rm{U}}}({f_{\theta'(\phi)} }) }$ ; given the updated $\phi$ , updating $\theta$ actually: $\theta \leftarrow \theta - \eta_1 {\nabla _\theta }{{\hat L}_{\rm{T}}}({f_\theta }|\phi ))$ .

3.3 Complexity analysis of AutoDebias

Now, we conduct time complexity analyses of AutoDebias. The time complexity mainly comes from the following two parts:

Calculation on observed training instances. As depicted by the first part of the Eq. (3), the calculation only involves the instances in $D_{\rm{T}}$ . As such, the complexity of this part is linear with the number of the observed training instances, i.e., $O(|D_{\rm{T}}|)$ .

Calculation on imputed training instances. As depicted by the second part of the Eq. (3), the complexity involves the calculation over all user-item pairs, which is $O(nm)$ . The number of users and items can be easily scaled to millions or more in practice, AutoDebias would be computationally infeasible.

Considering the sparse nature of real-world recommendation data, the second part has become the bottleneck of AutoDebias. A naive solution for acceleration employs stochastic gradient descent and a uniform sampler, where the gradient can be quickly calculated on the sampled portion of the dataset. However, we find that this treatment would significantly deteriorate model convergence and stability, leading to inferior recommendation performance. In fact, not all user-item pairs are equally important in model training. We observe fairly scattered $w^{(2)}$ in practice. The uniform sampler easily samples the uninformative training instances with small $w^{(2)}$ , which limits contribution to the training process. As such, we need a new sampler that can adaptively draw informative training instances.

4. Method

In this section, we present LightAD which equips AutoDebias with a carefully designed adaptive sampling strategy. LightAD adopts an uneven sampling distribution and factorization-based efficient sampling strategy. We will detail LightAD in the following parts.

4.1 Fast informative sampler

As mentioned before, the second part of Eq. (3) has become the bottleneck of AutoDebias. Here we leverage a sampler to accelerate the training, where the gradient can be estimated from a much smaller sampled training dataset. Formally, the learning with the sampler can be formulated as follows:

$\begin{array}{l} S \sim \text{Multinomial}(N,p_{{\rm{s}}}(u,i)), \end{array}$

(5)

which draws a small set of users and items, i.e., $S$ , with the sample size $N$ and the distribution $p_{{\rm{s}}}(u,i)$ . In this way, a recommendation model can be trained efficiently with sampled instances, which can be expressed as:

$\begin{array}{l} {{\hat L}_{\rm{T}}}(f|\phi) = \dfrac{1}{|D_{\rm{T}}|}\displaystyle \sum\limits_{k=1}^{|D_{\rm{T}}|} {w_k^{(1)}\delta (f({u_k},{i_k}),{r_k})} + \dfrac{1}{|S|}{\displaystyle \sum\limits_{u,i \in \cal{S}}} {\delta (f(u,i),{m_{ui}})}, \end{array}$

(6)

where the confidence weights $w_{ui}^{(2)}$ are offset by the sampling distribution. Now, the question lies in which sampling distribution to use and how to implement fast sampling.

Importance-aware sampling distribution. Note that different data may have different confidence weights $w^{(2)}$ . This suggests that the data sampling distribution is highly important since it determines which data are used to update parameters and how often it is. Intuitively, the informative data that have a larger $w^{(2)}$ should be sampled with a larger sampling probability. The reason lies in the fact that they would bring larger gradients and make a greater contribution to training, which could speed up model convergence. In fact, our theoretical analyses presented in Section 5.2 prove that sampling with the distribution $p_{\rm{s}}(u,i) \propto w^{(2)}_{ui}$ can reduce the sampling variance.

Fast factorization-based sampling. Now the question lies in how to perform fast sampling from the distribution $p_{\rm{s}}(u,i) \propto w^{(2)}_{ui}$ . Direct sampling from $p_{\rm{s}}(u,i)$ would be highly time-consuming, as it involves a large instance space. Moreover, the sampling distribution would evolve as the training proceeds. To address this problem, we propose a subtle fast factorization-based sampling algorithm. The merit of the algorithm lies in the decomposition nature of the hyperparameters $w_{ui}^{(2)}$ . In fact, we have:

$\begin{array}{l} w_{ui}^{(2)} = \exp (\varphi _2^{\rm{T}}[{{\boldsymbol{ e}}_u} \circ {{\boldsymbol{ e}}_i} ])=\exp (\varphi _2^{\rm{T}}[{{\boldsymbol{ e}}_u} ])*\exp (\varphi _2^{\rm{T}}[{{\boldsymbol{ e}}_i} ])\equiv w_{u}^{(2)}*w_{i}^{(2)},\\ \end{array}$

(7)

where $\boldsymbol{e}_u$ and $\boldsymbol{e}_i$ denote the one-hot vectors of user ID $u$ and item ID $i$ . The $w_{ui}^{(2)}$ can be decomposed as the product of the item-based weight $w_{i}^{(2)}$ and user-based weight $w_{u}^{(2)}$ . As such, we could separate the sampling of users and items as follows:

$\begin{array}{l} S_{{{u}}} \sim \text{Multinomial}(N,p_{{u}}),\\ S_{{{i}}} \sim \text{Multinomial}(N,p_{{i}}), \end{array}$

(8)

where $p_{{u}} \propto w_{{{u}}}^{(2)}$ and $p_{{i}} \propto w_{{{i}}}^{(2)}$ . In this way, we can find that the sampling distribution for a user-item pair is proportional to $w_{ui}^{(2)}$ , without requiring direct sampling from the large user-item space. As such, the sampling complexity could be reduced from $O(nm)$ to $O(n+m)$ .

4.2 Self-paced sampling strategy.

Note that in the early training stage, the model may not yield a reliable estimate of the confidence weights. Directly learning a model from an informative sampler may lead to suboptimal results. To address this problem, we propose a self-paced strategy that adopts a uniform distribution at the beginning while gradually alters the distribution as the training proceeds. Formally, we adopt the following sampling strategy:

$\begin{array}{l} S_1 \sim \text{Multinomial}(\alpha N,p_{s}(u,i)), \\ S_2 \sim \text{Multinomial}((1-\alpha)N,U(u,i)), \end{array}$

(9)

where $U(u,i)$ denotes the uniform distribution. $\alpha$ controls the ratio of the instances sampled from the informative sampler and the uniform sampler, which would gradually increases as the training proceeds. In this work, we simply set $\alpha$ with $\text{min}(e/\beta,1)$ , where $e$ denotes the current epoch, and $\beta$ controls the increase rate of $\alpha$ , while leaving other choices as future work. After obtaining $S_1$ and $S_2$ , the model can be updated with the following objective functions:

$\begin{split} {{\hat L}_{\rm{P}}}(f|\phi)=& \frac{1}{|D_{\rm{T}}|}\sum\limits_{k=1}^{|D_{\rm{T}}|} {w_k^{(1)}\delta (f({u_k},{i_k}),{r_k})} + \frac{\alpha \displaystyle \sum\limits_{u,i}{w^{(2)}_{ui}}}{|S_1|}\cdot\\&{\sum\limits_{u,i \in {\cal{S}}_1}} {\delta (f(u,i),{m_{ui}})}+\frac{(1-\alpha)nm}{|S_2|}{\sum\limits_{u,i \in {\cal{S}}_2}} {w_{ui}^{(2)}\delta (f(u,i),{m_{ui}})}. \end{split}$

(10)

5. Theoretical analyses

The former section elaborates how LightAD accelerates AutoDebias. In this section, we conduct theoretical analyses to answer the following questions: Does the unbiasedness still hold with our proposed sampling strategy? Does the proposed informative sampler reduce the sampling variance?

5.1 Unbiasedness of LightAD

It has been proven that the empirical risk function of AutoDebias can be an unbiased estimator of the true risk. Now we connect the objective of LightAD (Eq. (10)) with AutoDebias, which reveals the unbiasedness of LightAD. In fact, we have:

$\begin{split} &E_{S_1,S_2}[{{\hat L}_{\rm{P}}}(f|\phi)] = \dfrac{1}{|D_{\rm{T}}|}\displaystyle \sum\limits_{k=1}^{|D_{\rm{T}}|} {w_k^{(1)}\delta (f({u_k},{i_k}),{r_k})} + E_{S_1}[\dfrac{\alpha \displaystyle \sum\limits_{u,i}{w^{(2)}_{ui}}}{|S_1|}{\displaystyle \sum\limits_{u,i \in {{\cal{S}}_1}}} {\delta (f(u,i),{m_{ui}})}] +E_{S_2}[\dfrac{(1-\alpha)nm}{|S_2|}{\displaystyle \sum\limits_{u,i \in {{\cal {S}}_2}}} {w_{ui}^{(2)}\delta (f(u,i),{m_{ui}})}] =\\& \dfrac{1}{|D_{\rm{T}}|}\displaystyle \sum\limits_{k=1}^{|D_{\rm{T}}|} {w_k^{(1)}\delta (f({u_k},{i_k}),{r_k})} + \dfrac{\alpha \displaystyle \sum\limits_{u,i}{w^{(2)}_{ui}}|S_1|}{|S_1|}E_{u,i \sim p_{\rm{s}}} [{\delta (f(u,i),{m_{ui}})}] +\dfrac{(1-\alpha)nm|S_2|}{|S_2|}E_{u,i \sim U} [{w_{ui}^{(2)}\delta (f(u,i),{m_{ui}})}] = \end{split}$

$\dfrac{1}{|D_{\rm{T}}|}\displaystyle \sum\limits_{k=1}^{|D_{\rm{T}}|} {w_k^{(1)}\delta (f({u_k},{i_k}),{r_k})} + \alpha \displaystyle \sum\limits_{u,i}{\dfrac{w^{(2)}_{ui}\displaystyle \sum\limits_{u,i}{w^{(2)}_{ui}}}{\displaystyle \sum\limits_{u,i}{w^{(2)}_{ui}}} {\delta (f(u,i),{m_{ui}})}}+(1-\alpha)nm \displaystyle \sum\limits_{u,i} {w_{ui}^{(2)}\delta (f(u,i),{m_{ui}})} ={{\hat L}_S}(f|\phi).$

(11)

From the above mathematical deduction, we can safely draw the conclusion that the objective of LightAD is indeed to use an unbiased estimator of AutoDebias and the ideal true risk.

5.2 Variance reduction

How does the conventional uniform sampler perform? Does the proposed informative sampler reduce the variance? In this section, we aim to provide a theoretical answer to these problems.

Let us first formulate the variance of the estimated objective with the sampler. In fact, we have the following upper-bound of the variance:

$\begin{array}{l} {\text{Var}}{_S}[{{\hat L}_S}] = \dfrac{1}{{|S|}}\text{Var}{_{p_{\rm{s}}}}\left[ {\dfrac{{w_{ui}^{(2)}}}{{{p_{\rm{s}}}(u,i)}}\delta \left({f(u,i),{m_{ui}}} \right)} \right]\leqslant\\ \dfrac{1}{{|S|}}\left( {\displaystyle \sum\limits_{u,i} {\dfrac{{{\rm{Va}}{{\rm{r}}_{ui}}[w_{ui}^{(2)}] + {E_{u,i}}{{[w_{ui}^{(2)}]}^2}}}{{{p_{\rm{s}}}(u,i)}}} } \right) \times {E_{{p_{\rm{s}}}}}{{[\delta \left({f(u,i),{m_{ui}}} \right)]}^2}-\\ \dfrac{1}{{|S|}}{E_{{p_{\rm{s}}}}}{[w_{ui}^{(2)}\delta \left({f(u,i),{m_{ui}}} \right)]^2}. \end{array}$

(12)

For a uniform sampler, the variance of the objective is subject to the variance of weights $w_{ui}^{(2)}$ , which is unsatisfactory. As the training proceeds, we observe a diverse distribution of $w_{ui}^{(2)}$ with large variance. The gradient from ${\hat L}_S$ with a uniform sampler fluctuates heavily, making the training process highly unstable.

To address this problem, we should choose a better sampling distribution to reduce the upper bound of the estimation variance. In fact, we have:

$\begin{split} &\displaystyle \sum\limits_{ui} {\dfrac{{{{(w_{ui}^{(2)})}^2}}}{{{p_{\rm{s}}}(u,i)}}} = \displaystyle \sum\limits_{ui} {\dfrac{{{{(w_{ui}^{(2)})}^2}}}{{{p_{\rm{s}}}(u,i)}}} \displaystyle \sum\limits_{ui} {{p_{\rm{s}}}(u,i)} \geqslant\\& {\left( {\displaystyle\sum\limits_{ui} {\sqrt {\dfrac{{{{(w_{ui}^{(2)})}^2}}}{{{p_{\rm{s}}}(u,i)}}} \sqrt {{p_{\rm{s}}}(u,i)} } } \right)^2} = {E_{ui}}{[w_{ui}^{(2)}]^2}, \end{split}$

(13)

where we employ the Cauchy inequality, and the equation holds if and only if $p_{\rm{s}}(u,i)\propto w_{ui}^{(2)}$ . This means that for the informative sampler whose sampling distribution is proportional to the confidence weights, the estimation variance achieves the lowest upper bound. The estimation variance is only subject to the mean of the confidence weights and independent of the variance of the confidence weights. The informative sampler would bring better stability and convergence.

6. Experiments

In this section, we first describe the experimental settings, and then present the detailed experiments with the purpose of answering the following questions:

RQ1: How does our LightAD perform compared with AutoDebias in recommendation performance?

RQ2: How does the proposed LightAD accelerate AutoDebias?

RQ3: How does the hyperparameter $\beta$ affect the recommendation performance?

RQ4: How does LightAD perform when dealing with the simultaneous presence of various biases?

6.1 Experimental setup

We conduct our experiments with two publicly accessible datasets and one synthetic dataset: Yahoo!R3, Coat, and Simulation. The concrete statistics of the three datasets are summarized in Table 1.

Table 1. Statistics of the datasets.

Dataset	Users	Items	Training	Validation	Test
Coat	290	300	6960	232	4176
Yahoo!R3	15400	1000	311704	2700	48600
Simulation	500	500	12500	3750	67500

| Show Table

DownLoad: CSV

Yahoo!R3. This dataset contains two types of ratings collected from two sources. The first source collects ratings of users’ normal interactions with music in Yahoo! services (i.e., users pick and rate items according to their own preferences), which can be considered as a stochastic logging policy. Thus, this type of data is obsessed with bias. The second kind of data is ratings from randomly selected songs collected during an online survey (i.e., uniform logging policy). Approximately, it can be regarded as unbiased and reflects the real interest of users.

Coat. This is a dataset collected by Ref. [7], which simulates customers shopping behaviors for coats in an online service. In this dataset, users were first given a web-shop interface and then asked to select the coat they wanted to buy most. Shortly afterwards, they were requested to rate 24 of the self-selected coats and 16 of the randomly picked ones on a five-point scale. Therefore, this dataset consists of self-selected ratings and the uniformly selected ratings which can be regarded as biased data and unbiased data respectively.

Simulation. A great advantage of AutoDebias is that it can handle multiple biases simultaneously, which means good universality in complex scenario. We need to verify that this universality also persists in our model. Since there is no public dataset that satisfies such a scenario, we borrow from AutoDebias and adopt a synthetic dataset Simulation where both position bias and selection bias are taken into account. Unlike the previous two datasets, this one contains the position information of items when the interaction occurs. In short, this dataset also includes a large amount of biased data and a relatively small amount of unbiased data. The rationale for the way of data synthesis can be found in Refs. [40, 41].

To be consistent with AutoDebias, we follow the experimental setup with them in training, validating, and testing. Specifically, we use the biased data ${D}_{\rm{T }}$ to train our base model and uniform data ${D}_{\rm{U}}$ to assist in debiasing and testing. Given the paucity of uniform data, we split ${D}_{\rm{U}}$ into three parts: 5 $\%$ to assist in training hyperparameters $D_\text{ut}$ , 5 $\%$ for validation to tune the hyperparameters $D_\text{uv}$ , and the remaining 90 $\%$ for evaluating the model $D_\text{ue}$ . Additionally, we binarize the rating in three datasets with threshold 3, i.e., we treat the rating values that are larger than 3 as positive feedback and label them with $r=1$ .

Evaluation protocols. The evaluation metrics in this work are the negative logarithmic loss (NLL), the area under the ROC curve (AUC), and normalized discounted cumulative gain (NDCG):

NLL. This metric evaluates the performance of the predictions:

$\begin{array}{l} {\rm{NLL}} = - \dfrac{1}{|D_\text{ve}|} \displaystyle \sum\limits_{(u,i,r)\in D_\text{ve}} \log \left(1+{\rm{e}}^{-r*f_\theta(u,i)}\right), \end{array}$

(14)

where $D_\text{ve}$ denotes the validation set $D_\text{uv}$ or the test set $D_\text{ue}$ .

AUC. This metric evaluates the performance of rankings:

$\begin{array}{l} {\rm{AUC}} = \dfrac{\displaystyle \sum\limits_{(u,i)\in D_\text{ve}^+}\widehat {\text{Rank}}_{u,i} - (|D_\text{ve}^+|+1)(|D_\text{ve}^+|)/2}{(|D_\text{ve}^+|) * (|D_\text{ve}| - |D_\text{ve}^+|)}, \end{array}$

(15)

where $|D_\text{ve}^+|$ detotes the number of positive data in $D_\text{ve}$ , and $\widehat {\text{Rank}}_{u,i}$ denotes the rank position of a positive feedback $(u,i)$ .

NDCG@k. This metric evaluates the quality of recommendation through discounted importance based on ranking position:

$\begin{array}{l} \text{DCG}{_u}@k = \displaystyle \sum\limits_{(u,i) \in {D_\text{ve}}} {\dfrac{{\mathbf I({{\widehat {\text{Rank}}}_{u,i}} \leqslant k)}}{{\log({{\widehat {\text{Rank}}}_{u,i}} + 1)}}}, \\ \text{NDCG}@k = \dfrac{1}{|\cal{{U}}|} \displaystyle \sum\limits_{u\in \cal{{U}}} \dfrac{\text{DCG}_u@k}{\text{IDCG}_u@k}, \end{array}$

(16)

where $\text{IDCG}_u@k$ is a normalizer that ensures NDCG is between 0 and 1.

We implement our experiment with Pytorch and the code is available at https://github.com/Chris-Mraz/LightAD. We perform grid search to tune the hyperparameters for all candidate methods on the validation set.

6.2 Performance comparison (RQ1)

Baselines. To demonstrate the effectiveness of our proposed sampled method, we use matrix factorization (MF)^[42] as the backbone and the following models as our baselines:

Matrix factorization (MF)^[42]. MF is one classical model in RS. In this model, the preference of user u for item i can be formulated as $R_{u,i} = U_u^{\rm{T}}V_j + u_i + u_j + \mu$ . In our experimental setup, we train the MF model with biased, uniform, and combined dataset, i.e., ${D}_{\rm{T}}$ , ${D}_{\rm{U}}$ , and $D_{\rm{T}} + D_{\rm{U}},$ respectively.

Inverse propensity score (IPS)^[6]. IPS is a counterfactual-based recommendation method that estimates the propensity score via naive bayes.

Doubly robust (DR)^[43]. DR combines the ideas of the IPS and data imputation. The debiasing capacity of this model can be ensured if either the imputed data or propensity score are accurately estimated

CausE and KD-label. CausE and KD-label are excellent methods that transfer the unbiased information with a teacher model using knowledge distillation techniques. Both of them are the best performing models in Ref. [5].

AutoDebias. AutoDebias is a generic and effective solution for recommendation debiasing.

We also test three versions of our LightAD for ablation studies: (ⅰ) LightAD-uniform, which improves AutoDebias with a uniform sampler. (ⅱ) LightAD-fixed, where we remove the selfpaced strategy and fix $\alpha$ as a constant (i.e., $\alpha=0.5$ ). (ⅲ) LightAD-selfpaced, the complete model proposed in Section 4.2.

Table 2 shows the performance of our LightAD and the baselines with respect to three evaluation metrics on two datasets. We make the following two observations:

Table 2. Comparison results on explicit feedback data. The boldface font denotes the winner in that column.

Model	Yahoo!R3			Coat
Model	NLL	AUC	NDCG@5	NLL	AUC	NDCG@5
MF (biased)	−0.587	0.727	0.547	−0.546	0.757	0.484
MF (uniform)	−0.503	0.568	0.435	−0.641	0.540	0.362
MF (combined)	−0.579	0.729	0.551	−0.538	0.750	0.492
IPS	−0.450	0.725	0.556	−0.531	0.755	0.484
DR	−0.444	0.723	0.552	−0.513	0.760	0.501
CausE	−0.585	0.729	0.553	−0.529	0.762	0.500
KD-label	−0.587	0.727	0.547	−0.546	0.757	0.484
AutoDebias	−0.419	0.748	0.648	−0.529	0.766	0.501
LightAD-uniform	−0.410	0.734	0.586	−0.505	0.710	0.473
LightAD-fixed	–0.406	0.737	0.632	–0.499	0.742	0.489
LightAD-selfpaced	−0.425	0.748	0.641	−0.510	0.765	0.517

| Show Table

DownLoad: CSV

(Ⅰ) As a whole, our LightAD-selfpaced outperforms other baselines by a large margin. Contrast with AutoDebias, our model achieves similar performance with AutoDebias in both datasets and some metrics even exceed it. This can be attributed to the fact that AutoDebias gives the same attention to all samples during training, which will hinder the capture of truly useful information. It’s worth noting that the learning of sampling strategy is only conducted on a very small amount of data instances i.e., the same order of magnitude as the biased instances, which is far less than $O(nm)$ .

(Ⅱ) In terms of all the sampling strategies, LightAD with simple uniform sampling obtains the worst performance and convergence. This is consistent with our intuition that uniform sampling suffers from high variance. The fact that LightAD-fixed outperforms LightAD-uniform proves that the informative sampler could partly address this problem. LightAD-selfpaced achieves the best performance which validates the effectiveness of the proposed selfpaced strategy.

6.3 Efficiency comparison (RQ2)

As we mention in Section 3.3, the calculation on imputed training instances consumes enormous computing resources and restricts the practicality of the AutoDebias. In this part, we test this notion through ablation experiments. Table 3 shows the time cost of the ablated methods with different components being deleted. From the table, we can conclude that the introduction of these components $\{{w^{(1)}},{w^{(2)}},m\}$ has different effects on the efficiency of our model. $w^{(1)}$ has much less impact on model efficiency than the other two. This is consistent with our previous analysis and demonstrates the necessity of our sampling.

Table 3. Ablation study of time cost per training epoch on Yahoo!R3.

Model	$w^{(1)}$	m	$w^{(2)}$	Epoch (s)
MF	$\times$	$\times$	$\times$	0.02
AutoDebias-w⁽¹⁾	√	$\times$	$\times$	0.07
AutoDebias-w⁽¹⁾m	√	√	$\times$	0.92
AutoDebias	√	√	√	1.3

| Show Table

DownLoad: CSV

To verify the efficiency promotion of our methods, we conduct experiments in two dimensions: the time cost in each epoch and the overall time cost. To ensure the reliability of the experiment as much as possible, we validated the experiment under the same conditions i.e., the same hyperparameter configuration and computing resource. The statistics of training time for these four models are shown in Table 4. From this table, we can find that employing samplers could indeed accelerate the training of AutoDebias. More interestingly, employing a uniform sampler achieves fast calculation in each epoch while requiring more time cost in total. We ascribe it to the fact that informative sampler could improve convergence and reduce the number of the required epochs. To prove this, we plotted the training process of LightAD-fixed and LightAD-uniform. As shown in the Fig. 1, it takes less than 20 epochs for LightAD-fixed to converge while taking more than 40 epochs to achieve the analogous level in LightAD-uniform. The experimental results are consistent with our previous theoretical proof.

Table 4. Training time on Yahoo!R3.

Model	Epoch (s)	Total time (s)
AutoDebias	1.30	1260.50
LightAD-uniform	0.14	74.10
LightAD-fixed	0.34	107.36
LightAD-selfpaced	0.20	29.05

| Show Table

DownLoad: CSV

Figure 1. Training process of LightAD-fixed and LightAD-uniform.

DownLoad: Full-Size Img PowerPoint

6.4 Hyperparameter analysis (RQ3)

Note that in LightAD-selfpaced, we gradually increase the contribution of the informative sampler while hyperparamter $\beta$ controls the increasing ratio. In this subsection, we adjust the value of $\beta$ and explore how it affects recommendation performance. The results are presented in Fig. 2. As we can see, as $\beta$ becomes larger, with few exceptions, the performance will increase first and then drop when $\beta$ surpasses a threshold. The reason is that too small (or large) $\beta$ would make the model focus on informative sampler too early (or late). Setting proper $\beta$ makes the model achieve the best performance.

Figure 2. The influence of hyperparameter β on self-paced sampling strategy.

DownLoad: Full-Size Img PowerPoint

6.5 Universality verification (RQ4)

To demonstrate the universality of our LightAD, we conduct experiments on Simulation. We compare our three sampling strategies with two SOTA methods in mitigating both position bias and selection bias:

Dual learning algorithm (DLA)^[44]. DLA treated the learning of the propensity function and the ranking list as a dual problem and designed specific EM algorithms to learn the two models, which is one SOTA method in mitigating position bias.

Heckman ensemble (HeckE)^[45]. Ascribed bias to the reality that users can only view a truncated list of recommendations. HeckE adopted a two-step method in which it first designed a Probit model to estimate the probability of a item being examined and then took advantage of this probability to modify their model.

Different from the models in Section 6.2, we add position information in the modeling of $w_k^{(1)}$ and left other experimental settings unchanged, i.e., $w_k^{(1)} = \exp (\varphi_1^{\rm{T}}[{\boldsymbol x_{{u_k}}} \circ {\boldsymbol x_{i_k}} \circ {{\boldsymbol{ e}}_{r_k} } \circ {{\boldsymbol{ e}}_{p_k}} ]),$ where ${{\boldsymbol{ e}}_{p_k}}$ denotes the feature vectors of position information. Table 5 shows the performance of our LightAD and the baselines on Simulation. We can find LightAD vastly outperform SOTA methods which verifies that LightAD can handle various biases and their combination.

Table 5. Comparison results on simulation dataset.

	NLL	AUC	NDCG@5
DLA	−0.693	0.572	0.595
HeckE	−0.692	0.587	0.657
LightAD-uniform	−0.731	0.603	0.665
LightAD-fixed	−0.745	0.604	0.680
LightAD-selfpaced	−0.685	0.611	0.700

| Show Table

DownLoad: CSV

7. Conclusions

In this work, we identify the inefficiency issue of AutoDebias, and propose to accelerate AutoDebias with a carefully designed sampling strategy. The sampler could adaptively and dynamically draw informative training instances, which brings provably better convergence and stability than the standard uniform sampler. Extensive experiments on three benchmark datasets validate that our LightAD accelerates AutoDebias by several magnitudes while maintaining almost equal recommendation performance and universality.

One promising research direction for future work is to explore pruning strategy for AutoDebias. AutoDebias involves a large number of debiasing parameters, and of course not all parameters are necessary. Leveraging AutoML or lottery hypothesis to locate important parameters while leave others would significantly reduce the time and space cost, mitigate over-fitting and potentially boost recommendation performance.

Acknowledgements

This work was supported by the National Natural Science Foundation of China (12171451) and Anhui Center for Applied Mathematics.

Conflict of interest

The authors declare that they have no conflict of interest.

We propose a feature transfer model for domain generalization and a new sampling strategy based on Mixup in cooperation with contrastive loss. Experiments on two mainstream datasets demonstrate the superiority of our method.

References (38)

References

[1]	Shao R, Lan X, Li J, et al. Multi-adversarial discriminative deep domain generalization for face presentation attack detection. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, 2020 : 10015–10023.
[2]	Ouyang C, Chen C, Li S, et al. Causality-inspired single-source domain generalization for medical image segmentation. IEEE Transactions on Medical Imaging, 2023, 42: 1095–1106. DOI: 10.1109/TMI.2022.3224067
[3]	Guo L L, Pfohl S R, Fries J, et al. Evaluation of domain generalization and adaptation on improving model robustness to temporal dataset shift in clinical medicine. Scientific Reports, 2022, 12: 2726. DOI: 10.1038/s41598-022-06484-1
[4]	Motiian S, Piccirilli M, Adjeroh D A, et al. Unified deep supervised domain adaptation and generalization. In: 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE, 2017 : 5716–5726.
[5]	Li H, Pan S J, Wang S, et al. Domain generalization with adversarial feature learning. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018 : 5400–5409.
[6]	Li D, Yang Y, Song Y Z, et al. Learning to generalize: Meta-learning for domain generalization. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32 (1). DOI: 10.1609/aaai.v32i1.11596
[7]	Mancini M, Bulò S R, Caputo B, et al. Robust place categorization with deep domain generalization. IEEE Robotics and Automation Letters, 2018, 3: 2093–2100. DOI: 10.1109/LRA.2018.2809700
[8]	Mancini M, Bulò S R, Caputo B, et al. Best sources forward: Domain generalization through source-specific nets. In: 2018 25th IEEE International Conference on Image Processing (ICIP). Athens, Greece: IEEE, 2018 : 1353–1357.
[9]	Segu M, Tonioni A, Tombari F. Batch normalization embeddings for deep domain generalization. Pattern Recognition, 2023, 135: 109115. DOI: 10.1016/j.patcog.2022.109115
[10]	Zhang H, Cisse M, Dauphin Y N, et al. Mixup: Beyond empirical risk minimization. arXiv: 1710.09412, 2017.
[11]	Ding Z, Fu Y. Deep domain generalization with structured low-rank constraint. IEEE Transactions on Image Processing, 2018, 27: 304–313. DOI: 10.1109/TIP.2017.2758199
[12]	Nalisnick E, Matsukawa A, Teh Y W, et al. Do deep generative models know what they don’t know? arXiv: 1810.09136, 2019 .
[13]	Li Y, Tian X, Gong M, et al. Deep domain generalization via conditional invariant adversarial networks. In: Ferrari V, Hebert M, Sminchisescu C, editors. Computer Vision–ECCV 2018. Cham: Springer, 2018 , 11219: 647–663.
[14]	Long M, Cao Y, Wang J, et al. Learning transferable features with deep adaptation networks. In: Proceedings of the 32nd International Conference on Machine Learning. New York: ACM, 2015 , 37: 97–105.
[15]	Ganin Y, Lempitsky V. Unsupervised domain adaptation by backpropagation. In: Proceedings of the 32nd International Conference on Machine Learning. New York: ACM, 2015 , 37: 1180–1189.
[16]	Bousmalis K, Silberman N, Dohan D, et al. Unsupervised pixel-level domain adaptation with generative adversarial networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA: IEEE, 2017 : 95–104.
[17]	Xu R, Chen Z, Zuo W, et al. Deep cocktail network: Multi-source unsupervised domain adaptation with category shift. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018 : 3964–3973.
[18]	Zhou K, Yang Y, Hospedales T, et al. Learning to generate novel domains for domain generalization. In: Vedaldi A, Bischof H, Brox T, editors. Computer Vision–ECCV 2020. Cham: Springer, 2020 , 12361: 561–578.
[19]	Bai H, Sun R, Hong L, et al. DecAug: Out-of-distribution generalization via decomposed feature representation and semantic augmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35: 6705–6713. DOI: 10.1609/aaai.v35i8.16829
[20]	Chattopadhyay P, Balaji Y, Hoffman J. Learning to balance specificity and invariance for in and out of domain generalization. In: European Conference on Computer Vision. Cham: Springer, 2020 : 301–318.
[21]	Peng X, Bai Q, Xia X, et al. Moment matching for multi-source domain adaptation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE, 2020 : 1406–1415.
[22]	Li D, Zhang J, Yang Y, et al. Episodic training for domain generalization. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE, 2020 : 1446–1455.
[23]	S Seo, Y Suh, D Kim, G Kim, J Han, and B Han. Learning to optimize domain speciﬁc normalization for domain generalization. In: Vedaldi A, Bischof H, Brox T, et al. editors. Computer Vision─ECCV 2020. Cham: Springer, 2020 : 68–83.
[24]	K Zhou, Y Yang, Y Qiao, et al. Domain generalization with mixstyle. arXiv: 2104.02008, 2021 .
[25]	Cai Q, Wang Y, Pan Y, et al. Joint contrastive learning with infinite possibilities. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. New York: ACM, 2020 : 12638–12648.
[26]	Chen T, Kornblith S, Norouzi M, et al. A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning. New York: ACM, 2020 : 1597–1607.
[27]	Mitrovic J, McWilliams B, Walker J, et al. Representation learning via invariant causal mechanisms. arXiv: 2010.07922, 2020 .
[28]	He K, Fan H, Wu Y, et al. Momentum contrast for unsupervised visual representation learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA: IEEE, 2020 : 9726–9735.
[29]	Wu Z, Xiong Y, Yu S X, et al. Unsupervised feature learning via non-parametric instance discrimination. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018 : 3733–3742.
[30]	Yao T, Zhang Y, Qiu Z, et al. SeCo: Exploring sequence supervision for unsupervised representation learning. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35: 10656–10664. DOI: 10.1609/aaai.v35i12.17274
[31]	Li D, Yang Y, Song Y Z, et al. Deeper, broader and artier domain generalization. In: 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE, 2017 : 5543–5551.
[32]	Carlucci F M, D'Innocente A, Bucci S, et al. Domain generalization by solving jigsaw puzzles. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, 2020 : 2224–2233.
[33]	Matsuura T, Harada T. Domain generalization using a mixture of multiple latent domains. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34: 11749–11756. DOI: 10.1609/aaai.v34i07.6846
[34]	Mahajan D, Tople S, Sharma A. Domain generalization using causal matching. In: Proceedings of the 38th International Conference on Machine Learning. Volume 139 of Proceedings of Machine Learning Research. PMLR, 2021 , 139: 7313–7324.
[35]	Nam H, Lee H, Park J, et al. Reducing domain gap by reducing style bias. arXiv: 1910.11645, 2019.
[36]	Zhou K, Yang Y, Hospedales T, et al. Deep domain-adversarial image generation for domain generalisation. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34: 13025–13032. DOI: 10.1609/aaai.v34i07.7003
[37]	Balaji Y, Sankaranarayanan S, Chellappa R. MetaReg: towards domain generalization using meta-regularization. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. New York: ACM, 2018 : 1006–1016.
[38]	Chattopadhyay P, Balaji Y, Hoffman J. Learning to balance specificity and invariance for in and out of domain generalization. In: Vedaldi A, Bischof H, Brox T, editors. Computer Vision–ECCV 2020. Cham: Springer, 2020 : 301–318.

Supplements (1)

Supplements
Other Related Supplements
- Graphic and text summary
  Download

Cited By

Track Citations

Get Citation

{{if article.articleBusiness.pdfLink && article.articleBusiness.pdfLink != ''}} {{else}} {{/if}}PDF

XML

Figure 1. Illustration of our proposed method.

Figure 2. A causal model underlying our proposed method. $y$ : label variable; $f_y$ : label feature; $d$ : domain variable; $f_d$ : domain feature; $z$ : hidden variable; $f_z$ : hidden feature; $x$ : sample; $f_x$ : sample feature; $f(x)$ : domain invariable feature (extracted from CNN); $\hat d$ : new domain variable; $f_{\hat d}$ : new domain feature; $f_{x,\hat d}$ : new feature generated by $f(x)$ and $f_{\hat d}$ . Solid arrow: causal relationship; dotted arrow: prediction.

References

[1]	Shao R, Lan X, Li J, et al. Multi-adversarial discriminative deep domain generalization for face presentation attack detection. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, 2020 : 10015–10023.
[2]	Ouyang C, Chen C, Li S, et al. Causality-inspired single-source domain generalization for medical image segmentation. IEEE Transactions on Medical Imaging, 2023, 42: 1095–1106. DOI: 10.1109/TMI.2022.3224067
[3]	Guo L L, Pfohl S R, Fries J, et al. Evaluation of domain generalization and adaptation on improving model robustness to temporal dataset shift in clinical medicine. Scientific Reports, 2022, 12: 2726. DOI: 10.1038/s41598-022-06484-1
[4]	Motiian S, Piccirilli M, Adjeroh D A, et al. Unified deep supervised domain adaptation and generalization. In: 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE, 2017 : 5716–5726.
[5]	Li H, Pan S J, Wang S, et al. Domain generalization with adversarial feature learning. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018 : 5400–5409.
[6]	Li D, Yang Y, Song Y Z, et al. Learning to generalize: Meta-learning for domain generalization. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32 (1). DOI: 10.1609/aaai.v32i1.11596
[7]	Mancini M, Bulò S R, Caputo B, et al. Robust place categorization with deep domain generalization. IEEE Robotics and Automation Letters, 2018, 3: 2093–2100. DOI: 10.1109/LRA.2018.2809700
[8]	Mancini M, Bulò S R, Caputo B, et al. Best sources forward: Domain generalization through source-specific nets. In: 2018 25th IEEE International Conference on Image Processing (ICIP). Athens, Greece: IEEE, 2018 : 1353–1357.
[9]	Segu M, Tonioni A, Tombari F. Batch normalization embeddings for deep domain generalization. Pattern Recognition, 2023, 135: 109115. DOI: 10.1016/j.patcog.2022.109115
[10]	Zhang H, Cisse M, Dauphin Y N, et al. Mixup: Beyond empirical risk minimization. arXiv: 1710.09412, 2017.
[11]	Ding Z, Fu Y. Deep domain generalization with structured low-rank constraint. IEEE Transactions on Image Processing, 2018, 27: 304–313. DOI: 10.1109/TIP.2017.2758199
[12]	Nalisnick E, Matsukawa A, Teh Y W, et al. Do deep generative models know what they don’t know? arXiv: 1810.09136, 2019 .
[13]	Li Y, Tian X, Gong M, et al. Deep domain generalization via conditional invariant adversarial networks. In: Ferrari V, Hebert M, Sminchisescu C, editors. Computer Vision–ECCV 2018. Cham: Springer, 2018 , 11219: 647–663.
[14]	Long M, Cao Y, Wang J, et al. Learning transferable features with deep adaptation networks. In: Proceedings of the 32nd International Conference on Machine Learning. New York: ACM, 2015 , 37: 97–105.
[15]	Ganin Y, Lempitsky V. Unsupervised domain adaptation by backpropagation. In: Proceedings of the 32nd International Conference on Machine Learning. New York: ACM, 2015 , 37: 1180–1189.
[16]	Bousmalis K, Silberman N, Dohan D, et al. Unsupervised pixel-level domain adaptation with generative adversarial networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA: IEEE, 2017 : 95–104.
[17]	Xu R, Chen Z, Zuo W, et al. Deep cocktail network: Multi-source unsupervised domain adaptation with category shift. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018 : 3964–3973.
[18]	Zhou K, Yang Y, Hospedales T, et al. Learning to generate novel domains for domain generalization. In: Vedaldi A, Bischof H, Brox T, editors. Computer Vision–ECCV 2020. Cham: Springer, 2020 , 12361: 561–578.
[19]	Bai H, Sun R, Hong L, et al. DecAug: Out-of-distribution generalization via decomposed feature representation and semantic augmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35: 6705–6713. DOI: 10.1609/aaai.v35i8.16829
[20]	Chattopadhyay P, Balaji Y, Hoffman J. Learning to balance specificity and invariance for in and out of domain generalization. In: European Conference on Computer Vision. Cham: Springer, 2020 : 301–318.
[21]	Peng X, Bai Q, Xia X, et al. Moment matching for multi-source domain adaptation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE, 2020 : 1406–1415.
[22]	Li D, Zhang J, Yang Y, et al. Episodic training for domain generalization. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE, 2020 : 1446–1455.
[23]	S Seo, Y Suh, D Kim, G Kim, J Han, and B Han. Learning to optimize domain speciﬁc normalization for domain generalization. In: Vedaldi A, Bischof H, Brox T, et al. editors. Computer Vision─ECCV 2020. Cham: Springer, 2020 : 68–83.
[24]	K Zhou, Y Yang, Y Qiao, et al. Domain generalization with mixstyle. arXiv: 2104.02008, 2021 .
[25]	Cai Q, Wang Y, Pan Y, et al. Joint contrastive learning with infinite possibilities. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. New York: ACM, 2020 : 12638–12648.
[26]	Chen T, Kornblith S, Norouzi M, et al. A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning. New York: ACM, 2020 : 1597–1607.
[27]	Mitrovic J, McWilliams B, Walker J, et al. Representation learning via invariant causal mechanisms. arXiv: 2010.07922, 2020 .
[28]	He K, Fan H, Wu Y, et al. Momentum contrast for unsupervised visual representation learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA: IEEE, 2020 : 9726–9735.
[29]	Wu Z, Xiong Y, Yu S X, et al. Unsupervised feature learning via non-parametric instance discrimination. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018 : 3733–3742.
[30]	Yao T, Zhang Y, Qiu Z, et al. SeCo: Exploring sequence supervision for unsupervised representation learning. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35: 10656–10664. DOI: 10.1609/aaai.v35i12.17274
[31]	Li D, Yang Y, Song Y Z, et al. Deeper, broader and artier domain generalization. In: 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE, 2017 : 5543–5551.
[32]	Carlucci F M, D'Innocente A, Bucci S, et al. Domain generalization by solving jigsaw puzzles. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, 2020 : 2224–2233.
[33]	Matsuura T, Harada T. Domain generalization using a mixture of multiple latent domains. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34: 11749–11756. DOI: 10.1609/aaai.v34i07.6846
[34]	Mahajan D, Tople S, Sharma A. Domain generalization using causal matching. In: Proceedings of the 38th International Conference on Machine Learning. Volume 139 of Proceedings of Machine Learning Research. PMLR, 2021 , 139: 7313–7324.
[35]	Nam H, Lee H, Park J, et al. Reducing domain gap by reducing style bias. arXiv: 1910.11645, 2019.
[36]	Zhou K, Yang Y, Hospedales T, et al. Deep domain-adversarial image generation for domain generalisation. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34: 13025–13032. DOI: 10.1609/aaai.v34i07.7003
[37]	Balaji Y, Sankaranarayanan S, Chellappa R. MetaReg: towards domain generalization using meta-regularization. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. New York: ACM, 2018 : 1006–1016.
[38]	Chattopadhyay P, Balaji Y, Hoffman J. Learning to balance specificity and invariance for in and out of domain generalization. In: Vedaldi A, Bischof H, Brox T, editors. Computer Vision–ECCV 2020. Cham: Springer, 2020 : 301–318.

TrendMD

Volume 54 Issue 4 PP. 0404

Cover

Keywords

Article Metrics

Article views (266) PDF downloads (731)

A feature transfer model with Mixup and contrastive loss in domain generalization

Share

Tools

Abstract

Graphical Abstract

Abstract

Public Summary

1. Introduction

2. Related work

2.1 Biases in recommendation

2.2 Sampling

3. Preliminary

3.1 Problem definition

3.2 AutoDebias brief

3.3 Complexity analysis of AutoDebias

4. Method

4.1 Fast informative sampler

4.2 Self-paced sampling strategy.

5. Theoretical analyses

5.1 Unbiasedness of LightAD

5.2 Variance reduction

6. Experiments

6.1 Experimental setup

6.2 Performance comparison (RQ1)

6.3 Efficiency comparison (RQ2)

6.4 Hyperparameter analysis (RQ3)

6.5 Universality verification (RQ4)

7. Conclusions

Acknowledgements

Conflict of interest

References

Supplements

Other Related Supplements

Graphic and text summary

Catalog

References

Related Articles

TrendMD

Article Metrics

Authors

Browse

Contact Us

About

Export File

Citation

Format

Content