SlideShare a Scribd company logo
Domain Invariant Representation Learning with Domain Density Transformations
A. Tuan Nguyen, Toan Tran, Yarin Gal, Atılım Güneş Baydin, arxiv 2102.05082
PR-320, Presented by Eddie
Domain Invariant Representation Learning with Domain Density Transformations
1. Domain Generalization
2. Domain Generalization과 Domain Adaptation
이전에 못 본 도메인(OOD)에 대응하기 위해 도메인에 구애받지 않는 모델을 만드는 것을 목표로 하는 학습 방법을 말한다.
Domain Adaptation은 타깃 도메인(Target Domain)의 레이블이 없는 데이터를 바탕으로 정보(information)를 얻는 것이 가능하지만, Domain Generalization은 그렇지 않다는 것이 차이점이다.
Train on the painting data in the Baroque period. Test on the painting data in the Modern period.
Model “Caravaggio” Model ?
Thanh-Dat Truong, et al., Recognition in Unseen Domains: Domain Generalization via Universal Non-volume Preserving Models
타깃 도메인(Target Domain)에 대한 정보가 없는 상태에서
예측을 해야 하기 때문에 어려운 과제이다.
“
”
Domain Invariant Representation Learning with Domain Density Transformations
Definition1.Marginal Distribution Alignment The representation z is said to satisfy the marginal distribution
alignment condition if p(z|d) is invariant w.r.t. d.
Definition2.Conditional Distribution Alignment The representaion z is said to satisfy the conditional distribution
alignment condition if p(y|z,d) is invariant w.r.t. d.
3. [Domian Invariance] Marginal and Conditional Alignment
4. Proposed Method
Ed E
[ Ed d,d 2
2
'
[ ]]]
l( )
y,gθ(x)
gθ(x) ,
:
gθ(x) (
- )
gθ f (x)
d,d'
f d d' ' '
:
' || ||
+
[
p(x,y|d) Ed ,
s
D
∈
, [ d,d 2
2
' ]
l(
d ,
=
D
d , where : 데이터 공간, ,
D D : 정의역이 될 수 있는 공간
- X : 공역이 될 수 있는 공간
Y
∈
d
∈
x ,
X
X Z Z
∈
∈
y Y
∈
- 입력 를 로 변환하는 함수(domain representation function)
예측 값과 레이블 간의 Loss
도메인, 다른도메인
예측값 도메인 변환후 예측값
z
→
→
x
(x) 입력 로 변환하는 함수(density transformation function)
x d
∈
를 x
,
-
{ }
s 1 d ,
2 ..., dK )
y,gθ(x) gθ(x) (
- )
gθ f (x)
d' || ||
+
p(x,y|d)
E [ 2
2
]
-
|| ||
+
Transformations
A. Tuan Nguyen 1
Toan Tran 2
Yarin Gal 1
Atılım Güneş Baydin 1
Abstract
Domain generalization refers to the problem
where we aim to train a model on data from a
set of source domains so that the model can gen-
eralize to unseen target domains. Naively training
a model on the aggregate set of data (pooled from
all source domains) has been shown to perform
suboptimally, since the information learned by
that model might be domain-specific and general-
ize imperfectly to target domains. To tackle this
problem, a predominant approach is to find and
learn some domain-invariant information in order
to use it for the prediction task. In this paper, we
propose a theoretically grounded method to learn
a domain-invariant representation by enforcing
the representation network to be invariant under
all transformation functions among domains. We
also show how to use generative adversarial net-
works to learn such domain transformations to
implement our method in practice. We demon-
strate the effectiveness of our method on several
widely used datasets for the domain generaliza-
tion problem, on all of which we achieve compet-
itive results with state-of-the-art models.
1. Introduction
Domain generalization refers to the machine learning sce-
!"#$
!"%$
!"#$
!"%$
!"#$%&'()'% & #' # (
*%+,'-"."/'%&0%-$+%&1'2 !"#$%&'3)'% & #' # (
*%+,'-"."/'%&0%-$+%&1'2
Figure 1. An example of two domains. For each domain, x is
uniformly distributed on the outer circle (radius 2 for domain 1
and radius 3 for domain 2), with the color indicating class label y.
After the transformation z = x/||x||2, the marginal of z is aligned
(uniformly distributed on the unit circle for both domains), but the
conditional p(y|z) is not aligned. Thus, using this representation
for predicting y would not generalize well across domains.
In the representation learning framework, the prediction
y = f(x), where x is data and y is a label, is obtained as a
composition y = h ◦ g(x) of a deep representation network
z = g(x), where z is a learned representation of data x,
and a smaller classifier y = h(z), predicting label y given
representation z, both of which are shared across domains.
Current “domain-invariance”-based methods in domain gen-
eralization focus on either the marginal distribution align-
ment (Muandet et al., 2013) or the conditional distribution
alignment (Li et al., 2018b;c), which are still prone to distri-
v:2102.05082v2
[cs.LG]
14
Feb
2021
Transformations
A. Tuan Nguyen 1
Toan Tran 2
Yarin Gal 1
Atılım Güneş Baydin 1
Abstract
Domain generalization refers to the problem
where we aim to train a model on data from a
set of source domains so that the model can gen-
eralize to unseen target domains. Naively training
a model on the aggregate set of data (pooled from
all source domains) has been shown to perform
suboptimally, since the information learned by
that model might be domain-specific and general-
ize imperfectly to target domains. To tackle this
problem, a predominant approach is to find and
learn some domain-invariant information in order
to use it for the prediction task. In this paper, we
propose a theoretically grounded method to learn
a domain-invariant representation by enforcing
the representation network to be invariant under
all transformation functions among domains. We
also show how to use generative adversarial net-
works to learn such domain transformations to
implement our method in practice. We demon-
strate the effectiveness of our method on several
widely used datasets for the domain generaliza-
tion problem, on all of which we achieve compet-
itive results with state-of-the-art models.
1. Introduction
Domain generalization refers to the machine learning sce-
!"#$
!"%$
!"#$
!"%$
!"#$%&'()'% & #' # (
*%+,'-"."/'%&0%-$+%&1'2 !"#$%&'3)'% & #' # (
*%+,'-"."/'%&0%-$+%&1'2
Figure 1. An example of two domains. For each domain, x is
uniformly distributed on the outer circle (radius 2 for domain 1
and radius 3 for domain 2), with the color indicating class label y.
After the transformation z = x/||x||2, the marginal of z is aligned
(uniformly distributed on the unit circle for both domains), but the
conditional p(y|z) is not aligned. Thus, using this representation
for predicting y would not generalize well across domains.
In the representation learning framework, the prediction
y = f(x), where x is data and y is a label, is obtained as a
composition y = h ◦ g(x) of a deep representation network
z = g(x), where z is a learned representation of data x,
and a smaller classifier y = h(z), predicting label y given
representation z, both of which are shared across domains.
Current “domain-invariance”-based methods in domain gen-
eralization focus on either the marginal distribution align-
ment (Muandet et al., 2013) or the conditional distribution
alignment (Li et al., 2018b;c), which are still prone to distri-
v:2102.05082v2
[cs.LG]
14
Feb
2021
1) Domain-invariant representation function이�존재하는가?
Q 2) 가 Domain-invariant 한가?
d,d'
gθ(x) (
- =0
)
gθ f (x)
Domain Invariant Representation Learning with Domain Density Transformations
Theorem 1; Domain-invariant representation function이�존재하는가?
Theorem 1. The invariance of p(y|d) across domains is the necessary and sufficient condition for the existence of a domain-invariant representation (that aligns both
the marginal and conditional distribution).
‘p(y|d)의 도메인에 따른 불변함’과 ‘ Domain-invariant representation(function)이 존재하는 것’은 동치이다.
p(y,z|d) p(y|z,d)
p(z|d)
= = =
p(y|z,d')
p(z|d') p(y,z|d') p(y|d)
∴ = p(y|d') marginalizing
∵ over z
‘ Domain-invariant representation(function)이 존재하는 것’ ‘p(y|d)의 도메인에 따른 불변함’
tion (Khosla et al., 2012; Muandet et al., 2013; Ghifary et al.,
2015) and domain adaptation (Zhao et al., 2019; Zhang et al.,
2019; Combes et al., 2020; Tanwani, 2020) is that, in do-
main generalization, the learner does not have access to
(even a small amount of) data of the target domain, making
the problem much more challenging.
One of the most common domain generalization approaches
is to learn an invariant representation across domains, aim-
ing at a good generalization performance on target domains.
1
University of Oxford 2
VinAI Research. Correspondence to: A.
Tuan Nguyen <tuan.nguyen@cs.ox.ac.uk>.
alignment refers to making the representation distribution
p(z) to be the same across domains. This is essential since
if p(z) for the target domain is different from that of source
domains, the classification network h(z) would face out-
of-distribution data because the representation z it receives
as input at test time would be different from the ones it
was trained with in source domains. Conditional alignment
refers to aligning p(y|z), the conditional distribution of
the label given the representation, since if this conditional
for the target domain is different from that of the source
domains, the classification network (trained on the source
domains) would give inaccurate predictions at test time.
The formal definition of the two alignments is discussed in
Section 3.
tion (Khosla et al., 2012; Muandet et al., 2013; Ghifary et al.,
2015) and domain adaptation (Zhao et al., 2019; Zhang et al.,
2019; Combes et al., 2020; Tanwani, 2020) is that, in do-
main generalization, the learner does not have access to
(even a small amount of) data of the target domain, making
the problem much more challenging.
One of the most common domain generalization approaches
is to learn an invariant representation across domains, aim-
ing at a good generalization performance on target domains.
1
University of Oxford 2
VinAI Research. Correspondence to: A.
Tuan Nguyen <tuan.nguyen@cs.ox.ac.uk>.
alignment refers to making the representation distribution
p(z) to be the same across domains. This is essential since
if p(z) for the target domain is different from that of source
domains, the classification network h(z) would face out-
of-distribution data because the representation z it receives
as input at test time would be different from the ones it
was trained with in source domains. Conditional alignment
refers to aligning p(y|z), the conditional distribution of
the label given the representation, since if this conditional
for the target domain is different from that of the source
domains, the classification network (trained on the source
domains) would give inaccurate predictions at test time.
The formal definition of the two alignments is discussed in
Section 3.
Domain Invariant Representation Learning with Domain Density Transformations
ze this objective, while the discriminator D tries to
ze it.
n Classification Loss. For a given input image x
3.2. Training with Multiple Datasets
An important advantage of StarGAN is that it simulta-
neously incorporates multiple datasets containing different
minimize this objective, while the discriminator D tries to
maximize it.
Domain Classification Loss. For a given input image x
and a target domain label c, our goal is to translate x into
an output image y, which is properly classified to the target
3.2. Training with Multiple Datasets
An important advantage of StarGAN is
neously incorporates multiple datasets conta
types of labels, so that StarGAN can contro
→
→
1)
A ⇔ B, A⇒B and B⇒A
‘ Domain-invariant representation(function)이 존재하는 것’
‘p(y|d)의 도메인에 따른 불변함’
If is unchanged w.r.t. the domain d, then we can always find a domain invariant representation(This is trivial).
p(y|d)
For example, for the deterministic case(that maps all x to 0), or for the probabilistic case.
p(z|x) (z|x)
= δ0 p(z|x) (z;0,1)
= N
→
2)
Domain-invariant representation(function)이 존재한다 Marginal and Conditional Distribution Alignment를 만족하는 표현 z 가 존재한다.
Domain Invariant Representation Learning with Domain Density Transformations
Theorem 2; 가 Domain-invariant한가?
d,d'
gθ(x) (
- =0
)
gθ f (x)
Theorem 2. Given an invertible and differentiable function
the representation z satisfies
Marginal Alignment
Conditional Alignment
. Then it aligns both the marginal and conditional of the data distribution for domain d and d
(with the inverse that transforms the data density from to (as described above). Assuming that
)
fd,d' f d '
d
d ,d
'
!
"#$ !"#$%&'"(%)*#!#)*+$%,!-)&"'()*'(#,$).)!')/
01,!2)!2+),$3+"%+)!
$#"4
%$ & !"#$'%"(
))%" ))%$
)*
+
, '%" (
+,
'%$
(
5'(#,$). 5'(#,$)/
main density transformation. If we know the function f1,2 that transforms the data density from domain 1 to domain 2,
a domain invariant representation network gθ(x) by enforcing it to be invariant under f1,2, i.e., gθ(x1) = gθ(x2) for any
1) .
applying variable substitution in multiple inte-

= fd,d (x))
p(x
|d
)


det Jfd,d
(x
)



−1
p(z|x
)


det Jfd,d
(x
)


 dx
ce p(fd,d(x
)|d) = p(x
|d
)


det Jfd,d
(x
)



−1
Eq 6 and p(z|fd,d(x
)) = p(z|x
) due to defini-
z in Eq 7)
p(x
|d
)p(z|x
)dx
|d
) (8)
ional alignment: ∀z, y we have:

(since p(fd,d(x
)|y, d) =
p(x
|y, d
)


det Jfd,d
(x
)



−1
due to Eq 4 and
p(z|fd,d(x
)) = p(z|x
) due to definition of z in
Eq 7)
=

p(x
|y, d
)p(z|x
)dx
= p(z|y, d
) (9)
Note that
p(y|z, d) =
p(y, z|d)
p(z|d)
=
p(y|d)p(z|y, d)
p(z|d)
(10)
Since p(y|d) = p(y) = p(y|d
), p(z|y, d) = p(z|y, d
)
and p(z|d) = p(z|d
), we have:
p(y|z, d) =
p(y|d
)p(z|y, d
)
p(z|d)
= p(y|z, d
) (11)
01,!2)!2+),$3+%+)!
$#4
%$  !#$'%(
))% ))%$
)*
+
, '% (
+,
'%$
(
Figure 3. Domain density transformation. If we know the function f1,2 that transforms the data density from domain 1 to domain 2,
we can learn a domain invariant representation network gθ(x) by enforcing it to be invariant under f1,2, i.e., gθ(x1) = gθ(x2) for any
x2 = f1,2(x1) .
(by applying variable substitution in multiple inte-
gral: x
= fd,d (x))
=

p(x
|d
)


det Jfd,d
(x
)



−1
p(z|x
)


det Jfd,d
(x
)


 dx
(since p(fd,d(x
)|d) = p(x
|d
)


det Jfd,d
(x
)



−1
due to Eq 6 and p(z|fd,d(x
)) = p(z|x
) due to defini-
tion of z in Eq 7)
=

p(x
|d
)p(z|x
)dx
= p(z|d
) (8)
ii) Conditional alignment: ∀z, y we have:
p(z|y, d) =

p(x|y, d)p(z|x)dx
=

p(fd,d(x
)|y, d)p(z|fd,d(x
))


det Jfd,d
(x
)


 dx
(since p(fd,d(x
)|y, d) =
p(x
|y, d
)


det Jfd,d
(x
)



−1
due to Eq 4 and
p(z|fd,d(x
)) = p(z|x
) due to definition of z in
Eq 7)
=

p(x
|y, d
)p(z|x
)dx
= p(z|y, d
) (9)
Note that
p(y|z, d) =
p(y, z|d)
p(z|d)
=
p(y|d)p(z|y, d)
p(z|d)
(10)
Since p(y|d) = p(y) = p(y|d
), p(z|y, d) = p(z|y, d
)
and p(z|d) = p(z|d
), we have:
p(y|z, d) =
p(y|d
)p(z|y, d
)
p(z|d)
= p(y|z, d
) (11)
This theorem indicates that, if we can find the functions
f’s that transform the data densities among the domains,
we can learn a domain-invariant representation z by en-
!
#$ !#$%'(%)*#!#)*+$%,!-)'()*'(#,$).)!')/
01,!2)!2+),$3+%+)!
$#4
%$  !#$'%(
))% ))%$
)*
+
, '% (
+,
'%$
(
Figure 3. Domain density transformation. If we know the function f1,2 that transforms the data density from domain 1 to domain 2,
we can learn a domain invariant representation network gθ(x) by enforcing it to be invariant under f1,2, i.e., gθ(x1) = gθ(x2) for any
x2 = f1,2(x1) .
(by applying variable substitution in multiple inte-
gral: x
= fd,d (x))
=

p(x
|d
)


det Jfd,d
(x
)



−1
p(z|x
)


det Jfd,d
(x
)


 dx
(since p(fd,d(x
)|d) = p(x
|d
)


det Jfd,d
(x
)



−1
due to Eq 6 and p(z|fd,d(x
)) = p(z|x
) due to defini-
tion of z in Eq 7)
=

p(x
|d
)p(z|x
)dx
= p(z|d
) (8)
ii) Conditional alignment: ∀z, y we have:
p(z|y, d) =

p(x|y, d)p(z|x)dx
(since p(fd,d(x
)|y, d) =
p(x
|y, d
)


det Jfd,d
(x
)



−1
due to Eq 4 and
p(z|fd,d(x
)) = p(z|x
) due to definition of z in
Eq 7)
=

p(x
|y, d
)p(z|x
)dx
= p(z|y, d
) (9)
Note that
p(y|z, d) =
p(y, z|d)
p(z|d)
=
p(y|d)p(z|y, d)
p(z|d)
(10)
Since p(y|d) = p(y) = p(y|d
), p(z|y, d) = p(z|y, d
)
and p(z|d) = p(z|d
), we have:
p(y|z, d) =
p(y|d
)p(z|y, d
)
p(z|d)
= p(y|z, d
) (11)
This theorem indicates that, if we can find the functions
ng with Domain Density Transformations
$%,!-)'()*'(#,$).)!')/
3+%+)!
$#4
%$  !#$'%(
))%$
)*
+,
'%$
(
5'(#,$)/
on f1,2 that transforms the data density from domain 1 to domain 2,
enforcing it to be invariant under f1,2, i.e., gθ(x1) = gθ(x2) for any
(since p(fd,d(x
)|y, d) =
p(x
|y, d
)


det Jfd,d
(x
)



−1
due to Eq 4 and
p(z|fd,d(x
)) = p(z|x
) due to definition of z in
Eq 7)
=

p(x
|y, d
)p(z|x
)dx
= p(z|y, d
) (9)
Note that
p(y|z, d) =
p(y, z|d)
p(z|d)
=
p(y|d)p(z|y, d)
p(z|d)
(10)
Since p(y|d) = p(y) = p(y|d
), p(z|y, d) = p(z|y, d
)
and p(z|d) = p(z|d
), we have:
p(y|z, d) =
p(y|d
)p(z|y, d
)
p(z|d)
= p(y|z, d
) (11)
d,d'
p(z|x)
-1
-1
p(z|d) ∫
= p(x|d)p(z|x) |d)
dx ∫
= p( (x')
fd',d z )
|
p( (x')
fd',d
p(z|y,d) p(z|y,d')
∫
= =
p(x|y,d)p(z|x)dx =
dx'
fd',d
det (x')
J
|y,d)
∫ p( (x')
fd',d z )
|
p( (x')
fd',d dx' x'
fd',d
det (x')
J = |y,d')
x'
∫ p( z )
|
p( dx'
fd',d
det (x')
J
fd',d
det (x')
J x'
= |y,d')
x'
∫ p( z )
|
p( dx'
|d')
x'
∫
= =
=
p( |x')
z
p( dx'
fd',d
det (x')
J fd',d
det (x')
J |d')
x'
∫ p( |d')
z
p(
|x')
z
p( dx'
f x'
d,d'
where is Jacobian matrix of the function evaluated at
p(x|y,d) (∵ marginalizing over z)
⇒
p(y|d) p(y|d')
p(x'|y,d')
= fd',d
det
-1
(x')
J p(x|d) p(x'|d')
= fd',d
det
-1
(x')
J
p(x|y,d) p(x'|y,d')
= fd',d
det
-1
(x')
J fd,d'(x')
J
(z|
= )
p ∀x
f (x),
c. To achieve this condition, we add an auxiliary
r on top of D and impose the domain classification
en optimizing both D and G. That is, we decompose
ctive into two terms: a domain classification loss of
ges used to optimize D, and a domain classification
ake images used to optimize G. In detail, the former
ed as
Lr
cls = Ex,c [− log Dcls(c
|x)], (2)
he term Dcls(c
|x) represents a probability distribu-
er domain labels computed by D. By minimizing
ective, D learns to classify a real image x to its cor-
ing original domain c
. We assume that the input
nd domain label pair (x, c
) is given by the training
n the other hand, the loss function for the domain
ation of fake images is defined as
Lf
cls = Ex,c[− log Dcls(c|G(x, c))]. (3)
words, G tries to minimize this objective to gener-
ges that can be classified as the target domain c.
truction Loss. By minimizing the adversarial and
ation losses, G is trained to generate images that
stic and classified to its correct target domain. How-
nimizing the losses (Eqs. (1) and (3)) does not guar-
at the test phase. An issue when learning from multiple
datasets, however, is that the label information is only par-
tially known to each dataset. In the case of CelebA [19] and
RaFD [13], while the former contains labels for attributes
such as hair color and gender, it does not have any labels
for facial expressions such as ‘happy’ and ‘angry’, and vice
versa for the latter. This is problematic because the com-
plete information on the label vector c
is required when
reconstructing the input image x from the translated image
G(x, c) (See Eq. (4)).
Mask Vector. To alleviate this problem, we introduce a
mask vector m that allows StarGAN to ignore unspecified
labels and focus on the explicitly known label provided by
a particular dataset. In StarGAN, we use an n-dimensional
one-hot vector to represent m, with n being the number of
datasets. In addition, we define a unified version of the label
as a vector
c̃ = [c1, ..., cn, m], (7)
where [·] refers to concatenation, and ci represents a vector
for the labels of the i-th dataset. The vector of the known
label ci can be represented as either a binary vector for bi-
nary attributes or a one-hot vector for categorical attributes.
classifier on top of D and impose the domain classification
loss when optimizing both D and G. That is, we decompose
the objective into two terms: a domain classification loss of
real images used to optimize D, and a domain classification
loss of fake images used to optimize G. In detail, the former
is defined as
Lr
cls = Ex,c [− log Dcls(c
|x)], (2)
where the term Dcls(c
|x) represents a probability distribu-
tion over domain labels computed by D. By minimizing
this objective, D learns to classify a real image x to its cor-
responding original domain c
. We assume that the input
image and domain label pair (x, c
) is given by the training
data. On the other hand, the loss function for the domain
classification of fake images is defined as
Lf
cls = Ex,c[− log Dcls(c|G(x, c))]. (3)
In other words, G tries to minimize this objective to gener-
ate images that can be classified as the target domain c.
Reconstruction Loss. By minimizing the adversarial and
classification losses, G is trained to generate images that
are realistic and classified to its correct target domain. How-
ever, minimizing the losses (Eqs. (1) and (3)) does not guar-
antee that translated images preserve the content of its input
images while changing only the domain-related part of the
tially known to each dataset. In the case of C
RaFD [13], while the former contains label
such as hair color and gender, it does not h
for facial expressions such as ‘happy’ and ‘a
versa for the latter. This is problematic bec
plete information on the label vector c
is
reconstructing the input image x from the tr
G(x, c) (See Eq. (4)).
Mask Vector. To alleviate this problem, w
mask vector m that allows StarGAN to ign
labels and focus on the explicitly known lab
a particular dataset. In StarGAN, we use an
one-hot vector to represent m, with n being
datasets. In addition, we define a unified vers
as a vector
c̃ = [c1, ..., cn, m],
where [·] refers to concatenation, and ci repr
for the labels of the i-th dataset. The vecto
label ci can be represented as either a binar
nary attributes or a one-hot vector for catego
For the remaining n−1 unknown labels we
Domain Invariant Representation Learning with Domain Density Transformations
Assume that we have a set of K sources domain Ds =
{d1, d2, ..., dK}, the objective function in Eq. 12 becomes:
Ed,d∈Ds,p(x,y|d)

l(y, gθ(x)) + ||gθ(x) − gθ(fd,d (x))||2
2

(13)
In the next section, we show how one can incorporate this
idea into real-world domain generalization problems with
generative adversarial networks.
4. Domain Generalization with Generative
Adversarial Networks
In practice, we will learn the functions f’s that transform
the data distributions between domains and one can use
several generative modeling frameworks, e.g., normalizing
flows (Grover et al., 2020) or GANs (Zhu et al., 2017; Choi
et al., 2018; 2020) to learn such functions. One advantage
of normalizing flows is that this transformation is naturally
invertible by design of the neural network. In addition, the
• The adversarial loss Ladv that is the classification loss
of a discriminator D that tries to distinguish between
real images and the fake images generated by G. The
equilibrium state of StarGAN is when G completely
fools D, which means the distribution of the generated
images (via G(x, d, d
), x ∼ p(x|d)) becomes the dis-
tribution of the real images of the destination domain
p(x
|d
). This is our objective, i.e., to learn a function
that transforms domains’ densities.
• Reconstruction loss Lrec = Ex,d,d [||x −
G(x
, d
, d)||1] where x
= G(x, d, d
) to ensure
that the transformations preserve the image’s content.
Note that this also aligns with our interest since we
want G(., d
, d) to be the inverse of G(., d, d
), which
will minimize Lrec to zero.
We can enforce the generator G to transform the data distri-
bution within the class y (e.g., p(x|y, d) to p(x
|y, d
) ∀y)
by sampling each minibatch with data from the same class
y, so that the discriminator will distinguish the transformed
images with the real images from class y and domain d
.
minimize this objective, while the discriminator D tries to
maximize it.
Domain Classification Loss. For a given input image x
and a target domain label c, our goal is to translate x into
an output image y, which is properly classified to the target
domain c. To achieve this condition, we add an auxiliary
classifier on top of D and impose the domain classification
loss when optimizing both D and G. That is, we decompose
the objective into two terms: a domain classification loss of
real images used to optimize D, and a domain classification
loss of fake images used to optimize G. In detail, the former
is defined as
Lr
cls = Ex,c [− log Dcls(c
|x)], (2)
where the term Dcls(c
|x) represents a probability distribu-
tion over domain labels computed by D. By minimizing
this objective, D learns to classify a real image x to its cor-
minimize this objective, while the discriminator D tries to
maximize it.
Domain Classification Loss. For a given input image x
and a target domain label c, our goal is to translate x into
3.2. Training with Multiple Data
An important advantage of StarG
neously incorporates multiple datase
types of labels, so that StarGAN can
Domain Invariant Representation Learning with Domain Density Transformations
d ,d
applying variable substitution in multiple inte-

= fd,d (x))
p(x
|y, d
)


det Jfd,d
(x
)



−1
p(z|x
)


det Jfd,d
(x
)


 dx
couraging the representation to be invariant under all the
transformations f’s. This idea is illustrated in Figure 3. We
therefore can use the following learning objective to learn a
domain-invariant representation z = gθ(x):
Ed

Ep(x,y|d)

l(y, gθ(x)) + Ed [||gθ(x) − gθ(fd,d (x))||2
2]

(12)
where l(y, gθ(x)) is the prediction loss of a network that pre-
dicts y given z = gθ(x), and the second term is to enforce
the invariant condition in Eq 7.
gral: x
= fd,d (x))
=

p(x
|y, d
)


det Jfd,d
(x
)



−1
p(z|x
)


det Jfd,d
(x
)


 dx
domain-invariant representation z = gθ(x):
Ed

Ep(x,y|d)

l(y, gθ(x)) + Ed [||gθ(x) − gθ(fd,d (x))||2
2]

(12)
where l(y, gθ(x)) is the prediction loss of a network that pre-
dicts y given z = gθ(x), and the second term is to enforce
the invariant condition in Eq 7.
(by applying variable substitution in multiple inte-
gral: x
= fd,d (x))
=

p(x
|y, d
)


det Jfd,d
(x
)



−1
p(z|x
)


det Jfd,d
(x
)


 dx
transformations f’s. This idea is illustrated in Figure 3. We
therefore can use the following learning objective to learn a
domain-invariant representation z = gθ(x):
Ed

Ep(x,y|d)

l(y, gθ(x)) + Ed [||gθ(x) − gθ(fd,d (x))||2
2]

(12)
where l(y, gθ(x)) is the prediction loss of a network that pre-
dicts y given z = gθ(x), and the second term is to enforce
the invariant condition in Eq 7.
f’s that transform the data densities among the domains,
we can learn a domain-invariant representation z by en-
couraging the representation to be invariant under all the
transformations f’s. This idea is illustrated in Figure 3. We
therefore can use the following learning objective to learn a
domain-invariant representation z = gθ(x):
Ed

Ep(x,y|d)

l(y, gθ(x)) + Ed [||gθ(x) − gθ(fd,d (x))||2
2]

(12)
where l(y, gθ(x)) is the prediction loss of a network that pre-
dicts y given z = gθ(x), and the second term is to enforce
the invariant condition in Eq 7.
5. Domain Generalization with Generative Adversarial Networks (StarGAN; PR-152)
models
2
3
2
G32
G23
3
2
1
5
4 3
(b) StarGAN
on between cross-domain models and our pro-
GAN. (a) To handle multiple domains, cross-
ould be built for every pair of image domains.
able of learning mappings among multiple do-
e generator. The figure represents a star topol-
lti-domains.
ned from RaFD, as shown in the right-
Fig. 1. As far as our knowledge goes, our
o successfully perform multi-domain im-
ross different datasets.
ontributions are as follows:
StarGAN, a novel generative adversarial
learns the mappings among multiple do-
only a single generator and a discrimina-
effectively from images of all domains.
rate how we can successfully learn multi-
G
Input image
Target domain
Depth-wise concatenation
Fake image
G
Original
domain
Fake image
Depth-wise concatenation
Reconstructed
image
D
Fake image
Domain
classification
Real / Fake
(b) Original-to-target domain (c) Target-to-original domain (d) Fooling the discriminator
D
Domain
classification
Real / Fake
Fake image
Real image
(a) Training the discriminator
(1) (2)
(1), (2) (1)
Figure 3. Overview of StarGAN, consisting of two modules, a discriminator D and a generator G. (a) D learns to distinguish between
real and fake images and classify the real images to its corresponding domain. (b) G takes in as input both the image and target domain
label and generates an fake image. The target domain label is spatially replicated and concatenated with the input image. (c) G tries to
reconstruct the original image from the fake image given the original domain label. (d) G tries to generate images indistinguishable from
real images and classifiable as target domain by D.
vided both the discriminator and generator with class infor-
mation in order to generate samples conditioned on the class
[20, 21, 22]. Other recent approaches focused on generating
particular images highly relevant to a given text description
[25, 30]. The idea of conditional image generation has also
been successfully applied to domain transfer [9, 28], super-
resolution imaging[14], and photo editing [2, 27]. In this
paper, we propose a scalable GAN framework that can flex-
ibly steer the image translation to various target domains,
by providing conditional domain information.
Image-to-Image Translation. Recent work have achieved
impressive results in image-to-image translation [7, 9, 17,
33]. For instance, pix2pix [7] learns this task in a super-
vised manner using cGANs[20]. It combines an adver-
sarial loss with a L1 loss, thus requires paired data sam-
ples. To alleviate the problem of obtaining data pairs, un-
3. Star Generative Adversarial Networks
We first describe our proposed StarGAN, a framework to
address multi-domain image-to-image translation within a
single dataset. Then, we discuss how StarGAN incorporates
multiple datasets containing different label sets to flexibly
perform image translations using any of these labels.
3.1. Multi-Domain Image-to-Image Translation
Our goal is to train a single generator G that learns map-
pings among multiple domains. To achieve this, we train G
to translate an input image x into an output image y condi-
tioned on the target domain label c, G(x, c) → y. We ran-
domly generate the target domain label c so that G learns
to flexibly translate the input image. We also introduce an
auxiliary classifier [22] that allows a single discriminator to
control multiple domains. That is, our discriminator pro-
To alleviate this problem, we apply a cycle consis-
ss [9, 33] to the generator, defined as
Lrec = Ex,c,c [||x − G(G(x, c), c
)||1], (4)
G takes in the translated image G(x, c) and the origi-
ain label c
as input and tries to reconstruct the orig-
ge x. We adopt the L1 norm as our reconstruction
te that we use a single generator twice, first to trans-
original image into an image in the target domain
n to reconstruct the original image from the trans-
age.
jective. Finally, the objective functions to optimize
D are written, respectively, as
LD = −Ladv + λcls Lr
cls, (5)
LG = Ladv + λcls Lf
cls + λrec Lrec, (6)
λcls and λrec are hyper-parameters that control the
importance of domain classification and reconstruc-
ses, respectively, compared to the adversarial loss.
λcls = 1 and λrec = 10 in all of our experiments.
RaFD datasets, where n is two.
Training Strategy. When training StarGAN with multiple
datasets, we use the domain label c̃ defined in Eq. (7) as in-
put to the generator. By doing so, the generator learns to
ignore the unspecified labels, which are zero vectors, and
focus on the explicitly given label. The structure of the gen-
erator is exactly the same as in training with a single dataset,
except for the dimension of the input label c̃. On the other
hand, we extend the auxiliary classifier of the discrimina-
tor to generate probability distributions over labels for all
datasets. Then, we train the model in a multi-task learning
setting, where the discriminator tries to minimize only the
classification error associated to the known label. For ex-
ample, when training with images in CelebA, the discrimi-
nator minimizes only classification errors for labels related
to CelebA attributes, and not facial expressions related to
RaFD. Under these settings, by alternating between CelebA
and RaFD the discriminator learns all of the discriminative
features for both datasets, and the generator learns to con-
trol all the labels in both datasets.
4
Lrec = Ex,c,c [||x − G(G(x, c), c
)||1], (4)
where G takes in the translated image G(x, c) and the origi-
nal domain label c
as input and tries to reconstruct the orig-
inal image x. We adopt the L1 norm as our reconstruction
loss. Note that we use a single generator twice, first to trans-
late an original image into an image in the target domain
and then to reconstruct the original image from the trans-
lated image.
Full Objective. Finally, the objective functions to optimize
G and D are written, respectively, as
LD = −Ladv + λcls Lr
cls, (5)
LG = Ladv + λcls Lf
cls + λrec Lrec, (6)
where λcls and λrec are hyper-parameters that control the
relative importance of domain classification and reconstruc-
tion losses, respectively, compared to the adversarial loss.
We use λcls = 1 and λrec = 10 in all of our experiments.
Training Strategy. When training StarGAN
datasets, we use the domain label c̃ defined i
put to the generator. By doing so, the gene
ignore the unspecified labels, which are ze
focus on the explicitly given label. The struc
erator is exactly the same as in training with a
except for the dimension of the input label c
hand, we extend the auxiliary classifier of
tor to generate probability distributions ove
datasets. Then, we train the model in a mul
setting, where the discriminator tries to min
classification error associated to the known
ample, when training with images in CelebA
nator minimizes only classification errors fo
to CelebA attributes, and not facial express
RaFD. Under these settings, by alternating b
and RaFD the discriminator learns all of the
features for both datasets, and the generator
trol all the labels in both datasets.
4
of the generative model is completed, we propose the use
of GANs to inherit its rich network capacity. In particular,
we use the StarGAN (Choi et al., 2018) model, which is
designed for image domain transformations.
The goal of StarGAN is to learn a unified network G that
transforms the data density among multiple domains. In
particular, the network G(x, d, d
) (i.e., G is conditioned on
the image x and the two different domains d, d
) transforms
an image x from domain d to domain d
. Different from
the original StarGAN model that only takes the image x
and the desired destination domain d
as its input, in our
implementation, we feed both the original domain d and
desired destination domain d
together with the original
image x to the generator G.
The generator’s goal is to fool a discriminator D into think-
ing that the transformed image belongs to the destination do-
main d
. In other words, the equilibrium state of StarGAN,
in which G completely fools D, is when G successfully
transforms the data density of the original domain to that
of the destination domain. After training, we use G(., d, d
)
as the function fd,d (.) described in the previous section
and perform the representation learning via the objective
function in Eq 13.
Three important loss functions of the StarGAN architecture
are:
• Domain classification loss Lcls that encourages the
generator G to generate images that correctly belongs
to the desired destination domain d
.
image within the original class y.
As mentioned earlier, after training the StarGAN model,
we can use the generator G(., d, d
) as our fd,d (.) function
and learn a domain-invariant representation via the learning
objective in Eq 13. We name this implementation of our
method DIR-GAN (domain-invariant representation learn-
ing with generative adversarial networks).
5. Experiments
5.1. Datasets
To evaluate our method, we perform experiments in three
datasets that are commonly used in the literature for domain
generalization.
Rotated MNIST. In this dataset by Ghifary et al.
(2015), 1,000 MNIST images (100 per class) (LeCun 
Cortes, 2010) are chosen to form the first domain (de-
noted M0), then rotations of 15◦
, 30◦
, 45◦
, 60◦
and 75◦
are applied to create five additional domains, denoted
M15, M30, M45, M60 and M75. The task is classifica-
tion with ten classes (digits 0 to 9).
PACS (Li et al., 2017) contains 9,991 images from four
different domains: art painting, cartoon, photo, sketch. The
task is classification with seven classes.
OfficeHome (Venkateswara et al., 2017) has 15,500 im-
ages of daily objects from four domains: art, clipart, product
and real. There are 65 classes in this classification dataset.
Domain Invariant Representation Learning with Domain Density Transformations
Assume that we have a set of K sources domain Ds =
{d1, d2, ..., dK}, the objective function in Eq. 12 becomes:
Ed,d∈Ds,p(x,y|d)

l(y, gθ(x)) + ||gθ(x) − gθ(fd,d (x))||2
2

(13)
In the next section, we show how one can incorporate this
idea into real-world domain generalization problems with
generative adversarial networks.
4. Domain Generalization with Generative
Adversarial Networks
In practice, we will learn the functions f’s that transform
the data distributions between domains and one can use
several generative modeling frameworks, e.g., normalizing
flows (Grover et al., 2020) or GANs (Zhu et al., 2017; Choi
et al., 2018; 2020) to learn such functions. One advantage
of normalizing flows is that this transformation is naturally
invertible by design of the neural network. In addition, the
• The adversarial loss Ladv that is the classification loss
of a discriminator D that tries to distinguish between
real images and the fake images generated by G. The
equilibrium state of StarGAN is when G completely
fools D, which means the distribution of the generated
images (via G(x, d, d
), x ∼ p(x|d)) becomes the dis-
tribution of the real images of the destination domain
p(x
|d
). This is our objective, i.e., to learn a function
that transforms domains’ densities.
• Reconstruction loss Lrec = Ex,d,d [||x −
G(x
, d
, d)||1] where x
= G(x, d, d
) to ensure
that the transformations preserve the image’s content.
Note that this also aligns with our interest since we
want G(., d
, d) to be the inverse of G(., d, d
), which
will minimize Lrec to zero.
We can enforce the generator G to transform the data distri-
bution within the class y (e.g., p(x|y, d) to p(x
|y, d
) ∀y)
by sampling each minibatch with data from the same class
y, so that the discriminator will distinguish the transformed
Domain Invariant Representation Learning with Domain Density Transformations
Assume that we have a set of K sources domain Ds =
{d1, d2, ..., dK}, the objective function in Eq. 12 becomes:
Ed,d∈Ds,p(x,y|d)

l(y, gθ(x)) + ||gθ(x) − gθ(fd,d (x))||2
2

(13)
In the next section, we show how one can incorporate this
idea into real-world domain generalization problems with
generative adversarial networks.
• The adversarial loss Ladv that is the classification loss
of a discriminator D that tries to distinguish between
real images and the fake images generated by G. The
equilibrium state of StarGAN is when G completely
fools D, which means the distribution of the generated
images (via G(x, d, d
), x ∼ p(x|d)) becomes the dis-
tribution of the real images of the destination domain
p(x
|d
). This is our objective, i.e., to learn a function
that transforms domains’ densities.
• Reconstruction loss Lrec = Ex,d,d [||x −
   
ation Learning with Domain Density Transformations
ain Ds =
becomes:
d (x))||2
2

(13)
porate this
lems with
ative
transform
e can use
rmalizing
017; Choi
advantage
s naturally
dition, the
on can be
hat we do
g process
se the use
particular,
, which is
• The adversarial loss Ladv that is the classification loss
of a discriminator D that tries to distinguish between
real images and the fake images generated by G. The
equilibrium state of StarGAN is when G completely
fools D, which means the distribution of the generated
images (via G(x, d, d
), x ∼ p(x|d)) becomes the dis-
tribution of the real images of the destination domain
p(x
|d
). This is our objective, i.e., to learn a function
that transforms domains’ densities.
• Reconstruction loss Lrec = Ex,d,d [||x −
G(x
, d
, d)||1] where x
= G(x, d, d
) to ensure
that the transformations preserve the image’s content.
Note that this also aligns with our interest since we
want G(., d
, d) to be the inverse of G(., d, d
), which
will minimize Lrec to zero.
We can enforce the generator G to transform the data distri-
bution within the class y (e.g., p(x|y, d) to p(x
|y, d
) ∀y)
by sampling each minibatch with data from the same class
y, so that the discriminator will distinguish the transformed
images with the real images from class y and domain d
.
However, we found that this constraint can be relaxed in
practice, and the generator almost always transforms the
image within the original class y.
As mentioned earlier, after training the StarGAN model,
we can use the generator G(., d, d
) as our fd,d (.) function
Domain Invariant Representation Learning with Domain Density Transformations
Assume that we have a set of K sources domain Ds =
{d1, d2, ..., dK}, the objective function in Eq. 12 becomes:
Ed,d∈Ds,p(x,y|d)

l(y, gθ(x)) + ||gθ(x) − gθ(fd,d (x))||2
2

(13)
In the next section, we show how one can incorporate this
idea into real-world domain generalization problems with
generative adversarial networks.
4. Domain Generalization with Generative
Adversarial Networks
In practice, we will learn the functions f’s that transform
the data distributions between domains and one can use
several generative modeling frameworks, e.g., normalizing
flows (Grover et al., 2020) or GANs (Zhu et al., 2017; Choi
et al., 2018; 2020) to learn such functions. One advantage
of normalizing flows is that this transformation is naturally
invertible by design of the neural network. In addition, the
determinant of the Jacobian of that transformation can be
efficiently computed. However, due to the fact that we do
not need access to the Jacobian when the training process
of the generative model is completed, we propose the use
of GANs to inherit its rich network capacity. In particular,
we use the StarGAN (Choi et al., 2018) model, which is
designed for image domain transformations.
The goal of StarGAN is to learn a unified network G that
transforms the data density among multiple domains. In
particular, the network G(x, d, d
) (i.e., G is conditioned on
the image x and the two different domains d, d
) transforms
an image x from domain d to domain d
. Different from
the original StarGAN model that only takes the image x

• The adversarial loss Ladv that is the classification loss
of a discriminator D that tries to distinguish between
real images and the fake images generated by G. The
equilibrium state of StarGAN is when G completely
fools D, which means the distribution of the generated
images (via G(x, d, d
), x ∼ p(x|d)) becomes the dis-
tribution of the real images of the destination domain
p(x
|d
). This is our objective, i.e., to learn a function
that transforms domains’ densities.
• Reconstruction loss Lrec = Ex,d,d [||x −
G(x
, d
, d)||1] where x
= G(x, d, d
) to ensure
that the transformations preserve the image’s content.
Note that this also aligns with our interest since we
want G(., d
, d) to be the inverse of G(., d, d
), which
will minimize Lrec to zero.
We can enforce the generator G to transform the data distri-
bution within the class y (e.g., p(x|y, d) to p(x
|y, d
) ∀y)
by sampling each minibatch with data from the same class
y, so that the discriminator will distinguish the transformed
images with the real images from class y and domain d
.
However, we found that this constraint can be relaxed in
practice, and the generator almost always transforms the
image within the original class y.
As mentioned earlier, after training the StarGAN model,
we can use the generator G(., d, d
) as our fd,d (.) function
and learn a domain-invariant representation via the learning
objective in Eq 13. We name this implementation of our
method DIR-GAN (domain-invariant representation learn-
ing with generative adversarial networks).
5. Experiments
5.1. Datasets
Domain Invariant Representation Learning with Domain Density Transformations
Assume that we have a set of K sources domain Ds =
{d1, d2, ..., dK}, the objective function in Eq. 12 becomes:
Ed,d∈Ds,p(x,y|d)

l(y, gθ(x)) + ||gθ(x) − gθ(fd,d (x))||2
2

(13)
In the next section, we show how one can incorporate this
idea into real-world domain generalization problems with
generative adversarial networks.
4. Domain Generalization with Generative
Adversarial Networks
In practice, we will learn the functions f’s that transform
the data distributions between domains and one can use
several generative modeling frameworks, e.g., normalizing
flows (Grover et al., 2020) or GANs (Zhu et al., 2017; Choi
et al., 2018; 2020) to learn such functions. One advantage
of normalizing flows is that this transformation is naturally
invertible by design of the neural network. In addition, the
determinant of the Jacobian of that transformation can be
efficiently computed. However, due to the fact that we do
not need access to the Jacobian when the training process
of the generative model is completed, we propose the use
of GANs to inherit its rich network capacity. In particular,
we use the StarGAN (Choi et al., 2018) model, which is
designed for image domain transformations.
The goal of StarGAN is to learn a unified network G that
transforms the data density among multiple domains. In
particular, the network G(x, d, d
) (i.e., G is conditioned on
the image x and the two different domains d, d
) transforms
an image x from domain d to domain d
. Different from
the original StarGAN model that only takes the image x
and the desired destination domain d
as its input, in our
implementation, we feed both the original domain d and
desired destination domain d
together with the original
image x to the generator G.
• The adversarial loss Ladv that is the classification loss
of a discriminator D that tries to distinguish between
real images and the fake images generated by G. The
equilibrium state of StarGAN is when G completely
fools D, which means the distribution of the generated
images (via G(x, d, d
), x ∼ p(x|d)) becomes the dis-
tribution of the real images of the destination domain
p(x
|d
). This is our objective, i.e., to learn a function
that transforms domains’ densities.
• Reconstruction loss Lrec = Ex,d,d [||x −
G(x
, d
, d)||1] where x
= G(x, d, d
) to ensure
that the transformations preserve the image’s content.
Note that this also aligns with our interest since we
want G(., d
, d) to be the inverse of G(., d, d
), which
will minimize Lrec to zero.
We can enforce the generator G to transform the data distri-
bution within the class y (e.g., p(x|y, d) to p(x
|y, d
) ∀y)
by sampling each minibatch with data from the same class
y, so that the discriminator will distinguish the transformed
images with the real images from class y and domain d
.
However, we found that this constraint can be relaxed in
practice, and the generator almost always transforms the
image within the original class y.
As mentioned earlier, after training the StarGAN model,
we can use the generator G(., d, d
) as our fd,d (.) function
and learn a domain-invariant representation via the learning
objective in Eq 13. We name this implementation of our
method DIR-GAN (domain-invariant representation learn-
ing with generative adversarial networks).
5. Experiments
5.1. Datasets
To evaluate our method, we perform experiments in three
datasets that are commonly used in the literature for domain
generalization.
Domain Invariant Representation Learning with Domain Density Transformations
Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
Domains
Model M0 M15 M30 M45 M60 M75 Average
Domain Invariant Representation Learning with Domain Density Transformations
Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
image and domain label pair (x, c ) is given by the training
data. On the other hand, the loss function for the domain
classification of fake images is defined as
Lf
cls = Ex,c[− log Dcls(c|G(x, c))]. (3)
In other words, G tries to minimize this objective to gener-
ate images that can be classified as the target domain c.
Reconstruction Loss. By minimizing the adversarial and
classification losses, G is trained to generate images that
are realistic and classified to its correct target domain. How-
ever, minimizing the losses (Eqs. (1) and (3)) does not guar-
antee that translated images preserve the content of its input
images while changing only the domain-related part of the
inputs. To alleviate this problem, we apply a cycle consis-
tency loss [9, 33] to the generator, defined as
Lrec = Ex,c,c [||x − G(G(x, c), c
)||1], (4)
where G takes in the translated image G(x, c) and the origi-
nal domain label c
as input and tries to reconstruct the orig-
inal image x. We adopt the L1 norm as our reconstruction
loss. Note that we use a single generator twice, first to trans-
late an original image into an image in the target domain
and then to reconstruct the original image from the trans-
lated image.
classifier on top of D and impose the domain classification
loss when optimizing both D and G. That is, we decompose
the objective into two terms: a domain classification loss of
real images used to optimize D, and a domain classification
loss of fake images used to optimize G. In detail, the former
is defined as
Lr
cls = Ex,c [− log Dcls(c
|x)], (2)
where the term Dcls(c
|x) represents a probability distribu-
tion over domain labels computed by D. By minimizing
this objective, D learns to classify a real image x to its cor-
responding original domain c
. We assume that the input
image and domain label pair (x, c
) is given by the training
data. On the other hand, the loss function for the domain
classification of fake images is defined as
Lf
cls = Ex,c[− log Dcls(c|G(x, c))]. (3)
In other words, G tries to minimize this objective to gener-
ate images that can be classified as the target domain c.
Reconstruction Loss. By minimizing the adversarial and
classification losses, G is trained to generate images that
are realistic and classified to its correct target domain. How-
ever, minimizing the losses (Eqs. (1) and (3)) does not guar-
antee that translated images preserve the content of its input
tially known to each dataset. In the ca
RaFD [13], while the former contain
such as hair color and gender, it doe
for facial expressions such as ‘happy’
versa for the latter. This is problem
plete information on the label vecto
reconstructing the input image x from
G(x, c) (See Eq. (4)).
Mask Vector. To alleviate this pro
mask vector m that allows StarGAN
labels and focus on the explicitly kno
a particular dataset. In StarGAN, we
one-hot vector to represent m, with n
datasets. In addition, we define a unifi
as a vector
c̃ = [c1, ..., cn, m
where [·] refers to concatenation, and
for the labels of the i-th dataset. Th
label ci can be represented as either
nary attributes or a one-hot vector for
For the remaining n−1 unknown lab
mize this objective, while the discriminator D tries to
mize it.
ain Classification Loss. For a given input image x
target domain label c, our goal is to translate x into
put image y, which is properly classified to the target
n c. To achieve this condition, we add an auxiliary
fier on top of D and impose the domain classification
hen optimizing both D and G. That is, we decompose
jective into two terms: a domain classification loss of
mages used to optimize D, and a domain classification
f fake images used to optimize G. In detail, the former
ned as
Lr
cls = Ex,c [− log Dcls(c
|x)], (2)
the term Dcls(c
|x) represents a probability distribu-
ver domain labels computed by D. By minimizing
bjective, D learns to classify a real image x to its cor-
nding original domain c
. We assume that the input
and domain label pair (x, c
) is given by the training
On the other hand, the loss function for the domain
fication of fake images is defined as
3.2. Training with Multiple Datasets
An important advantage of StarGAN is that it simulta-
neously incorporates multiple datasets containing different
types of labels, so that StarGAN can control all the labels
at the test phase. An issue when learning from multiple
datasets, however, is that the label information is only par-
tially known to each dataset. In the case of CelebA [19] and
RaFD [13], while the former contains labels for attributes
such as hair color and gender, it does not have any labels
for facial expressions such as ‘happy’ and ‘angry’, and vice
versa for the latter. This is problematic because the com-
plete information on the label vector c
is required when
reconstructing the input image x from the translated image
G(x, c) (See Eq. (4)).
Mask Vector. To alleviate this problem, we introduce a
mask vector m that allows StarGAN to ignore unspecified
labels and focus on the explicitly known label provided by
a particular dataset. In StarGAN, we use an n-dimensional
one-hot vector to represent m, with n being the number of
datasets. In addition, we define a unified version of the label
Domain Invariant Representation Learning with Domain Density Transformations
both qualitative and quantitative results on
ute transfer and facial expression synthe-
ng StarGAN, showing its superiority over
dels.
ork
ersarial Networks. Generative adversar-
ANs) [3] have shown remarkable results
ter vision tasks such as image generation
age translation [7, 9, 33], super-resolution
d face image synthesis [10, 16, 26, 31]. A
del consists of two modules: a discrimina-
or. The discriminator learns to distinguish
fake samples, while the generator learns to
mples that are indistinguishable from real
proach also leverages the adversarial loss
rated images as realistic as possible.
Ns. GAN-based conditional image gener-
n actively studied. Prior studies have pro-
distribution of images in cross domains. CycleGAN [33]
and DiscoGAN [9] preserve key attributes between the in-
put and the translated image by utilizing a cycle consistency
loss. However, all these frameworks are only capable of
learning the relations between two different domains at a
time. Their approaches have limited scalability in handling
multiple domains since different models should be trained
for each pair of domains. Unlike the aforementioned ap-
proaches, our framework can learn the relations among mul-
tiple domains using only a single model.
Adversarial Loss. To make the generated images indistin-
guishable from real images, we adopt an adversarial loss
Ladv = Ex [log Dsrc(x)] +
Ex,c[log (1 − Dsrc(G(x, c)))],
(1)
where G generates an image G(x, c) conditioned on both
the input image x and the target domain label c, while D
tries to distinguish between real and fake images. In this
paper, we refer to the term Dsrc(x) as a probability distri-
bution over sources given by D. The generator G tries to
3
of the generative model is completed, we propose the use
of GANs to inherit its rich network capacity. In particular,
we use the StarGAN (Choi et al., 2018) model, which is
designed for image domain transformations.
The goal of StarGAN is to learn a unified network G that
transforms the data density among multiple domains. In
particular, the network G(x, d, d
) (i.e., G is conditioned on
the image x and the two different domains d, d
) transforms
an image x from domain d to domain d
. Different from
the original StarGAN model that only takes the image x
and the desired destination domain d
as its input, in our
implementation, we feed both the original domain d and
desired destination domain d
together with the original
image x to the generator G.
The generator’s goal is to fool a discriminator D into think-
ing that the transformed image belongs to the destination do-
main d
. In other words, the equilibrium state of StarGAN,
in which G completely fools D, is when G successfully
transforms the data density of the original domain to that
of the destination domain. After training, we use G(., d, d
)
as the function fd,d (.) described in the previous section
and perform the representation learning via the objective
function in Eq 13.
Three important loss functions of the StarGAN architecture
are:
• Domain classification loss Lcls that encourages the
generator G to generate images that correctly belongs
to the desired destination domain d
.
image within the original class y.
As mentioned earlier, after training the StarGAN model,
we can use the generator G(., d, d
) as our fd,d (.) function
and learn a domain-invariant representation via the learning
objective in Eq 13. We name this implementation of our
method DIR-GAN (domain-invariant representation learn-
ing with generative adversarial networks).
5. Experiments
5.1. Datasets
To evaluate our method, we perform experiments in three
datasets that are commonly used in the literature for domain
generalization.
Rotated MNIST. In this dataset by Ghifary et al.
(2015), 1,000 MNIST images (100 per class) (LeCun 
Cortes, 2010) are chosen to form the first domain (de-
noted M0), then rotations of 15◦
, 30◦
, 45◦
, 60◦
and 75◦
are applied to create five additional domains, denoted
M15, M30, M45, M60 and M75. The task is classifica-
tion with ten classes (digits 0 to 9).
PACS (Li et al., 2017) contains 9,991 images from four
different domains: art painting, cartoon, photo, sketch. The
task is classification with seven classes.
OfficeHome (Venkateswara et al., 2017) has 15,500 im-
ages of daily objects from four domains: art, clipart, product
and real. There are 65 classes in this classification dataset.
In practice, we will learn the functions f’s that transform
the data distributions between domains and one can use
several generative modeling frameworks, e.g., normalizing
flows (Grover et al., 2020) or GANs (Zhu et al., 2017; Choi
et al., 2018; 2020) to learn such functions. One advantage
of normalizing flows is that this transformation is naturally
invertible by design of the neural network. In addition, the
determinant of the Jacobian of that transformation can be
efficiently computed. However, due to the fact that we do
not need access to the Jacobian when the training process
of the generative model is completed, we propose the use
of GANs to inherit its rich network capacity. In particular,
we use the StarGAN (Choi et al., 2018) model, which is
designed for image domain transformations.
The goal of StarGAN is to learn a unified network G that
transforms the data density among multiple domains. In
particular, the network G(x, d, d
) (i.e., G is conditioned on
the image x and the two different domains d, d
) transforms
an image x from domain d to domain d
. Different from
the original StarGAN model that only takes the image x
and the desired destination domain d
as its input, in our
implementation, we feed both the original domain d and
desired destination domain d
together with the original
image x to the generator G.
The generator’s goal is to fool a discriminator D into think-
ing that the transformed image belongs to the destination do-
main d
. In other words, the equilibrium state of StarGAN,
in which G completely fools D, is when G successfully
transforms the data density of the original domain to that
of the destination domain. After training, we use G(., d, d
)
as the function fd,d (.) described in the previous section
and perform the representation learning via the objective
function in Eq 13.
Three important loss functions of the StarGAN architecture
are:
• Domain classification loss Lcls that encourages the
want G(., d , d) to be the inverse of G(., d, d ), which
will minimize Lrec to zero.
We can enforce the generator G to transform the data distri-
bution within the class y (e.g., p(x|y, d) to p(x
|y, d
) ∀y)
by sampling each minibatch with data from the same class
y, so that the discriminator will distinguish the transformed
images with the real images from class y and domain d
.
However, we found that this constraint can be relaxed in
practice, and the generator almost always transforms the
image within the original class y.
As mentioned earlier, after training the StarGAN model,
we can use the generator G(., d, d
) as our fd,d (.) function
and learn a domain-invariant representation via the learning
objective in Eq 13. We name this implementation of our
method DIR-GAN (domain-invariant representation learn-
ing with generative adversarial networks).
5. Experiments
5.1. Datasets
To evaluate our method, we perform experiments in three
datasets that are commonly used in the literature for domain
generalization.
Rotated MNIST. In this dataset by Ghifary et al.
(2015), 1,000 MNIST images (100 per class) (LeCun 
Cortes, 2010) are chosen to form the first domain (de-
noted M0), then rotations of 15◦
, 30◦
, 45◦
, 60◦
and 75◦
are applied to create five additional domains, denoted
M15, M30, M45, M60 and M75. The task is classifica-
tion with ten classes (digits 0 to 9).
PACS (Li et al., 2017) contains 9,991 images from four
different domains: art painting, cartoon, photo, sketch. The
task is classification with seven classes.
OfficeHome (Venkateswara et al., 2017) has 15,500 im-
6. Experiments / Results
1) Dataset
2) Results
itioned on
ransforms
rent from
e image x
ut, in our
ain d and
e original
into think-
nation do-
StarGAN,
ccessfully
ain to that
G(., d, d
)
us section
objective
chitecture
urages the
y belongs
ing with generative adversarial networks).
5. Experiments
5.1. Datasets
To evaluate our method, we perform experiments in three
datasets that are commonly used in the literature for domain
generalization.
Rotated MNIST. In this dataset by Ghifary et al.
(2015), 1,000 MNIST images (100 per class) (LeCun 
Cortes, 2010) are chosen to form the first domain (de-
noted M0), then rotations of 15◦
, 30◦
, 45◦
, 60◦
and 75◦
are applied to create five additional domains, denoted
M15, M30, M45, M60 and M75. The task is classifica-
tion with ten classes (digits 0 to 9).
PACS (Li et al., 2017) contains 9,991 images from four
different domains: art painting, cartoon, photo, sketch. The
task is classification with seven classes.
OfficeHome (Venkateswara et al., 2017) has 15,500 im-
ages of daily objects from four domains: art, clipart, product
and real. There are 65 classes in this classification dataset.
The generator’s goal is to fool a discriminator D into think-
ing that the transformed image belongs to the destination do-
main d
. In other words, the equilibrium state of StarGAN,
in which G completely fools D, is when G successfully
transforms the data density of the original domain to that
of the destination domain. After training, we use G(., d, d
)
as the function fd,d (.) described in the previous section
and perform the representation learning via the objective
function in Eq 13.
Three important loss functions of the StarGAN architecture
are:
• Domain classification loss Lcls that encourages the
generator G to generate images that correctly belongs
to the desired destination domain d
.
generalization.
Rotated MNIST. In this dataset by Ghifary et al.
(2015), 1,000 MNIST images (100 per class) (LeCun 
Cortes, 2010) are chosen to form the first domain (de-
noted M0), then rotations of 15◦
, 30◦
, 45◦
, 60◦
and 75◦
are applied to create five additional domains, denoted
M15, M30, M45, M60 and M75. The task is classifica-
tion with ten classes (digits 0 to 9).
PACS (Li et al., 2017) contains 9,991 images from four
different domains: art painting, cartoon, photo, sketch. The
task is classification with seven classes.
OfficeHome (Venkateswara et al., 2017) has 15,500 im-
ages of daily objects from four domains: art, clipart, product
and real. There are 65 classes in this classification dataset.
in which G completely fools D, is when G successfully
transforms the data density of the original domain to that
of the destination domain. After training, we use G(., d, d
)
as the function fd,d (.) described in the previous section
and perform the representation learning via the objective
function in Eq 13.
Three important loss functions of the StarGAN architecture
are:
• Domain classification loss Lcls that encourages the
generator G to generate images that correctly belongs
to the desired destination domain d
.
Cortes, 2010) are chosen to form the first domain (de-
noted M0), then rotations of 15◦
, 30◦
, 45◦
, 60◦
and 75◦
are applied to create five additional domains, denoted
M15, M30, M45, M60 and M75. The task is classifica-
tion with ten classes (digits 0 to 9).
PACS (Li et al., 2017) contains 9,991 images from four
different domains: art painting, cartoon, photo, sketch. The
task is classification with seven classes.
OfficeHome (Venkateswara et al., 2017) has 15,500 im-
ages of daily objects from four domains: art, clipart, product
and real. There are 65 classes in this classification dataset.
Domain Invariant Representation Learning with Domain Density Transformations
Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
Domains
Model M0 M15 M30 M45 M60 M75 Average
HIR (Wang et al., 2020) 90.34 99.75 99.40 96.17 99.25 91.26 96.03
DIVA (Ilse et al., 2020) 93.5 99.3 99.1 99.2 99.3 93.0 97.2
DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36
DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3
LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3
HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8
ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2
DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6
Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
PACS
Model Backbone Art Painting Cartoon Photo Sketch Average
DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38
JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14
MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70
MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70
CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40
DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46
DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62
Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
OfficeHome
Domain Invariant Representation Learning with Domain Density Transformations
Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
Domains
Model M0 M15 M30 M45 M60 M75 Average
HIR (Wang et al., 2020) 90.34 99.75 99.40 96.17 99.25 91.26 96.03
DIVA (Ilse et al., 2020) 93.5 99.3 99.1 99.2 99.3 93.0 97.2
DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36
DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3
LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3
HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8
ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2
DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6
Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
PACS
Model Backbone Art Painting Cartoon Photo Sketch Average
DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38
JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14
MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70
MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70
CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40
DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46
DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62
Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
Domains
Model M0 M15 M30 M45 M60 M75 Average
HIR (Wang et al., 2020) 90.34 99.75 99.40 96.17 99.25 91.26 96.03
DIVA (Ilse et al., 2020) 93.5 99.3 99.1 99.2 99.3 93.0 97.2
DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36
DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3
LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3
HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8
ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2
DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6
Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
PACS
Model Backbone Art Painting Cartoon Photo Sketch Average
DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38
JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14
MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70
MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70
CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40
DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46
DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62
Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
OfficeHome
Model Backbone Art ClipArt Product Real Average
D-SAM (D’Innocente  Caputo, 2018) Resnet18 58.03 44.37 69.22 71.45 60.77
JiGen (Carlucci et al., 2019) Resnet18 53.04 47.51 71.47 72.79 61.20
DIR-GAN (ours) Resnet18 56.69(±0.4) 50.49(±0.2) 71.32(±0.4) 74.23(±0.5) 63.18
5.2. Experimental Setting
For all datasets, we perform “leave-one-domain-out” exper-
iments, where we choose one domain as the target domain,
train the model on all remaining domains and evaluate it
on the chosen domain. Following standard practice, we use
90% of available data as training data and 10% as validation
data, except for the Rotated MNIST experiment where we
do not use a validation set and just report the performance
of the last epoch.
For the Rotated MNIST dataset, we use a network of two
3x3 convolutional layers and a fully connected layer as the
representation network gθ to get a representation z of 64
dimensions. A single linear layer is then used to map the
representation z to the ten output classes. This architecture
is the deterministic version of the network used by Ilse et al.
(2020). We train our network for 500 epochs with the Adam
optimizer (Kingma  Ba, 2014), using the learning rate
test domain after the last epoch.
For the PACS and OfficeHome datasets, we use a Resnet18
(He et al., 2016) network as the representation network gθ.
As a standard practice, the Resnet18 backbone is pre-trained
on ImageNet. We replace the last fully connected layer of
the Resnet with a linear layer of dimensions (512, 256) so
that our representation has 256 dimensions. As with the
Rotated MNIST experiment, we use a single layer to map
from the representation z to the output. We train the network
for 100 epochs with plain stochastic gradient descent (SGD)
using learning rate 0.001, momentum 0.9, minibatch size
64, and weight decay 0.001. Data augmentation is also
standard practice for real-world computer vision datasets
like PACS and OfficeHome, and during the training we
augment our data as follows: crops of random size and
aspect ratio, resizing to 224 × 224 pixels, random horizontal
flips, random color jitter, randomly converting the image
tile to grayscale with 10% probability, and normalization
HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8
ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2
DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6
Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
PACS
Model Backbone Art Painting Cartoon Photo Sketch Average
DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38
JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14
MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70
MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70
CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40
DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46
DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62
Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
OfficeHome
Model Backbone Art ClipArt Product Real Average
D-SAM (D’Innocente  Caputo, 2018) Resnet18 58.03 44.37 69.22 71.45 60.77
JiGen (Carlucci et al., 2019) Resnet18 53.04 47.51 71.47 72.79 61.20
DIR-GAN (ours) Resnet18 56.69(±0.4) 50.49(±0.2) 71.32(±0.4) 74.23(±0.5) 63.18
5.2. Experimental Setting
For all datasets, we perform “leave-one-domain-out” exper-
iments, where we choose one domain as the target domain,
train the model on all remaining domains and evaluate it
on the chosen domain. Following standard practice, we use
90% of available data as training data and 10% as validation
data, except for the Rotated MNIST experiment where we
do not use a validation set and just report the performance
of the last epoch.
For the Rotated MNIST dataset, we use a network of two
3x3 convolutional layers and a fully connected layer as the
representation network gθ to get a representation z of 64
dimensions. A single linear layer is then used to map the
representation z to the ten output classes. This architecture
is the deterministic version of the network used by Ilse et al.
(2020). We train our network for 500 epochs with the Adam
optimizer (Kingma  Ba, 2014), using the learning rate
0.001 and minibatch size 64, and report performance on the
test domain after the last epoch.
For the PACS and OfficeHome datasets, we use a Resnet18
(He et al., 2016) network as the representation network gθ.
As a standard practice, the Resnet18 backbone is pre-trained
on ImageNet. We replace the last fully connected layer of
the Resnet with a linear layer of dimensions (512, 256) so
that our representation has 256 dimensions. As with the
Rotated MNIST experiment, we use a single layer to map
from the representation z to the output. We train the network
for 100 epochs with plain stochastic gradient descent (SGD)
using learning rate 0.001, momentum 0.9, minibatch size
64, and weight decay 0.001. Data augmentation is also
standard practice for real-world computer vision datasets
like PACS and OfficeHome, and during the training we
augment our data as follows: crops of random size and
aspect ratio, resizing to 224 × 224 pixels, random horizontal
flips, random color jitter, randomly converting the image
tile to grayscale with 10% probability, and normalization
using the ImageNet channel means and standard deviations.
Domain Invariant Representation Learning with Domain Density Transformations
Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
Domains
Model M0 M15 M30 M45 M60 M75 Average
HIR (Wang et al., 2020) 90.34 99.75 99.40 96.17 99.25 91.26 96.03
DIVA (Ilse et al., 2020) 93.5 99.3 99.1 99.2 99.3 93.0 97.2
DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36
DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3
LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3
HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8
ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2
DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6
Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
PACS
Model Backbone Art Painting Cartoon Photo Sketch Average
DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38
JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14
MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70
MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70
CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40
DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46
DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62
Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
OfficeHome
Model Backbone Art ClipArt Product Real Average
D-SAM (D’Innocente  Caputo, 2018) Resnet18 58.03 44.37 69.22 71.45 60.77
JiGen (Carlucci et al., 2019) Resnet18 53.04 47.51 71.47 72.79 61.20
DIR-GAN (ours) Resnet18 56.69(±0.4) 50.49(±0.2) 71.32(±0.4) 74.23(±0.5) 63.18
5.2. Experimental Setting
For all datasets, we perform “leave-one-domain-out” exper-
iments, where we choose one domain as the target domain,
train the model on all remaining domains and evaluate it
on the chosen domain. Following standard practice, we use
90% of available data as training data and 10% as validation
data, except for the Rotated MNIST experiment where we
do not use a validation set and just report the performance
of the last epoch.
test domain after the last epoch.
For the PACS and OfficeHome datasets, we use a Resnet18
(He et al., 2016) network as the representation network gθ.
As a standard practice, the Resnet18 backbone is pre-trained
on ImageNet. We replace the last fully connected layer of
the Resnet with a linear layer of dimensions (512, 256) so
that our representation has 256 dimensions. As with the
Rotated MNIST experiment, we use a single layer to map
from the representation z to the output. We train the network
for 100 epochs with plain stochastic gradient descent (SGD)
Domain Invariant Representation Learning with Domain Density Transformations
Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
Domains
Model M0 M15 M30 M45 M60 M75 Average
HIR (Wang et al., 2020) 90.34 99.75 99.40 96.17 99.25 91.26 96.03
DIVA (Ilse et al., 2020) 93.5 99.3 99.1 99.2 99.3 93.0 97.2
DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36
DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3
LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3
HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8
ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2
DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6
Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
PACS
Model Backbone Art Painting Cartoon Photo Sketch Average
DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38
JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14
MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70
MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70
CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40
DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46
DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62
Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
OfficeHome
Model Backbone Art ClipArt Product Real Average
D-SAM (D’Innocente  Caputo, 2018) Resnet18 58.03 44.37 69.22 71.45 60.77
JiGen (Carlucci et al., 2019) Resnet18 53.04 47.51 71.47 72.79 61.20
DIR-GAN (ours) Resnet18 56.69(±0.4) 50.49(±0.2) 71.32(±0.4) 74.23(±0.5) 63.18
5.2. Experimental Setting
For all datasets, we perform “leave-one-domain-out” exper-
iments, where we choose one domain as the target domain,
train the model on all remaining domains and evaluate it
on the chosen domain. Following standard practice, we use
90% of available data as training data and 10% as validation
data, except for the Rotated MNIST experiment where we
do not use a validation set and just report the performance
of the last epoch.
test domain after the last epoch.
For the PACS and OfficeHome datasets, we use a Resnet18
(He et al., 2016) network as the representation network gθ.
As a standard practice, the Resnet18 backbone is pre-trained
on ImageNet. We replace the last fully connected layer of
the Resnet with a linear layer of dimensions (512, 256) so
that our representation has 256 dimensions. As with the
Rotated MNIST experiment, we use a single layer to map
from the representation z to the output. We train the network
for 100 epochs with plain stochastic gradient descent (SGD)
DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36
DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3
LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3
HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8
ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2
DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6
Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
PACS
Model Backbone Art Painting Cartoon Photo Sketch Average
DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38
JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14
MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70
MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70
CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40
DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46
DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62
Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
OfficeHome
Model Backbone Art ClipArt Product Real Average
D-SAM (D’Innocente  Caputo, 2018) Resnet18 58.03 44.37 69.22 71.45 60.77
JiGen (Carlucci et al., 2019) Resnet18 53.04 47.51 71.47 72.79 61.20
DIR-GAN (ours) Resnet18 56.69(±0.4) 50.49(±0.2) 71.32(±0.4) 74.23(±0.5) 63.18
5.2. Experimental Setting
For all datasets, we perform “leave-one-domain-out” exper-
iments, where we choose one domain as the target domain,
train the model on all remaining domains and evaluate it
on the chosen domain. Following standard practice, we use
90% of available data as training data and 10% as validation
data, except for the Rotated MNIST experiment where we
do not use a validation set and just report the performance
of the last epoch.
For the Rotated MNIST dataset, we use a network of two
3x3 convolutional layers and a fully connected layer as the
representation network gθ to get a representation z of 64
dimensions. A single linear layer is then used to map the
representation z to the ten output classes. This architecture
is the deterministic version of the network used by Ilse et al.
(2020). We train our network for 500 epochs with the Adam
optimizer (Kingma  Ba, 2014), using the learning rate
0.001 and minibatch size 64, and report performance on the
test domain after the last epoch.
For the PACS and OfficeHome datasets, we use a Resnet18
(He et al., 2016) network as the representation network gθ.
As a standard practice, the Resnet18 backbone is pre-trained
on ImageNet. We replace the last fully connected layer of
the Resnet with a linear layer of dimensions (512, 256) so
that our representation has 256 dimensions. As with the
Rotated MNIST experiment, we use a single layer to map
from the representation z to the output. We train the network
for 100 epochs with plain stochastic gradient descent (SGD)
using learning rate 0.001, momentum 0.9, minibatch size
64, and weight decay 0.001. Data augmentation is also
standard practice for real-world computer vision datasets
like PACS and OfficeHome, and during the training we
augment our data as follows: crops of random size and
aspect ratio, resizing to 224 × 224 pixels, random horizontal
flips, random color jitter, randomly converting the image
tile to grayscale with 10% probability, and normalization
using the ImageNet channel means and standard deviations.
Domain Invariant Representation Learning with Domain D
Figure 4. Visualization of the representation space. Each point indicates a representa
and its color indicates the label y. Two left figures are for our method DIR-GAN and t
The StarGAN (Choi et al., 2018) model implementation
is taken from the authors’ original source code with no
significant modifications. For each set of source domains,
we train the StarGAN model for 100,000 iterations with a
minibatch of 16 images per iteration.
The code for all of our experiments will be released for
reproducibility. Please also refer to the source code for any
the general dis
distribution (fo
and green poin
PACS and Of
domain invaria
been applied w
puter vision d
LD = −Ladv + λcls Lr
cls, (5)
LG = Ladv + λcls Lf
cls + λrec Lrec, (6)
where λcls and λrec are hyper-parameters that control the
relative importance of domain classification and reconstruc-
tion losses, respectively, compared to the adversarial loss.
We use λcls = 1 and λrec = 10 in all of our experiments.
tency loss [9, 33] to the generator, defined as
Lrec = Ex,c,c [||x − G(G(x, c), c
)||1], (4)
where G takes in the translated image G(x, c) and the origi-
nal domain label c
as input and tries to reconstruct the orig-
inal image x. We adopt the L1 norm as our reconstruction
loss. Note that we use a single generator twice, first to trans-
late an original image into an image in the target domain
and then to reconstruct the original image from the trans-
lated image.
Full Objective. Finally, the objective functions to optimize
G and D are written, respectively, as
LD = −Ladv + λcls Lr
cls, (5)
LG = Ladv + λcls Lf
cls + λrec Lrec, (6)
where λcls and λrec are hyper-parameters that control the
relative importance of domain classification and reconstruc-
tion losses, respectively, compared to the adversarial loss.
We use λcls = 1 and λrec = 10 in all of our experiments.
RaFD datasets, where n is two.
Training Strategy. When training S
datasets, we use the domain label c̃ d
put to the generator. By doing so, t
ignore the unspecified labels, which
focus on the explicitly given label. Th
erator is exactly the same as in trainin
except for the dimension of the inpu
hand, we extend the auxiliary classi
tor to generate probability distributio
datasets. Then, we train the model in
setting, where the discriminator tries
classification error associated to the
ample, when training with images in
nator minimizes only classification e
to CelebA attributes, and not facial
RaFD. Under these settings, by altern
and RaFD the discriminator learns al
features for both datasets, and the ge
trol all the labels in both datasets.
4
er words, G tries to minimize this objective to gener-
ages that can be classified as the target domain c.
nstruction Loss. By minimizing the adversarial and
fication losses, G is trained to generate images that
alistic and classified to its correct target domain. How-
minimizing the losses (Eqs. (1) and (3)) does not guar-
that translated images preserve the content of its input
s while changing only the domain-related part of the
. To alleviate this problem, we apply a cycle consis-
loss [9, 33] to the generator, defined as
Lrec = Ex,c,c [||x − G(G(x, c), c
)||1], (4)
G takes in the translated image G(x, c) and the origi-
main label c
as input and tries to reconstruct the orig-
mage x. We adopt the L1 norm as our reconstruction
Note that we use a single generator twice, first to trans-
n original image into an image in the target domain
hen to reconstruct the original image from the trans-
mage.
Objective. Finally, the objective functions to optimize
D are written, respectively, as
LD = −Ladv + λcls Lr
cls, (5)
f
c̃ = [c1, ..., cn, m], (7)
where [·] refers to concatenation, and ci represents a vector
for the labels of the i-th dataset. The vector of the known
label ci can be represented as either a binary vector for bi-
nary attributes or a one-hot vector for categorical attributes.
For the remaining n−1 unknown labels we simply assign
zero values. In our experiments, we utilize the CelebA and
RaFD datasets, where n is two.
Training Strategy. When training StarGAN with multiple
datasets, we use the domain label c̃ defined in Eq. (7) as in-
put to the generator. By doing so, the generator learns to
ignore the unspecified labels, which are zero vectors, and
focus on the explicitly given label. The structure of the gen-
erator is exactly the same as in training with a single dataset,
except for the dimension of the input label c̃. On the other
hand, we extend the auxiliary classifier of the discrimina-
tor to generate probability distributions over labels for all
datasets. Then, we train the model in a multi-task learning
setting, where the discriminator tries to minimize only the
classification error associated to the known label. For ex-
ample, when training with images in CelebA, the discrimi-
Domain Invariant Representation Learning with Domain Density Transformations
3) Visualization of Representation
novel and scalable approach capable of learning mappings
among multiple domains. As demonstrated in Fig. 2 (b), our
model takes in training data of multiple domains, and learns
the mappings between all available domains using only a
single generator. The idea is simple. Instead of learning
a fixed translation (e.g., black-to-blond hair), our generator
takes in as inputs both image and domain information, and
learns to flexibly translate the image into the correspond-
ing domain. We use a label (e.g., binary or one-hot vector)
to represent domain information. During training, we ran-
domly generate a target domain label and train the model to
flexibly translate an input image into the target domain. By
doing so, we can control the domain label and translate the
image into any desired domain at testing phase.
We also introduce a simple but effective approach that
enables joint training between domains of different datasets
by adding a mask vector to the domain label. Our proposed
method ensures that the model can ignore unknown labels
and focus on the label provided by a particular dataset. In
this manner, our model can perform well on tasks such
as synthesizing facial expressions of CelebA images us-
• We provi
facial att
sis tasks
baseline
2. Related W
Generative A
ial networks
in various com
[6, 24, 32, 8],
imaging [14],
typical GAN m
tor and a gene
between real a
generate fake
samples. Our
to make the ge
Conditional G
ation has also
2
particular, the network G(x, d, d
) (i.e., G is c
the image x and the two different domains d, d
an image x from domain d to domain d
. D
the original StarGAN model that only takes
and the desired destination domain d
as its
implementation, we feed both the original d
desired destination domain d
together with
image x to the generator G.
The generator’s goal is to fool a discriminator
ing that the transformed image belongs to the d
main d
. In other words, the equilibrium state
in which G completely fools D, is when G
transforms the data density of the original d
of the destination domain. After training, we
as the function fd,d (.) described in the pre
and perform the representation learning via
function in Eq 13.
Three important loss functions of the StarGAN
are:
• Domain classification loss Lcls that en
generator G to generate images that corr
to the desired destination domain d
.
In o
ate i
Rec
clas
are r
ever
ante
ima
inpu
tenc
whe
nal d
inal
loss
late
and
lated
Full
G a
Domain Invariant Representation Learning with Domain Density Transformations
Figure 4. Visualization of the representation space. Each point indicates a representation z of an image x in the two dimensional space
and its color indicates the label y. Two left figures are for our method DIR-GAN and two right figures are for the naive model DeepAll.
The StarGAN (Choi et al., 2018) model implementation
is taken from the authors’ original source code with no
significant modifications. For each set of source domains,
we train the StarGAN model for 100,000 iterations with a
minibatch of 16 images per iteration.
The code for all of our experiments will be released for
reproducibility. Please also refer to the source code for any
other architecture and implementation details.
the general distribution of the points) and the conditional
distribution (for example, the distributions of blue points
and green points).
PACS and OfficeHome. To the best of our knowledge,
domain invariant representation learning methods have not
been applied widely and successfully for real-world com-
puter vision datasets (e.g., PACS and OfficeHome) with
very deep neural networks such as Resnet, so the only rel-

More Related Content

What's hot

Super resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun YooSuper resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun Yoo
JaeJun Yoo
 
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
MLAI2
 
Introduction to Interpretable Machine Learning
Introduction to Interpretable Machine LearningIntroduction to Interpretable Machine Learning
Introduction to Interpretable Machine Learning
Nguyen Giang
 
Efficient Neural Network Architecture for Image Classfication
Efficient Neural Network Architecture for Image ClassficationEfficient Neural Network Architecture for Image Classfication
Efficient Neural Network Architecture for Image Classfication
Yogendra Tamang
 
Online Coreset Selection for Rehearsal-based Continual Learning
Online Coreset Selection for Rehearsal-based Continual LearningOnline Coreset Selection for Rehearsal-based Continual Learning
Online Coreset Selection for Rehearsal-based Continual Learning
MLAI2
 
Representational Continuity for Unsupervised Continual Learning
Representational Continuity for Unsupervised Continual LearningRepresentational Continuity for Unsupervised Continual Learning
Representational Continuity for Unsupervised Continual Learning
MLAI2
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design Spaces
Jinwon Lee
 
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
Fellowship at Vodafone FutureLab
 
InfoGAIL
InfoGAIL InfoGAIL
InfoGAIL
Sungjoon Choi
 
Overview of Convolutional Neural Networks
Overview of Convolutional Neural NetworksOverview of Convolutional Neural Networks
Overview of Convolutional Neural Networks
ananth
 
Image classification with Deep Neural Networks
Image classification with Deep Neural NetworksImage classification with Deep Neural Networks
Image classification with Deep Neural Networks
Yogendra Tamang
 
Understanding Convolutional Neural Networks
Understanding Convolutional Neural NetworksUnderstanding Convolutional Neural Networks
Understanding Convolutional Neural Networks
Jeremy Nixon
 
3D 딥러닝 동향
3D 딥러닝 동향3D 딥러닝 동향
3D 딥러닝 동향
NAVER Engineering
 
Transformer based approaches for visual representation learning
Transformer based approaches for visual representation learningTransformer based approaches for visual representation learning
Transformer based approaches for visual representation learning
Ryohei Suzuki
 
Learning to learn unlearned feature for segmentation
Learning to learn unlearned feature for segmentationLearning to learn unlearned feature for segmentation
Learning to learn unlearned feature for segmentation
NAVER Engineering
 
AlexNet
AlexNetAlexNet
AlexNet
Bertil Hatt
 
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Jia-Bin Huang
 
Domain Transfer and Adaptation Survey
Domain Transfer and Adaptation SurveyDomain Transfer and Adaptation Survey
Domain Transfer and Adaptation Survey
Sangwoo Mo
 
Learning with Relative Attributes
Learning with Relative AttributesLearning with Relative Attributes
Learning with Relative Attributes
Vikas Jain
 
Survey on contrastive self supervised l earning
Survey on contrastive self supervised l earningSurvey on contrastive self supervised l earning
Survey on contrastive self supervised l earning
Anirudh Ganguly
 

What's hot (20)

Super resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun YooSuper resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun Yoo
 
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
 
Introduction to Interpretable Machine Learning
Introduction to Interpretable Machine LearningIntroduction to Interpretable Machine Learning
Introduction to Interpretable Machine Learning
 
Efficient Neural Network Architecture for Image Classfication
Efficient Neural Network Architecture for Image ClassficationEfficient Neural Network Architecture for Image Classfication
Efficient Neural Network Architecture for Image Classfication
 
Online Coreset Selection for Rehearsal-based Continual Learning
Online Coreset Selection for Rehearsal-based Continual LearningOnline Coreset Selection for Rehearsal-based Continual Learning
Online Coreset Selection for Rehearsal-based Continual Learning
 
Representational Continuity for Unsupervised Continual Learning
Representational Continuity for Unsupervised Continual LearningRepresentational Continuity for Unsupervised Continual Learning
Representational Continuity for Unsupervised Continual Learning
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design Spaces
 
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
 
InfoGAIL
InfoGAIL InfoGAIL
InfoGAIL
 
Overview of Convolutional Neural Networks
Overview of Convolutional Neural NetworksOverview of Convolutional Neural Networks
Overview of Convolutional Neural Networks
 
Image classification with Deep Neural Networks
Image classification with Deep Neural NetworksImage classification with Deep Neural Networks
Image classification with Deep Neural Networks
 
Understanding Convolutional Neural Networks
Understanding Convolutional Neural NetworksUnderstanding Convolutional Neural Networks
Understanding Convolutional Neural Networks
 
3D 딥러닝 동향
3D 딥러닝 동향3D 딥러닝 동향
3D 딥러닝 동향
 
Transformer based approaches for visual representation learning
Transformer based approaches for visual representation learningTransformer based approaches for visual representation learning
Transformer based approaches for visual representation learning
 
Learning to learn unlearned feature for segmentation
Learning to learn unlearned feature for segmentationLearning to learn unlearned feature for segmentation
Learning to learn unlearned feature for segmentation
 
AlexNet
AlexNetAlexNet
AlexNet
 
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
 
Domain Transfer and Adaptation Survey
Domain Transfer and Adaptation SurveyDomain Transfer and Adaptation Survey
Domain Transfer and Adaptation Survey
 
Learning with Relative Attributes
Learning with Relative AttributesLearning with Relative Attributes
Learning with Relative Attributes
 
Survey on contrastive self supervised l earning
Survey on contrastive self supervised l earningSurvey on contrastive self supervised l earning
Survey on contrastive self supervised l earning
 

Similar to Domain Invariant Representation Learning with Domain Density Transformations

06 mlp
06 mlp06 mlp
06 mlp
Ronald Teo
 
GAN(と強化学習との関係)
GAN(と強化学習との関係)GAN(と強化学習との関係)
GAN(と強化学習との関係)
Masahiro Suzuki
 
Face Anti Spoofing
Face Anti SpoofingFace Anti Spoofing
Face Anti Spoofing
ssuser17040e
 
論文紹介:Learning With Neighbor Consistency for Noisy Labels
論文紹介:Learning With Neighbor Consistency for Noisy Labels論文紹介:Learning With Neighbor Consistency for Noisy Labels
論文紹介:Learning With Neighbor Consistency for Noisy Labels
Toru Tamaki
 
A CONVERGENCE ANALYSIS OF GRADIENT_version1
A CONVERGENCE ANALYSIS OF GRADIENT_version1A CONVERGENCE ANALYSIS OF GRADIENT_version1
A CONVERGENCE ANALYSIS OF GRADIENT_version1
thanhdowork
 
Adversarial Variational Autoencoders to extend and improve generative model -...
Adversarial Variational Autoencoders to extend and improve generative model -...Adversarial Variational Autoencoders to extend and improve generative model -...
Adversarial Variational Autoencoders to extend and improve generative model -...
Loc Nguyen
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
Masahiro Suzuki
 
Adversarial Variational Autoencoders to extend and improve generative model
Adversarial Variational Autoencoders to extend and improve generative modelAdversarial Variational Autoencoders to extend and improve generative model
Adversarial Variational Autoencoders to extend and improve generative model
Loc Nguyen
 
ssc_icml13
ssc_icml13ssc_icml13
ssc_icml13
Guy Lebanon
 
MTH 2001 Project 2Instructions• Each group must choos.docx
MTH 2001 Project 2Instructions• Each group must choos.docxMTH 2001 Project 2Instructions• Each group must choos.docx
MTH 2001 Project 2Instructions• Each group must choos.docx
gilpinleeanna
 
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
Sean Golliher
 
Paper Summary of Disentangling by Factorising (Factor-VAE)
Paper Summary of Disentangling by Factorising (Factor-VAE)Paper Summary of Disentangling by Factorising (Factor-VAE)
Paper Summary of Disentangling by Factorising (Factor-VAE)
준식 최
 
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
IJERA Editor
 
[BMVC 2022] DA-CIL: Towards Domain Adaptive Class-Incremental 3D Object Detec...
[BMVC 2022] DA-CIL: Towards Domain Adaptive Class-Incremental 3D Object Detec...[BMVC 2022] DA-CIL: Towards Domain Adaptive Class-Incremental 3D Object Detec...
[BMVC 2022] DA-CIL: Towards Domain Adaptive Class-Incremental 3D Object Detec...
Ziyuan Zhao
 
Web image annotation by diffusion maps manifold learning algorithm
Web image annotation by diffusion maps manifold learning algorithmWeb image annotation by diffusion maps manifold learning algorithm
Web image annotation by diffusion maps manifold learning algorithm
ijfcstjournal
 
Deep Domain Adaptation using Adversarial Learning and GAN
Deep Domain Adaptation using Adversarial Learning and GAN Deep Domain Adaptation using Adversarial Learning and GAN
Deep Domain Adaptation using Adversarial Learning and GAN
RishirajChakraborty4
 
Instance Based Learning in Machine Learning
Instance Based Learning in Machine LearningInstance Based Learning in Machine Learning
Instance Based Learning in Machine Learning
Pavithra Thippanaik
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
Devansh16
 
GRAPH BASED LOCAL RECODING FOR DATA ANONYMIZATION
GRAPH BASED LOCAL RECODING FOR DATA ANONYMIZATIONGRAPH BASED LOCAL RECODING FOR DATA ANONYMIZATION
GRAPH BASED LOCAL RECODING FOR DATA ANONYMIZATION
ijdms
 
Lucas Theis - Compressing Images with Neural Networks - Creative AI meetup
Lucas Theis - Compressing Images with Neural Networks - Creative AI meetupLucas Theis - Compressing Images with Neural Networks - Creative AI meetup
Lucas Theis - Compressing Images with Neural Networks - Creative AI meetup
Luba Elliott
 

Similar to Domain Invariant Representation Learning with Domain Density Transformations (20)

06 mlp
06 mlp06 mlp
06 mlp
 
GAN(と強化学習との関係)
GAN(と強化学習との関係)GAN(と強化学習との関係)
GAN(と強化学習との関係)
 
Face Anti Spoofing
Face Anti SpoofingFace Anti Spoofing
Face Anti Spoofing
 
論文紹介:Learning With Neighbor Consistency for Noisy Labels
論文紹介:Learning With Neighbor Consistency for Noisy Labels論文紹介:Learning With Neighbor Consistency for Noisy Labels
論文紹介:Learning With Neighbor Consistency for Noisy Labels
 
A CONVERGENCE ANALYSIS OF GRADIENT_version1
A CONVERGENCE ANALYSIS OF GRADIENT_version1A CONVERGENCE ANALYSIS OF GRADIENT_version1
A CONVERGENCE ANALYSIS OF GRADIENT_version1
 
Adversarial Variational Autoencoders to extend and improve generative model -...
Adversarial Variational Autoencoders to extend and improve generative model -...Adversarial Variational Autoencoders to extend and improve generative model -...
Adversarial Variational Autoencoders to extend and improve generative model -...
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
 
Adversarial Variational Autoencoders to extend and improve generative model
Adversarial Variational Autoencoders to extend and improve generative modelAdversarial Variational Autoencoders to extend and improve generative model
Adversarial Variational Autoencoders to extend and improve generative model
 
ssc_icml13
ssc_icml13ssc_icml13
ssc_icml13
 
MTH 2001 Project 2Instructions• Each group must choos.docx
MTH 2001 Project 2Instructions• Each group must choos.docxMTH 2001 Project 2Instructions• Each group must choos.docx
MTH 2001 Project 2Instructions• Each group must choos.docx
 
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
 
Paper Summary of Disentangling by Factorising (Factor-VAE)
Paper Summary of Disentangling by Factorising (Factor-VAE)Paper Summary of Disentangling by Factorising (Factor-VAE)
Paper Summary of Disentangling by Factorising (Factor-VAE)
 
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
 
[BMVC 2022] DA-CIL: Towards Domain Adaptive Class-Incremental 3D Object Detec...
[BMVC 2022] DA-CIL: Towards Domain Adaptive Class-Incremental 3D Object Detec...[BMVC 2022] DA-CIL: Towards Domain Adaptive Class-Incremental 3D Object Detec...
[BMVC 2022] DA-CIL: Towards Domain Adaptive Class-Incremental 3D Object Detec...
 
Web image annotation by diffusion maps manifold learning algorithm
Web image annotation by diffusion maps manifold learning algorithmWeb image annotation by diffusion maps manifold learning algorithm
Web image annotation by diffusion maps manifold learning algorithm
 
Deep Domain Adaptation using Adversarial Learning and GAN
Deep Domain Adaptation using Adversarial Learning and GAN Deep Domain Adaptation using Adversarial Learning and GAN
Deep Domain Adaptation using Adversarial Learning and GAN
 
Instance Based Learning in Machine Learning
Instance Based Learning in Machine LearningInstance Based Learning in Machine Learning
Instance Based Learning in Machine Learning
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
 
GRAPH BASED LOCAL RECODING FOR DATA ANONYMIZATION
GRAPH BASED LOCAL RECODING FOR DATA ANONYMIZATIONGRAPH BASED LOCAL RECODING FOR DATA ANONYMIZATION
GRAPH BASED LOCAL RECODING FOR DATA ANONYMIZATION
 
Lucas Theis - Compressing Images with Neural Networks - Creative AI meetup
Lucas Theis - Compressing Images with Neural Networks - Creative AI meetupLucas Theis - Compressing Images with Neural Networks - Creative AI meetup
Lucas Theis - Compressing Images with Neural Networks - Creative AI meetup
 

More from HyunKyu Jeon

[PR-358] Training Differentially Private Generative Models with Sinkhorn Dive...
[PR-358] Training Differentially Private Generative Models with Sinkhorn Dive...[PR-358] Training Differentially Private Generative Models with Sinkhorn Dive...
[PR-358] Training Differentially Private Generative Models with Sinkhorn Dive...
HyunKyu Jeon
 
Super tickets in pre trained language models
Super tickets in pre trained language modelsSuper tickets in pre trained language models
Super tickets in pre trained language models
HyunKyu Jeon
 
Synthesizer rethinking self-attention for transformer models
Synthesizer rethinking self-attention for transformer models Synthesizer rethinking self-attention for transformer models
Synthesizer rethinking self-attention for transformer models
HyunKyu Jeon
 
Meta back translation
Meta back translationMeta back translation
Meta back translation
HyunKyu Jeon
 
Maxmin qlearning controlling the estimation bias of qlearning
Maxmin qlearning controlling the estimation bias of qlearningMaxmin qlearning controlling the estimation bias of qlearning
Maxmin qlearning controlling the estimation bias of qlearning
HyunKyu Jeon
 
Adversarial Attack in Neural Machine Translation
Adversarial Attack in Neural Machine TranslationAdversarial Attack in Neural Machine Translation
Adversarial Attack in Neural Machine Translation
HyunKyu Jeon
 
십분딥러닝_19_ALL_ABOUT_CNN
십분딥러닝_19_ALL_ABOUT_CNN십분딥러닝_19_ALL_ABOUT_CNN
십분딥러닝_19_ALL_ABOUT_CNN
HyunKyu Jeon
 
십분수학_Entropy and KL-Divergence
십분수학_Entropy and KL-Divergence십분수학_Entropy and KL-Divergence
십분수학_Entropy and KL-Divergence
HyunKyu Jeon
 
(edited) 십분딥러닝_17_DIM(DeepInfoMax)
(edited) 십분딥러닝_17_DIM(DeepInfoMax)(edited) 십분딥러닝_17_DIM(DeepInfoMax)
(edited) 십분딥러닝_17_DIM(DeepInfoMax)
HyunKyu Jeon
 
십분딥러닝_18_GumBolt (VAE with Boltzmann Machine)
십분딥러닝_18_GumBolt (VAE with Boltzmann Machine)십분딥러닝_18_GumBolt (VAE with Boltzmann Machine)
십분딥러닝_18_GumBolt (VAE with Boltzmann Machine)
HyunKyu Jeon
 
십분딥러닝_17_DIM(Deep InfoMax)
십분딥러닝_17_DIM(Deep InfoMax)십분딥러닝_17_DIM(Deep InfoMax)
십분딥러닝_17_DIM(Deep InfoMax)
HyunKyu Jeon
 
십분딥러닝_16_WGAN (Wasserstein GANs)
십분딥러닝_16_WGAN (Wasserstein GANs)십분딥러닝_16_WGAN (Wasserstein GANs)
십분딥러닝_16_WGAN (Wasserstein GANs)
HyunKyu Jeon
 
십분딥러닝_15_SSD(Single Shot Multibox Detector)
십분딥러닝_15_SSD(Single Shot Multibox Detector)십분딥러닝_15_SSD(Single Shot Multibox Detector)
십분딥러닝_15_SSD(Single Shot Multibox Detector)
HyunKyu Jeon
 
십분딥러닝_14_YOLO(You Only Look Once)
십분딥러닝_14_YOLO(You Only Look Once)십분딥러닝_14_YOLO(You Only Look Once)
십분딥러닝_14_YOLO(You Only Look Once)
HyunKyu Jeon
 
십분딥러닝_13_Transformer Networks (Self Attention)
십분딥러닝_13_Transformer Networks (Self Attention)십분딥러닝_13_Transformer Networks (Self Attention)
십분딥러닝_13_Transformer Networks (Self Attention)
HyunKyu Jeon
 
십분딥러닝_12_어텐션(Attention Mechanism)
십분딥러닝_12_어텐션(Attention Mechanism)십분딥러닝_12_어텐션(Attention Mechanism)
십분딥러닝_12_어텐션(Attention Mechanism)
HyunKyu Jeon
 
십분딥러닝_11_LSTM (Long Short Term Memory)
십분딥러닝_11_LSTM (Long Short Term Memory)십분딥러닝_11_LSTM (Long Short Term Memory)
십분딥러닝_11_LSTM (Long Short Term Memory)
HyunKyu Jeon
 
십분딥러닝_10_R-CNN
십분딥러닝_10_R-CNN십분딥러닝_10_R-CNN
십분딥러닝_10_R-CNN
HyunKyu Jeon
 
십분딥러닝_9_VAE(Variational Autoencoder)
십분딥러닝_9_VAE(Variational Autoencoder)십분딥러닝_9_VAE(Variational Autoencoder)
십분딥러닝_9_VAE(Variational Autoencoder)
HyunKyu Jeon
 
십분딥러닝_7_GANs (Edited)
십분딥러닝_7_GANs (Edited)십분딥러닝_7_GANs (Edited)
십분딥러닝_7_GANs (Edited)
HyunKyu Jeon
 

More from HyunKyu Jeon (20)

[PR-358] Training Differentially Private Generative Models with Sinkhorn Dive...
[PR-358] Training Differentially Private Generative Models with Sinkhorn Dive...[PR-358] Training Differentially Private Generative Models with Sinkhorn Dive...
[PR-358] Training Differentially Private Generative Models with Sinkhorn Dive...
 
Super tickets in pre trained language models
Super tickets in pre trained language modelsSuper tickets in pre trained language models
Super tickets in pre trained language models
 
Synthesizer rethinking self-attention for transformer models
Synthesizer rethinking self-attention for transformer models Synthesizer rethinking self-attention for transformer models
Synthesizer rethinking self-attention for transformer models
 
Meta back translation
Meta back translationMeta back translation
Meta back translation
 
Maxmin qlearning controlling the estimation bias of qlearning
Maxmin qlearning controlling the estimation bias of qlearningMaxmin qlearning controlling the estimation bias of qlearning
Maxmin qlearning controlling the estimation bias of qlearning
 
Adversarial Attack in Neural Machine Translation
Adversarial Attack in Neural Machine TranslationAdversarial Attack in Neural Machine Translation
Adversarial Attack in Neural Machine Translation
 
십분딥러닝_19_ALL_ABOUT_CNN
십분딥러닝_19_ALL_ABOUT_CNN십분딥러닝_19_ALL_ABOUT_CNN
십분딥러닝_19_ALL_ABOUT_CNN
 
십분수학_Entropy and KL-Divergence
십분수학_Entropy and KL-Divergence십분수학_Entropy and KL-Divergence
십분수학_Entropy and KL-Divergence
 
(edited) 십분딥러닝_17_DIM(DeepInfoMax)
(edited) 십분딥러닝_17_DIM(DeepInfoMax)(edited) 십분딥러닝_17_DIM(DeepInfoMax)
(edited) 십분딥러닝_17_DIM(DeepInfoMax)
 
십분딥러닝_18_GumBolt (VAE with Boltzmann Machine)
십분딥러닝_18_GumBolt (VAE with Boltzmann Machine)십분딥러닝_18_GumBolt (VAE with Boltzmann Machine)
십분딥러닝_18_GumBolt (VAE with Boltzmann Machine)
 
십분딥러닝_17_DIM(Deep InfoMax)
십분딥러닝_17_DIM(Deep InfoMax)십분딥러닝_17_DIM(Deep InfoMax)
십분딥러닝_17_DIM(Deep InfoMax)
 
십분딥러닝_16_WGAN (Wasserstein GANs)
십분딥러닝_16_WGAN (Wasserstein GANs)십분딥러닝_16_WGAN (Wasserstein GANs)
십분딥러닝_16_WGAN (Wasserstein GANs)
 
십분딥러닝_15_SSD(Single Shot Multibox Detector)
십분딥러닝_15_SSD(Single Shot Multibox Detector)십분딥러닝_15_SSD(Single Shot Multibox Detector)
십분딥러닝_15_SSD(Single Shot Multibox Detector)
 
십분딥러닝_14_YOLO(You Only Look Once)
십분딥러닝_14_YOLO(You Only Look Once)십분딥러닝_14_YOLO(You Only Look Once)
십분딥러닝_14_YOLO(You Only Look Once)
 
십분딥러닝_13_Transformer Networks (Self Attention)
십분딥러닝_13_Transformer Networks (Self Attention)십분딥러닝_13_Transformer Networks (Self Attention)
십분딥러닝_13_Transformer Networks (Self Attention)
 
십분딥러닝_12_어텐션(Attention Mechanism)
십분딥러닝_12_어텐션(Attention Mechanism)십분딥러닝_12_어텐션(Attention Mechanism)
십분딥러닝_12_어텐션(Attention Mechanism)
 
십분딥러닝_11_LSTM (Long Short Term Memory)
십분딥러닝_11_LSTM (Long Short Term Memory)십분딥러닝_11_LSTM (Long Short Term Memory)
십분딥러닝_11_LSTM (Long Short Term Memory)
 
십분딥러닝_10_R-CNN
십분딥러닝_10_R-CNN십분딥러닝_10_R-CNN
십분딥러닝_10_R-CNN
 
십분딥러닝_9_VAE(Variational Autoencoder)
십분딥러닝_9_VAE(Variational Autoencoder)십분딥러닝_9_VAE(Variational Autoencoder)
십분딥러닝_9_VAE(Variational Autoencoder)
 
십분딥러닝_7_GANs (Edited)
십분딥러닝_7_GANs (Edited)십분딥러닝_7_GANs (Edited)
십분딥러닝_7_GANs (Edited)
 

Recently uploaded

ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 

Recently uploaded (20)

ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 

Domain Invariant Representation Learning with Domain Density Transformations

  • 1. Domain Invariant Representation Learning with Domain Density Transformations A. Tuan Nguyen, Toan Tran, Yarin Gal, Atılım Güneş Baydin, arxiv 2102.05082 PR-320, Presented by Eddie
  • 2. Domain Invariant Representation Learning with Domain Density Transformations 1. Domain Generalization 2. Domain Generalization과 Domain Adaptation 이전에 못 본 도메인(OOD)에 대응하기 위해 도메인에 구애받지 않는 모델을 만드는 것을 목표로 하는 학습 방법을 말한다. Domain Adaptation은 타깃 도메인(Target Domain)의 레이블이 없는 데이터를 바탕으로 정보(information)를 얻는 것이 가능하지만, Domain Generalization은 그렇지 않다는 것이 차이점이다. Train on the painting data in the Baroque period. Test on the painting data in the Modern period. Model “Caravaggio” Model ? Thanh-Dat Truong, et al., Recognition in Unseen Domains: Domain Generalization via Universal Non-volume Preserving Models 타깃 도메인(Target Domain)에 대한 정보가 없는 상태에서 예측을 해야 하기 때문에 어려운 과제이다. “ ”
  • 3. Domain Invariant Representation Learning with Domain Density Transformations Definition1.Marginal Distribution Alignment The representation z is said to satisfy the marginal distribution alignment condition if p(z|d) is invariant w.r.t. d. Definition2.Conditional Distribution Alignment The representaion z is said to satisfy the conditional distribution alignment condition if p(y|z,d) is invariant w.r.t. d. 3. [Domian Invariance] Marginal and Conditional Alignment 4. Proposed Method Ed E [ Ed d,d 2 2 ' [ ]]] l( ) y,gθ(x) gθ(x) , : gθ(x) ( - ) gθ f (x) d,d' f d d' ' ' : ' || || + [ p(x,y|d) Ed , s D ∈ , [ d,d 2 2 ' ] l( d , = D d , where : 데이터 공간, , D D : 정의역이 될 수 있는 공간 - X : 공역이 될 수 있는 공간 Y ∈ d ∈ x , X X Z Z ∈ ∈ y Y ∈ - 입력 를 로 변환하는 함수(domain representation function) 예측 값과 레이블 간의 Loss 도메인, 다른도메인 예측값 도메인 변환후 예측값 z → → x (x) 입력 로 변환하는 함수(density transformation function) x d ∈ 를 x , - { } s 1 d , 2 ..., dK ) y,gθ(x) gθ(x) ( - ) gθ f (x) d' || || + p(x,y|d) E [ 2 2 ] - || || + Transformations A. Tuan Nguyen 1 Toan Tran 2 Yarin Gal 1 Atılım Güneş Baydin 1 Abstract Domain generalization refers to the problem where we aim to train a model on data from a set of source domains so that the model can gen- eralize to unseen target domains. Naively training a model on the aggregate set of data (pooled from all source domains) has been shown to perform suboptimally, since the information learned by that model might be domain-specific and general- ize imperfectly to target domains. To tackle this problem, a predominant approach is to find and learn some domain-invariant information in order to use it for the prediction task. In this paper, we propose a theoretically grounded method to learn a domain-invariant representation by enforcing the representation network to be invariant under all transformation functions among domains. We also show how to use generative adversarial net- works to learn such domain transformations to implement our method in practice. We demon- strate the effectiveness of our method on several widely used datasets for the domain generaliza- tion problem, on all of which we achieve compet- itive results with state-of-the-art models. 1. Introduction Domain generalization refers to the machine learning sce- !"#$ !"%$ !"#$ !"%$ !"#$%&'()'% & #' # ( *%+,'-"."/'%&0%-$+%&1'2 !"#$%&'3)'% & #' # ( *%+,'-"."/'%&0%-$+%&1'2 Figure 1. An example of two domains. For each domain, x is uniformly distributed on the outer circle (radius 2 for domain 1 and radius 3 for domain 2), with the color indicating class label y. After the transformation z = x/||x||2, the marginal of z is aligned (uniformly distributed on the unit circle for both domains), but the conditional p(y|z) is not aligned. Thus, using this representation for predicting y would not generalize well across domains. In the representation learning framework, the prediction y = f(x), where x is data and y is a label, is obtained as a composition y = h ◦ g(x) of a deep representation network z = g(x), where z is a learned representation of data x, and a smaller classifier y = h(z), predicting label y given representation z, both of which are shared across domains. Current “domain-invariance”-based methods in domain gen- eralization focus on either the marginal distribution align- ment (Muandet et al., 2013) or the conditional distribution alignment (Li et al., 2018b;c), which are still prone to distri- v:2102.05082v2 [cs.LG] 14 Feb 2021 Transformations A. Tuan Nguyen 1 Toan Tran 2 Yarin Gal 1 Atılım Güneş Baydin 1 Abstract Domain generalization refers to the problem where we aim to train a model on data from a set of source domains so that the model can gen- eralize to unseen target domains. Naively training a model on the aggregate set of data (pooled from all source domains) has been shown to perform suboptimally, since the information learned by that model might be domain-specific and general- ize imperfectly to target domains. To tackle this problem, a predominant approach is to find and learn some domain-invariant information in order to use it for the prediction task. In this paper, we propose a theoretically grounded method to learn a domain-invariant representation by enforcing the representation network to be invariant under all transformation functions among domains. We also show how to use generative adversarial net- works to learn such domain transformations to implement our method in practice. We demon- strate the effectiveness of our method on several widely used datasets for the domain generaliza- tion problem, on all of which we achieve compet- itive results with state-of-the-art models. 1. Introduction Domain generalization refers to the machine learning sce- !"#$ !"%$ !"#$ !"%$ !"#$%&'()'% & #' # ( *%+,'-"."/'%&0%-$+%&1'2 !"#$%&'3)'% & #' # ( *%+,'-"."/'%&0%-$+%&1'2 Figure 1. An example of two domains. For each domain, x is uniformly distributed on the outer circle (radius 2 for domain 1 and radius 3 for domain 2), with the color indicating class label y. After the transformation z = x/||x||2, the marginal of z is aligned (uniformly distributed on the unit circle for both domains), but the conditional p(y|z) is not aligned. Thus, using this representation for predicting y would not generalize well across domains. In the representation learning framework, the prediction y = f(x), where x is data and y is a label, is obtained as a composition y = h ◦ g(x) of a deep representation network z = g(x), where z is a learned representation of data x, and a smaller classifier y = h(z), predicting label y given representation z, both of which are shared across domains. Current “domain-invariance”-based methods in domain gen- eralization focus on either the marginal distribution align- ment (Muandet et al., 2013) or the conditional distribution alignment (Li et al., 2018b;c), which are still prone to distri- v:2102.05082v2 [cs.LG] 14 Feb 2021 1) Domain-invariant representation function이�존재하는가? Q 2) 가 Domain-invariant 한가? d,d' gθ(x) ( - =0 ) gθ f (x)
  • 4. Domain Invariant Representation Learning with Domain Density Transformations Theorem 1; Domain-invariant representation function이�존재하는가? Theorem 1. The invariance of p(y|d) across domains is the necessary and sufficient condition for the existence of a domain-invariant representation (that aligns both the marginal and conditional distribution). ‘p(y|d)의 도메인에 따른 불변함’과 ‘ Domain-invariant representation(function)이 존재하는 것’은 동치이다. p(y,z|d) p(y|z,d) p(z|d) = = = p(y|z,d') p(z|d') p(y,z|d') p(y|d) ∴ = p(y|d') marginalizing ∵ over z ‘ Domain-invariant representation(function)이 존재하는 것’ ‘p(y|d)의 도메인에 따른 불변함’ tion (Khosla et al., 2012; Muandet et al., 2013; Ghifary et al., 2015) and domain adaptation (Zhao et al., 2019; Zhang et al., 2019; Combes et al., 2020; Tanwani, 2020) is that, in do- main generalization, the learner does not have access to (even a small amount of) data of the target domain, making the problem much more challenging. One of the most common domain generalization approaches is to learn an invariant representation across domains, aim- ing at a good generalization performance on target domains. 1 University of Oxford 2 VinAI Research. Correspondence to: A. Tuan Nguyen <tuan.nguyen@cs.ox.ac.uk>. alignment refers to making the representation distribution p(z) to be the same across domains. This is essential since if p(z) for the target domain is different from that of source domains, the classification network h(z) would face out- of-distribution data because the representation z it receives as input at test time would be different from the ones it was trained with in source domains. Conditional alignment refers to aligning p(y|z), the conditional distribution of the label given the representation, since if this conditional for the target domain is different from that of the source domains, the classification network (trained on the source domains) would give inaccurate predictions at test time. The formal definition of the two alignments is discussed in Section 3. tion (Khosla et al., 2012; Muandet et al., 2013; Ghifary et al., 2015) and domain adaptation (Zhao et al., 2019; Zhang et al., 2019; Combes et al., 2020; Tanwani, 2020) is that, in do- main generalization, the learner does not have access to (even a small amount of) data of the target domain, making the problem much more challenging. One of the most common domain generalization approaches is to learn an invariant representation across domains, aim- ing at a good generalization performance on target domains. 1 University of Oxford 2 VinAI Research. Correspondence to: A. Tuan Nguyen <tuan.nguyen@cs.ox.ac.uk>. alignment refers to making the representation distribution p(z) to be the same across domains. This is essential since if p(z) for the target domain is different from that of source domains, the classification network h(z) would face out- of-distribution data because the representation z it receives as input at test time would be different from the ones it was trained with in source domains. Conditional alignment refers to aligning p(y|z), the conditional distribution of the label given the representation, since if this conditional for the target domain is different from that of the source domains, the classification network (trained on the source domains) would give inaccurate predictions at test time. The formal definition of the two alignments is discussed in Section 3. Domain Invariant Representation Learning with Domain Density Transformations ze this objective, while the discriminator D tries to ze it. n Classification Loss. For a given input image x 3.2. Training with Multiple Datasets An important advantage of StarGAN is that it simulta- neously incorporates multiple datasets containing different minimize this objective, while the discriminator D tries to maximize it. Domain Classification Loss. For a given input image x and a target domain label c, our goal is to translate x into an output image y, which is properly classified to the target 3.2. Training with Multiple Datasets An important advantage of StarGAN is neously incorporates multiple datasets conta types of labels, so that StarGAN can contro → → 1) A ⇔ B, A⇒B and B⇒A ‘ Domain-invariant representation(function)이 존재하는 것’ ‘p(y|d)의 도메인에 따른 불변함’ If is unchanged w.r.t. the domain d, then we can always find a domain invariant representation(This is trivial). p(y|d) For example, for the deterministic case(that maps all x to 0), or for the probabilistic case. p(z|x) (z|x) = δ0 p(z|x) (z;0,1) = N → 2) Domain-invariant representation(function)이 존재한다 Marginal and Conditional Distribution Alignment를 만족하는 표현 z 가 존재한다.
  • 5. Domain Invariant Representation Learning with Domain Density Transformations Theorem 2; 가 Domain-invariant한가? d,d' gθ(x) ( - =0 ) gθ f (x) Theorem 2. Given an invertible and differentiable function the representation z satisfies Marginal Alignment Conditional Alignment . Then it aligns both the marginal and conditional of the data distribution for domain d and d (with the inverse that transforms the data density from to (as described above). Assuming that ) fd,d' f d ' d d ,d ' ! "#$ !"#$%&'"(%)*#!#)*+$%,!-)&"'()*'(#,$).)!')/ 01,!2)!2+),$3+"%+)! $#"4 %$ & !"#$'%"( ))%" ))%$ )* + , '%" ( +, '%$ ( 5'(#,$). 5'(#,$)/ main density transformation. If we know the function f1,2 that transforms the data density from domain 1 to domain 2, a domain invariant representation network gθ(x) by enforcing it to be invariant under f1,2, i.e., gθ(x1) = gθ(x2) for any 1) . applying variable substitution in multiple inte- = fd,d (x)) p(x |d ) det Jfd,d (x ) −1 p(z|x ) det Jfd,d (x ) dx ce p(fd,d(x )|d) = p(x |d ) det Jfd,d (x ) −1 Eq 6 and p(z|fd,d(x )) = p(z|x ) due to defini- z in Eq 7) p(x |d )p(z|x )dx |d ) (8) ional alignment: ∀z, y we have: (since p(fd,d(x )|y, d) = p(x |y, d ) det Jfd,d (x ) −1 due to Eq 4 and p(z|fd,d(x )) = p(z|x ) due to definition of z in Eq 7) = p(x |y, d )p(z|x )dx = p(z|y, d ) (9) Note that p(y|z, d) = p(y, z|d) p(z|d) = p(y|d)p(z|y, d) p(z|d) (10) Since p(y|d) = p(y) = p(y|d ), p(z|y, d) = p(z|y, d ) and p(z|d) = p(z|d ), we have: p(y|z, d) = p(y|d )p(z|y, d ) p(z|d) = p(y|z, d ) (11) 01,!2)!2+),$3+%+)! $#4 %$ !#$'%( ))% ))%$ )* + , '% ( +, '%$ ( Figure 3. Domain density transformation. If we know the function f1,2 that transforms the data density from domain 1 to domain 2, we can learn a domain invariant representation network gθ(x) by enforcing it to be invariant under f1,2, i.e., gθ(x1) = gθ(x2) for any x2 = f1,2(x1) . (by applying variable substitution in multiple inte- gral: x = fd,d (x)) = p(x |d ) det Jfd,d (x ) −1 p(z|x ) det Jfd,d (x ) dx (since p(fd,d(x )|d) = p(x |d ) det Jfd,d (x ) −1 due to Eq 6 and p(z|fd,d(x )) = p(z|x ) due to defini- tion of z in Eq 7) = p(x |d )p(z|x )dx = p(z|d ) (8) ii) Conditional alignment: ∀z, y we have: p(z|y, d) = p(x|y, d)p(z|x)dx = p(fd,d(x )|y, d)p(z|fd,d(x )) det Jfd,d (x ) dx (since p(fd,d(x )|y, d) = p(x |y, d ) det Jfd,d (x ) −1 due to Eq 4 and p(z|fd,d(x )) = p(z|x ) due to definition of z in Eq 7) = p(x |y, d )p(z|x )dx = p(z|y, d ) (9) Note that p(y|z, d) = p(y, z|d) p(z|d) = p(y|d)p(z|y, d) p(z|d) (10) Since p(y|d) = p(y) = p(y|d ), p(z|y, d) = p(z|y, d ) and p(z|d) = p(z|d ), we have: p(y|z, d) = p(y|d )p(z|y, d ) p(z|d) = p(y|z, d ) (11) This theorem indicates that, if we can find the functions f’s that transform the data densities among the domains, we can learn a domain-invariant representation z by en- ! #$ !#$%'(%)*#!#)*+$%,!-)'()*'(#,$).)!')/ 01,!2)!2+),$3+%+)! $#4 %$ !#$'%( ))% ))%$ )* + , '% ( +, '%$ ( Figure 3. Domain density transformation. If we know the function f1,2 that transforms the data density from domain 1 to domain 2, we can learn a domain invariant representation network gθ(x) by enforcing it to be invariant under f1,2, i.e., gθ(x1) = gθ(x2) for any x2 = f1,2(x1) . (by applying variable substitution in multiple inte- gral: x = fd,d (x)) = p(x |d ) det Jfd,d (x ) −1 p(z|x ) det Jfd,d (x ) dx (since p(fd,d(x )|d) = p(x |d ) det Jfd,d (x ) −1 due to Eq 6 and p(z|fd,d(x )) = p(z|x ) due to defini- tion of z in Eq 7) = p(x |d )p(z|x )dx = p(z|d ) (8) ii) Conditional alignment: ∀z, y we have: p(z|y, d) = p(x|y, d)p(z|x)dx (since p(fd,d(x )|y, d) = p(x |y, d ) det Jfd,d (x ) −1 due to Eq 4 and p(z|fd,d(x )) = p(z|x ) due to definition of z in Eq 7) = p(x |y, d )p(z|x )dx = p(z|y, d ) (9) Note that p(y|z, d) = p(y, z|d) p(z|d) = p(y|d)p(z|y, d) p(z|d) (10) Since p(y|d) = p(y) = p(y|d ), p(z|y, d) = p(z|y, d ) and p(z|d) = p(z|d ), we have: p(y|z, d) = p(y|d )p(z|y, d ) p(z|d) = p(y|z, d ) (11) This theorem indicates that, if we can find the functions ng with Domain Density Transformations $%,!-)'()*'(#,$).)!')/ 3+%+)! $#4 %$ !#$'%( ))%$ )* +, '%$ ( 5'(#,$)/ on f1,2 that transforms the data density from domain 1 to domain 2, enforcing it to be invariant under f1,2, i.e., gθ(x1) = gθ(x2) for any (since p(fd,d(x )|y, d) = p(x |y, d ) det Jfd,d (x ) −1 due to Eq 4 and p(z|fd,d(x )) = p(z|x ) due to definition of z in Eq 7) = p(x |y, d )p(z|x )dx = p(z|y, d ) (9) Note that p(y|z, d) = p(y, z|d) p(z|d) = p(y|d)p(z|y, d) p(z|d) (10) Since p(y|d) = p(y) = p(y|d ), p(z|y, d) = p(z|y, d ) and p(z|d) = p(z|d ), we have: p(y|z, d) = p(y|d )p(z|y, d ) p(z|d) = p(y|z, d ) (11) d,d' p(z|x) -1 -1 p(z|d) ∫ = p(x|d)p(z|x) |d) dx ∫ = p( (x') fd',d z ) | p( (x') fd',d p(z|y,d) p(z|y,d') ∫ = = p(x|y,d)p(z|x)dx = dx' fd',d det (x') J |y,d) ∫ p( (x') fd',d z ) | p( (x') fd',d dx' x' fd',d det (x') J = |y,d') x' ∫ p( z ) | p( dx' fd',d det (x') J fd',d det (x') J x' = |y,d') x' ∫ p( z ) | p( dx' |d') x' ∫ = = = p( |x') z p( dx' fd',d det (x') J fd',d det (x') J |d') x' ∫ p( |d') z p( |x') z p( dx' f x' d,d' where is Jacobian matrix of the function evaluated at p(x|y,d) (∵ marginalizing over z) ⇒ p(y|d) p(y|d') p(x'|y,d') = fd',d det -1 (x') J p(x|d) p(x'|d') = fd',d det -1 (x') J p(x|y,d) p(x'|y,d') = fd',d det -1 (x') J fd,d'(x') J (z| = ) p ∀x f (x), c. To achieve this condition, we add an auxiliary r on top of D and impose the domain classification en optimizing both D and G. That is, we decompose ctive into two terms: a domain classification loss of ges used to optimize D, and a domain classification ake images used to optimize G. In detail, the former ed as Lr cls = Ex,c [− log Dcls(c |x)], (2) he term Dcls(c |x) represents a probability distribu- er domain labels computed by D. By minimizing ective, D learns to classify a real image x to its cor- ing original domain c . We assume that the input nd domain label pair (x, c ) is given by the training n the other hand, the loss function for the domain ation of fake images is defined as Lf cls = Ex,c[− log Dcls(c|G(x, c))]. (3) words, G tries to minimize this objective to gener- ges that can be classified as the target domain c. truction Loss. By minimizing the adversarial and ation losses, G is trained to generate images that stic and classified to its correct target domain. How- nimizing the losses (Eqs. (1) and (3)) does not guar- at the test phase. An issue when learning from multiple datasets, however, is that the label information is only par- tially known to each dataset. In the case of CelebA [19] and RaFD [13], while the former contains labels for attributes such as hair color and gender, it does not have any labels for facial expressions such as ‘happy’ and ‘angry’, and vice versa for the latter. This is problematic because the com- plete information on the label vector c is required when reconstructing the input image x from the translated image G(x, c) (See Eq. (4)). Mask Vector. To alleviate this problem, we introduce a mask vector m that allows StarGAN to ignore unspecified labels and focus on the explicitly known label provided by a particular dataset. In StarGAN, we use an n-dimensional one-hot vector to represent m, with n being the number of datasets. In addition, we define a unified version of the label as a vector c̃ = [c1, ..., cn, m], (7) where [·] refers to concatenation, and ci represents a vector for the labels of the i-th dataset. The vector of the known label ci can be represented as either a binary vector for bi- nary attributes or a one-hot vector for categorical attributes. classifier on top of D and impose the domain classification loss when optimizing both D and G. That is, we decompose the objective into two terms: a domain classification loss of real images used to optimize D, and a domain classification loss of fake images used to optimize G. In detail, the former is defined as Lr cls = Ex,c [− log Dcls(c |x)], (2) where the term Dcls(c |x) represents a probability distribu- tion over domain labels computed by D. By minimizing this objective, D learns to classify a real image x to its cor- responding original domain c . We assume that the input image and domain label pair (x, c ) is given by the training data. On the other hand, the loss function for the domain classification of fake images is defined as Lf cls = Ex,c[− log Dcls(c|G(x, c))]. (3) In other words, G tries to minimize this objective to gener- ate images that can be classified as the target domain c. Reconstruction Loss. By minimizing the adversarial and classification losses, G is trained to generate images that are realistic and classified to its correct target domain. How- ever, minimizing the losses (Eqs. (1) and (3)) does not guar- antee that translated images preserve the content of its input images while changing only the domain-related part of the tially known to each dataset. In the case of C RaFD [13], while the former contains label such as hair color and gender, it does not h for facial expressions such as ‘happy’ and ‘a versa for the latter. This is problematic bec plete information on the label vector c is reconstructing the input image x from the tr G(x, c) (See Eq. (4)). Mask Vector. To alleviate this problem, w mask vector m that allows StarGAN to ign labels and focus on the explicitly known lab a particular dataset. In StarGAN, we use an one-hot vector to represent m, with n being datasets. In addition, we define a unified vers as a vector c̃ = [c1, ..., cn, m], where [·] refers to concatenation, and ci repr for the labels of the i-th dataset. The vecto label ci can be represented as either a binar nary attributes or a one-hot vector for catego For the remaining n−1 unknown labels we Domain Invariant Representation Learning with Domain Density Transformations Assume that we have a set of K sources domain Ds = {d1, d2, ..., dK}, the objective function in Eq. 12 becomes: Ed,d∈Ds,p(x,y|d) l(y, gθ(x)) + ||gθ(x) − gθ(fd,d (x))||2 2 (13) In the next section, we show how one can incorporate this idea into real-world domain generalization problems with generative adversarial networks. 4. Domain Generalization with Generative Adversarial Networks In practice, we will learn the functions f’s that transform the data distributions between domains and one can use several generative modeling frameworks, e.g., normalizing flows (Grover et al., 2020) or GANs (Zhu et al., 2017; Choi et al., 2018; 2020) to learn such functions. One advantage of normalizing flows is that this transformation is naturally invertible by design of the neural network. In addition, the • The adversarial loss Ladv that is the classification loss of a discriminator D that tries to distinguish between real images and the fake images generated by G. The equilibrium state of StarGAN is when G completely fools D, which means the distribution of the generated images (via G(x, d, d ), x ∼ p(x|d)) becomes the dis- tribution of the real images of the destination domain p(x |d ). This is our objective, i.e., to learn a function that transforms domains’ densities. • Reconstruction loss Lrec = Ex,d,d [||x − G(x , d , d)||1] where x = G(x, d, d ) to ensure that the transformations preserve the image’s content. Note that this also aligns with our interest since we want G(., d , d) to be the inverse of G(., d, d ), which will minimize Lrec to zero. We can enforce the generator G to transform the data distri- bution within the class y (e.g., p(x|y, d) to p(x |y, d ) ∀y) by sampling each minibatch with data from the same class y, so that the discriminator will distinguish the transformed images with the real images from class y and domain d . minimize this objective, while the discriminator D tries to maximize it. Domain Classification Loss. For a given input image x and a target domain label c, our goal is to translate x into an output image y, which is properly classified to the target domain c. To achieve this condition, we add an auxiliary classifier on top of D and impose the domain classification loss when optimizing both D and G. That is, we decompose the objective into two terms: a domain classification loss of real images used to optimize D, and a domain classification loss of fake images used to optimize G. In detail, the former is defined as Lr cls = Ex,c [− log Dcls(c |x)], (2) where the term Dcls(c |x) represents a probability distribu- tion over domain labels computed by D. By minimizing this objective, D learns to classify a real image x to its cor- minimize this objective, while the discriminator D tries to maximize it. Domain Classification Loss. For a given input image x and a target domain label c, our goal is to translate x into 3.2. Training with Multiple Data An important advantage of StarG neously incorporates multiple datase types of labels, so that StarGAN can
  • 6. Domain Invariant Representation Learning with Domain Density Transformations d ,d applying variable substitution in multiple inte- = fd,d (x)) p(x |y, d ) det Jfd,d (x ) −1 p(z|x ) det Jfd,d (x ) dx couraging the representation to be invariant under all the transformations f’s. This idea is illustrated in Figure 3. We therefore can use the following learning objective to learn a domain-invariant representation z = gθ(x): Ed Ep(x,y|d) l(y, gθ(x)) + Ed [||gθ(x) − gθ(fd,d (x))||2 2] (12) where l(y, gθ(x)) is the prediction loss of a network that pre- dicts y given z = gθ(x), and the second term is to enforce the invariant condition in Eq 7. gral: x = fd,d (x)) = p(x |y, d ) det Jfd,d (x ) −1 p(z|x ) det Jfd,d (x ) dx domain-invariant representation z = gθ(x): Ed Ep(x,y|d) l(y, gθ(x)) + Ed [||gθ(x) − gθ(fd,d (x))||2 2] (12) where l(y, gθ(x)) is the prediction loss of a network that pre- dicts y given z = gθ(x), and the second term is to enforce the invariant condition in Eq 7. (by applying variable substitution in multiple inte- gral: x = fd,d (x)) = p(x |y, d ) det Jfd,d (x ) −1 p(z|x ) det Jfd,d (x ) dx transformations f’s. This idea is illustrated in Figure 3. We therefore can use the following learning objective to learn a domain-invariant representation z = gθ(x): Ed Ep(x,y|d) l(y, gθ(x)) + Ed [||gθ(x) − gθ(fd,d (x))||2 2] (12) where l(y, gθ(x)) is the prediction loss of a network that pre- dicts y given z = gθ(x), and the second term is to enforce the invariant condition in Eq 7. f’s that transform the data densities among the domains, we can learn a domain-invariant representation z by en- couraging the representation to be invariant under all the transformations f’s. This idea is illustrated in Figure 3. We therefore can use the following learning objective to learn a domain-invariant representation z = gθ(x): Ed Ep(x,y|d) l(y, gθ(x)) + Ed [||gθ(x) − gθ(fd,d (x))||2 2] (12) where l(y, gθ(x)) is the prediction loss of a network that pre- dicts y given z = gθ(x), and the second term is to enforce the invariant condition in Eq 7. 5. Domain Generalization with Generative Adversarial Networks (StarGAN; PR-152) models 2 3 2 G32 G23 3 2 1 5 4 3 (b) StarGAN on between cross-domain models and our pro- GAN. (a) To handle multiple domains, cross- ould be built for every pair of image domains. able of learning mappings among multiple do- e generator. The figure represents a star topol- lti-domains. ned from RaFD, as shown in the right- Fig. 1. As far as our knowledge goes, our o successfully perform multi-domain im- ross different datasets. ontributions are as follows: StarGAN, a novel generative adversarial learns the mappings among multiple do- only a single generator and a discrimina- effectively from images of all domains. rate how we can successfully learn multi- G Input image Target domain Depth-wise concatenation Fake image G Original domain Fake image Depth-wise concatenation Reconstructed image D Fake image Domain classification Real / Fake (b) Original-to-target domain (c) Target-to-original domain (d) Fooling the discriminator D Domain classification Real / Fake Fake image Real image (a) Training the discriminator (1) (2) (1), (2) (1) Figure 3. Overview of StarGAN, consisting of two modules, a discriminator D and a generator G. (a) D learns to distinguish between real and fake images and classify the real images to its corresponding domain. (b) G takes in as input both the image and target domain label and generates an fake image. The target domain label is spatially replicated and concatenated with the input image. (c) G tries to reconstruct the original image from the fake image given the original domain label. (d) G tries to generate images indistinguishable from real images and classifiable as target domain by D. vided both the discriminator and generator with class infor- mation in order to generate samples conditioned on the class [20, 21, 22]. Other recent approaches focused on generating particular images highly relevant to a given text description [25, 30]. The idea of conditional image generation has also been successfully applied to domain transfer [9, 28], super- resolution imaging[14], and photo editing [2, 27]. In this paper, we propose a scalable GAN framework that can flex- ibly steer the image translation to various target domains, by providing conditional domain information. Image-to-Image Translation. Recent work have achieved impressive results in image-to-image translation [7, 9, 17, 33]. For instance, pix2pix [7] learns this task in a super- vised manner using cGANs[20]. It combines an adver- sarial loss with a L1 loss, thus requires paired data sam- ples. To alleviate the problem of obtaining data pairs, un- 3. Star Generative Adversarial Networks We first describe our proposed StarGAN, a framework to address multi-domain image-to-image translation within a single dataset. Then, we discuss how StarGAN incorporates multiple datasets containing different label sets to flexibly perform image translations using any of these labels. 3.1. Multi-Domain Image-to-Image Translation Our goal is to train a single generator G that learns map- pings among multiple domains. To achieve this, we train G to translate an input image x into an output image y condi- tioned on the target domain label c, G(x, c) → y. We ran- domly generate the target domain label c so that G learns to flexibly translate the input image. We also introduce an auxiliary classifier [22] that allows a single discriminator to control multiple domains. That is, our discriminator pro- To alleviate this problem, we apply a cycle consis- ss [9, 33] to the generator, defined as Lrec = Ex,c,c [||x − G(G(x, c), c )||1], (4) G takes in the translated image G(x, c) and the origi- ain label c as input and tries to reconstruct the orig- ge x. We adopt the L1 norm as our reconstruction te that we use a single generator twice, first to trans- original image into an image in the target domain n to reconstruct the original image from the trans- age. jective. Finally, the objective functions to optimize D are written, respectively, as LD = −Ladv + λcls Lr cls, (5) LG = Ladv + λcls Lf cls + λrec Lrec, (6) λcls and λrec are hyper-parameters that control the importance of domain classification and reconstruc- ses, respectively, compared to the adversarial loss. λcls = 1 and λrec = 10 in all of our experiments. RaFD datasets, where n is two. Training Strategy. When training StarGAN with multiple datasets, we use the domain label c̃ defined in Eq. (7) as in- put to the generator. By doing so, the generator learns to ignore the unspecified labels, which are zero vectors, and focus on the explicitly given label. The structure of the gen- erator is exactly the same as in training with a single dataset, except for the dimension of the input label c̃. On the other hand, we extend the auxiliary classifier of the discrimina- tor to generate probability distributions over labels for all datasets. Then, we train the model in a multi-task learning setting, where the discriminator tries to minimize only the classification error associated to the known label. For ex- ample, when training with images in CelebA, the discrimi- nator minimizes only classification errors for labels related to CelebA attributes, and not facial expressions related to RaFD. Under these settings, by alternating between CelebA and RaFD the discriminator learns all of the discriminative features for both datasets, and the generator learns to con- trol all the labels in both datasets. 4 Lrec = Ex,c,c [||x − G(G(x, c), c )||1], (4) where G takes in the translated image G(x, c) and the origi- nal domain label c as input and tries to reconstruct the orig- inal image x. We adopt the L1 norm as our reconstruction loss. Note that we use a single generator twice, first to trans- late an original image into an image in the target domain and then to reconstruct the original image from the trans- lated image. Full Objective. Finally, the objective functions to optimize G and D are written, respectively, as LD = −Ladv + λcls Lr cls, (5) LG = Ladv + λcls Lf cls + λrec Lrec, (6) where λcls and λrec are hyper-parameters that control the relative importance of domain classification and reconstruc- tion losses, respectively, compared to the adversarial loss. We use λcls = 1 and λrec = 10 in all of our experiments. Training Strategy. When training StarGAN datasets, we use the domain label c̃ defined i put to the generator. By doing so, the gene ignore the unspecified labels, which are ze focus on the explicitly given label. The struc erator is exactly the same as in training with a except for the dimension of the input label c hand, we extend the auxiliary classifier of tor to generate probability distributions ove datasets. Then, we train the model in a mul setting, where the discriminator tries to min classification error associated to the known ample, when training with images in CelebA nator minimizes only classification errors fo to CelebA attributes, and not facial express RaFD. Under these settings, by alternating b and RaFD the discriminator learns all of the features for both datasets, and the generator trol all the labels in both datasets. 4 of the generative model is completed, we propose the use of GANs to inherit its rich network capacity. In particular, we use the StarGAN (Choi et al., 2018) model, which is designed for image domain transformations. The goal of StarGAN is to learn a unified network G that transforms the data density among multiple domains. In particular, the network G(x, d, d ) (i.e., G is conditioned on the image x and the two different domains d, d ) transforms an image x from domain d to domain d . Different from the original StarGAN model that only takes the image x and the desired destination domain d as its input, in our implementation, we feed both the original domain d and desired destination domain d together with the original image x to the generator G. The generator’s goal is to fool a discriminator D into think- ing that the transformed image belongs to the destination do- main d . In other words, the equilibrium state of StarGAN, in which G completely fools D, is when G successfully transforms the data density of the original domain to that of the destination domain. After training, we use G(., d, d ) as the function fd,d (.) described in the previous section and perform the representation learning via the objective function in Eq 13. Three important loss functions of the StarGAN architecture are: • Domain classification loss Lcls that encourages the generator G to generate images that correctly belongs to the desired destination domain d . image within the original class y. As mentioned earlier, after training the StarGAN model, we can use the generator G(., d, d ) as our fd,d (.) function and learn a domain-invariant representation via the learning objective in Eq 13. We name this implementation of our method DIR-GAN (domain-invariant representation learn- ing with generative adversarial networks). 5. Experiments 5.1. Datasets To evaluate our method, we perform experiments in three datasets that are commonly used in the literature for domain generalization. Rotated MNIST. In this dataset by Ghifary et al. (2015), 1,000 MNIST images (100 per class) (LeCun Cortes, 2010) are chosen to form the first domain (de- noted M0), then rotations of 15◦ , 30◦ , 45◦ , 60◦ and 75◦ are applied to create five additional domains, denoted M15, M30, M45, M60 and M75. The task is classifica- tion with ten classes (digits 0 to 9). PACS (Li et al., 2017) contains 9,991 images from four different domains: art painting, cartoon, photo, sketch. The task is classification with seven classes. OfficeHome (Venkateswara et al., 2017) has 15,500 im- ages of daily objects from four domains: art, clipart, product and real. There are 65 classes in this classification dataset. Domain Invariant Representation Learning with Domain Density Transformations Assume that we have a set of K sources domain Ds = {d1, d2, ..., dK}, the objective function in Eq. 12 becomes: Ed,d∈Ds,p(x,y|d) l(y, gθ(x)) + ||gθ(x) − gθ(fd,d (x))||2 2 (13) In the next section, we show how one can incorporate this idea into real-world domain generalization problems with generative adversarial networks. 4. Domain Generalization with Generative Adversarial Networks In practice, we will learn the functions f’s that transform the data distributions between domains and one can use several generative modeling frameworks, e.g., normalizing flows (Grover et al., 2020) or GANs (Zhu et al., 2017; Choi et al., 2018; 2020) to learn such functions. One advantage of normalizing flows is that this transformation is naturally invertible by design of the neural network. In addition, the • The adversarial loss Ladv that is the classification loss of a discriminator D that tries to distinguish between real images and the fake images generated by G. The equilibrium state of StarGAN is when G completely fools D, which means the distribution of the generated images (via G(x, d, d ), x ∼ p(x|d)) becomes the dis- tribution of the real images of the destination domain p(x |d ). This is our objective, i.e., to learn a function that transforms domains’ densities. • Reconstruction loss Lrec = Ex,d,d [||x − G(x , d , d)||1] where x = G(x, d, d ) to ensure that the transformations preserve the image’s content. Note that this also aligns with our interest since we want G(., d , d) to be the inverse of G(., d, d ), which will minimize Lrec to zero. We can enforce the generator G to transform the data distri- bution within the class y (e.g., p(x|y, d) to p(x |y, d ) ∀y) by sampling each minibatch with data from the same class y, so that the discriminator will distinguish the transformed Domain Invariant Representation Learning with Domain Density Transformations Assume that we have a set of K sources domain Ds = {d1, d2, ..., dK}, the objective function in Eq. 12 becomes: Ed,d∈Ds,p(x,y|d) l(y, gθ(x)) + ||gθ(x) − gθ(fd,d (x))||2 2 (13) In the next section, we show how one can incorporate this idea into real-world domain generalization problems with generative adversarial networks. • The adversarial loss Ladv that is the classification loss of a discriminator D that tries to distinguish between real images and the fake images generated by G. The equilibrium state of StarGAN is when G completely fools D, which means the distribution of the generated images (via G(x, d, d ), x ∼ p(x|d)) becomes the dis- tribution of the real images of the destination domain p(x |d ). This is our objective, i.e., to learn a function that transforms domains’ densities. • Reconstruction loss Lrec = Ex,d,d [||x − ation Learning with Domain Density Transformations ain Ds = becomes: d (x))||2 2 (13) porate this lems with ative transform e can use rmalizing 017; Choi advantage s naturally dition, the on can be hat we do g process se the use particular, , which is • The adversarial loss Ladv that is the classification loss of a discriminator D that tries to distinguish between real images and the fake images generated by G. The equilibrium state of StarGAN is when G completely fools D, which means the distribution of the generated images (via G(x, d, d ), x ∼ p(x|d)) becomes the dis- tribution of the real images of the destination domain p(x |d ). This is our objective, i.e., to learn a function that transforms domains’ densities. • Reconstruction loss Lrec = Ex,d,d [||x − G(x , d , d)||1] where x = G(x, d, d ) to ensure that the transformations preserve the image’s content. Note that this also aligns with our interest since we want G(., d , d) to be the inverse of G(., d, d ), which will minimize Lrec to zero. We can enforce the generator G to transform the data distri- bution within the class y (e.g., p(x|y, d) to p(x |y, d ) ∀y) by sampling each minibatch with data from the same class y, so that the discriminator will distinguish the transformed images with the real images from class y and domain d . However, we found that this constraint can be relaxed in practice, and the generator almost always transforms the image within the original class y. As mentioned earlier, after training the StarGAN model, we can use the generator G(., d, d ) as our fd,d (.) function Domain Invariant Representation Learning with Domain Density Transformations Assume that we have a set of K sources domain Ds = {d1, d2, ..., dK}, the objective function in Eq. 12 becomes: Ed,d∈Ds,p(x,y|d) l(y, gθ(x)) + ||gθ(x) − gθ(fd,d (x))||2 2 (13) In the next section, we show how one can incorporate this idea into real-world domain generalization problems with generative adversarial networks. 4. Domain Generalization with Generative Adversarial Networks In practice, we will learn the functions f’s that transform the data distributions between domains and one can use several generative modeling frameworks, e.g., normalizing flows (Grover et al., 2020) or GANs (Zhu et al., 2017; Choi et al., 2018; 2020) to learn such functions. One advantage of normalizing flows is that this transformation is naturally invertible by design of the neural network. In addition, the determinant of the Jacobian of that transformation can be efficiently computed. However, due to the fact that we do not need access to the Jacobian when the training process of the generative model is completed, we propose the use of GANs to inherit its rich network capacity. In particular, we use the StarGAN (Choi et al., 2018) model, which is designed for image domain transformations. The goal of StarGAN is to learn a unified network G that transforms the data density among multiple domains. In particular, the network G(x, d, d ) (i.e., G is conditioned on the image x and the two different domains d, d ) transforms an image x from domain d to domain d . Different from the original StarGAN model that only takes the image x • The adversarial loss Ladv that is the classification loss of a discriminator D that tries to distinguish between real images and the fake images generated by G. The equilibrium state of StarGAN is when G completely fools D, which means the distribution of the generated images (via G(x, d, d ), x ∼ p(x|d)) becomes the dis- tribution of the real images of the destination domain p(x |d ). This is our objective, i.e., to learn a function that transforms domains’ densities. • Reconstruction loss Lrec = Ex,d,d [||x − G(x , d , d)||1] where x = G(x, d, d ) to ensure that the transformations preserve the image’s content. Note that this also aligns with our interest since we want G(., d , d) to be the inverse of G(., d, d ), which will minimize Lrec to zero. We can enforce the generator G to transform the data distri- bution within the class y (e.g., p(x|y, d) to p(x |y, d ) ∀y) by sampling each minibatch with data from the same class y, so that the discriminator will distinguish the transformed images with the real images from class y and domain d . However, we found that this constraint can be relaxed in practice, and the generator almost always transforms the image within the original class y. As mentioned earlier, after training the StarGAN model, we can use the generator G(., d, d ) as our fd,d (.) function and learn a domain-invariant representation via the learning objective in Eq 13. We name this implementation of our method DIR-GAN (domain-invariant representation learn- ing with generative adversarial networks). 5. Experiments 5.1. Datasets Domain Invariant Representation Learning with Domain Density Transformations Assume that we have a set of K sources domain Ds = {d1, d2, ..., dK}, the objective function in Eq. 12 becomes: Ed,d∈Ds,p(x,y|d) l(y, gθ(x)) + ||gθ(x) − gθ(fd,d (x))||2 2 (13) In the next section, we show how one can incorporate this idea into real-world domain generalization problems with generative adversarial networks. 4. Domain Generalization with Generative Adversarial Networks In practice, we will learn the functions f’s that transform the data distributions between domains and one can use several generative modeling frameworks, e.g., normalizing flows (Grover et al., 2020) or GANs (Zhu et al., 2017; Choi et al., 2018; 2020) to learn such functions. One advantage of normalizing flows is that this transformation is naturally invertible by design of the neural network. In addition, the determinant of the Jacobian of that transformation can be efficiently computed. However, due to the fact that we do not need access to the Jacobian when the training process of the generative model is completed, we propose the use of GANs to inherit its rich network capacity. In particular, we use the StarGAN (Choi et al., 2018) model, which is designed for image domain transformations. The goal of StarGAN is to learn a unified network G that transforms the data density among multiple domains. In particular, the network G(x, d, d ) (i.e., G is conditioned on the image x and the two different domains d, d ) transforms an image x from domain d to domain d . Different from the original StarGAN model that only takes the image x and the desired destination domain d as its input, in our implementation, we feed both the original domain d and desired destination domain d together with the original image x to the generator G. • The adversarial loss Ladv that is the classification loss of a discriminator D that tries to distinguish between real images and the fake images generated by G. The equilibrium state of StarGAN is when G completely fools D, which means the distribution of the generated images (via G(x, d, d ), x ∼ p(x|d)) becomes the dis- tribution of the real images of the destination domain p(x |d ). This is our objective, i.e., to learn a function that transforms domains’ densities. • Reconstruction loss Lrec = Ex,d,d [||x − G(x , d , d)||1] where x = G(x, d, d ) to ensure that the transformations preserve the image’s content. Note that this also aligns with our interest since we want G(., d , d) to be the inverse of G(., d, d ), which will minimize Lrec to zero. We can enforce the generator G to transform the data distri- bution within the class y (e.g., p(x|y, d) to p(x |y, d ) ∀y) by sampling each minibatch with data from the same class y, so that the discriminator will distinguish the transformed images with the real images from class y and domain d . However, we found that this constraint can be relaxed in practice, and the generator almost always transforms the image within the original class y. As mentioned earlier, after training the StarGAN model, we can use the generator G(., d, d ) as our fd,d (.) function and learn a domain-invariant representation via the learning objective in Eq 13. We name this implementation of our method DIR-GAN (domain-invariant representation learn- ing with generative adversarial networks). 5. Experiments 5.1. Datasets To evaluate our method, we perform experiments in three datasets that are commonly used in the literature for domain generalization. Domain Invariant Representation Learning with Domain Density Transformations Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs Domains Model M0 M15 M30 M45 M60 M75 Average Domain Invariant Representation Learning with Domain Density Transformations Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs image and domain label pair (x, c ) is given by the training data. On the other hand, the loss function for the domain classification of fake images is defined as Lf cls = Ex,c[− log Dcls(c|G(x, c))]. (3) In other words, G tries to minimize this objective to gener- ate images that can be classified as the target domain c. Reconstruction Loss. By minimizing the adversarial and classification losses, G is trained to generate images that are realistic and classified to its correct target domain. How- ever, minimizing the losses (Eqs. (1) and (3)) does not guar- antee that translated images preserve the content of its input images while changing only the domain-related part of the inputs. To alleviate this problem, we apply a cycle consis- tency loss [9, 33] to the generator, defined as Lrec = Ex,c,c [||x − G(G(x, c), c )||1], (4) where G takes in the translated image G(x, c) and the origi- nal domain label c as input and tries to reconstruct the orig- inal image x. We adopt the L1 norm as our reconstruction loss. Note that we use a single generator twice, first to trans- late an original image into an image in the target domain and then to reconstruct the original image from the trans- lated image. classifier on top of D and impose the domain classification loss when optimizing both D and G. That is, we decompose the objective into two terms: a domain classification loss of real images used to optimize D, and a domain classification loss of fake images used to optimize G. In detail, the former is defined as Lr cls = Ex,c [− log Dcls(c |x)], (2) where the term Dcls(c |x) represents a probability distribu- tion over domain labels computed by D. By minimizing this objective, D learns to classify a real image x to its cor- responding original domain c . We assume that the input image and domain label pair (x, c ) is given by the training data. On the other hand, the loss function for the domain classification of fake images is defined as Lf cls = Ex,c[− log Dcls(c|G(x, c))]. (3) In other words, G tries to minimize this objective to gener- ate images that can be classified as the target domain c. Reconstruction Loss. By minimizing the adversarial and classification losses, G is trained to generate images that are realistic and classified to its correct target domain. How- ever, minimizing the losses (Eqs. (1) and (3)) does not guar- antee that translated images preserve the content of its input tially known to each dataset. In the ca RaFD [13], while the former contain such as hair color and gender, it doe for facial expressions such as ‘happy’ versa for the latter. This is problem plete information on the label vecto reconstructing the input image x from G(x, c) (See Eq. (4)). Mask Vector. To alleviate this pro mask vector m that allows StarGAN labels and focus on the explicitly kno a particular dataset. In StarGAN, we one-hot vector to represent m, with n datasets. In addition, we define a unifi as a vector c̃ = [c1, ..., cn, m where [·] refers to concatenation, and for the labels of the i-th dataset. Th label ci can be represented as either nary attributes or a one-hot vector for For the remaining n−1 unknown lab mize this objective, while the discriminator D tries to mize it. ain Classification Loss. For a given input image x target domain label c, our goal is to translate x into put image y, which is properly classified to the target n c. To achieve this condition, we add an auxiliary fier on top of D and impose the domain classification hen optimizing both D and G. That is, we decompose jective into two terms: a domain classification loss of mages used to optimize D, and a domain classification f fake images used to optimize G. In detail, the former ned as Lr cls = Ex,c [− log Dcls(c |x)], (2) the term Dcls(c |x) represents a probability distribu- ver domain labels computed by D. By minimizing bjective, D learns to classify a real image x to its cor- nding original domain c . We assume that the input and domain label pair (x, c ) is given by the training On the other hand, the loss function for the domain fication of fake images is defined as 3.2. Training with Multiple Datasets An important advantage of StarGAN is that it simulta- neously incorporates multiple datasets containing different types of labels, so that StarGAN can control all the labels at the test phase. An issue when learning from multiple datasets, however, is that the label information is only par- tially known to each dataset. In the case of CelebA [19] and RaFD [13], while the former contains labels for attributes such as hair color and gender, it does not have any labels for facial expressions such as ‘happy’ and ‘angry’, and vice versa for the latter. This is problematic because the com- plete information on the label vector c is required when reconstructing the input image x from the translated image G(x, c) (See Eq. (4)). Mask Vector. To alleviate this problem, we introduce a mask vector m that allows StarGAN to ignore unspecified labels and focus on the explicitly known label provided by a particular dataset. In StarGAN, we use an n-dimensional one-hot vector to represent m, with n being the number of datasets. In addition, we define a unified version of the label
  • 7. Domain Invariant Representation Learning with Domain Density Transformations both qualitative and quantitative results on ute transfer and facial expression synthe- ng StarGAN, showing its superiority over dels. ork ersarial Networks. Generative adversar- ANs) [3] have shown remarkable results ter vision tasks such as image generation age translation [7, 9, 33], super-resolution d face image synthesis [10, 16, 26, 31]. A del consists of two modules: a discrimina- or. The discriminator learns to distinguish fake samples, while the generator learns to mples that are indistinguishable from real proach also leverages the adversarial loss rated images as realistic as possible. Ns. GAN-based conditional image gener- n actively studied. Prior studies have pro- distribution of images in cross domains. CycleGAN [33] and DiscoGAN [9] preserve key attributes between the in- put and the translated image by utilizing a cycle consistency loss. However, all these frameworks are only capable of learning the relations between two different domains at a time. Their approaches have limited scalability in handling multiple domains since different models should be trained for each pair of domains. Unlike the aforementioned ap- proaches, our framework can learn the relations among mul- tiple domains using only a single model. Adversarial Loss. To make the generated images indistin- guishable from real images, we adopt an adversarial loss Ladv = Ex [log Dsrc(x)] + Ex,c[log (1 − Dsrc(G(x, c)))], (1) where G generates an image G(x, c) conditioned on both the input image x and the target domain label c, while D tries to distinguish between real and fake images. In this paper, we refer to the term Dsrc(x) as a probability distri- bution over sources given by D. The generator G tries to 3 of the generative model is completed, we propose the use of GANs to inherit its rich network capacity. In particular, we use the StarGAN (Choi et al., 2018) model, which is designed for image domain transformations. The goal of StarGAN is to learn a unified network G that transforms the data density among multiple domains. In particular, the network G(x, d, d ) (i.e., G is conditioned on the image x and the two different domains d, d ) transforms an image x from domain d to domain d . Different from the original StarGAN model that only takes the image x and the desired destination domain d as its input, in our implementation, we feed both the original domain d and desired destination domain d together with the original image x to the generator G. The generator’s goal is to fool a discriminator D into think- ing that the transformed image belongs to the destination do- main d . In other words, the equilibrium state of StarGAN, in which G completely fools D, is when G successfully transforms the data density of the original domain to that of the destination domain. After training, we use G(., d, d ) as the function fd,d (.) described in the previous section and perform the representation learning via the objective function in Eq 13. Three important loss functions of the StarGAN architecture are: • Domain classification loss Lcls that encourages the generator G to generate images that correctly belongs to the desired destination domain d . image within the original class y. As mentioned earlier, after training the StarGAN model, we can use the generator G(., d, d ) as our fd,d (.) function and learn a domain-invariant representation via the learning objective in Eq 13. We name this implementation of our method DIR-GAN (domain-invariant representation learn- ing with generative adversarial networks). 5. Experiments 5.1. Datasets To evaluate our method, we perform experiments in three datasets that are commonly used in the literature for domain generalization. Rotated MNIST. In this dataset by Ghifary et al. (2015), 1,000 MNIST images (100 per class) (LeCun Cortes, 2010) are chosen to form the first domain (de- noted M0), then rotations of 15◦ , 30◦ , 45◦ , 60◦ and 75◦ are applied to create five additional domains, denoted M15, M30, M45, M60 and M75. The task is classifica- tion with ten classes (digits 0 to 9). PACS (Li et al., 2017) contains 9,991 images from four different domains: art painting, cartoon, photo, sketch. The task is classification with seven classes. OfficeHome (Venkateswara et al., 2017) has 15,500 im- ages of daily objects from four domains: art, clipart, product and real. There are 65 classes in this classification dataset. In practice, we will learn the functions f’s that transform the data distributions between domains and one can use several generative modeling frameworks, e.g., normalizing flows (Grover et al., 2020) or GANs (Zhu et al., 2017; Choi et al., 2018; 2020) to learn such functions. One advantage of normalizing flows is that this transformation is naturally invertible by design of the neural network. In addition, the determinant of the Jacobian of that transformation can be efficiently computed. However, due to the fact that we do not need access to the Jacobian when the training process of the generative model is completed, we propose the use of GANs to inherit its rich network capacity. In particular, we use the StarGAN (Choi et al., 2018) model, which is designed for image domain transformations. The goal of StarGAN is to learn a unified network G that transforms the data density among multiple domains. In particular, the network G(x, d, d ) (i.e., G is conditioned on the image x and the two different domains d, d ) transforms an image x from domain d to domain d . Different from the original StarGAN model that only takes the image x and the desired destination domain d as its input, in our implementation, we feed both the original domain d and desired destination domain d together with the original image x to the generator G. The generator’s goal is to fool a discriminator D into think- ing that the transformed image belongs to the destination do- main d . In other words, the equilibrium state of StarGAN, in which G completely fools D, is when G successfully transforms the data density of the original domain to that of the destination domain. After training, we use G(., d, d ) as the function fd,d (.) described in the previous section and perform the representation learning via the objective function in Eq 13. Three important loss functions of the StarGAN architecture are: • Domain classification loss Lcls that encourages the want G(., d , d) to be the inverse of G(., d, d ), which will minimize Lrec to zero. We can enforce the generator G to transform the data distri- bution within the class y (e.g., p(x|y, d) to p(x |y, d ) ∀y) by sampling each minibatch with data from the same class y, so that the discriminator will distinguish the transformed images with the real images from class y and domain d . However, we found that this constraint can be relaxed in practice, and the generator almost always transforms the image within the original class y. As mentioned earlier, after training the StarGAN model, we can use the generator G(., d, d ) as our fd,d (.) function and learn a domain-invariant representation via the learning objective in Eq 13. We name this implementation of our method DIR-GAN (domain-invariant representation learn- ing with generative adversarial networks). 5. Experiments 5.1. Datasets To evaluate our method, we perform experiments in three datasets that are commonly used in the literature for domain generalization. Rotated MNIST. In this dataset by Ghifary et al. (2015), 1,000 MNIST images (100 per class) (LeCun Cortes, 2010) are chosen to form the first domain (de- noted M0), then rotations of 15◦ , 30◦ , 45◦ , 60◦ and 75◦ are applied to create five additional domains, denoted M15, M30, M45, M60 and M75. The task is classifica- tion with ten classes (digits 0 to 9). PACS (Li et al., 2017) contains 9,991 images from four different domains: art painting, cartoon, photo, sketch. The task is classification with seven classes. OfficeHome (Venkateswara et al., 2017) has 15,500 im- 6. Experiments / Results 1) Dataset 2) Results itioned on ransforms rent from e image x ut, in our ain d and e original into think- nation do- StarGAN, ccessfully ain to that G(., d, d ) us section objective chitecture urages the y belongs ing with generative adversarial networks). 5. Experiments 5.1. Datasets To evaluate our method, we perform experiments in three datasets that are commonly used in the literature for domain generalization. Rotated MNIST. In this dataset by Ghifary et al. (2015), 1,000 MNIST images (100 per class) (LeCun Cortes, 2010) are chosen to form the first domain (de- noted M0), then rotations of 15◦ , 30◦ , 45◦ , 60◦ and 75◦ are applied to create five additional domains, denoted M15, M30, M45, M60 and M75. The task is classifica- tion with ten classes (digits 0 to 9). PACS (Li et al., 2017) contains 9,991 images from four different domains: art painting, cartoon, photo, sketch. The task is classification with seven classes. OfficeHome (Venkateswara et al., 2017) has 15,500 im- ages of daily objects from four domains: art, clipart, product and real. There are 65 classes in this classification dataset. The generator’s goal is to fool a discriminator D into think- ing that the transformed image belongs to the destination do- main d . In other words, the equilibrium state of StarGAN, in which G completely fools D, is when G successfully transforms the data density of the original domain to that of the destination domain. After training, we use G(., d, d ) as the function fd,d (.) described in the previous section and perform the representation learning via the objective function in Eq 13. Three important loss functions of the StarGAN architecture are: • Domain classification loss Lcls that encourages the generator G to generate images that correctly belongs to the desired destination domain d . generalization. Rotated MNIST. In this dataset by Ghifary et al. (2015), 1,000 MNIST images (100 per class) (LeCun Cortes, 2010) are chosen to form the first domain (de- noted M0), then rotations of 15◦ , 30◦ , 45◦ , 60◦ and 75◦ are applied to create five additional domains, denoted M15, M30, M45, M60 and M75. The task is classifica- tion with ten classes (digits 0 to 9). PACS (Li et al., 2017) contains 9,991 images from four different domains: art painting, cartoon, photo, sketch. The task is classification with seven classes. OfficeHome (Venkateswara et al., 2017) has 15,500 im- ages of daily objects from four domains: art, clipart, product and real. There are 65 classes in this classification dataset. in which G completely fools D, is when G successfully transforms the data density of the original domain to that of the destination domain. After training, we use G(., d, d ) as the function fd,d (.) described in the previous section and perform the representation learning via the objective function in Eq 13. Three important loss functions of the StarGAN architecture are: • Domain classification loss Lcls that encourages the generator G to generate images that correctly belongs to the desired destination domain d . Cortes, 2010) are chosen to form the first domain (de- noted M0), then rotations of 15◦ , 30◦ , 45◦ , 60◦ and 75◦ are applied to create five additional domains, denoted M15, M30, M45, M60 and M75. The task is classifica- tion with ten classes (digits 0 to 9). PACS (Li et al., 2017) contains 9,991 images from four different domains: art painting, cartoon, photo, sketch. The task is classification with seven classes. OfficeHome (Venkateswara et al., 2017) has 15,500 im- ages of daily objects from four domains: art, clipart, product and real. There are 65 classes in this classification dataset. Domain Invariant Representation Learning with Domain Density Transformations Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs Domains Model M0 M15 M30 M45 M60 M75 Average HIR (Wang et al., 2020) 90.34 99.75 99.40 96.17 99.25 91.26 96.03 DIVA (Ilse et al., 2020) 93.5 99.3 99.1 99.2 99.3 93.0 97.2 DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36 DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3 LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3 HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8 ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2 DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6 Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs PACS Model Backbone Art Painting Cartoon Photo Sketch Average DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38 JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14 MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70 MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70 CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40 DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46 DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62 Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs OfficeHome Domain Invariant Representation Learning with Domain Density Transformations Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs Domains Model M0 M15 M30 M45 M60 M75 Average HIR (Wang et al., 2020) 90.34 99.75 99.40 96.17 99.25 91.26 96.03 DIVA (Ilse et al., 2020) 93.5 99.3 99.1 99.2 99.3 93.0 97.2 DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36 DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3 LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3 HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8 ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2 DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6 Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs PACS Model Backbone Art Painting Cartoon Photo Sketch Average DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38 JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14 MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70 MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70 CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40 DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46 DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62 Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs Domains Model M0 M15 M30 M45 M60 M75 Average HIR (Wang et al., 2020) 90.34 99.75 99.40 96.17 99.25 91.26 96.03 DIVA (Ilse et al., 2020) 93.5 99.3 99.1 99.2 99.3 93.0 97.2 DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36 DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3 LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3 HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8 ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2 DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6 Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs PACS Model Backbone Art Painting Cartoon Photo Sketch Average DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38 JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14 MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70 MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70 CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40 DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46 DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62 Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs OfficeHome Model Backbone Art ClipArt Product Real Average D-SAM (D’Innocente Caputo, 2018) Resnet18 58.03 44.37 69.22 71.45 60.77 JiGen (Carlucci et al., 2019) Resnet18 53.04 47.51 71.47 72.79 61.20 DIR-GAN (ours) Resnet18 56.69(±0.4) 50.49(±0.2) 71.32(±0.4) 74.23(±0.5) 63.18 5.2. Experimental Setting For all datasets, we perform “leave-one-domain-out” exper- iments, where we choose one domain as the target domain, train the model on all remaining domains and evaluate it on the chosen domain. Following standard practice, we use 90% of available data as training data and 10% as validation data, except for the Rotated MNIST experiment where we do not use a validation set and just report the performance of the last epoch. For the Rotated MNIST dataset, we use a network of two 3x3 convolutional layers and a fully connected layer as the representation network gθ to get a representation z of 64 dimensions. A single linear layer is then used to map the representation z to the ten output classes. This architecture is the deterministic version of the network used by Ilse et al. (2020). We train our network for 500 epochs with the Adam optimizer (Kingma Ba, 2014), using the learning rate test domain after the last epoch. For the PACS and OfficeHome datasets, we use a Resnet18 (He et al., 2016) network as the representation network gθ. As a standard practice, the Resnet18 backbone is pre-trained on ImageNet. We replace the last fully connected layer of the Resnet with a linear layer of dimensions (512, 256) so that our representation has 256 dimensions. As with the Rotated MNIST experiment, we use a single layer to map from the representation z to the output. We train the network for 100 epochs with plain stochastic gradient descent (SGD) using learning rate 0.001, momentum 0.9, minibatch size 64, and weight decay 0.001. Data augmentation is also standard practice for real-world computer vision datasets like PACS and OfficeHome, and during the training we augment our data as follows: crops of random size and aspect ratio, resizing to 224 × 224 pixels, random horizontal flips, random color jitter, randomly converting the image tile to grayscale with 10% probability, and normalization HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8 ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2 DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6 Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs PACS Model Backbone Art Painting Cartoon Photo Sketch Average DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38 JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14 MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70 MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70 CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40 DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46 DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62 Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs OfficeHome Model Backbone Art ClipArt Product Real Average D-SAM (D’Innocente Caputo, 2018) Resnet18 58.03 44.37 69.22 71.45 60.77 JiGen (Carlucci et al., 2019) Resnet18 53.04 47.51 71.47 72.79 61.20 DIR-GAN (ours) Resnet18 56.69(±0.4) 50.49(±0.2) 71.32(±0.4) 74.23(±0.5) 63.18 5.2. Experimental Setting For all datasets, we perform “leave-one-domain-out” exper- iments, where we choose one domain as the target domain, train the model on all remaining domains and evaluate it on the chosen domain. Following standard practice, we use 90% of available data as training data and 10% as validation data, except for the Rotated MNIST experiment where we do not use a validation set and just report the performance of the last epoch. For the Rotated MNIST dataset, we use a network of two 3x3 convolutional layers and a fully connected layer as the representation network gθ to get a representation z of 64 dimensions. A single linear layer is then used to map the representation z to the ten output classes. This architecture is the deterministic version of the network used by Ilse et al. (2020). We train our network for 500 epochs with the Adam optimizer (Kingma Ba, 2014), using the learning rate 0.001 and minibatch size 64, and report performance on the test domain after the last epoch. For the PACS and OfficeHome datasets, we use a Resnet18 (He et al., 2016) network as the representation network gθ. As a standard practice, the Resnet18 backbone is pre-trained on ImageNet. We replace the last fully connected layer of the Resnet with a linear layer of dimensions (512, 256) so that our representation has 256 dimensions. As with the Rotated MNIST experiment, we use a single layer to map from the representation z to the output. We train the network for 100 epochs with plain stochastic gradient descent (SGD) using learning rate 0.001, momentum 0.9, minibatch size 64, and weight decay 0.001. Data augmentation is also standard practice for real-world computer vision datasets like PACS and OfficeHome, and during the training we augment our data as follows: crops of random size and aspect ratio, resizing to 224 × 224 pixels, random horizontal flips, random color jitter, randomly converting the image tile to grayscale with 10% probability, and normalization using the ImageNet channel means and standard deviations. Domain Invariant Representation Learning with Domain Density Transformations Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs Domains Model M0 M15 M30 M45 M60 M75 Average HIR (Wang et al., 2020) 90.34 99.75 99.40 96.17 99.25 91.26 96.03 DIVA (Ilse et al., 2020) 93.5 99.3 99.1 99.2 99.3 93.0 97.2 DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36 DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3 LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3 HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8 ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2 DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6 Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs PACS Model Backbone Art Painting Cartoon Photo Sketch Average DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38 JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14 MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70 MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70 CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40 DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46 DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62 Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs OfficeHome Model Backbone Art ClipArt Product Real Average D-SAM (D’Innocente Caputo, 2018) Resnet18 58.03 44.37 69.22 71.45 60.77 JiGen (Carlucci et al., 2019) Resnet18 53.04 47.51 71.47 72.79 61.20 DIR-GAN (ours) Resnet18 56.69(±0.4) 50.49(±0.2) 71.32(±0.4) 74.23(±0.5) 63.18 5.2. Experimental Setting For all datasets, we perform “leave-one-domain-out” exper- iments, where we choose one domain as the target domain, train the model on all remaining domains and evaluate it on the chosen domain. Following standard practice, we use 90% of available data as training data and 10% as validation data, except for the Rotated MNIST experiment where we do not use a validation set and just report the performance of the last epoch. test domain after the last epoch. For the PACS and OfficeHome datasets, we use a Resnet18 (He et al., 2016) network as the representation network gθ. As a standard practice, the Resnet18 backbone is pre-trained on ImageNet. We replace the last fully connected layer of the Resnet with a linear layer of dimensions (512, 256) so that our representation has 256 dimensions. As with the Rotated MNIST experiment, we use a single layer to map from the representation z to the output. We train the network for 100 epochs with plain stochastic gradient descent (SGD) Domain Invariant Representation Learning with Domain Density Transformations Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs Domains Model M0 M15 M30 M45 M60 M75 Average HIR (Wang et al., 2020) 90.34 99.75 99.40 96.17 99.25 91.26 96.03 DIVA (Ilse et al., 2020) 93.5 99.3 99.1 99.2 99.3 93.0 97.2 DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36 DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3 LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3 HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8 ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2 DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6 Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs PACS Model Backbone Art Painting Cartoon Photo Sketch Average DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38 JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14 MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70 MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70 CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40 DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46 DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62 Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs OfficeHome Model Backbone Art ClipArt Product Real Average D-SAM (D’Innocente Caputo, 2018) Resnet18 58.03 44.37 69.22 71.45 60.77 JiGen (Carlucci et al., 2019) Resnet18 53.04 47.51 71.47 72.79 61.20 DIR-GAN (ours) Resnet18 56.69(±0.4) 50.49(±0.2) 71.32(±0.4) 74.23(±0.5) 63.18 5.2. Experimental Setting For all datasets, we perform “leave-one-domain-out” exper- iments, where we choose one domain as the target domain, train the model on all remaining domains and evaluate it on the chosen domain. Following standard practice, we use 90% of available data as training data and 10% as validation data, except for the Rotated MNIST experiment where we do not use a validation set and just report the performance of the last epoch. test domain after the last epoch. For the PACS and OfficeHome datasets, we use a Resnet18 (He et al., 2016) network as the representation network gθ. As a standard practice, the Resnet18 backbone is pre-trained on ImageNet. We replace the last fully connected layer of the Resnet with a linear layer of dimensions (512, 256) so that our representation has 256 dimensions. As with the Rotated MNIST experiment, we use a single layer to map from the representation z to the output. We train the network for 100 epochs with plain stochastic gradient descent (SGD) DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36 DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3 LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3 HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8 ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2 DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6 Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs PACS Model Backbone Art Painting Cartoon Photo Sketch Average DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38 JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14 MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70 MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70 CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40 DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46 DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62 Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs OfficeHome Model Backbone Art ClipArt Product Real Average D-SAM (D’Innocente Caputo, 2018) Resnet18 58.03 44.37 69.22 71.45 60.77 JiGen (Carlucci et al., 2019) Resnet18 53.04 47.51 71.47 72.79 61.20 DIR-GAN (ours) Resnet18 56.69(±0.4) 50.49(±0.2) 71.32(±0.4) 74.23(±0.5) 63.18 5.2. Experimental Setting For all datasets, we perform “leave-one-domain-out” exper- iments, where we choose one domain as the target domain, train the model on all remaining domains and evaluate it on the chosen domain. Following standard practice, we use 90% of available data as training data and 10% as validation data, except for the Rotated MNIST experiment where we do not use a validation set and just report the performance of the last epoch. For the Rotated MNIST dataset, we use a network of two 3x3 convolutional layers and a fully connected layer as the representation network gθ to get a representation z of 64 dimensions. A single linear layer is then used to map the representation z to the ten output classes. This architecture is the deterministic version of the network used by Ilse et al. (2020). We train our network for 500 epochs with the Adam optimizer (Kingma Ba, 2014), using the learning rate 0.001 and minibatch size 64, and report performance on the test domain after the last epoch. For the PACS and OfficeHome datasets, we use a Resnet18 (He et al., 2016) network as the representation network gθ. As a standard practice, the Resnet18 backbone is pre-trained on ImageNet. We replace the last fully connected layer of the Resnet with a linear layer of dimensions (512, 256) so that our representation has 256 dimensions. As with the Rotated MNIST experiment, we use a single layer to map from the representation z to the output. We train the network for 100 epochs with plain stochastic gradient descent (SGD) using learning rate 0.001, momentum 0.9, minibatch size 64, and weight decay 0.001. Data augmentation is also standard practice for real-world computer vision datasets like PACS and OfficeHome, and during the training we augment our data as follows: crops of random size and aspect ratio, resizing to 224 × 224 pixels, random horizontal flips, random color jitter, randomly converting the image tile to grayscale with 10% probability, and normalization using the ImageNet channel means and standard deviations. Domain Invariant Representation Learning with Domain D Figure 4. Visualization of the representation space. Each point indicates a representa and its color indicates the label y. Two left figures are for our method DIR-GAN and t The StarGAN (Choi et al., 2018) model implementation is taken from the authors’ original source code with no significant modifications. For each set of source domains, we train the StarGAN model for 100,000 iterations with a minibatch of 16 images per iteration. The code for all of our experiments will be released for reproducibility. Please also refer to the source code for any the general dis distribution (fo and green poin PACS and Of domain invaria been applied w puter vision d LD = −Ladv + λcls Lr cls, (5) LG = Ladv + λcls Lf cls + λrec Lrec, (6) where λcls and λrec are hyper-parameters that control the relative importance of domain classification and reconstruc- tion losses, respectively, compared to the adversarial loss. We use λcls = 1 and λrec = 10 in all of our experiments. tency loss [9, 33] to the generator, defined as Lrec = Ex,c,c [||x − G(G(x, c), c )||1], (4) where G takes in the translated image G(x, c) and the origi- nal domain label c as input and tries to reconstruct the orig- inal image x. We adopt the L1 norm as our reconstruction loss. Note that we use a single generator twice, first to trans- late an original image into an image in the target domain and then to reconstruct the original image from the trans- lated image. Full Objective. Finally, the objective functions to optimize G and D are written, respectively, as LD = −Ladv + λcls Lr cls, (5) LG = Ladv + λcls Lf cls + λrec Lrec, (6) where λcls and λrec are hyper-parameters that control the relative importance of domain classification and reconstruc- tion losses, respectively, compared to the adversarial loss. We use λcls = 1 and λrec = 10 in all of our experiments. RaFD datasets, where n is two. Training Strategy. When training S datasets, we use the domain label c̃ d put to the generator. By doing so, t ignore the unspecified labels, which focus on the explicitly given label. Th erator is exactly the same as in trainin except for the dimension of the inpu hand, we extend the auxiliary classi tor to generate probability distributio datasets. Then, we train the model in setting, where the discriminator tries classification error associated to the ample, when training with images in nator minimizes only classification e to CelebA attributes, and not facial RaFD. Under these settings, by altern and RaFD the discriminator learns al features for both datasets, and the ge trol all the labels in both datasets. 4 er words, G tries to minimize this objective to gener- ages that can be classified as the target domain c. nstruction Loss. By minimizing the adversarial and fication losses, G is trained to generate images that alistic and classified to its correct target domain. How- minimizing the losses (Eqs. (1) and (3)) does not guar- that translated images preserve the content of its input s while changing only the domain-related part of the . To alleviate this problem, we apply a cycle consis- loss [9, 33] to the generator, defined as Lrec = Ex,c,c [||x − G(G(x, c), c )||1], (4) G takes in the translated image G(x, c) and the origi- main label c as input and tries to reconstruct the orig- mage x. We adopt the L1 norm as our reconstruction Note that we use a single generator twice, first to trans- n original image into an image in the target domain hen to reconstruct the original image from the trans- mage. Objective. Finally, the objective functions to optimize D are written, respectively, as LD = −Ladv + λcls Lr cls, (5) f c̃ = [c1, ..., cn, m], (7) where [·] refers to concatenation, and ci represents a vector for the labels of the i-th dataset. The vector of the known label ci can be represented as either a binary vector for bi- nary attributes or a one-hot vector for categorical attributes. For the remaining n−1 unknown labels we simply assign zero values. In our experiments, we utilize the CelebA and RaFD datasets, where n is two. Training Strategy. When training StarGAN with multiple datasets, we use the domain label c̃ defined in Eq. (7) as in- put to the generator. By doing so, the generator learns to ignore the unspecified labels, which are zero vectors, and focus on the explicitly given label. The structure of the gen- erator is exactly the same as in training with a single dataset, except for the dimension of the input label c̃. On the other hand, we extend the auxiliary classifier of the discrimina- tor to generate probability distributions over labels for all datasets. Then, we train the model in a multi-task learning setting, where the discriminator tries to minimize only the classification error associated to the known label. For ex- ample, when training with images in CelebA, the discrimi-
  • 8. Domain Invariant Representation Learning with Domain Density Transformations 3) Visualization of Representation novel and scalable approach capable of learning mappings among multiple domains. As demonstrated in Fig. 2 (b), our model takes in training data of multiple domains, and learns the mappings between all available domains using only a single generator. The idea is simple. Instead of learning a fixed translation (e.g., black-to-blond hair), our generator takes in as inputs both image and domain information, and learns to flexibly translate the image into the correspond- ing domain. We use a label (e.g., binary or one-hot vector) to represent domain information. During training, we ran- domly generate a target domain label and train the model to flexibly translate an input image into the target domain. By doing so, we can control the domain label and translate the image into any desired domain at testing phase. We also introduce a simple but effective approach that enables joint training between domains of different datasets by adding a mask vector to the domain label. Our proposed method ensures that the model can ignore unknown labels and focus on the label provided by a particular dataset. In this manner, our model can perform well on tasks such as synthesizing facial expressions of CelebA images us- • We provi facial att sis tasks baseline 2. Related W Generative A ial networks in various com [6, 24, 32, 8], imaging [14], typical GAN m tor and a gene between real a generate fake samples. Our to make the ge Conditional G ation has also 2 particular, the network G(x, d, d ) (i.e., G is c the image x and the two different domains d, d an image x from domain d to domain d . D the original StarGAN model that only takes and the desired destination domain d as its implementation, we feed both the original d desired destination domain d together with image x to the generator G. The generator’s goal is to fool a discriminator ing that the transformed image belongs to the d main d . In other words, the equilibrium state in which G completely fools D, is when G transforms the data density of the original d of the destination domain. After training, we as the function fd,d (.) described in the pre and perform the representation learning via function in Eq 13. Three important loss functions of the StarGAN are: • Domain classification loss Lcls that en generator G to generate images that corr to the desired destination domain d . In o ate i Rec clas are r ever ante ima inpu tenc whe nal d inal loss late and lated Full G a Domain Invariant Representation Learning with Domain Density Transformations Figure 4. Visualization of the representation space. Each point indicates a representation z of an image x in the two dimensional space and its color indicates the label y. Two left figures are for our method DIR-GAN and two right figures are for the naive model DeepAll. The StarGAN (Choi et al., 2018) model implementation is taken from the authors’ original source code with no significant modifications. For each set of source domains, we train the StarGAN model for 100,000 iterations with a minibatch of 16 images per iteration. The code for all of our experiments will be released for reproducibility. Please also refer to the source code for any other architecture and implementation details. the general distribution of the points) and the conditional distribution (for example, the distributions of blue points and green points). PACS and OfficeHome. To the best of our knowledge, domain invariant representation learning methods have not been applied widely and successfully for real-world com- puter vision datasets (e.g., PACS and OfficeHome) with very deep neural networks such as Resnet, so the only rel-