Picked-up lists of GAN variants which provided insights to the community. (GANs-Improved GANs-DCGAN-Unrolled GAN-InfoGAN-f-GAN-EBGAN-WGAN)
After short introduction to GANs, we look through the remaining difficulties of standard GANs and their temporary solutions (Improved GANs). By following the slides, we can see the other solutions which tried to resolve the problems in various ways, e.g. careful architecture selection (DCGAN), slight change in update (Unrolled GAN), additional constraint (InfoGAN), generalization of the loss function using various divergence (f-GAN), providing new framework of energy based model (EBGAN), another step of generalization of the loss function (WGAN).
[CVPR2020] Simple but effective image enhancement techniquesJaeJun Yoo
The document discusses several image enhancement techniques:
1. WCT2, which uses wavelet transforms for photorealistic style transfer, achieving faster and lighter models than previous techniques.
2. CutBlur, a new data augmentation method that improves performance on super-resolution and other low-level vision tasks by adding blur and cutting patches from images.
3. SimUSR, a simple but strong baseline for unsupervised super-resolution that achieves state-of-the-art results using only a single low-resolution image during training.
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...Jinwon Lee
TensorFlow Korea 논문읽기모임 PR12 258번째 논문 review입니다.
이번 논문은 MIT에서 나온 From ImageNet to Image Classification: Contextualizing Progress on Benchmarks입니다.
Deep Learning 하시는 분들이면 ImageNet 모르시는 분들이 없을텐데요, 이 논문은 ImageNet의 labeling 방법의 한계와 문제점에 대해서 얘기하고 top-1 accuracy 기반의 평가 방법에도 문제가 있을 수 있음을 지적하고 있습니다.
ImageNet data의 20% 이상이 multi object를 포함하고 있지만 그 중에 하나만 정답으로 인정되는 문제가 있고, annotation 방법의 한계로 인하여 실제로 사람이 생각하는 것과 다른 class가 정답으로 labeling되어 있는 경우도 많았습니다. 또한 terrier만 20종이 넘는 등 전문가가 아니면 판단하기 어려운 label도 많다는 문제도 있었구요. 이 밖에도 다양한 실험을 통해서 정량적인 분석과 함께 human-in-the-loop을 이용한 평가로 현재 model들의 성능이 어디까지 와있는지, 그리고 앞으로 더 높은 성능을 내기 위해서 data labeling 측면에서 해결해야할 과제는 무엇인지에 대해서 이야기하고 있습니다. 논문이 양이 좀 많긴 하지만 기술적인 내용이 별로 없어서 쉽게 읽으실 수 있는데요, 자세한 내용이 궁금하신 분들은 영상을 참고해주세요!
논문링크: https://arxiv.org/abs/2005.11295
발표영상링크: https://youtu.be/CPMgX5ikL_8
This paper proposes AmbientGAN, which trains a generative adversarial network using partial or noisy observations rather than fully observed samples. AmbientGAN trains the discriminator on the measurement domain rather than the raw data domain, allowing the generator to be trained without needing large amounts of good training data. The paper proves it is theoretically possible to recover the original data distribution even when the measurement process is not invertible. It presents experimental results showing AmbientGAN can generate high quality samples and recover the underlying data distribution from various types of lossy and noisy measurements.
[PR12] understanding deep learning requires rethinking generalizationJaeJun Yoo
The document discusses a paper that argues traditional theories of generalization may not fully explain why large neural networks generalize well in practice. It summarizes the paper's key points:
1) The paper shows neural networks can easily fit random labels, calling into question traditional measures of complexity.
2) Regularization helps but is not the fundamental reason for generalization. Neural networks have sufficient capacity to memorize data.
3) Implicit biases in algorithms like SGD may better explain generalization by driving solutions toward minimum norm.
4) The paper suggests rethinking generalization as the effective capacity of neural networks may differ from theoretical measures. Understanding finite sample expressivity is important.
Picked-up lists of GAN variants which provided insights to the community. (GANs-Improved GANs-DCGAN-Unrolled GAN-InfoGAN-f-GAN-EBGAN-WGAN)
After short introduction to GANs, we look through the remaining difficulties of standard GANs and their temporary solutions (Improved GANs). By following the slides, we can see the other solutions which tried to resolve the problems in various ways, e.g. careful architecture selection (DCGAN), slight change in update (Unrolled GAN), additional constraint (InfoGAN), generalization of the loss function using various divergence (f-GAN), providing new framework of energy based model (EBGAN), another step of generalization of the loss function (WGAN).
[CVPR2020] Simple but effective image enhancement techniquesJaeJun Yoo
The document discusses several image enhancement techniques:
1. WCT2, which uses wavelet transforms for photorealistic style transfer, achieving faster and lighter models than previous techniques.
2. CutBlur, a new data augmentation method that improves performance on super-resolution and other low-level vision tasks by adding blur and cutting patches from images.
3. SimUSR, a simple but strong baseline for unsupervised super-resolution that achieves state-of-the-art results using only a single low-resolution image during training.
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...Jinwon Lee
TensorFlow Korea 논문읽기모임 PR12 258번째 논문 review입니다.
이번 논문은 MIT에서 나온 From ImageNet to Image Classification: Contextualizing Progress on Benchmarks입니다.
Deep Learning 하시는 분들이면 ImageNet 모르시는 분들이 없을텐데요, 이 논문은 ImageNet의 labeling 방법의 한계와 문제점에 대해서 얘기하고 top-1 accuracy 기반의 평가 방법에도 문제가 있을 수 있음을 지적하고 있습니다.
ImageNet data의 20% 이상이 multi object를 포함하고 있지만 그 중에 하나만 정답으로 인정되는 문제가 있고, annotation 방법의 한계로 인하여 실제로 사람이 생각하는 것과 다른 class가 정답으로 labeling되어 있는 경우도 많았습니다. 또한 terrier만 20종이 넘는 등 전문가가 아니면 판단하기 어려운 label도 많다는 문제도 있었구요. 이 밖에도 다양한 실험을 통해서 정량적인 분석과 함께 human-in-the-loop을 이용한 평가로 현재 model들의 성능이 어디까지 와있는지, 그리고 앞으로 더 높은 성능을 내기 위해서 data labeling 측면에서 해결해야할 과제는 무엇인지에 대해서 이야기하고 있습니다. 논문이 양이 좀 많긴 하지만 기술적인 내용이 별로 없어서 쉽게 읽으실 수 있는데요, 자세한 내용이 궁금하신 분들은 영상을 참고해주세요!
논문링크: https://arxiv.org/abs/2005.11295
발표영상링크: https://youtu.be/CPMgX5ikL_8
This paper proposes AmbientGAN, which trains a generative adversarial network using partial or noisy observations rather than fully observed samples. AmbientGAN trains the discriminator on the measurement domain rather than the raw data domain, allowing the generator to be trained without needing large amounts of good training data. The paper proves it is theoretically possible to recover the original data distribution even when the measurement process is not invertible. It presents experimental results showing AmbientGAN can generate high quality samples and recover the underlying data distribution from various types of lossy and noisy measurements.
[PR12] understanding deep learning requires rethinking generalizationJaeJun Yoo
The document discusses a paper that argues traditional theories of generalization may not fully explain why large neural networks generalize well in practice. It summarizes the paper's key points:
1) The paper shows neural networks can easily fit random labels, calling into question traditional measures of complexity.
2) Regularization helps but is not the fundamental reason for generalization. Neural networks have sufficient capacity to memorize data.
3) Implicit biases in algorithms like SGD may better explain generalization by driving solutions toward minimum norm.
4) The paper suggests rethinking generalization as the effective capacity of neural networks may differ from theoretical measures. Understanding finite sample expressivity is important.
Super resolution in deep learning era - Jaejun YooJaeJun Yoo
1) The document discusses super-resolution techniques in deep learning, including inverse problems, image restoration problems, and different deep learning models.
2) Early models like SRCNN used convolutional networks for super-resolution but were shallow, while later models incorporated residual learning (VDSR), recursive learning (DRCN), and became very deep and dense (SRResNet).
3) Key developments included EDSR which provided a strong backbone model and GAN-based approaches like SRGAN which aimed to generate more realistic textures but require new evaluation metrics.
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...MLAI2
Numerous recent works utilize bi-Lipschitz regularization of neural network layers to preserve relative distances between data instances in the feature spaces of each layer. This distance sensitivity with respect to the data aids in tasks such as uncertainty calibration and out-of-distribution (OOD) detection. In previous works, features extracted with a distance sensitive model are used to construct feature covariance matrices which are used in deterministic uncertainty estimation or OOD detection. However, in cases where there is a distribution over tasks, these methods result in covariances which are sub-optimal, as they may not leverage all of the meta information which can be shared among tasks. With the use of an attentive set encoder, we propose to meta learn either diagonal or diagonal plus low-rank factors to efficiently construct task specific covariance matrices. Additionally, we propose an inference procedure which utilizes scaled energy to achieve a final predictive distribution which is well calibrated under a distributional dataset shift.
Introduction to Interpretable Machine LearningNguyen Giang
This document discusses interpretable machine learning and explainable AI. It begins with definitions of key terms and an overview of interpretable methods. Deep learning models are often treated as "black boxes" that are difficult to interpret. Interpretability can be achieved by using inherently interpretable models like linear models or decision trees, adding attention mechanisms, or interpreting models before, during or after building them. Later sections discuss specific interpretable techniques like understanding data through examples, MMD-Critic for learning prototypes and criticisms, and visualizing convolutional neural networks to understand predictions. The document emphasizes the importance of interpretability and explains several approaches to make machine learning models more transparent to humans.
Efficient Neural Network Architecture for Image ClassficationYogendra Tamang
The document outlines the objectives, methodology, and work accomplished for a project involving designing an efficient convolutional neural network architecture for image classification. The objectives were to classify images using CNNs and design an effective CNN architecture. The methodology involved designing convolution and pooling layers, and using gradient descent to train the network. Work accomplished included GPU configuration, designing CNN architectures for CIFAR-10 and MNIST datasets, and tracking training loss, validation loss, and accuracy over epochs.
Online Coreset Selection for Rehearsal-based Continual LearningMLAI2
A dataset is a shred of crucial evidence to describe a task. However, each data point in the dataset does not have the same potential, as some of the data points can be more representative or informative than others. This unequal importance among the data points may have a large impact in rehearsal-based continual learning, where we store a subset of the training examples (coreset) to be replayed later to alleviate catastrophic forgetting. In continual learning, the quality of the samples stored in the coreset directly affects the model's effectiveness and efficiency. The coreset selection problem becomes even more important under realistic settings, such as imbalanced continual learning or noisy data scenarios. To tackle this problem, we propose Online Coreset Selection (OCS), a simple yet effective method that selects the most representative and informative coreset at each iteration and trains them in an online manner. Our proposed method maximizes the model's adaptation to a target dataset while selecting high-affinity samples to past tasks, which directly inhibits catastrophic forgetting. We validate the effectiveness of our coreset selection mechanism over various standard, imbalanced, and noisy datasets against strong continual learning baselines, demonstrating that it improves task adaptation and prevents catastrophic forgetting in a sample-efficient manner.
Representational Continuity for Unsupervised Continual LearningMLAI2
Continual learning (CL) aims to learn a sequence of tasks without forgetting the previously acquired knowledge. However, recent CL advances are restricted to supervised continual learning (SCL) scenarios. Consequently, they are not scalable to real-world applications where the data distribution is often biased and unannotated. In this work, we focus on unsupervised continual learning (UCL), where we learn the feature representations on an unlabelled sequence of tasks and show that reliance on annotated data is not necessary for continual learning. We conduct a systematic study analyzing the learned feature representations and show that unsupervised visual representations are surprisingly more robust to catastrophic forgetting, consistently achieve better performance, and generalize better to out-of-distribution tasks than SCL. Furthermore, we find that UCL achieves a smoother loss landscape through qualitative analysis of the learned representations and learns meaningful feature representations. Additionally, we propose Lifelong Unsupervised Mixup (Lump), a simple yet effective technique that interpolates between the current task and previous tasks' instances to alleviate catastrophic forgetting for unsupervised representations.
TensorFlow Korea 논문읽기모임 PR12 243째 논문 review입니다
이번 논문은 RegNet으로 알려진 Facebook AI Research의 Designing Network Design Spaces 입니다.
CNN을 디자인할 때, bottleneck layer는 정말 좋을까요? layer 수는 많을 수록 높은 성능을 낼까요? activation map의 width, height를 절반으로 줄일 때(stride 2 혹은 pooling), channel을 2배로 늘려주는데 이게 최선일까요? 혹시 bottleneck layer가 없는 게 더 좋지는 않은지, 최고 성능을 내는 layer 수에 magic number가 있는 건 아닐지, activation이 절반으로 줄어들 때 channel을 2배가 아니라 3배로 늘리는 게 더 좋은건 아닌지?
이 논문에서는 하나의 neural network을 잘 design하는 것이 아니라 Auto ML과 같은 기술로 좋은 neural network을 찾을 수 있는 즉 좋은 neural network들이 살고 있는 좋은 design space를 design하는 방법에 대해서 얘기하고 있습니다. constraint이 거의 없는 design space에서 human-in-the-loop을 통해 좋은 design space로 그 공간을 좁혀나가는 방법을 제안하였는데요, EfficientNet보다 더 좋은 성능을 보여주는 RegNet은 어떤 design space에서 탄생하였는지 그리고 그 과정에서 우리가 당연하게 여기고 있었던 design choice들이 잘못된 부분은 없었는지 아래 동영상에서 확인하실 수 있습니다~
영상링크: https://youtu.be/bnbKQRae_u4
논문링크: https://arxiv.org/abs/2003.13678
We trained a large, deep convolutional neural network to classify the 1.2 million
high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-
ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5%
and 17.0% which is considerably better than the previous state-of-the-art. The
neural network, which has 60 million parameters and 650,000 neurons, consists
of five convolutional layers, some of which are followed by max-pooling layers,
and three fully-connected layers with a final 1000-way softmax. To make train-
ing faster, we used non-saturating neurons and a very efficient GPU implemen-
tation of the convolution operation. To reduce overfitting in the fully-connected
layers we employed a recently-developed regularization method called “dropout”
that proved to be very effective. We also entered a variant of this model in the
ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%,
compared to 26.2% achieved by the second-best entry.
In this presentation we discuss the convolution operation, the architecture of a convolution neural network, different layers such as pooling etc. This presentation draws heavily from A Karpathy's Stanford Course CS 231n
Image classification with Deep Neural NetworksYogendra Tamang
This document discusses image classification using deep neural networks. It provides background on image classification and convolutional neural networks. The document outlines techniques like activation functions, pooling, dropout and data augmentation to prevent overfitting. It summarizes a paper on ImageNet classification using CNNs with multiple convolutional and fully connected layers. The paper achieved state-of-the-art results on ImageNet in 2010 and 2012 by training CNNs on a large dataset using multiple GPUs.
Explores the type of structure learned by Convolutional Neural Networks, the applications where they're most valuable and a number of appropriate mental models for understanding deep learning.
발표자: 홍정모 (동국대학교 교수)
발표일: 18.5.
딥러닝으로 대표되는 최신 기계학습 기술은 방대한 응용 분야에서 인공지능 소프트웨어를 향한 돌파구를 열어가고 있으며 특히 이미지 처리나 컴퓨터 그래픽스와 관련된 응용 분야에서의 활약이 크게 기대된다. 본 세미나에서는 삼차원 기하 데이터를 중심으로 딥러닝 기술이 어떻게 발전해나가고 있는 지를 살펴보고 관련 산업에 끼칠 영향과 대응 방안 등에 대해서 생각해본다.
홍정모 교수는 2008년부터 동국대학교 컴퓨터공학과에 재직중이다. KAIST 기계공학과에서 학사와 석사를 마쳤으며 석사과정 중에는 요즘 4D라고 불리우는 가상현실 시뮬레이터를 연구하여 탑승형 로봇의 가상 체험 시뮬레이션 게임을 개발하였다. 고려대학교에서 영상 특수효과를 위한 유체 시뮬레이션 연구로 전산학 박사학위를 취득한 후 스탠포드 대학교 연구원으로써 파괴, 폭발, 화염과 같은 본격적인 VFX 연구를 수행하였다. 산학협력에 많은 노력을 기울여 '해운대', '7광구', '적인걸2' 등 다수의 작품에 기술 자문을 하였다. 디지털 제조로 연구 분야를 확장하며 개발한 모델링 소프트웨어 '리쏘피아'는 전 세계의 3D 프린터 사용자와 창업자들에게 꾸준히 사용되고 있다. 이 과정에서 전통적인 소프트웨어 기술의 한계를 느끼고 딥러닝과 기계학습을 활용한 모델링과 콘텐츠 제작에서 돌파구를 찾고 있다. 'C++로 배우는 딥러닝' 동영상 강의를 공개하였으며 최신 기술을 대학 강의에 선제적으로 활용하며 4차산업혁명 시대의 고급 소프트웨어 인력 양성에 노력하고 있다.
Transformer based approaches for visual representation learningRyohei Suzuki
1) Transformer-based approaches for visual representation learning such as Vision Transformers (ViTs) have shown promising performance compared to CNNs on image classification tasks.
2) A pure Transformer architecture pre-trained on a very large dataset like JFT-300M can outperform modern CNNs without any convolutions.
3) Self-supervised pre-training methods like DINO that leverage knowledge distillation have been shown to obtain comparable performance to supervised pre-training of ViTs using only unlabeled ImageNet data.
Learning to learn unlearned feature for segmentationNAVER Engineering
최근 machine learning 분야에서 활발히 연구되고 있는 meta-learning은 기존의 Gradient-descent 기반 학습 방법의 한계점으로 지적되는 엄청난 규모의 데이터 요구량 문제를 해결하기 위해 연구되는 분야로 학습 모델이 수 샘플으로도 충분한 학습 성능을 낼 수 있도록 하는 학습 기법이다. 메타 러닝 기법 중에서 Model-Agnostic Meta-Learning (MAML)은 학습 대상 모델의 구조와 상관없이 새로운 gradient-descent based algorithm을 통해 classification, reinforcement learning 임무를 빠른 시간 안에 높은 성능을 가지는 모델으로 학습하는 것이 실제로 가능하다고 보여주었다. 하지만 MAML은 image segmentation과 같이 복잡한 학습 네트워크 모델을 가지는 일에서는 효과적인 성능을 보여주지 못한다. 따라서 본 발표에서는 segmentation에 적용할 수 있는 MAML 기반 학습법에 대해 고찰하고, 특히 segmentation 네트워크를 re-training, transfer-learning와 같이 fine-tuning해야할 때 쓸 수 있는 meta-learning 기법을 소개하고자 한다. 제안된 기법은 active meta-tune이라 부르며, classification과 달리 복잡한 구조를 가지는 segmentation을 잘 수행하기 위해 meta-learning을 통해 학습하는 학습 데이터의 순서를 active learning 기반 알고리즘으로 정해주는 기술이다. 그러므로 본 발표에서는 active learning과 meta-learning이 어떻게 결합될 수 있는 지에 대한 이론적 배경과 active meta-tune의 알고리즘, 실제 적용 분야에 대하여 다룰 것이다.
AlexNet achieved unprecedented results on the ImageNet dataset by using a deep convolutional neural network with over 60 million parameters. It achieved top-1 and top-5 error rates of 37.5% and 17.0%, significantly outperforming previous methods. The network architecture included 5 convolutional layers, some with max pooling, and 3 fully-connected layers. Key aspects were the use of ReLU activations for faster training, dropout to reduce overfitting, and parallelizing computations across two GPUs. This dramatic improvement demonstrated the potential of deep learning for computer vision tasks.
This document discusses domain transfer and domain adaptation in deep learning. It begins with introductions to domain transfer, which learns a mapping between domains, and domain adaptation, which learns a mapping between domains with labels. It then covers several approaches for domain transfer, including neural style transfer, instance normalization, and GAN-based methods. It also discusses general approaches for domain adaptation such as source/target feature matching and target data augmentation.
This document is a project report submitted by Shubham Jain and Vikas Jain for their course CS676A. The project aims to learn relative attributes associated with face images using the PubFig dataset. Convolutional neural network features and the RankNet model were used to predict attribute rankings. RankNet achieved better performance than RankSVM and GIST features. Zero-shot learning for unseen classes was explored by building probabilistic class models, but performance was poor. Future work could improve the modeling of unseen classes.
Survey on contrastive self supervised l earningAnirudh Ganguly
Contrastive self-supervised learning aims to group similar images together and dissimilar images apart by randomly augmenting each image and training the model to group originals with their augmentations but not with other images. Pretext tasks generate pseudo labels for self-supervised learning using data attributes without labels. Major pretext tasks include color/geometric transformations and context/cross-model based tasks. Encoders output representations for downstream tasks like classification, and contrastive loss updates encoder parameters to bring positive samples closer and negative samples farther in latent space.
This document provides an overview of deep feedforward networks. It begins with an example of using a network to solve the XOR problem. It then discusses gradient-based learning and backpropagation. Hidden units with rectified linear activations are commonly used. Deeper networks can more efficiently represent functions and generalize better than shallow networks. Architecture design considerations include width, depth, and number of hidden layers. Backpropagation efficiently computes gradients using the chain rule and dynamic programming.
This document discusses generative adversarial networks (GANs) and their relationship to reinforcement learning. It begins with an introduction to GANs, explaining how they can generate images without explicitly defining a probability distribution by using an adversarial training process. The second half discusses how GANs are related to actor-critic models and inverse reinforcement learning in reinforcement learning. It explains how GANs can be viewed as training a generator to fool a discriminator, similar to how policies are trained in reinforcement learning.
Super resolution in deep learning era - Jaejun YooJaeJun Yoo
1) The document discusses super-resolution techniques in deep learning, including inverse problems, image restoration problems, and different deep learning models.
2) Early models like SRCNN used convolutional networks for super-resolution but were shallow, while later models incorporated residual learning (VDSR), recursive learning (DRCN), and became very deep and dense (SRResNet).
3) Key developments included EDSR which provided a strong backbone model and GAN-based approaches like SRGAN which aimed to generate more realistic textures but require new evaluation metrics.
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...MLAI2
Numerous recent works utilize bi-Lipschitz regularization of neural network layers to preserve relative distances between data instances in the feature spaces of each layer. This distance sensitivity with respect to the data aids in tasks such as uncertainty calibration and out-of-distribution (OOD) detection. In previous works, features extracted with a distance sensitive model are used to construct feature covariance matrices which are used in deterministic uncertainty estimation or OOD detection. However, in cases where there is a distribution over tasks, these methods result in covariances which are sub-optimal, as they may not leverage all of the meta information which can be shared among tasks. With the use of an attentive set encoder, we propose to meta learn either diagonal or diagonal plus low-rank factors to efficiently construct task specific covariance matrices. Additionally, we propose an inference procedure which utilizes scaled energy to achieve a final predictive distribution which is well calibrated under a distributional dataset shift.
Introduction to Interpretable Machine LearningNguyen Giang
This document discusses interpretable machine learning and explainable AI. It begins with definitions of key terms and an overview of interpretable methods. Deep learning models are often treated as "black boxes" that are difficult to interpret. Interpretability can be achieved by using inherently interpretable models like linear models or decision trees, adding attention mechanisms, or interpreting models before, during or after building them. Later sections discuss specific interpretable techniques like understanding data through examples, MMD-Critic for learning prototypes and criticisms, and visualizing convolutional neural networks to understand predictions. The document emphasizes the importance of interpretability and explains several approaches to make machine learning models more transparent to humans.
Efficient Neural Network Architecture for Image ClassficationYogendra Tamang
The document outlines the objectives, methodology, and work accomplished for a project involving designing an efficient convolutional neural network architecture for image classification. The objectives were to classify images using CNNs and design an effective CNN architecture. The methodology involved designing convolution and pooling layers, and using gradient descent to train the network. Work accomplished included GPU configuration, designing CNN architectures for CIFAR-10 and MNIST datasets, and tracking training loss, validation loss, and accuracy over epochs.
Online Coreset Selection for Rehearsal-based Continual LearningMLAI2
A dataset is a shred of crucial evidence to describe a task. However, each data point in the dataset does not have the same potential, as some of the data points can be more representative or informative than others. This unequal importance among the data points may have a large impact in rehearsal-based continual learning, where we store a subset of the training examples (coreset) to be replayed later to alleviate catastrophic forgetting. In continual learning, the quality of the samples stored in the coreset directly affects the model's effectiveness and efficiency. The coreset selection problem becomes even more important under realistic settings, such as imbalanced continual learning or noisy data scenarios. To tackle this problem, we propose Online Coreset Selection (OCS), a simple yet effective method that selects the most representative and informative coreset at each iteration and trains them in an online manner. Our proposed method maximizes the model's adaptation to a target dataset while selecting high-affinity samples to past tasks, which directly inhibits catastrophic forgetting. We validate the effectiveness of our coreset selection mechanism over various standard, imbalanced, and noisy datasets against strong continual learning baselines, demonstrating that it improves task adaptation and prevents catastrophic forgetting in a sample-efficient manner.
Representational Continuity for Unsupervised Continual LearningMLAI2
Continual learning (CL) aims to learn a sequence of tasks without forgetting the previously acquired knowledge. However, recent CL advances are restricted to supervised continual learning (SCL) scenarios. Consequently, they are not scalable to real-world applications where the data distribution is often biased and unannotated. In this work, we focus on unsupervised continual learning (UCL), where we learn the feature representations on an unlabelled sequence of tasks and show that reliance on annotated data is not necessary for continual learning. We conduct a systematic study analyzing the learned feature representations and show that unsupervised visual representations are surprisingly more robust to catastrophic forgetting, consistently achieve better performance, and generalize better to out-of-distribution tasks than SCL. Furthermore, we find that UCL achieves a smoother loss landscape through qualitative analysis of the learned representations and learns meaningful feature representations. Additionally, we propose Lifelong Unsupervised Mixup (Lump), a simple yet effective technique that interpolates between the current task and previous tasks' instances to alleviate catastrophic forgetting for unsupervised representations.
TensorFlow Korea 논문읽기모임 PR12 243째 논문 review입니다
이번 논문은 RegNet으로 알려진 Facebook AI Research의 Designing Network Design Spaces 입니다.
CNN을 디자인할 때, bottleneck layer는 정말 좋을까요? layer 수는 많을 수록 높은 성능을 낼까요? activation map의 width, height를 절반으로 줄일 때(stride 2 혹은 pooling), channel을 2배로 늘려주는데 이게 최선일까요? 혹시 bottleneck layer가 없는 게 더 좋지는 않은지, 최고 성능을 내는 layer 수에 magic number가 있는 건 아닐지, activation이 절반으로 줄어들 때 channel을 2배가 아니라 3배로 늘리는 게 더 좋은건 아닌지?
이 논문에서는 하나의 neural network을 잘 design하는 것이 아니라 Auto ML과 같은 기술로 좋은 neural network을 찾을 수 있는 즉 좋은 neural network들이 살고 있는 좋은 design space를 design하는 방법에 대해서 얘기하고 있습니다. constraint이 거의 없는 design space에서 human-in-the-loop을 통해 좋은 design space로 그 공간을 좁혀나가는 방법을 제안하였는데요, EfficientNet보다 더 좋은 성능을 보여주는 RegNet은 어떤 design space에서 탄생하였는지 그리고 그 과정에서 우리가 당연하게 여기고 있었던 design choice들이 잘못된 부분은 없었는지 아래 동영상에서 확인하실 수 있습니다~
영상링크: https://youtu.be/bnbKQRae_u4
논문링크: https://arxiv.org/abs/2003.13678
We trained a large, deep convolutional neural network to classify the 1.2 million
high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-
ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5%
and 17.0% which is considerably better than the previous state-of-the-art. The
neural network, which has 60 million parameters and 650,000 neurons, consists
of five convolutional layers, some of which are followed by max-pooling layers,
and three fully-connected layers with a final 1000-way softmax. To make train-
ing faster, we used non-saturating neurons and a very efficient GPU implemen-
tation of the convolution operation. To reduce overfitting in the fully-connected
layers we employed a recently-developed regularization method called “dropout”
that proved to be very effective. We also entered a variant of this model in the
ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%,
compared to 26.2% achieved by the second-best entry.
In this presentation we discuss the convolution operation, the architecture of a convolution neural network, different layers such as pooling etc. This presentation draws heavily from A Karpathy's Stanford Course CS 231n
Image classification with Deep Neural NetworksYogendra Tamang
This document discusses image classification using deep neural networks. It provides background on image classification and convolutional neural networks. The document outlines techniques like activation functions, pooling, dropout and data augmentation to prevent overfitting. It summarizes a paper on ImageNet classification using CNNs with multiple convolutional and fully connected layers. The paper achieved state-of-the-art results on ImageNet in 2010 and 2012 by training CNNs on a large dataset using multiple GPUs.
Explores the type of structure learned by Convolutional Neural Networks, the applications where they're most valuable and a number of appropriate mental models for understanding deep learning.
발표자: 홍정모 (동국대학교 교수)
발표일: 18.5.
딥러닝으로 대표되는 최신 기계학습 기술은 방대한 응용 분야에서 인공지능 소프트웨어를 향한 돌파구를 열어가고 있으며 특히 이미지 처리나 컴퓨터 그래픽스와 관련된 응용 분야에서의 활약이 크게 기대된다. 본 세미나에서는 삼차원 기하 데이터를 중심으로 딥러닝 기술이 어떻게 발전해나가고 있는 지를 살펴보고 관련 산업에 끼칠 영향과 대응 방안 등에 대해서 생각해본다.
홍정모 교수는 2008년부터 동국대학교 컴퓨터공학과에 재직중이다. KAIST 기계공학과에서 학사와 석사를 마쳤으며 석사과정 중에는 요즘 4D라고 불리우는 가상현실 시뮬레이터를 연구하여 탑승형 로봇의 가상 체험 시뮬레이션 게임을 개발하였다. 고려대학교에서 영상 특수효과를 위한 유체 시뮬레이션 연구로 전산학 박사학위를 취득한 후 스탠포드 대학교 연구원으로써 파괴, 폭발, 화염과 같은 본격적인 VFX 연구를 수행하였다. 산학협력에 많은 노력을 기울여 '해운대', '7광구', '적인걸2' 등 다수의 작품에 기술 자문을 하였다. 디지털 제조로 연구 분야를 확장하며 개발한 모델링 소프트웨어 '리쏘피아'는 전 세계의 3D 프린터 사용자와 창업자들에게 꾸준히 사용되고 있다. 이 과정에서 전통적인 소프트웨어 기술의 한계를 느끼고 딥러닝과 기계학습을 활용한 모델링과 콘텐츠 제작에서 돌파구를 찾고 있다. 'C++로 배우는 딥러닝' 동영상 강의를 공개하였으며 최신 기술을 대학 강의에 선제적으로 활용하며 4차산업혁명 시대의 고급 소프트웨어 인력 양성에 노력하고 있다.
Transformer based approaches for visual representation learningRyohei Suzuki
1) Transformer-based approaches for visual representation learning such as Vision Transformers (ViTs) have shown promising performance compared to CNNs on image classification tasks.
2) A pure Transformer architecture pre-trained on a very large dataset like JFT-300M can outperform modern CNNs without any convolutions.
3) Self-supervised pre-training methods like DINO that leverage knowledge distillation have been shown to obtain comparable performance to supervised pre-training of ViTs using only unlabeled ImageNet data.
Learning to learn unlearned feature for segmentationNAVER Engineering
최근 machine learning 분야에서 활발히 연구되고 있는 meta-learning은 기존의 Gradient-descent 기반 학습 방법의 한계점으로 지적되는 엄청난 규모의 데이터 요구량 문제를 해결하기 위해 연구되는 분야로 학습 모델이 수 샘플으로도 충분한 학습 성능을 낼 수 있도록 하는 학습 기법이다. 메타 러닝 기법 중에서 Model-Agnostic Meta-Learning (MAML)은 학습 대상 모델의 구조와 상관없이 새로운 gradient-descent based algorithm을 통해 classification, reinforcement learning 임무를 빠른 시간 안에 높은 성능을 가지는 모델으로 학습하는 것이 실제로 가능하다고 보여주었다. 하지만 MAML은 image segmentation과 같이 복잡한 학습 네트워크 모델을 가지는 일에서는 효과적인 성능을 보여주지 못한다. 따라서 본 발표에서는 segmentation에 적용할 수 있는 MAML 기반 학습법에 대해 고찰하고, 특히 segmentation 네트워크를 re-training, transfer-learning와 같이 fine-tuning해야할 때 쓸 수 있는 meta-learning 기법을 소개하고자 한다. 제안된 기법은 active meta-tune이라 부르며, classification과 달리 복잡한 구조를 가지는 segmentation을 잘 수행하기 위해 meta-learning을 통해 학습하는 학습 데이터의 순서를 active learning 기반 알고리즘으로 정해주는 기술이다. 그러므로 본 발표에서는 active learning과 meta-learning이 어떻게 결합될 수 있는 지에 대한 이론적 배경과 active meta-tune의 알고리즘, 실제 적용 분야에 대하여 다룰 것이다.
AlexNet achieved unprecedented results on the ImageNet dataset by using a deep convolutional neural network with over 60 million parameters. It achieved top-1 and top-5 error rates of 37.5% and 17.0%, significantly outperforming previous methods. The network architecture included 5 convolutional layers, some with max pooling, and 3 fully-connected layers. Key aspects were the use of ReLU activations for faster training, dropout to reduce overfitting, and parallelizing computations across two GPUs. This dramatic improvement demonstrated the potential of deep learning for computer vision tasks.
This document discusses domain transfer and domain adaptation in deep learning. It begins with introductions to domain transfer, which learns a mapping between domains, and domain adaptation, which learns a mapping between domains with labels. It then covers several approaches for domain transfer, including neural style transfer, instance normalization, and GAN-based methods. It also discusses general approaches for domain adaptation such as source/target feature matching and target data augmentation.
This document is a project report submitted by Shubham Jain and Vikas Jain for their course CS676A. The project aims to learn relative attributes associated with face images using the PubFig dataset. Convolutional neural network features and the RankNet model were used to predict attribute rankings. RankNet achieved better performance than RankSVM and GIST features. Zero-shot learning for unseen classes was explored by building probabilistic class models, but performance was poor. Future work could improve the modeling of unseen classes.
Survey on contrastive self supervised l earningAnirudh Ganguly
Contrastive self-supervised learning aims to group similar images together and dissimilar images apart by randomly augmenting each image and training the model to group originals with their augmentations but not with other images. Pretext tasks generate pseudo labels for self-supervised learning using data attributes without labels. Major pretext tasks include color/geometric transformations and context/cross-model based tasks. Encoders output representations for downstream tasks like classification, and contrastive loss updates encoder parameters to bring positive samples closer and negative samples farther in latent space.
This document provides an overview of deep feedforward networks. It begins with an example of using a network to solve the XOR problem. It then discusses gradient-based learning and backpropagation. Hidden units with rectified linear activations are commonly used. Deeper networks can more efficiently represent functions and generalize better than shallow networks. Architecture design considerations include width, depth, and number of hidden layers. Backpropagation efficiently computes gradients using the chain rule and dynamic programming.
This document discusses generative adversarial networks (GANs) and their relationship to reinforcement learning. It begins with an introduction to GANs, explaining how they can generate images without explicitly defining a probability distribution by using an adversarial training process. The second half discusses how GANs are related to actor-critic models and inverse reinforcement learning in reinforcement learning. It explains how GANs can be viewed as training a generator to fool a discriminator, similar to how policies are trained in reinforcement learning.
The document proposes a novel domain adaptation method called cyclically disentangled feature translation network (CDFTN) for face anti-spoofing. CDFTN aims to generate pseudo-labeled samples that possess source domain-invariant liveness features and target domain-specific content features, which are disentangled through domain adversarial training. A robust classifier is then trained on the synthetic pseudo-labeled images under the supervision of source domain labels to improve generalization to the target domain. The method is extended to leverage multiple unlabeled target domains by allowing cross-domain transfer of domain-invariant liveness features.
論文紹介:Learning With Neighbor Consistency for Noisy LabelsToru Tamaki
Ahmet Iscen, Jack Valmadre, Anurag Arnab, Cordelia Schmid, "Learning With Neighbor Consistency for Noisy Labels" CVPR2022
https://openaccess.thecvf.com/content/CVPR2022/html/Iscen_Learning_With_Neighbor_Consistency_for_Noisy_Labels_CVPR_2022_paper.html
Adversarial Variational Autoencoders to extend and improve generative model -...Loc Nguyen
Generative artificial intelligence (GenAI) has been developing with many incredible achievements like ChatGPT and Bard. Deep generative model (DGM) is a branch of GenAI, which is preeminent in generating raster data such as image and sound due to strong points of deep neural network (DNN) in inference and recognition. The built-in inference mechanism of DNN, which simulates and aims to synaptic plasticity of human neuron network, fosters generation ability of DGM which produces surprised results with support of statistical flexibility. Two popular approaches in DGM are Variational Autoencoders (VAE) and Generative Adversarial Network (GAN). Both VAE and GAN have their own strong points although they share and imply underline theory of statistics as well as incredible complex via hidden layers of DNN when DNN becomes effective encoding/decoding functions without concrete specifications. In this research, I try to unify VAE and GAN into a consistent and consolidated model called Adversarial Variational Autoencoders (AVA) in which VAE and GAN complement each other, for instance, VAE is a good data generator by encoding data via excellent ideology of Kullback-Leibler divergence and GAN is a significantly important method to assess reliability of data which is realistic or fake. In other words, AVA aims to improve accuracy of generative models, besides AVA extends function of simple generative models. In methodology this research focuses on combination of applied mathematical concepts and skillful techniques of computer programming in order to implement and solve complicated problems as simply as possible.
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...Masahiro Suzuki
This document discusses techniques for training deep variational autoencoders and probabilistic ladder networks. It proposes three advances: 1) Using an inference model similar to ladder networks with multiple stochastic layers, 2) Adding a warm-up period to keep units active early in training, and 3) Using batch normalization. These advances allow training models with up to five stochastic layers and achieve state-of-the-art log-likelihood results on benchmark datasets. The document explains variational autoencoders, probabilistic ladder networks, and how the proposed techniques parameterize the generative and inference models.
Adversarial Variational Autoencoders to extend and improve generative modelLoc Nguyen
Generative artificial intelligence (GenAI) has been developing with many incredible achievements like ChatGPT and Bard. Deep generative model (DGM) is a branch of GenAI, which is preeminent in generating raster data such as image and sound due to strong points of deep neural network (DNN) in inference and recognition. The built-in inference mechanism of DNN, which simulates and aims to synaptic plasticity of human neuron network, fosters generation ability of DGM which produces surprised results with support of statistical flexibility. Two popular approaches in DGM are Variational Autoencoders (VAE) and Generative Adversarial Network (GAN). Both VAE and GAN have their own strong points although they share and imply underline theory of statistics as well as incredible complex via hidden layers of DNN when DNN becomes effective encoding/decoding functions without concrete specifications. In this research, I try to unify VAE and GAN into a consistent and consolidated model called Adversarial Variational Autoencoders (AVA) in which VAE and GAN complement each other, for instance, VAE is good at generator by encoding data via excellent ideology of Kullback-Leibler divergence and GAN is a significantly important method to assess reliability of data which is realistic or fake. In other words, AVA aims to improve accuracy of generative models, besides AVA extends function of simple generative models. In methodology this research focuses on combination of applied mathematical concepts and skillful techniques of computer programming in order to implement and solve complicated problems as simply as possible.
This document proposes a novel framework called smooth sparse coding for learning sparse representations of data. It incorporates feature similarity or temporal information present in data sets via non-parametric kernel smoothing. The approach constructs codes that represent neighborhoods of samples rather than individual samples, leading to lower reconstruction error. It also proposes using marginal regression rather than lasso for obtaining sparse codes, providing a dramatic speedup of up to two orders of magnitude without sacrificing accuracy. The document contributes a framework for incorporating domain information into sparse coding, sample complexity results for dictionary learning using smooth sparse coding, an efficient marginal regression training procedure, and successful application to classification tasks with improved accuracy and speed.
MTH 2001 Project 2Instructions• Each group must choos.docxgilpinleeanna
MTH 2001: Project 2
Instructions
• Each group must choose one problem to do, using material from chapter 14 in the textbook.
• Write up a solution including explanations in complete sentences of each step and drawings or computer
graphics if helpful. Cite any sources you use and mention how you made any diagrams.
• Write at a level that will be comprehensible to someone who is mathematically competent, but may
not have taken Calculus 3. Use calculus, but explain your method in simple terms. Your report should
consist of 80−90% explanation and 10−20% equations. If you find yourself with more equations than
words, then you do not have nearly enough explanation. See the checklist at the end of this document.
• One person from each group must present the work orally to Naveed or Ali. Presenters must make an
appointment. Visit the Calc 3 tab: http://www.fit.edu/mac/group_projects_presentations.php
• Submit written work to the Canvas dropbox for Project 2 by October 7 at 9:55PM. The
deadline for the oral presentation is October 7 at 2PM.
Problems
1. You probably studied Newton’s method for approximating the roots of a function (i.e. approximating
values of x such that f(x) = 0) in Calculus 1:
(1) Guess the solution, xj
(2) Find the tangent line of f at xj,
y = f′(xj)(x−xj) + f(xj) (1)
(3) Find the tangent line’s x-intercept, call it xj+1,
0 = f′(xj)(xj+1 −xj) + f(xj)
xj+1f
′(xj) = xjf
′(xj) −f(xj)
xj+1 = xj −
f(xj)
f′(xj)
(2)
(4) If f(xj+1) is sufficiently close to 0, stop, xj+1 is an approximate solution. Otherwise, return
to step (2) with xj+1 as the guess.
See this animation for a geometric view of the process. It simply follows the tangent line to the curve
at a starting point to its x-intercept, and repeats with this new x value until we (hopefully) find a
good approximation of the solution.
Newton’s method can be generalized to two dimensions to approximate the points (x,y) where the
surfaces z = f(x,y) and z = g(x,y) simultaneously touch the xy-plane. (In other words, it can
approximate solutions to the system of equations f(x,y) = 0 and g(x,y) = 0.) Here, the method is
http://www.fit.edu/mac/group_projects_presentations.php
http://upload.wikimedia.org/wikipedia/commons/e/e0/NewtonIteration_Ani.gif
(1) Guess the solution (xj,yj)
(2) Find the tangent planes to each f and g at this point.
z = f(x,y) =
z = g(x,y) =
(3) Find the line of intersection of the planes.
(4) Find the line’s xy-intercept, call this point (xj+1,yj+1),
xj+1 =
yj+1 =
(5) If f(xj+1,yj+1) < ε and g(xj+1,yj+1) < ε for some small number ε (error tolerance), stop,
(xj+1,yj+1) is an approximate solution. Otherwise, return to step (2) with (xj+1,yj+1) as
the guess.
(a) Find equations of the tangent planes for step (2), an equation for their line of intersection for step
(3), and find formulas for xj+1 and yj+1 for step (4).
(b) What assumptions must we make about f and g in order for the method to work? How might
the method fail? Explain in words h ...
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:Sean Golliher
This document presents a unifying probabilistic perspective for spectral dimensionality reduction methods. It introduces the Maximum Entropy Unfolding (MEU) algorithm as a unified approach that other methods like Local Linear Embedding (LLE) are special cases of. MEU models dimensionality reduction as a density estimation problem with constraints, using a Gaussian random field to represent the density. It also introduces the Acyclic Locally Linear Embedding (ALLE) and Dimensionality Reduction through Regularization of the Inverse covariance in the Log Likelihood (DRILL) algorithms. Experimental results on motion capture and robot navigation data are presented to compare the performance of these methods.
Paper Summary of Disentangling by Factorising (Factor-VAE)준식 최
The paper proposes Factor-VAE, which aims to learn disentangled representations in an unsupervised manner. Factor-VAE enhances disentanglement over the β-VAE by encouraging the latent distribution to be factorial (independent across dimensions) using a total correlation penalty. This penalty is optimized using a discriminator network. Experiments on various datasets show that Factor-VAE achieves better disentanglement than β-VAE, as measured by a proposed disentanglement metric, while maintaining good reconstruction quality. Latent traversals qualitatively demonstrate disentangled factors of variation.
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...IJERA Editor
An optimal data partitioning in parallel/distributed implementation of clustering algorithms is a necessary
computation as it ensures independent task completion, fair distribution, less number of affected points and
better & faster merging. Though partitioning using Kd-Tree is being conventionally used in academia, it suffers
from performance drenches and bias (non equal distribution) as dimensionality of data increases and hence is
not suitable for practical use in industry where dimensionality can be of order of 100’s to 1000’s. To address
these issues we propose two new partitioning techniques using existing mathematical models & study their
feasibility, performance (bias and partitioning speed) & possible variants in choosing initial seeds. First method
uses an n-dimensional hashed grid based approach which is based on mapping the points in space to a set of
cubes which hashes the points. Second method uses a tree of voronoi planes where each plane corresponds to a
partition. We found that grid based approach was computationally impractical, while using a tree of voronoi
planes (using scalable K-Means++ initial seeds) drastically outperformed the Kd-tree tree method as
dimensionality increased.
[BMVC 2022] DA-CIL: Towards Domain Adaptive Class-Incremental 3D Object Detec...Ziyuan Zhao
This document proposes a new domain adaptive class-incremental learning (DA-CIL) paradigm for 3D object detection. The method uses dual-domain copy-paste data augmentation to address data scarcity and domain shifts. It also employs dual-teacher knowledge distillation with multi-level consistency regularization between domains. Experimental results on the ScanNet and SUN RGB-D datasets show the method outperforms other class-incremental learning and domain adaptation baselines, and ablation studies validate the contributions of the dual-domain augmentation and consistency losses.
Web image annotation by diffusion maps manifold learning algorithmijfcstjournal
Automatic image annotation is one of the most challenging problems in machine vision areas. The goal of this task is to predict number of keywords automatically for images captured in real data. Many methods are based on visual features in order to calculate similarities between image samples. But the computation cost of these approaches is very high. These methods require many training samples to be stored in memory. To lessen thisburden, a number of techniques have been developed to reduce the number
of features in a dataset. Manifold learning is a popular approach to nonlinear dimensionality reduction. In
this paper, we investigate Diffusion maps manifold learning method for webimage auto-annotation task.Diffusion maps
manifold learning method isused to reduce the dimension of some visual features. Extensive experiments and analysis onNUS-WIDE-LITE web image dataset with
different visual featuresshow how this manifold learning dimensionality reduction method can be applied effectively to image annotation.
I did this work for Fields Institute Machine Learning Graduate Course. It covers the basics of adversarial domain adaptation and the mathematical formulation behind it like the use of domain divergence and how to implement it using a neural network. It also covers the subsequent development of GANs from the idea of adversarial learning including descriptions of CoGAN and CyCADA.
Slides were formed by referring to the text Machine Learning by Tom M Mitchelle (Mc Graw Hill, Indian Edition) and by referring to Video tutorials on NPTEL
A simple framework for contrastive learning of visual representationsDevansh16
Link: https://machine-learning-made-simple.medium.com/learnings-from-simclr-a-framework-contrastive-learning-for-visual-representations-6c145a5d8e99
If you'd like to discuss something, text me on LinkedIn, IG, or Twitter. To support me, please use my referral link to Robinhood. It's completely free, and we both get a free stock. Not using it is literally losing out on free money.
Check out my other articles on Medium. : https://rb.gy/zn1aiu
My YouTube: https://rb.gy/88iwdd
Reach out to me on LinkedIn. Let's connect: https://rb.gy/m5ok2y
My Instagram: https://rb.gy/gmvuy9
My Twitter: https://twitter.com/Machine01776819
My Substack: https://devanshacc.substack.com/
Live conversations at twitch here: https://rb.gy/zlhk9y
Get a free stock on Robinhood: https://join.robinhood.com/fnud75
This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive self-supervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, we systematically study the major components of our framework. We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. By combining these findings, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on self-supervised representations learned by SimCLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-of-the-art, matching the performance of a supervised ResNet-50. When fine-tuned on only 1% of the labels, we achieve 85.8% top-5 accuracy, outperforming AlexNet with 100X fewer labels.
Comments: ICML'2020. Code and pretrained models at this https URL
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Cite as: arXiv:2002.05709 [cs.LG]
(or arXiv:2002.05709v3 [cs.LG] for this version)
Submission history
From: Ting Chen [view email]
[v1] Thu, 13 Feb 2020 18:50:45 UTC (5,093 KB)
[v2] Mon, 30 Mar 2020 15:32:51 UTC (5,047 KB)
[v3] Wed, 1 Jul 2020 00:09:08 UTC (5,829 KB)
GRAPH BASED LOCAL RECODING FOR DATA ANONYMIZATIONijdms
This document summarizes a research paper that proposes a new local recoding approach for data anonymization based on minimum spanning tree partitioning. The approach aims to achieve k-anonymity while minimizing information loss. It involves constructing a minimum spanning tree using distances between data points based on attribute hierarchies, removing edges to form initial clusters, and generating equivalence classes that satisfy the anonymity requirement k. Experiments showed the proposed local recoding framework produced better quality anonymized tables than existing global recoding and clustering approaches.
Lucas Theis - Compressing Images with Neural Networks - Creative AI meetupLuba Elliott
This talk by Lucas Theis from Twitter/Magic Pony on "Compressing Images with Neural Networks" was presented at the Learning Image Representations event on 30th August at Twitter as part of the Creative AI meetup.
Similar to Domain Invariant Representation Learning with Domain Density Transformations (20)
Super tickets in pre trained language modelsHyunKyu Jeon
This document discusses finding "super tickets" in pre-trained language models through pruning attention heads and feedforward layers. It shows that lightly pruning BERT models can improve generalization without degrading accuracy (phase transition phenomenon). The authors propose a new pruning approach for multi-task fine-tuning of language models called "ticket sharing" where pruned weights are shared across tasks. Experiments on GLUE benchmarks show their proposed super ticket and ticket sharing methods consistently outperform unpruned baselines, with more significant gains on smaller tasks. Analysis indicates pruning reduces model variance and some tasks share more task-specific knowledge than others.
Synthesizer rethinking self-attention for transformer models HyunKyu Jeon
The document expresses gratitude to the reader for taking the time to listen. It does not provide any other details, context, or information beyond thanking the reader for listening. The summary captures the essence of the document in a single concise sentence.
This document summarizes Meta Back-Translation, a method for improving back-translation by training the backward model to directly optimize the performance of the forward model during training. The key points are:
1. Back-translation typically relies on a fixed backward model, which can lead the forward model to overfit to its outputs. Meta back-translation instead continually trains the backward model to generate pseudo-parallel data that improves the forward model.
2. Experiments show Meta back-translation generates translations with fewer pathological outputs like greatly differing in length from references. It also avoids both overfitting and underfitting of the forward model by flexibly controlling the diversity of pseudo-parallel data.
3. Related work leverages mon
Maxmin qlearning controlling the estimation bias of qlearningHyunKyu Jeon
This document summarizes the Maxmin Q-learning paper published at ICLR 2020. Maxmin Q-learning aims to address the overestimation bias of Q-learning and underestimation bias of Double Q-learning by maintaining multiple Q-functions and using the minimum value across them for the target in the Q-learning update. It defines the action selection and target construction for the update based on taking the maximum over the minimum Q-value for each action. The algorithm initializes multiple Q-functions, selects a random subset to update using the maxmin target constructed from the minimum Q-values. This approach reduces the biases seen in prior methods.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Domain Invariant Representation Learning with Domain Density Transformations
1. Domain Invariant Representation Learning with Domain Density Transformations
A. Tuan Nguyen, Toan Tran, Yarin Gal, Atılım Güneş Baydin, arxiv 2102.05082
PR-320, Presented by Eddie
2. Domain Invariant Representation Learning with Domain Density Transformations
1. Domain Generalization
2. Domain Generalization과 Domain Adaptation
이전에 못 본 도메인(OOD)에 대응하기 위해 도메인에 구애받지 않는 모델을 만드는 것을 목표로 하는 학습 방법을 말한다.
Domain Adaptation은 타깃 도메인(Target Domain)의 레이블이 없는 데이터를 바탕으로 정보(information)를 얻는 것이 가능하지만, Domain Generalization은 그렇지 않다는 것이 차이점이다.
Train on the painting data in the Baroque period. Test on the painting data in the Modern period.
Model “Caravaggio” Model ?
Thanh-Dat Truong, et al., Recognition in Unseen Domains: Domain Generalization via Universal Non-volume Preserving Models
타깃 도메인(Target Domain)에 대한 정보가 없는 상태에서
예측을 해야 하기 때문에 어려운 과제이다.
“
”
3. Domain Invariant Representation Learning with Domain Density Transformations
Definition1.Marginal Distribution Alignment The representation z is said to satisfy the marginal distribution
alignment condition if p(z|d) is invariant w.r.t. d.
Definition2.Conditional Distribution Alignment The representaion z is said to satisfy the conditional distribution
alignment condition if p(y|z,d) is invariant w.r.t. d.
3. [Domian Invariance] Marginal and Conditional Alignment
4. Proposed Method
Ed E
[ Ed d,d 2
2
'
[ ]]]
l( )
y,gθ(x)
gθ(x) ,
:
gθ(x) (
- )
gθ f (x)
d,d'
f d d' ' '
:
' || ||
+
[
p(x,y|d) Ed ,
s
D
∈
, [ d,d 2
2
' ]
l(
d ,
=
D
d , where : 데이터 공간, ,
D D : 정의역이 될 수 있는 공간
- X : 공역이 될 수 있는 공간
Y
∈
d
∈
x ,
X
X Z Z
∈
∈
y Y
∈
- 입력 를 로 변환하는 함수(domain representation function)
예측 값과 레이블 간의 Loss
도메인, 다른도메인
예측값 도메인 변환후 예측값
z
→
→
x
(x) 입력 로 변환하는 함수(density transformation function)
x d
∈
를 x
,
-
{ }
s 1 d ,
2 ..., dK )
y,gθ(x) gθ(x) (
- )
gθ f (x)
d' || ||
+
p(x,y|d)
E [ 2
2
]
-
|| ||
+
Transformations
A. Tuan Nguyen 1
Toan Tran 2
Yarin Gal 1
Atılım Güneş Baydin 1
Abstract
Domain generalization refers to the problem
where we aim to train a model on data from a
set of source domains so that the model can gen-
eralize to unseen target domains. Naively training
a model on the aggregate set of data (pooled from
all source domains) has been shown to perform
suboptimally, since the information learned by
that model might be domain-specific and general-
ize imperfectly to target domains. To tackle this
problem, a predominant approach is to find and
learn some domain-invariant information in order
to use it for the prediction task. In this paper, we
propose a theoretically grounded method to learn
a domain-invariant representation by enforcing
the representation network to be invariant under
all transformation functions among domains. We
also show how to use generative adversarial net-
works to learn such domain transformations to
implement our method in practice. We demon-
strate the effectiveness of our method on several
widely used datasets for the domain generaliza-
tion problem, on all of which we achieve compet-
itive results with state-of-the-art models.
1. Introduction
Domain generalization refers to the machine learning sce-
!"#$
!"%$
!"#$
!"%$
!"#$%&'()'% & #' # (
*%+,'-"."/'%&0%-$+%&1'2 !"#$%&'3)'% & #' # (
*%+,'-"."/'%&0%-$+%&1'2
Figure 1. An example of two domains. For each domain, x is
uniformly distributed on the outer circle (radius 2 for domain 1
and radius 3 for domain 2), with the color indicating class label y.
After the transformation z = x/||x||2, the marginal of z is aligned
(uniformly distributed on the unit circle for both domains), but the
conditional p(y|z) is not aligned. Thus, using this representation
for predicting y would not generalize well across domains.
In the representation learning framework, the prediction
y = f(x), where x is data and y is a label, is obtained as a
composition y = h ◦ g(x) of a deep representation network
z = g(x), where z is a learned representation of data x,
and a smaller classifier y = h(z), predicting label y given
representation z, both of which are shared across domains.
Current “domain-invariance”-based methods in domain gen-
eralization focus on either the marginal distribution align-
ment (Muandet et al., 2013) or the conditional distribution
alignment (Li et al., 2018b;c), which are still prone to distri-
v:2102.05082v2
[cs.LG]
14
Feb
2021
Transformations
A. Tuan Nguyen 1
Toan Tran 2
Yarin Gal 1
Atılım Güneş Baydin 1
Abstract
Domain generalization refers to the problem
where we aim to train a model on data from a
set of source domains so that the model can gen-
eralize to unseen target domains. Naively training
a model on the aggregate set of data (pooled from
all source domains) has been shown to perform
suboptimally, since the information learned by
that model might be domain-specific and general-
ize imperfectly to target domains. To tackle this
problem, a predominant approach is to find and
learn some domain-invariant information in order
to use it for the prediction task. In this paper, we
propose a theoretically grounded method to learn
a domain-invariant representation by enforcing
the representation network to be invariant under
all transformation functions among domains. We
also show how to use generative adversarial net-
works to learn such domain transformations to
implement our method in practice. We demon-
strate the effectiveness of our method on several
widely used datasets for the domain generaliza-
tion problem, on all of which we achieve compet-
itive results with state-of-the-art models.
1. Introduction
Domain generalization refers to the machine learning sce-
!"#$
!"%$
!"#$
!"%$
!"#$%&'()'% & #' # (
*%+,'-"."/'%&0%-$+%&1'2 !"#$%&'3)'% & #' # (
*%+,'-"."/'%&0%-$+%&1'2
Figure 1. An example of two domains. For each domain, x is
uniformly distributed on the outer circle (radius 2 for domain 1
and radius 3 for domain 2), with the color indicating class label y.
After the transformation z = x/||x||2, the marginal of z is aligned
(uniformly distributed on the unit circle for both domains), but the
conditional p(y|z) is not aligned. Thus, using this representation
for predicting y would not generalize well across domains.
In the representation learning framework, the prediction
y = f(x), where x is data and y is a label, is obtained as a
composition y = h ◦ g(x) of a deep representation network
z = g(x), where z is a learned representation of data x,
and a smaller classifier y = h(z), predicting label y given
representation z, both of which are shared across domains.
Current “domain-invariance”-based methods in domain gen-
eralization focus on either the marginal distribution align-
ment (Muandet et al., 2013) or the conditional distribution
alignment (Li et al., 2018b;c), which are still prone to distri-
v:2102.05082v2
[cs.LG]
14
Feb
2021
1) Domain-invariant representation function이�존재하는가?
Q 2) 가 Domain-invariant 한가?
d,d'
gθ(x) (
- =0
)
gθ f (x)
4. Domain Invariant Representation Learning with Domain Density Transformations
Theorem 1; Domain-invariant representation function이�존재하는가?
Theorem 1. The invariance of p(y|d) across domains is the necessary and sufficient condition for the existence of a domain-invariant representation (that aligns both
the marginal and conditional distribution).
‘p(y|d)의 도메인에 따른 불변함’과 ‘ Domain-invariant representation(function)이 존재하는 것’은 동치이다.
p(y,z|d) p(y|z,d)
p(z|d)
= = =
p(y|z,d')
p(z|d') p(y,z|d') p(y|d)
∴ = p(y|d') marginalizing
∵ over z
‘ Domain-invariant representation(function)이 존재하는 것’ ‘p(y|d)의 도메인에 따른 불변함’
tion (Khosla et al., 2012; Muandet et al., 2013; Ghifary et al.,
2015) and domain adaptation (Zhao et al., 2019; Zhang et al.,
2019; Combes et al., 2020; Tanwani, 2020) is that, in do-
main generalization, the learner does not have access to
(even a small amount of) data of the target domain, making
the problem much more challenging.
One of the most common domain generalization approaches
is to learn an invariant representation across domains, aim-
ing at a good generalization performance on target domains.
1
University of Oxford 2
VinAI Research. Correspondence to: A.
Tuan Nguyen <tuan.nguyen@cs.ox.ac.uk>.
alignment refers to making the representation distribution
p(z) to be the same across domains. This is essential since
if p(z) for the target domain is different from that of source
domains, the classification network h(z) would face out-
of-distribution data because the representation z it receives
as input at test time would be different from the ones it
was trained with in source domains. Conditional alignment
refers to aligning p(y|z), the conditional distribution of
the label given the representation, since if this conditional
for the target domain is different from that of the source
domains, the classification network (trained on the source
domains) would give inaccurate predictions at test time.
The formal definition of the two alignments is discussed in
Section 3.
tion (Khosla et al., 2012; Muandet et al., 2013; Ghifary et al.,
2015) and domain adaptation (Zhao et al., 2019; Zhang et al.,
2019; Combes et al., 2020; Tanwani, 2020) is that, in do-
main generalization, the learner does not have access to
(even a small amount of) data of the target domain, making
the problem much more challenging.
One of the most common domain generalization approaches
is to learn an invariant representation across domains, aim-
ing at a good generalization performance on target domains.
1
University of Oxford 2
VinAI Research. Correspondence to: A.
Tuan Nguyen <tuan.nguyen@cs.ox.ac.uk>.
alignment refers to making the representation distribution
p(z) to be the same across domains. This is essential since
if p(z) for the target domain is different from that of source
domains, the classification network h(z) would face out-
of-distribution data because the representation z it receives
as input at test time would be different from the ones it
was trained with in source domains. Conditional alignment
refers to aligning p(y|z), the conditional distribution of
the label given the representation, since if this conditional
for the target domain is different from that of the source
domains, the classification network (trained on the source
domains) would give inaccurate predictions at test time.
The formal definition of the two alignments is discussed in
Section 3.
Domain Invariant Representation Learning with Domain Density Transformations
ze this objective, while the discriminator D tries to
ze it.
n Classification Loss. For a given input image x
3.2. Training with Multiple Datasets
An important advantage of StarGAN is that it simulta-
neously incorporates multiple datasets containing different
minimize this objective, while the discriminator D tries to
maximize it.
Domain Classification Loss. For a given input image x
and a target domain label c, our goal is to translate x into
an output image y, which is properly classified to the target
3.2. Training with Multiple Datasets
An important advantage of StarGAN is
neously incorporates multiple datasets conta
types of labels, so that StarGAN can contro
→
→
1)
A ⇔ B, A⇒B and B⇒A
‘ Domain-invariant representation(function)이 존재하는 것’
‘p(y|d)의 도메인에 따른 불변함’
If is unchanged w.r.t. the domain d, then we can always find a domain invariant representation(This is trivial).
p(y|d)
For example, for the deterministic case(that maps all x to 0), or for the probabilistic case.
p(z|x) (z|x)
= δ0 p(z|x) (z;0,1)
= N
→
2)
Domain-invariant representation(function)이 존재한다 Marginal and Conditional Distribution Alignment를 만족하는 표현 z 가 존재한다.
5. Domain Invariant Representation Learning with Domain Density Transformations
Theorem 2; 가 Domain-invariant한가?
d,d'
gθ(x) (
- =0
)
gθ f (x)
Theorem 2. Given an invertible and differentiable function
the representation z satisfies
Marginal Alignment
Conditional Alignment
. Then it aligns both the marginal and conditional of the data distribution for domain d and d
(with the inverse that transforms the data density from to (as described above). Assuming that
)
fd,d' f d '
d
d ,d
'
!
"#$ !"#$%&'"(%)*#!#)*+$%,!-)&"'()*'(#,$).)!')/
01,!2)!2+),$3+"%+)!
$#"4
%$ & !"#$'%"(
))%" ))%$
)*
+
, '%" (
+,
'%$
(
5'(#,$). 5'(#,$)/
main density transformation. If we know the function f1,2 that transforms the data density from domain 1 to domain 2,
a domain invariant representation network gθ(x) by enforcing it to be invariant under f1,2, i.e., gθ(x1) = gθ(x2) for any
1) .
applying variable substitution in multiple inte-
= fd,d (x))
p(x
|d
)
det Jfd,d
(x
)
−1
p(z|x
)
det Jfd,d
(x
)
dx
ce p(fd,d(x
)|d) = p(x
|d
)
det Jfd,d
(x
)
−1
Eq 6 and p(z|fd,d(x
)) = p(z|x
) due to defini-
z in Eq 7)
p(x
|d
)p(z|x
)dx
|d
) (8)
ional alignment: ∀z, y we have:
(since p(fd,d(x
)|y, d) =
p(x
|y, d
)
det Jfd,d
(x
)
−1
due to Eq 4 and
p(z|fd,d(x
)) = p(z|x
) due to definition of z in
Eq 7)
=
p(x
|y, d
)p(z|x
)dx
= p(z|y, d
) (9)
Note that
p(y|z, d) =
p(y, z|d)
p(z|d)
=
p(y|d)p(z|y, d)
p(z|d)
(10)
Since p(y|d) = p(y) = p(y|d
), p(z|y, d) = p(z|y, d
)
and p(z|d) = p(z|d
), we have:
p(y|z, d) =
p(y|d
)p(z|y, d
)
p(z|d)
= p(y|z, d
) (11)
01,!2)!2+),$3+%+)!
$#4
%$ !#$'%(
))% ))%$
)*
+
, '% (
+,
'%$
(
Figure 3. Domain density transformation. If we know the function f1,2 that transforms the data density from domain 1 to domain 2,
we can learn a domain invariant representation network gθ(x) by enforcing it to be invariant under f1,2, i.e., gθ(x1) = gθ(x2) for any
x2 = f1,2(x1) .
(by applying variable substitution in multiple inte-
gral: x
= fd,d (x))
=
p(x
|d
)
det Jfd,d
(x
)
−1
p(z|x
)
det Jfd,d
(x
)
dx
(since p(fd,d(x
)|d) = p(x
|d
)
det Jfd,d
(x
)
−1
due to Eq 6 and p(z|fd,d(x
)) = p(z|x
) due to defini-
tion of z in Eq 7)
=
p(x
|d
)p(z|x
)dx
= p(z|d
) (8)
ii) Conditional alignment: ∀z, y we have:
p(z|y, d) =
p(x|y, d)p(z|x)dx
=
p(fd,d(x
)|y, d)p(z|fd,d(x
))
det Jfd,d
(x
)
dx
(since p(fd,d(x
)|y, d) =
p(x
|y, d
)
det Jfd,d
(x
)
−1
due to Eq 4 and
p(z|fd,d(x
)) = p(z|x
) due to definition of z in
Eq 7)
=
p(x
|y, d
)p(z|x
)dx
= p(z|y, d
) (9)
Note that
p(y|z, d) =
p(y, z|d)
p(z|d)
=
p(y|d)p(z|y, d)
p(z|d)
(10)
Since p(y|d) = p(y) = p(y|d
), p(z|y, d) = p(z|y, d
)
and p(z|d) = p(z|d
), we have:
p(y|z, d) =
p(y|d
)p(z|y, d
)
p(z|d)
= p(y|z, d
) (11)
This theorem indicates that, if we can find the functions
f’s that transform the data densities among the domains,
we can learn a domain-invariant representation z by en-
!
#$ !#$%'(%)*#!#)*+$%,!-)'()*'(#,$).)!')/
01,!2)!2+),$3+%+)!
$#4
%$ !#$'%(
))% ))%$
)*
+
, '% (
+,
'%$
(
Figure 3. Domain density transformation. If we know the function f1,2 that transforms the data density from domain 1 to domain 2,
we can learn a domain invariant representation network gθ(x) by enforcing it to be invariant under f1,2, i.e., gθ(x1) = gθ(x2) for any
x2 = f1,2(x1) .
(by applying variable substitution in multiple inte-
gral: x
= fd,d (x))
=
p(x
|d
)
det Jfd,d
(x
)
−1
p(z|x
)
det Jfd,d
(x
)
dx
(since p(fd,d(x
)|d) = p(x
|d
)
det Jfd,d
(x
)
−1
due to Eq 6 and p(z|fd,d(x
)) = p(z|x
) due to defini-
tion of z in Eq 7)
=
p(x
|d
)p(z|x
)dx
= p(z|d
) (8)
ii) Conditional alignment: ∀z, y we have:
p(z|y, d) =
p(x|y, d)p(z|x)dx
(since p(fd,d(x
)|y, d) =
p(x
|y, d
)
det Jfd,d
(x
)
−1
due to Eq 4 and
p(z|fd,d(x
)) = p(z|x
) due to definition of z in
Eq 7)
=
p(x
|y, d
)p(z|x
)dx
= p(z|y, d
) (9)
Note that
p(y|z, d) =
p(y, z|d)
p(z|d)
=
p(y|d)p(z|y, d)
p(z|d)
(10)
Since p(y|d) = p(y) = p(y|d
), p(z|y, d) = p(z|y, d
)
and p(z|d) = p(z|d
), we have:
p(y|z, d) =
p(y|d
)p(z|y, d
)
p(z|d)
= p(y|z, d
) (11)
This theorem indicates that, if we can find the functions
ng with Domain Density Transformations
$%,!-)'()*'(#,$).)!')/
3+%+)!
$#4
%$ !#$'%(
))%$
)*
+,
'%$
(
5'(#,$)/
on f1,2 that transforms the data density from domain 1 to domain 2,
enforcing it to be invariant under f1,2, i.e., gθ(x1) = gθ(x2) for any
(since p(fd,d(x
)|y, d) =
p(x
|y, d
)
det Jfd,d
(x
)
−1
due to Eq 4 and
p(z|fd,d(x
)) = p(z|x
) due to definition of z in
Eq 7)
=
p(x
|y, d
)p(z|x
)dx
= p(z|y, d
) (9)
Note that
p(y|z, d) =
p(y, z|d)
p(z|d)
=
p(y|d)p(z|y, d)
p(z|d)
(10)
Since p(y|d) = p(y) = p(y|d
), p(z|y, d) = p(z|y, d
)
and p(z|d) = p(z|d
), we have:
p(y|z, d) =
p(y|d
)p(z|y, d
)
p(z|d)
= p(y|z, d
) (11)
d,d'
p(z|x)
-1
-1
p(z|d) ∫
= p(x|d)p(z|x) |d)
dx ∫
= p( (x')
fd',d z )
|
p( (x')
fd',d
p(z|y,d) p(z|y,d')
∫
= =
p(x|y,d)p(z|x)dx =
dx'
fd',d
det (x')
J
|y,d)
∫ p( (x')
fd',d z )
|
p( (x')
fd',d dx' x'
fd',d
det (x')
J = |y,d')
x'
∫ p( z )
|
p( dx'
fd',d
det (x')
J
fd',d
det (x')
J x'
= |y,d')
x'
∫ p( z )
|
p( dx'
|d')
x'
∫
= =
=
p( |x')
z
p( dx'
fd',d
det (x')
J fd',d
det (x')
J |d')
x'
∫ p( |d')
z
p(
|x')
z
p( dx'
f x'
d,d'
where is Jacobian matrix of the function evaluated at
p(x|y,d) (∵ marginalizing over z)
⇒
p(y|d) p(y|d')
p(x'|y,d')
= fd',d
det
-1
(x')
J p(x|d) p(x'|d')
= fd',d
det
-1
(x')
J
p(x|y,d) p(x'|y,d')
= fd',d
det
-1
(x')
J fd,d'(x')
J
(z|
= )
p ∀x
f (x),
c. To achieve this condition, we add an auxiliary
r on top of D and impose the domain classification
en optimizing both D and G. That is, we decompose
ctive into two terms: a domain classification loss of
ges used to optimize D, and a domain classification
ake images used to optimize G. In detail, the former
ed as
Lr
cls = Ex,c [− log Dcls(c
|x)], (2)
he term Dcls(c
|x) represents a probability distribu-
er domain labels computed by D. By minimizing
ective, D learns to classify a real image x to its cor-
ing original domain c
. We assume that the input
nd domain label pair (x, c
) is given by the training
n the other hand, the loss function for the domain
ation of fake images is defined as
Lf
cls = Ex,c[− log Dcls(c|G(x, c))]. (3)
words, G tries to minimize this objective to gener-
ges that can be classified as the target domain c.
truction Loss. By minimizing the adversarial and
ation losses, G is trained to generate images that
stic and classified to its correct target domain. How-
nimizing the losses (Eqs. (1) and (3)) does not guar-
at the test phase. An issue when learning from multiple
datasets, however, is that the label information is only par-
tially known to each dataset. In the case of CelebA [19] and
RaFD [13], while the former contains labels for attributes
such as hair color and gender, it does not have any labels
for facial expressions such as ‘happy’ and ‘angry’, and vice
versa for the latter. This is problematic because the com-
plete information on the label vector c
is required when
reconstructing the input image x from the translated image
G(x, c) (See Eq. (4)).
Mask Vector. To alleviate this problem, we introduce a
mask vector m that allows StarGAN to ignore unspecified
labels and focus on the explicitly known label provided by
a particular dataset. In StarGAN, we use an n-dimensional
one-hot vector to represent m, with n being the number of
datasets. In addition, we define a unified version of the label
as a vector
c̃ = [c1, ..., cn, m], (7)
where [·] refers to concatenation, and ci represents a vector
for the labels of the i-th dataset. The vector of the known
label ci can be represented as either a binary vector for bi-
nary attributes or a one-hot vector for categorical attributes.
classifier on top of D and impose the domain classification
loss when optimizing both D and G. That is, we decompose
the objective into two terms: a domain classification loss of
real images used to optimize D, and a domain classification
loss of fake images used to optimize G. In detail, the former
is defined as
Lr
cls = Ex,c [− log Dcls(c
|x)], (2)
where the term Dcls(c
|x) represents a probability distribu-
tion over domain labels computed by D. By minimizing
this objective, D learns to classify a real image x to its cor-
responding original domain c
. We assume that the input
image and domain label pair (x, c
) is given by the training
data. On the other hand, the loss function for the domain
classification of fake images is defined as
Lf
cls = Ex,c[− log Dcls(c|G(x, c))]. (3)
In other words, G tries to minimize this objective to gener-
ate images that can be classified as the target domain c.
Reconstruction Loss. By minimizing the adversarial and
classification losses, G is trained to generate images that
are realistic and classified to its correct target domain. How-
ever, minimizing the losses (Eqs. (1) and (3)) does not guar-
antee that translated images preserve the content of its input
images while changing only the domain-related part of the
tially known to each dataset. In the case of C
RaFD [13], while the former contains label
such as hair color and gender, it does not h
for facial expressions such as ‘happy’ and ‘a
versa for the latter. This is problematic bec
plete information on the label vector c
is
reconstructing the input image x from the tr
G(x, c) (See Eq. (4)).
Mask Vector. To alleviate this problem, w
mask vector m that allows StarGAN to ign
labels and focus on the explicitly known lab
a particular dataset. In StarGAN, we use an
one-hot vector to represent m, with n being
datasets. In addition, we define a unified vers
as a vector
c̃ = [c1, ..., cn, m],
where [·] refers to concatenation, and ci repr
for the labels of the i-th dataset. The vecto
label ci can be represented as either a binar
nary attributes or a one-hot vector for catego
For the remaining n−1 unknown labels we
Domain Invariant Representation Learning with Domain Density Transformations
Assume that we have a set of K sources domain Ds =
{d1, d2, ..., dK}, the objective function in Eq. 12 becomes:
Ed,d∈Ds,p(x,y|d)
l(y, gθ(x)) + ||gθ(x) − gθ(fd,d (x))||2
2
(13)
In the next section, we show how one can incorporate this
idea into real-world domain generalization problems with
generative adversarial networks.
4. Domain Generalization with Generative
Adversarial Networks
In practice, we will learn the functions f’s that transform
the data distributions between domains and one can use
several generative modeling frameworks, e.g., normalizing
flows (Grover et al., 2020) or GANs (Zhu et al., 2017; Choi
et al., 2018; 2020) to learn such functions. One advantage
of normalizing flows is that this transformation is naturally
invertible by design of the neural network. In addition, the
• The adversarial loss Ladv that is the classification loss
of a discriminator D that tries to distinguish between
real images and the fake images generated by G. The
equilibrium state of StarGAN is when G completely
fools D, which means the distribution of the generated
images (via G(x, d, d
), x ∼ p(x|d)) becomes the dis-
tribution of the real images of the destination domain
p(x
|d
). This is our objective, i.e., to learn a function
that transforms domains’ densities.
• Reconstruction loss Lrec = Ex,d,d [||x −
G(x
, d
, d)||1] where x
= G(x, d, d
) to ensure
that the transformations preserve the image’s content.
Note that this also aligns with our interest since we
want G(., d
, d) to be the inverse of G(., d, d
), which
will minimize Lrec to zero.
We can enforce the generator G to transform the data distri-
bution within the class y (e.g., p(x|y, d) to p(x
|y, d
) ∀y)
by sampling each minibatch with data from the same class
y, so that the discriminator will distinguish the transformed
images with the real images from class y and domain d
.
minimize this objective, while the discriminator D tries to
maximize it.
Domain Classification Loss. For a given input image x
and a target domain label c, our goal is to translate x into
an output image y, which is properly classified to the target
domain c. To achieve this condition, we add an auxiliary
classifier on top of D and impose the domain classification
loss when optimizing both D and G. That is, we decompose
the objective into two terms: a domain classification loss of
real images used to optimize D, and a domain classification
loss of fake images used to optimize G. In detail, the former
is defined as
Lr
cls = Ex,c [− log Dcls(c
|x)], (2)
where the term Dcls(c
|x) represents a probability distribu-
tion over domain labels computed by D. By minimizing
this objective, D learns to classify a real image x to its cor-
minimize this objective, while the discriminator D tries to
maximize it.
Domain Classification Loss. For a given input image x
and a target domain label c, our goal is to translate x into
3.2. Training with Multiple Data
An important advantage of StarG
neously incorporates multiple datase
types of labels, so that StarGAN can
6. Domain Invariant Representation Learning with Domain Density Transformations
d ,d
applying variable substitution in multiple inte-
= fd,d (x))
p(x
|y, d
)
det Jfd,d
(x
)
−1
p(z|x
)
det Jfd,d
(x
)
dx
couraging the representation to be invariant under all the
transformations f’s. This idea is illustrated in Figure 3. We
therefore can use the following learning objective to learn a
domain-invariant representation z = gθ(x):
Ed
Ep(x,y|d)
l(y, gθ(x)) + Ed [||gθ(x) − gθ(fd,d (x))||2
2]
(12)
where l(y, gθ(x)) is the prediction loss of a network that pre-
dicts y given z = gθ(x), and the second term is to enforce
the invariant condition in Eq 7.
gral: x
= fd,d (x))
=
p(x
|y, d
)
det Jfd,d
(x
)
−1
p(z|x
)
det Jfd,d
(x
)
dx
domain-invariant representation z = gθ(x):
Ed
Ep(x,y|d)
l(y, gθ(x)) + Ed [||gθ(x) − gθ(fd,d (x))||2
2]
(12)
where l(y, gθ(x)) is the prediction loss of a network that pre-
dicts y given z = gθ(x), and the second term is to enforce
the invariant condition in Eq 7.
(by applying variable substitution in multiple inte-
gral: x
= fd,d (x))
=
p(x
|y, d
)
det Jfd,d
(x
)
−1
p(z|x
)
det Jfd,d
(x
)
dx
transformations f’s. This idea is illustrated in Figure 3. We
therefore can use the following learning objective to learn a
domain-invariant representation z = gθ(x):
Ed
Ep(x,y|d)
l(y, gθ(x)) + Ed [||gθ(x) − gθ(fd,d (x))||2
2]
(12)
where l(y, gθ(x)) is the prediction loss of a network that pre-
dicts y given z = gθ(x), and the second term is to enforce
the invariant condition in Eq 7.
f’s that transform the data densities among the domains,
we can learn a domain-invariant representation z by en-
couraging the representation to be invariant under all the
transformations f’s. This idea is illustrated in Figure 3. We
therefore can use the following learning objective to learn a
domain-invariant representation z = gθ(x):
Ed
Ep(x,y|d)
l(y, gθ(x)) + Ed [||gθ(x) − gθ(fd,d (x))||2
2]
(12)
where l(y, gθ(x)) is the prediction loss of a network that pre-
dicts y given z = gθ(x), and the second term is to enforce
the invariant condition in Eq 7.
5. Domain Generalization with Generative Adversarial Networks (StarGAN; PR-152)
models
2
3
2
G32
G23
3
2
1
5
4 3
(b) StarGAN
on between cross-domain models and our pro-
GAN. (a) To handle multiple domains, cross-
ould be built for every pair of image domains.
able of learning mappings among multiple do-
e generator. The figure represents a star topol-
lti-domains.
ned from RaFD, as shown in the right-
Fig. 1. As far as our knowledge goes, our
o successfully perform multi-domain im-
ross different datasets.
ontributions are as follows:
StarGAN, a novel generative adversarial
learns the mappings among multiple do-
only a single generator and a discrimina-
effectively from images of all domains.
rate how we can successfully learn multi-
G
Input image
Target domain
Depth-wise concatenation
Fake image
G
Original
domain
Fake image
Depth-wise concatenation
Reconstructed
image
D
Fake image
Domain
classification
Real / Fake
(b) Original-to-target domain (c) Target-to-original domain (d) Fooling the discriminator
D
Domain
classification
Real / Fake
Fake image
Real image
(a) Training the discriminator
(1) (2)
(1), (2) (1)
Figure 3. Overview of StarGAN, consisting of two modules, a discriminator D and a generator G. (a) D learns to distinguish between
real and fake images and classify the real images to its corresponding domain. (b) G takes in as input both the image and target domain
label and generates an fake image. The target domain label is spatially replicated and concatenated with the input image. (c) G tries to
reconstruct the original image from the fake image given the original domain label. (d) G tries to generate images indistinguishable from
real images and classifiable as target domain by D.
vided both the discriminator and generator with class infor-
mation in order to generate samples conditioned on the class
[20, 21, 22]. Other recent approaches focused on generating
particular images highly relevant to a given text description
[25, 30]. The idea of conditional image generation has also
been successfully applied to domain transfer [9, 28], super-
resolution imaging[14], and photo editing [2, 27]. In this
paper, we propose a scalable GAN framework that can flex-
ibly steer the image translation to various target domains,
by providing conditional domain information.
Image-to-Image Translation. Recent work have achieved
impressive results in image-to-image translation [7, 9, 17,
33]. For instance, pix2pix [7] learns this task in a super-
vised manner using cGANs[20]. It combines an adver-
sarial loss with a L1 loss, thus requires paired data sam-
ples. To alleviate the problem of obtaining data pairs, un-
3. Star Generative Adversarial Networks
We first describe our proposed StarGAN, a framework to
address multi-domain image-to-image translation within a
single dataset. Then, we discuss how StarGAN incorporates
multiple datasets containing different label sets to flexibly
perform image translations using any of these labels.
3.1. Multi-Domain Image-to-Image Translation
Our goal is to train a single generator G that learns map-
pings among multiple domains. To achieve this, we train G
to translate an input image x into an output image y condi-
tioned on the target domain label c, G(x, c) → y. We ran-
domly generate the target domain label c so that G learns
to flexibly translate the input image. We also introduce an
auxiliary classifier [22] that allows a single discriminator to
control multiple domains. That is, our discriminator pro-
To alleviate this problem, we apply a cycle consis-
ss [9, 33] to the generator, defined as
Lrec = Ex,c,c [||x − G(G(x, c), c
)||1], (4)
G takes in the translated image G(x, c) and the origi-
ain label c
as input and tries to reconstruct the orig-
ge x. We adopt the L1 norm as our reconstruction
te that we use a single generator twice, first to trans-
original image into an image in the target domain
n to reconstruct the original image from the trans-
age.
jective. Finally, the objective functions to optimize
D are written, respectively, as
LD = −Ladv + λcls Lr
cls, (5)
LG = Ladv + λcls Lf
cls + λrec Lrec, (6)
λcls and λrec are hyper-parameters that control the
importance of domain classification and reconstruc-
ses, respectively, compared to the adversarial loss.
λcls = 1 and λrec = 10 in all of our experiments.
RaFD datasets, where n is two.
Training Strategy. When training StarGAN with multiple
datasets, we use the domain label c̃ defined in Eq. (7) as in-
put to the generator. By doing so, the generator learns to
ignore the unspecified labels, which are zero vectors, and
focus on the explicitly given label. The structure of the gen-
erator is exactly the same as in training with a single dataset,
except for the dimension of the input label c̃. On the other
hand, we extend the auxiliary classifier of the discrimina-
tor to generate probability distributions over labels for all
datasets. Then, we train the model in a multi-task learning
setting, where the discriminator tries to minimize only the
classification error associated to the known label. For ex-
ample, when training with images in CelebA, the discrimi-
nator minimizes only classification errors for labels related
to CelebA attributes, and not facial expressions related to
RaFD. Under these settings, by alternating between CelebA
and RaFD the discriminator learns all of the discriminative
features for both datasets, and the generator learns to con-
trol all the labels in both datasets.
4
Lrec = Ex,c,c [||x − G(G(x, c), c
)||1], (4)
where G takes in the translated image G(x, c) and the origi-
nal domain label c
as input and tries to reconstruct the orig-
inal image x. We adopt the L1 norm as our reconstruction
loss. Note that we use a single generator twice, first to trans-
late an original image into an image in the target domain
and then to reconstruct the original image from the trans-
lated image.
Full Objective. Finally, the objective functions to optimize
G and D are written, respectively, as
LD = −Ladv + λcls Lr
cls, (5)
LG = Ladv + λcls Lf
cls + λrec Lrec, (6)
where λcls and λrec are hyper-parameters that control the
relative importance of domain classification and reconstruc-
tion losses, respectively, compared to the adversarial loss.
We use λcls = 1 and λrec = 10 in all of our experiments.
Training Strategy. When training StarGAN
datasets, we use the domain label c̃ defined i
put to the generator. By doing so, the gene
ignore the unspecified labels, which are ze
focus on the explicitly given label. The struc
erator is exactly the same as in training with a
except for the dimension of the input label c
hand, we extend the auxiliary classifier of
tor to generate probability distributions ove
datasets. Then, we train the model in a mul
setting, where the discriminator tries to min
classification error associated to the known
ample, when training with images in CelebA
nator minimizes only classification errors fo
to CelebA attributes, and not facial express
RaFD. Under these settings, by alternating b
and RaFD the discriminator learns all of the
features for both datasets, and the generator
trol all the labels in both datasets.
4
of the generative model is completed, we propose the use
of GANs to inherit its rich network capacity. In particular,
we use the StarGAN (Choi et al., 2018) model, which is
designed for image domain transformations.
The goal of StarGAN is to learn a unified network G that
transforms the data density among multiple domains. In
particular, the network G(x, d, d
) (i.e., G is conditioned on
the image x and the two different domains d, d
) transforms
an image x from domain d to domain d
. Different from
the original StarGAN model that only takes the image x
and the desired destination domain d
as its input, in our
implementation, we feed both the original domain d and
desired destination domain d
together with the original
image x to the generator G.
The generator’s goal is to fool a discriminator D into think-
ing that the transformed image belongs to the destination do-
main d
. In other words, the equilibrium state of StarGAN,
in which G completely fools D, is when G successfully
transforms the data density of the original domain to that
of the destination domain. After training, we use G(., d, d
)
as the function fd,d (.) described in the previous section
and perform the representation learning via the objective
function in Eq 13.
Three important loss functions of the StarGAN architecture
are:
• Domain classification loss Lcls that encourages the
generator G to generate images that correctly belongs
to the desired destination domain d
.
image within the original class y.
As mentioned earlier, after training the StarGAN model,
we can use the generator G(., d, d
) as our fd,d (.) function
and learn a domain-invariant representation via the learning
objective in Eq 13. We name this implementation of our
method DIR-GAN (domain-invariant representation learn-
ing with generative adversarial networks).
5. Experiments
5.1. Datasets
To evaluate our method, we perform experiments in three
datasets that are commonly used in the literature for domain
generalization.
Rotated MNIST. In this dataset by Ghifary et al.
(2015), 1,000 MNIST images (100 per class) (LeCun
Cortes, 2010) are chosen to form the first domain (de-
noted M0), then rotations of 15◦
, 30◦
, 45◦
, 60◦
and 75◦
are applied to create five additional domains, denoted
M15, M30, M45, M60 and M75. The task is classifica-
tion with ten classes (digits 0 to 9).
PACS (Li et al., 2017) contains 9,991 images from four
different domains: art painting, cartoon, photo, sketch. The
task is classification with seven classes.
OfficeHome (Venkateswara et al., 2017) has 15,500 im-
ages of daily objects from four domains: art, clipart, product
and real. There are 65 classes in this classification dataset.
Domain Invariant Representation Learning with Domain Density Transformations
Assume that we have a set of K sources domain Ds =
{d1, d2, ..., dK}, the objective function in Eq. 12 becomes:
Ed,d∈Ds,p(x,y|d)
l(y, gθ(x)) + ||gθ(x) − gθ(fd,d (x))||2
2
(13)
In the next section, we show how one can incorporate this
idea into real-world domain generalization problems with
generative adversarial networks.
4. Domain Generalization with Generative
Adversarial Networks
In practice, we will learn the functions f’s that transform
the data distributions between domains and one can use
several generative modeling frameworks, e.g., normalizing
flows (Grover et al., 2020) or GANs (Zhu et al., 2017; Choi
et al., 2018; 2020) to learn such functions. One advantage
of normalizing flows is that this transformation is naturally
invertible by design of the neural network. In addition, the
• The adversarial loss Ladv that is the classification loss
of a discriminator D that tries to distinguish between
real images and the fake images generated by G. The
equilibrium state of StarGAN is when G completely
fools D, which means the distribution of the generated
images (via G(x, d, d
), x ∼ p(x|d)) becomes the dis-
tribution of the real images of the destination domain
p(x
|d
). This is our objective, i.e., to learn a function
that transforms domains’ densities.
• Reconstruction loss Lrec = Ex,d,d [||x −
G(x
, d
, d)||1] where x
= G(x, d, d
) to ensure
that the transformations preserve the image’s content.
Note that this also aligns with our interest since we
want G(., d
, d) to be the inverse of G(., d, d
), which
will minimize Lrec to zero.
We can enforce the generator G to transform the data distri-
bution within the class y (e.g., p(x|y, d) to p(x
|y, d
) ∀y)
by sampling each minibatch with data from the same class
y, so that the discriminator will distinguish the transformed
Domain Invariant Representation Learning with Domain Density Transformations
Assume that we have a set of K sources domain Ds =
{d1, d2, ..., dK}, the objective function in Eq. 12 becomes:
Ed,d∈Ds,p(x,y|d)
l(y, gθ(x)) + ||gθ(x) − gθ(fd,d (x))||2
2
(13)
In the next section, we show how one can incorporate this
idea into real-world domain generalization problems with
generative adversarial networks.
• The adversarial loss Ladv that is the classification loss
of a discriminator D that tries to distinguish between
real images and the fake images generated by G. The
equilibrium state of StarGAN is when G completely
fools D, which means the distribution of the generated
images (via G(x, d, d
), x ∼ p(x|d)) becomes the dis-
tribution of the real images of the destination domain
p(x
|d
). This is our objective, i.e., to learn a function
that transforms domains’ densities.
• Reconstruction loss Lrec = Ex,d,d [||x −
ation Learning with Domain Density Transformations
ain Ds =
becomes:
d (x))||2
2
(13)
porate this
lems with
ative
transform
e can use
rmalizing
017; Choi
advantage
s naturally
dition, the
on can be
hat we do
g process
se the use
particular,
, which is
• The adversarial loss Ladv that is the classification loss
of a discriminator D that tries to distinguish between
real images and the fake images generated by G. The
equilibrium state of StarGAN is when G completely
fools D, which means the distribution of the generated
images (via G(x, d, d
), x ∼ p(x|d)) becomes the dis-
tribution of the real images of the destination domain
p(x
|d
). This is our objective, i.e., to learn a function
that transforms domains’ densities.
• Reconstruction loss Lrec = Ex,d,d [||x −
G(x
, d
, d)||1] where x
= G(x, d, d
) to ensure
that the transformations preserve the image’s content.
Note that this also aligns with our interest since we
want G(., d
, d) to be the inverse of G(., d, d
), which
will minimize Lrec to zero.
We can enforce the generator G to transform the data distri-
bution within the class y (e.g., p(x|y, d) to p(x
|y, d
) ∀y)
by sampling each minibatch with data from the same class
y, so that the discriminator will distinguish the transformed
images with the real images from class y and domain d
.
However, we found that this constraint can be relaxed in
practice, and the generator almost always transforms the
image within the original class y.
As mentioned earlier, after training the StarGAN model,
we can use the generator G(., d, d
) as our fd,d (.) function
Domain Invariant Representation Learning with Domain Density Transformations
Assume that we have a set of K sources domain Ds =
{d1, d2, ..., dK}, the objective function in Eq. 12 becomes:
Ed,d∈Ds,p(x,y|d)
l(y, gθ(x)) + ||gθ(x) − gθ(fd,d (x))||2
2
(13)
In the next section, we show how one can incorporate this
idea into real-world domain generalization problems with
generative adversarial networks.
4. Domain Generalization with Generative
Adversarial Networks
In practice, we will learn the functions f’s that transform
the data distributions between domains and one can use
several generative modeling frameworks, e.g., normalizing
flows (Grover et al., 2020) or GANs (Zhu et al., 2017; Choi
et al., 2018; 2020) to learn such functions. One advantage
of normalizing flows is that this transformation is naturally
invertible by design of the neural network. In addition, the
determinant of the Jacobian of that transformation can be
efficiently computed. However, due to the fact that we do
not need access to the Jacobian when the training process
of the generative model is completed, we propose the use
of GANs to inherit its rich network capacity. In particular,
we use the StarGAN (Choi et al., 2018) model, which is
designed for image domain transformations.
The goal of StarGAN is to learn a unified network G that
transforms the data density among multiple domains. In
particular, the network G(x, d, d
) (i.e., G is conditioned on
the image x and the two different domains d, d
) transforms
an image x from domain d to domain d
. Different from
the original StarGAN model that only takes the image x
• The adversarial loss Ladv that is the classification loss
of a discriminator D that tries to distinguish between
real images and the fake images generated by G. The
equilibrium state of StarGAN is when G completely
fools D, which means the distribution of the generated
images (via G(x, d, d
), x ∼ p(x|d)) becomes the dis-
tribution of the real images of the destination domain
p(x
|d
). This is our objective, i.e., to learn a function
that transforms domains’ densities.
• Reconstruction loss Lrec = Ex,d,d [||x −
G(x
, d
, d)||1] where x
= G(x, d, d
) to ensure
that the transformations preserve the image’s content.
Note that this also aligns with our interest since we
want G(., d
, d) to be the inverse of G(., d, d
), which
will minimize Lrec to zero.
We can enforce the generator G to transform the data distri-
bution within the class y (e.g., p(x|y, d) to p(x
|y, d
) ∀y)
by sampling each minibatch with data from the same class
y, so that the discriminator will distinguish the transformed
images with the real images from class y and domain d
.
However, we found that this constraint can be relaxed in
practice, and the generator almost always transforms the
image within the original class y.
As mentioned earlier, after training the StarGAN model,
we can use the generator G(., d, d
) as our fd,d (.) function
and learn a domain-invariant representation via the learning
objective in Eq 13. We name this implementation of our
method DIR-GAN (domain-invariant representation learn-
ing with generative adversarial networks).
5. Experiments
5.1. Datasets
Domain Invariant Representation Learning with Domain Density Transformations
Assume that we have a set of K sources domain Ds =
{d1, d2, ..., dK}, the objective function in Eq. 12 becomes:
Ed,d∈Ds,p(x,y|d)
l(y, gθ(x)) + ||gθ(x) − gθ(fd,d (x))||2
2
(13)
In the next section, we show how one can incorporate this
idea into real-world domain generalization problems with
generative adversarial networks.
4. Domain Generalization with Generative
Adversarial Networks
In practice, we will learn the functions f’s that transform
the data distributions between domains and one can use
several generative modeling frameworks, e.g., normalizing
flows (Grover et al., 2020) or GANs (Zhu et al., 2017; Choi
et al., 2018; 2020) to learn such functions. One advantage
of normalizing flows is that this transformation is naturally
invertible by design of the neural network. In addition, the
determinant of the Jacobian of that transformation can be
efficiently computed. However, due to the fact that we do
not need access to the Jacobian when the training process
of the generative model is completed, we propose the use
of GANs to inherit its rich network capacity. In particular,
we use the StarGAN (Choi et al., 2018) model, which is
designed for image domain transformations.
The goal of StarGAN is to learn a unified network G that
transforms the data density among multiple domains. In
particular, the network G(x, d, d
) (i.e., G is conditioned on
the image x and the two different domains d, d
) transforms
an image x from domain d to domain d
. Different from
the original StarGAN model that only takes the image x
and the desired destination domain d
as its input, in our
implementation, we feed both the original domain d and
desired destination domain d
together with the original
image x to the generator G.
• The adversarial loss Ladv that is the classification loss
of a discriminator D that tries to distinguish between
real images and the fake images generated by G. The
equilibrium state of StarGAN is when G completely
fools D, which means the distribution of the generated
images (via G(x, d, d
), x ∼ p(x|d)) becomes the dis-
tribution of the real images of the destination domain
p(x
|d
). This is our objective, i.e., to learn a function
that transforms domains’ densities.
• Reconstruction loss Lrec = Ex,d,d [||x −
G(x
, d
, d)||1] where x
= G(x, d, d
) to ensure
that the transformations preserve the image’s content.
Note that this also aligns with our interest since we
want G(., d
, d) to be the inverse of G(., d, d
), which
will minimize Lrec to zero.
We can enforce the generator G to transform the data distri-
bution within the class y (e.g., p(x|y, d) to p(x
|y, d
) ∀y)
by sampling each minibatch with data from the same class
y, so that the discriminator will distinguish the transformed
images with the real images from class y and domain d
.
However, we found that this constraint can be relaxed in
practice, and the generator almost always transforms the
image within the original class y.
As mentioned earlier, after training the StarGAN model,
we can use the generator G(., d, d
) as our fd,d (.) function
and learn a domain-invariant representation via the learning
objective in Eq 13. We name this implementation of our
method DIR-GAN (domain-invariant representation learn-
ing with generative adversarial networks).
5. Experiments
5.1. Datasets
To evaluate our method, we perform experiments in three
datasets that are commonly used in the literature for domain
generalization.
Domain Invariant Representation Learning with Domain Density Transformations
Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
Domains
Model M0 M15 M30 M45 M60 M75 Average
Domain Invariant Representation Learning with Domain Density Transformations
Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
image and domain label pair (x, c ) is given by the training
data. On the other hand, the loss function for the domain
classification of fake images is defined as
Lf
cls = Ex,c[− log Dcls(c|G(x, c))]. (3)
In other words, G tries to minimize this objective to gener-
ate images that can be classified as the target domain c.
Reconstruction Loss. By minimizing the adversarial and
classification losses, G is trained to generate images that
are realistic and classified to its correct target domain. How-
ever, minimizing the losses (Eqs. (1) and (3)) does not guar-
antee that translated images preserve the content of its input
images while changing only the domain-related part of the
inputs. To alleviate this problem, we apply a cycle consis-
tency loss [9, 33] to the generator, defined as
Lrec = Ex,c,c [||x − G(G(x, c), c
)||1], (4)
where G takes in the translated image G(x, c) and the origi-
nal domain label c
as input and tries to reconstruct the orig-
inal image x. We adopt the L1 norm as our reconstruction
loss. Note that we use a single generator twice, first to trans-
late an original image into an image in the target domain
and then to reconstruct the original image from the trans-
lated image.
classifier on top of D and impose the domain classification
loss when optimizing both D and G. That is, we decompose
the objective into two terms: a domain classification loss of
real images used to optimize D, and a domain classification
loss of fake images used to optimize G. In detail, the former
is defined as
Lr
cls = Ex,c [− log Dcls(c
|x)], (2)
where the term Dcls(c
|x) represents a probability distribu-
tion over domain labels computed by D. By minimizing
this objective, D learns to classify a real image x to its cor-
responding original domain c
. We assume that the input
image and domain label pair (x, c
) is given by the training
data. On the other hand, the loss function for the domain
classification of fake images is defined as
Lf
cls = Ex,c[− log Dcls(c|G(x, c))]. (3)
In other words, G tries to minimize this objective to gener-
ate images that can be classified as the target domain c.
Reconstruction Loss. By minimizing the adversarial and
classification losses, G is trained to generate images that
are realistic and classified to its correct target domain. How-
ever, minimizing the losses (Eqs. (1) and (3)) does not guar-
antee that translated images preserve the content of its input
tially known to each dataset. In the ca
RaFD [13], while the former contain
such as hair color and gender, it doe
for facial expressions such as ‘happy’
versa for the latter. This is problem
plete information on the label vecto
reconstructing the input image x from
G(x, c) (See Eq. (4)).
Mask Vector. To alleviate this pro
mask vector m that allows StarGAN
labels and focus on the explicitly kno
a particular dataset. In StarGAN, we
one-hot vector to represent m, with n
datasets. In addition, we define a unifi
as a vector
c̃ = [c1, ..., cn, m
where [·] refers to concatenation, and
for the labels of the i-th dataset. Th
label ci can be represented as either
nary attributes or a one-hot vector for
For the remaining n−1 unknown lab
mize this objective, while the discriminator D tries to
mize it.
ain Classification Loss. For a given input image x
target domain label c, our goal is to translate x into
put image y, which is properly classified to the target
n c. To achieve this condition, we add an auxiliary
fier on top of D and impose the domain classification
hen optimizing both D and G. That is, we decompose
jective into two terms: a domain classification loss of
mages used to optimize D, and a domain classification
f fake images used to optimize G. In detail, the former
ned as
Lr
cls = Ex,c [− log Dcls(c
|x)], (2)
the term Dcls(c
|x) represents a probability distribu-
ver domain labels computed by D. By minimizing
bjective, D learns to classify a real image x to its cor-
nding original domain c
. We assume that the input
and domain label pair (x, c
) is given by the training
On the other hand, the loss function for the domain
fication of fake images is defined as
3.2. Training with Multiple Datasets
An important advantage of StarGAN is that it simulta-
neously incorporates multiple datasets containing different
types of labels, so that StarGAN can control all the labels
at the test phase. An issue when learning from multiple
datasets, however, is that the label information is only par-
tially known to each dataset. In the case of CelebA [19] and
RaFD [13], while the former contains labels for attributes
such as hair color and gender, it does not have any labels
for facial expressions such as ‘happy’ and ‘angry’, and vice
versa for the latter. This is problematic because the com-
plete information on the label vector c
is required when
reconstructing the input image x from the translated image
G(x, c) (See Eq. (4)).
Mask Vector. To alleviate this problem, we introduce a
mask vector m that allows StarGAN to ignore unspecified
labels and focus on the explicitly known label provided by
a particular dataset. In StarGAN, we use an n-dimensional
one-hot vector to represent m, with n being the number of
datasets. In addition, we define a unified version of the label
7. Domain Invariant Representation Learning with Domain Density Transformations
both qualitative and quantitative results on
ute transfer and facial expression synthe-
ng StarGAN, showing its superiority over
dels.
ork
ersarial Networks. Generative adversar-
ANs) [3] have shown remarkable results
ter vision tasks such as image generation
age translation [7, 9, 33], super-resolution
d face image synthesis [10, 16, 26, 31]. A
del consists of two modules: a discrimina-
or. The discriminator learns to distinguish
fake samples, while the generator learns to
mples that are indistinguishable from real
proach also leverages the adversarial loss
rated images as realistic as possible.
Ns. GAN-based conditional image gener-
n actively studied. Prior studies have pro-
distribution of images in cross domains. CycleGAN [33]
and DiscoGAN [9] preserve key attributes between the in-
put and the translated image by utilizing a cycle consistency
loss. However, all these frameworks are only capable of
learning the relations between two different domains at a
time. Their approaches have limited scalability in handling
multiple domains since different models should be trained
for each pair of domains. Unlike the aforementioned ap-
proaches, our framework can learn the relations among mul-
tiple domains using only a single model.
Adversarial Loss. To make the generated images indistin-
guishable from real images, we adopt an adversarial loss
Ladv = Ex [log Dsrc(x)] +
Ex,c[log (1 − Dsrc(G(x, c)))],
(1)
where G generates an image G(x, c) conditioned on both
the input image x and the target domain label c, while D
tries to distinguish between real and fake images. In this
paper, we refer to the term Dsrc(x) as a probability distri-
bution over sources given by D. The generator G tries to
3
of the generative model is completed, we propose the use
of GANs to inherit its rich network capacity. In particular,
we use the StarGAN (Choi et al., 2018) model, which is
designed for image domain transformations.
The goal of StarGAN is to learn a unified network G that
transforms the data density among multiple domains. In
particular, the network G(x, d, d
) (i.e., G is conditioned on
the image x and the two different domains d, d
) transforms
an image x from domain d to domain d
. Different from
the original StarGAN model that only takes the image x
and the desired destination domain d
as its input, in our
implementation, we feed both the original domain d and
desired destination domain d
together with the original
image x to the generator G.
The generator’s goal is to fool a discriminator D into think-
ing that the transformed image belongs to the destination do-
main d
. In other words, the equilibrium state of StarGAN,
in which G completely fools D, is when G successfully
transforms the data density of the original domain to that
of the destination domain. After training, we use G(., d, d
)
as the function fd,d (.) described in the previous section
and perform the representation learning via the objective
function in Eq 13.
Three important loss functions of the StarGAN architecture
are:
• Domain classification loss Lcls that encourages the
generator G to generate images that correctly belongs
to the desired destination domain d
.
image within the original class y.
As mentioned earlier, after training the StarGAN model,
we can use the generator G(., d, d
) as our fd,d (.) function
and learn a domain-invariant representation via the learning
objective in Eq 13. We name this implementation of our
method DIR-GAN (domain-invariant representation learn-
ing with generative adversarial networks).
5. Experiments
5.1. Datasets
To evaluate our method, we perform experiments in three
datasets that are commonly used in the literature for domain
generalization.
Rotated MNIST. In this dataset by Ghifary et al.
(2015), 1,000 MNIST images (100 per class) (LeCun
Cortes, 2010) are chosen to form the first domain (de-
noted M0), then rotations of 15◦
, 30◦
, 45◦
, 60◦
and 75◦
are applied to create five additional domains, denoted
M15, M30, M45, M60 and M75. The task is classifica-
tion with ten classes (digits 0 to 9).
PACS (Li et al., 2017) contains 9,991 images from four
different domains: art painting, cartoon, photo, sketch. The
task is classification with seven classes.
OfficeHome (Venkateswara et al., 2017) has 15,500 im-
ages of daily objects from four domains: art, clipart, product
and real. There are 65 classes in this classification dataset.
In practice, we will learn the functions f’s that transform
the data distributions between domains and one can use
several generative modeling frameworks, e.g., normalizing
flows (Grover et al., 2020) or GANs (Zhu et al., 2017; Choi
et al., 2018; 2020) to learn such functions. One advantage
of normalizing flows is that this transformation is naturally
invertible by design of the neural network. In addition, the
determinant of the Jacobian of that transformation can be
efficiently computed. However, due to the fact that we do
not need access to the Jacobian when the training process
of the generative model is completed, we propose the use
of GANs to inherit its rich network capacity. In particular,
we use the StarGAN (Choi et al., 2018) model, which is
designed for image domain transformations.
The goal of StarGAN is to learn a unified network G that
transforms the data density among multiple domains. In
particular, the network G(x, d, d
) (i.e., G is conditioned on
the image x and the two different domains d, d
) transforms
an image x from domain d to domain d
. Different from
the original StarGAN model that only takes the image x
and the desired destination domain d
as its input, in our
implementation, we feed both the original domain d and
desired destination domain d
together with the original
image x to the generator G.
The generator’s goal is to fool a discriminator D into think-
ing that the transformed image belongs to the destination do-
main d
. In other words, the equilibrium state of StarGAN,
in which G completely fools D, is when G successfully
transforms the data density of the original domain to that
of the destination domain. After training, we use G(., d, d
)
as the function fd,d (.) described in the previous section
and perform the representation learning via the objective
function in Eq 13.
Three important loss functions of the StarGAN architecture
are:
• Domain classification loss Lcls that encourages the
want G(., d , d) to be the inverse of G(., d, d ), which
will minimize Lrec to zero.
We can enforce the generator G to transform the data distri-
bution within the class y (e.g., p(x|y, d) to p(x
|y, d
) ∀y)
by sampling each minibatch with data from the same class
y, so that the discriminator will distinguish the transformed
images with the real images from class y and domain d
.
However, we found that this constraint can be relaxed in
practice, and the generator almost always transforms the
image within the original class y.
As mentioned earlier, after training the StarGAN model,
we can use the generator G(., d, d
) as our fd,d (.) function
and learn a domain-invariant representation via the learning
objective in Eq 13. We name this implementation of our
method DIR-GAN (domain-invariant representation learn-
ing with generative adversarial networks).
5. Experiments
5.1. Datasets
To evaluate our method, we perform experiments in three
datasets that are commonly used in the literature for domain
generalization.
Rotated MNIST. In this dataset by Ghifary et al.
(2015), 1,000 MNIST images (100 per class) (LeCun
Cortes, 2010) are chosen to form the first domain (de-
noted M0), then rotations of 15◦
, 30◦
, 45◦
, 60◦
and 75◦
are applied to create five additional domains, denoted
M15, M30, M45, M60 and M75. The task is classifica-
tion with ten classes (digits 0 to 9).
PACS (Li et al., 2017) contains 9,991 images from four
different domains: art painting, cartoon, photo, sketch. The
task is classification with seven classes.
OfficeHome (Venkateswara et al., 2017) has 15,500 im-
6. Experiments / Results
1) Dataset
2) Results
itioned on
ransforms
rent from
e image x
ut, in our
ain d and
e original
into think-
nation do-
StarGAN,
ccessfully
ain to that
G(., d, d
)
us section
objective
chitecture
urages the
y belongs
ing with generative adversarial networks).
5. Experiments
5.1. Datasets
To evaluate our method, we perform experiments in three
datasets that are commonly used in the literature for domain
generalization.
Rotated MNIST. In this dataset by Ghifary et al.
(2015), 1,000 MNIST images (100 per class) (LeCun
Cortes, 2010) are chosen to form the first domain (de-
noted M0), then rotations of 15◦
, 30◦
, 45◦
, 60◦
and 75◦
are applied to create five additional domains, denoted
M15, M30, M45, M60 and M75. The task is classifica-
tion with ten classes (digits 0 to 9).
PACS (Li et al., 2017) contains 9,991 images from four
different domains: art painting, cartoon, photo, sketch. The
task is classification with seven classes.
OfficeHome (Venkateswara et al., 2017) has 15,500 im-
ages of daily objects from four domains: art, clipart, product
and real. There are 65 classes in this classification dataset.
The generator’s goal is to fool a discriminator D into think-
ing that the transformed image belongs to the destination do-
main d
. In other words, the equilibrium state of StarGAN,
in which G completely fools D, is when G successfully
transforms the data density of the original domain to that
of the destination domain. After training, we use G(., d, d
)
as the function fd,d (.) described in the previous section
and perform the representation learning via the objective
function in Eq 13.
Three important loss functions of the StarGAN architecture
are:
• Domain classification loss Lcls that encourages the
generator G to generate images that correctly belongs
to the desired destination domain d
.
generalization.
Rotated MNIST. In this dataset by Ghifary et al.
(2015), 1,000 MNIST images (100 per class) (LeCun
Cortes, 2010) are chosen to form the first domain (de-
noted M0), then rotations of 15◦
, 30◦
, 45◦
, 60◦
and 75◦
are applied to create five additional domains, denoted
M15, M30, M45, M60 and M75. The task is classifica-
tion with ten classes (digits 0 to 9).
PACS (Li et al., 2017) contains 9,991 images from four
different domains: art painting, cartoon, photo, sketch. The
task is classification with seven classes.
OfficeHome (Venkateswara et al., 2017) has 15,500 im-
ages of daily objects from four domains: art, clipart, product
and real. There are 65 classes in this classification dataset.
in which G completely fools D, is when G successfully
transforms the data density of the original domain to that
of the destination domain. After training, we use G(., d, d
)
as the function fd,d (.) described in the previous section
and perform the representation learning via the objective
function in Eq 13.
Three important loss functions of the StarGAN architecture
are:
• Domain classification loss Lcls that encourages the
generator G to generate images that correctly belongs
to the desired destination domain d
.
Cortes, 2010) are chosen to form the first domain (de-
noted M0), then rotations of 15◦
, 30◦
, 45◦
, 60◦
and 75◦
are applied to create five additional domains, denoted
M15, M30, M45, M60 and M75. The task is classifica-
tion with ten classes (digits 0 to 9).
PACS (Li et al., 2017) contains 9,991 images from four
different domains: art painting, cartoon, photo, sketch. The
task is classification with seven classes.
OfficeHome (Venkateswara et al., 2017) has 15,500 im-
ages of daily objects from four domains: art, clipart, product
and real. There are 65 classes in this classification dataset.
Domain Invariant Representation Learning with Domain Density Transformations
Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
Domains
Model M0 M15 M30 M45 M60 M75 Average
HIR (Wang et al., 2020) 90.34 99.75 99.40 96.17 99.25 91.26 96.03
DIVA (Ilse et al., 2020) 93.5 99.3 99.1 99.2 99.3 93.0 97.2
DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36
DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3
LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3
HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8
ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2
DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6
Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
PACS
Model Backbone Art Painting Cartoon Photo Sketch Average
DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38
JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14
MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70
MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70
CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40
DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46
DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62
Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
OfficeHome
Domain Invariant Representation Learning with Domain Density Transformations
Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
Domains
Model M0 M15 M30 M45 M60 M75 Average
HIR (Wang et al., 2020) 90.34 99.75 99.40 96.17 99.25 91.26 96.03
DIVA (Ilse et al., 2020) 93.5 99.3 99.1 99.2 99.3 93.0 97.2
DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36
DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3
LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3
HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8
ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2
DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6
Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
PACS
Model Backbone Art Painting Cartoon Photo Sketch Average
DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38
JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14
MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70
MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70
CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40
DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46
DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62
Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
Domains
Model M0 M15 M30 M45 M60 M75 Average
HIR (Wang et al., 2020) 90.34 99.75 99.40 96.17 99.25 91.26 96.03
DIVA (Ilse et al., 2020) 93.5 99.3 99.1 99.2 99.3 93.0 97.2
DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36
DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3
LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3
HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8
ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2
DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6
Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
PACS
Model Backbone Art Painting Cartoon Photo Sketch Average
DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38
JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14
MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70
MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70
CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40
DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46
DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62
Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
OfficeHome
Model Backbone Art ClipArt Product Real Average
D-SAM (D’Innocente Caputo, 2018) Resnet18 58.03 44.37 69.22 71.45 60.77
JiGen (Carlucci et al., 2019) Resnet18 53.04 47.51 71.47 72.79 61.20
DIR-GAN (ours) Resnet18 56.69(±0.4) 50.49(±0.2) 71.32(±0.4) 74.23(±0.5) 63.18
5.2. Experimental Setting
For all datasets, we perform “leave-one-domain-out” exper-
iments, where we choose one domain as the target domain,
train the model on all remaining domains and evaluate it
on the chosen domain. Following standard practice, we use
90% of available data as training data and 10% as validation
data, except for the Rotated MNIST experiment where we
do not use a validation set and just report the performance
of the last epoch.
For the Rotated MNIST dataset, we use a network of two
3x3 convolutional layers and a fully connected layer as the
representation network gθ to get a representation z of 64
dimensions. A single linear layer is then used to map the
representation z to the ten output classes. This architecture
is the deterministic version of the network used by Ilse et al.
(2020). We train our network for 500 epochs with the Adam
optimizer (Kingma Ba, 2014), using the learning rate
test domain after the last epoch.
For the PACS and OfficeHome datasets, we use a Resnet18
(He et al., 2016) network as the representation network gθ.
As a standard practice, the Resnet18 backbone is pre-trained
on ImageNet. We replace the last fully connected layer of
the Resnet with a linear layer of dimensions (512, 256) so
that our representation has 256 dimensions. As with the
Rotated MNIST experiment, we use a single layer to map
from the representation z to the output. We train the network
for 100 epochs with plain stochastic gradient descent (SGD)
using learning rate 0.001, momentum 0.9, minibatch size
64, and weight decay 0.001. Data augmentation is also
standard practice for real-world computer vision datasets
like PACS and OfficeHome, and during the training we
augment our data as follows: crops of random size and
aspect ratio, resizing to 224 × 224 pixels, random horizontal
flips, random color jitter, randomly converting the image
tile to grayscale with 10% probability, and normalization
HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8
ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2
DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6
Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
PACS
Model Backbone Art Painting Cartoon Photo Sketch Average
DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38
JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14
MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70
MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70
CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40
DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46
DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62
Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
OfficeHome
Model Backbone Art ClipArt Product Real Average
D-SAM (D’Innocente Caputo, 2018) Resnet18 58.03 44.37 69.22 71.45 60.77
JiGen (Carlucci et al., 2019) Resnet18 53.04 47.51 71.47 72.79 61.20
DIR-GAN (ours) Resnet18 56.69(±0.4) 50.49(±0.2) 71.32(±0.4) 74.23(±0.5) 63.18
5.2. Experimental Setting
For all datasets, we perform “leave-one-domain-out” exper-
iments, where we choose one domain as the target domain,
train the model on all remaining domains and evaluate it
on the chosen domain. Following standard practice, we use
90% of available data as training data and 10% as validation
data, except for the Rotated MNIST experiment where we
do not use a validation set and just report the performance
of the last epoch.
For the Rotated MNIST dataset, we use a network of two
3x3 convolutional layers and a fully connected layer as the
representation network gθ to get a representation z of 64
dimensions. A single linear layer is then used to map the
representation z to the ten output classes. This architecture
is the deterministic version of the network used by Ilse et al.
(2020). We train our network for 500 epochs with the Adam
optimizer (Kingma Ba, 2014), using the learning rate
0.001 and minibatch size 64, and report performance on the
test domain after the last epoch.
For the PACS and OfficeHome datasets, we use a Resnet18
(He et al., 2016) network as the representation network gθ.
As a standard practice, the Resnet18 backbone is pre-trained
on ImageNet. We replace the last fully connected layer of
the Resnet with a linear layer of dimensions (512, 256) so
that our representation has 256 dimensions. As with the
Rotated MNIST experiment, we use a single layer to map
from the representation z to the output. We train the network
for 100 epochs with plain stochastic gradient descent (SGD)
using learning rate 0.001, momentum 0.9, minibatch size
64, and weight decay 0.001. Data augmentation is also
standard practice for real-world computer vision datasets
like PACS and OfficeHome, and during the training we
augment our data as follows: crops of random size and
aspect ratio, resizing to 224 × 224 pixels, random horizontal
flips, random color jitter, randomly converting the image
tile to grayscale with 10% probability, and normalization
using the ImageNet channel means and standard deviations.
Domain Invariant Representation Learning with Domain Density Transformations
Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
Domains
Model M0 M15 M30 M45 M60 M75 Average
HIR (Wang et al., 2020) 90.34 99.75 99.40 96.17 99.25 91.26 96.03
DIVA (Ilse et al., 2020) 93.5 99.3 99.1 99.2 99.3 93.0 97.2
DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36
DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3
LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3
HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8
ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2
DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6
Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
PACS
Model Backbone Art Painting Cartoon Photo Sketch Average
DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38
JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14
MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70
MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70
CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40
DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46
DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62
Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
OfficeHome
Model Backbone Art ClipArt Product Real Average
D-SAM (D’Innocente Caputo, 2018) Resnet18 58.03 44.37 69.22 71.45 60.77
JiGen (Carlucci et al., 2019) Resnet18 53.04 47.51 71.47 72.79 61.20
DIR-GAN (ours) Resnet18 56.69(±0.4) 50.49(±0.2) 71.32(±0.4) 74.23(±0.5) 63.18
5.2. Experimental Setting
For all datasets, we perform “leave-one-domain-out” exper-
iments, where we choose one domain as the target domain,
train the model on all remaining domains and evaluate it
on the chosen domain. Following standard practice, we use
90% of available data as training data and 10% as validation
data, except for the Rotated MNIST experiment where we
do not use a validation set and just report the performance
of the last epoch.
test domain after the last epoch.
For the PACS and OfficeHome datasets, we use a Resnet18
(He et al., 2016) network as the representation network gθ.
As a standard practice, the Resnet18 backbone is pre-trained
on ImageNet. We replace the last fully connected layer of
the Resnet with a linear layer of dimensions (512, 256) so
that our representation has 256 dimensions. As with the
Rotated MNIST experiment, we use a single layer to map
from the representation z to the output. We train the network
for 100 epochs with plain stochastic gradient descent (SGD)
Domain Invariant Representation Learning with Domain Density Transformations
Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
Domains
Model M0 M15 M30 M45 M60 M75 Average
HIR (Wang et al., 2020) 90.34 99.75 99.40 96.17 99.25 91.26 96.03
DIVA (Ilse et al., 2020) 93.5 99.3 99.1 99.2 99.3 93.0 97.2
DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36
DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3
LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3
HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8
ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2
DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6
Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
PACS
Model Backbone Art Painting Cartoon Photo Sketch Average
DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38
JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14
MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70
MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70
CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40
DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46
DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62
Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
OfficeHome
Model Backbone Art ClipArt Product Real Average
D-SAM (D’Innocente Caputo, 2018) Resnet18 58.03 44.37 69.22 71.45 60.77
JiGen (Carlucci et al., 2019) Resnet18 53.04 47.51 71.47 72.79 61.20
DIR-GAN (ours) Resnet18 56.69(±0.4) 50.49(±0.2) 71.32(±0.4) 74.23(±0.5) 63.18
5.2. Experimental Setting
For all datasets, we perform “leave-one-domain-out” exper-
iments, where we choose one domain as the target domain,
train the model on all remaining domains and evaluate it
on the chosen domain. Following standard practice, we use
90% of available data as training data and 10% as validation
data, except for the Rotated MNIST experiment where we
do not use a validation set and just report the performance
of the last epoch.
test domain after the last epoch.
For the PACS and OfficeHome datasets, we use a Resnet18
(He et al., 2016) network as the representation network gθ.
As a standard practice, the Resnet18 backbone is pre-trained
on ImageNet. We replace the last fully connected layer of
the Resnet with a linear layer of dimensions (512, 256) so
that our representation has 256 dimensions. As with the
Rotated MNIST experiment, we use a single layer to map
from the representation z to the output. We train the network
for 100 epochs with plain stochastic gradient descent (SGD)
DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36
DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3
LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3
HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8
ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2
DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6
Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
PACS
Model Backbone Art Painting Cartoon Photo Sketch Average
DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38
JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14
MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70
MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70
CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40
DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46
DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62
Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
OfficeHome
Model Backbone Art ClipArt Product Real Average
D-SAM (D’Innocente Caputo, 2018) Resnet18 58.03 44.37 69.22 71.45 60.77
JiGen (Carlucci et al., 2019) Resnet18 53.04 47.51 71.47 72.79 61.20
DIR-GAN (ours) Resnet18 56.69(±0.4) 50.49(±0.2) 71.32(±0.4) 74.23(±0.5) 63.18
5.2. Experimental Setting
For all datasets, we perform “leave-one-domain-out” exper-
iments, where we choose one domain as the target domain,
train the model on all remaining domains and evaluate it
on the chosen domain. Following standard practice, we use
90% of available data as training data and 10% as validation
data, except for the Rotated MNIST experiment where we
do not use a validation set and just report the performance
of the last epoch.
For the Rotated MNIST dataset, we use a network of two
3x3 convolutional layers and a fully connected layer as the
representation network gθ to get a representation z of 64
dimensions. A single linear layer is then used to map the
representation z to the ten output classes. This architecture
is the deterministic version of the network used by Ilse et al.
(2020). We train our network for 500 epochs with the Adam
optimizer (Kingma Ba, 2014), using the learning rate
0.001 and minibatch size 64, and report performance on the
test domain after the last epoch.
For the PACS and OfficeHome datasets, we use a Resnet18
(He et al., 2016) network as the representation network gθ.
As a standard practice, the Resnet18 backbone is pre-trained
on ImageNet. We replace the last fully connected layer of
the Resnet with a linear layer of dimensions (512, 256) so
that our representation has 256 dimensions. As with the
Rotated MNIST experiment, we use a single layer to map
from the representation z to the output. We train the network
for 100 epochs with plain stochastic gradient descent (SGD)
using learning rate 0.001, momentum 0.9, minibatch size
64, and weight decay 0.001. Data augmentation is also
standard practice for real-world computer vision datasets
like PACS and OfficeHome, and during the training we
augment our data as follows: crops of random size and
aspect ratio, resizing to 224 × 224 pixels, random horizontal
flips, random color jitter, randomly converting the image
tile to grayscale with 10% probability, and normalization
using the ImageNet channel means and standard deviations.
Domain Invariant Representation Learning with Domain D
Figure 4. Visualization of the representation space. Each point indicates a representa
and its color indicates the label y. Two left figures are for our method DIR-GAN and t
The StarGAN (Choi et al., 2018) model implementation
is taken from the authors’ original source code with no
significant modifications. For each set of source domains,
we train the StarGAN model for 100,000 iterations with a
minibatch of 16 images per iteration.
The code for all of our experiments will be released for
reproducibility. Please also refer to the source code for any
the general dis
distribution (fo
and green poin
PACS and Of
domain invaria
been applied w
puter vision d
LD = −Ladv + λcls Lr
cls, (5)
LG = Ladv + λcls Lf
cls + λrec Lrec, (6)
where λcls and λrec are hyper-parameters that control the
relative importance of domain classification and reconstruc-
tion losses, respectively, compared to the adversarial loss.
We use λcls = 1 and λrec = 10 in all of our experiments.
tency loss [9, 33] to the generator, defined as
Lrec = Ex,c,c [||x − G(G(x, c), c
)||1], (4)
where G takes in the translated image G(x, c) and the origi-
nal domain label c
as input and tries to reconstruct the orig-
inal image x. We adopt the L1 norm as our reconstruction
loss. Note that we use a single generator twice, first to trans-
late an original image into an image in the target domain
and then to reconstruct the original image from the trans-
lated image.
Full Objective. Finally, the objective functions to optimize
G and D are written, respectively, as
LD = −Ladv + λcls Lr
cls, (5)
LG = Ladv + λcls Lf
cls + λrec Lrec, (6)
where λcls and λrec are hyper-parameters that control the
relative importance of domain classification and reconstruc-
tion losses, respectively, compared to the adversarial loss.
We use λcls = 1 and λrec = 10 in all of our experiments.
RaFD datasets, where n is two.
Training Strategy. When training S
datasets, we use the domain label c̃ d
put to the generator. By doing so, t
ignore the unspecified labels, which
focus on the explicitly given label. Th
erator is exactly the same as in trainin
except for the dimension of the inpu
hand, we extend the auxiliary classi
tor to generate probability distributio
datasets. Then, we train the model in
setting, where the discriminator tries
classification error associated to the
ample, when training with images in
nator minimizes only classification e
to CelebA attributes, and not facial
RaFD. Under these settings, by altern
and RaFD the discriminator learns al
features for both datasets, and the ge
trol all the labels in both datasets.
4
er words, G tries to minimize this objective to gener-
ages that can be classified as the target domain c.
nstruction Loss. By minimizing the adversarial and
fication losses, G is trained to generate images that
alistic and classified to its correct target domain. How-
minimizing the losses (Eqs. (1) and (3)) does not guar-
that translated images preserve the content of its input
s while changing only the domain-related part of the
. To alleviate this problem, we apply a cycle consis-
loss [9, 33] to the generator, defined as
Lrec = Ex,c,c [||x − G(G(x, c), c
)||1], (4)
G takes in the translated image G(x, c) and the origi-
main label c
as input and tries to reconstruct the orig-
mage x. We adopt the L1 norm as our reconstruction
Note that we use a single generator twice, first to trans-
n original image into an image in the target domain
hen to reconstruct the original image from the trans-
mage.
Objective. Finally, the objective functions to optimize
D are written, respectively, as
LD = −Ladv + λcls Lr
cls, (5)
f
c̃ = [c1, ..., cn, m], (7)
where [·] refers to concatenation, and ci represents a vector
for the labels of the i-th dataset. The vector of the known
label ci can be represented as either a binary vector for bi-
nary attributes or a one-hot vector for categorical attributes.
For the remaining n−1 unknown labels we simply assign
zero values. In our experiments, we utilize the CelebA and
RaFD datasets, where n is two.
Training Strategy. When training StarGAN with multiple
datasets, we use the domain label c̃ defined in Eq. (7) as in-
put to the generator. By doing so, the generator learns to
ignore the unspecified labels, which are zero vectors, and
focus on the explicitly given label. The structure of the gen-
erator is exactly the same as in training with a single dataset,
except for the dimension of the input label c̃. On the other
hand, we extend the auxiliary classifier of the discrimina-
tor to generate probability distributions over labels for all
datasets. Then, we train the model in a multi-task learning
setting, where the discriminator tries to minimize only the
classification error associated to the known label. For ex-
ample, when training with images in CelebA, the discrimi-
8. Domain Invariant Representation Learning with Domain Density Transformations
3) Visualization of Representation
novel and scalable approach capable of learning mappings
among multiple domains. As demonstrated in Fig. 2 (b), our
model takes in training data of multiple domains, and learns
the mappings between all available domains using only a
single generator. The idea is simple. Instead of learning
a fixed translation (e.g., black-to-blond hair), our generator
takes in as inputs both image and domain information, and
learns to flexibly translate the image into the correspond-
ing domain. We use a label (e.g., binary or one-hot vector)
to represent domain information. During training, we ran-
domly generate a target domain label and train the model to
flexibly translate an input image into the target domain. By
doing so, we can control the domain label and translate the
image into any desired domain at testing phase.
We also introduce a simple but effective approach that
enables joint training between domains of different datasets
by adding a mask vector to the domain label. Our proposed
method ensures that the model can ignore unknown labels
and focus on the label provided by a particular dataset. In
this manner, our model can perform well on tasks such
as synthesizing facial expressions of CelebA images us-
• We provi
facial att
sis tasks
baseline
2. Related W
Generative A
ial networks
in various com
[6, 24, 32, 8],
imaging [14],
typical GAN m
tor and a gene
between real a
generate fake
samples. Our
to make the ge
Conditional G
ation has also
2
particular, the network G(x, d, d
) (i.e., G is c
the image x and the two different domains d, d
an image x from domain d to domain d
. D
the original StarGAN model that only takes
and the desired destination domain d
as its
implementation, we feed both the original d
desired destination domain d
together with
image x to the generator G.
The generator’s goal is to fool a discriminator
ing that the transformed image belongs to the d
main d
. In other words, the equilibrium state
in which G completely fools D, is when G
transforms the data density of the original d
of the destination domain. After training, we
as the function fd,d (.) described in the pre
and perform the representation learning via
function in Eq 13.
Three important loss functions of the StarGAN
are:
• Domain classification loss Lcls that en
generator G to generate images that corr
to the desired destination domain d
.
In o
ate i
Rec
clas
are r
ever
ante
ima
inpu
tenc
whe
nal d
inal
loss
late
and
lated
Full
G a
Domain Invariant Representation Learning with Domain Density Transformations
Figure 4. Visualization of the representation space. Each point indicates a representation z of an image x in the two dimensional space
and its color indicates the label y. Two left figures are for our method DIR-GAN and two right figures are for the naive model DeepAll.
The StarGAN (Choi et al., 2018) model implementation
is taken from the authors’ original source code with no
significant modifications. For each set of source domains,
we train the StarGAN model for 100,000 iterations with a
minibatch of 16 images per iteration.
The code for all of our experiments will be released for
reproducibility. Please also refer to the source code for any
other architecture and implementation details.
the general distribution of the points) and the conditional
distribution (for example, the distributions of blue points
and green points).
PACS and OfficeHome. To the best of our knowledge,
domain invariant representation learning methods have not
been applied widely and successfully for real-world com-
puter vision datasets (e.g., PACS and OfficeHome) with
very deep neural networks such as Resnet, so the only rel-