VITON-HD: High-Resolution Virtual Try-On
via Misalignment-Aware Normalization
1Korea Advanced Institute of Science and Technology (KAIST)
Seunghwan Choi1* Sunghyun Park1* Minsoo Lee1* Jaegul Choo1
Image-based Virtual Try-on
2
• Virtual try-on refers to an image generation task of changing the clothing item on a
person into a different item.
• Various image-based virtual try-on methods follow two processes in common:
1. Warping the clothing images initially to fit the human body.
2. Fusing the warped clothes and the image of a person to generate the final result
that includes pixel-level refinement.
Limitation of Existing Virtual Try-On Approach
3
• The misalignment between the warped clothes and a person’s body occurs.
• The misaligned regions become more noticeable as the size of the image increases.
• U-Net is insufficient in synthesizing initially occluded body parts in a final high-
resolution images.
Key Contributions
4
• We propose a novel virtual try-on network called VITON-HD, which is the first model
to successfully synthesize high-resolution images.
• We introduce a clothing-agnostic person representation that removes the
dependency on the clothing item originally worn.
• To address the misalignment problem, we propose ALIAS normalization and ALIAS
generator, which is effective in maintaining the details of clothes.
• We demonstrate the superior performance of our method through experiments with
baselines on the newly collected dataset.
VITON-HD
5
VITON-HD
6
• Given a reference image 𝐼 containing a target person, we predict the segmentation
map 𝑆 and pose map 𝑃.
• We utilize 𝑆 and 𝑃 to make a clothing-agnostic person image 𝐼𝑎 and segmentation
𝑆𝑎 where the information of the clothing item originally worn is truly eliminated.
VITON-HD
7
• Given the clothing-agnostic person representation (𝑆𝑎, 𝑃), and target clothing
item 𝑐, the segmentation generator predicts the segmentation map መ
𝑆.
VITON-HD
8
• Using the clothing-agnostic person representation (𝐼𝑎, 𝑃) and ෡
𝑆𝑐 as inputs,
Geometric Matching Module predicts TPS transformation parameter θ, and
then 𝑐 is warped by 𝜃.
VITON-HD
9
• ALIAS generator synthesizes the final output image መ
𝐼 via ALIAS normalization
and the multi-scale refinement.
• Warped clothing image 𝑊(𝑐, 𝜃) and (𝐼𝑎, 𝑃) are injected into each layer of the
generator to preserve the clothing details.
• Also, each layer of the generator utilize ALIAS normalization.
ALIAS Normalization
10
• ALIAS normalization standardizes the regions of
𝑀𝑚𝑖𝑠𝑎𝑙𝑖𝑔𝑛 and the other regions in ℎ𝑖
separately,
and then modulates the standardized activation
using affine transformation parameters.
• The rationale behind this strategy is to remove
the misleading information in the misaligned regions.
Experiments
Figure: Qualitative comparison of the segmentation generator of ACGPN and VITON-HD.
The clothing-agnostic segmentation map used by each model is also reported. 11
Experiments
12
Experiments
Table: Quantitative comparison with baselines across different resolutions. VITON-HD* is a VITON-HD variant
where the standardization in ALIAS normalization is replace by channel-wise standardization as in the original
instance normalization. For the SSIM, higher is better. For the LPIPS and the FID, lower is better.
13
Experiments
Figure: Effects of ALIAS normalization. The orange colored areas in the enlarged images indicate the misaligned regions.
14
15
Thank you!

[CVPR 2021] VITON-HD: High-Resolution Virtual Try-On via Misalignment-Aware Normalization

  • 1.
    VITON-HD: High-Resolution VirtualTry-On via Misalignment-Aware Normalization 1Korea Advanced Institute of Science and Technology (KAIST) Seunghwan Choi1* Sunghyun Park1* Minsoo Lee1* Jaegul Choo1
  • 2.
    Image-based Virtual Try-on 2 •Virtual try-on refers to an image generation task of changing the clothing item on a person into a different item. • Various image-based virtual try-on methods follow two processes in common: 1. Warping the clothing images initially to fit the human body. 2. Fusing the warped clothes and the image of a person to generate the final result that includes pixel-level refinement.
  • 3.
    Limitation of ExistingVirtual Try-On Approach 3 • The misalignment between the warped clothes and a person’s body occurs. • The misaligned regions become more noticeable as the size of the image increases. • U-Net is insufficient in synthesizing initially occluded body parts in a final high- resolution images.
  • 4.
    Key Contributions 4 • Wepropose a novel virtual try-on network called VITON-HD, which is the first model to successfully synthesize high-resolution images. • We introduce a clothing-agnostic person representation that removes the dependency on the clothing item originally worn. • To address the misalignment problem, we propose ALIAS normalization and ALIAS generator, which is effective in maintaining the details of clothes. • We demonstrate the superior performance of our method through experiments with baselines on the newly collected dataset.
  • 5.
  • 6.
    VITON-HD 6 • Given areference image 𝐼 containing a target person, we predict the segmentation map 𝑆 and pose map 𝑃. • We utilize 𝑆 and 𝑃 to make a clothing-agnostic person image 𝐼𝑎 and segmentation 𝑆𝑎 where the information of the clothing item originally worn is truly eliminated.
  • 7.
    VITON-HD 7 • Given theclothing-agnostic person representation (𝑆𝑎, 𝑃), and target clothing item 𝑐, the segmentation generator predicts the segmentation map መ 𝑆.
  • 8.
    VITON-HD 8 • Using theclothing-agnostic person representation (𝐼𝑎, 𝑃) and ෡ 𝑆𝑐 as inputs, Geometric Matching Module predicts TPS transformation parameter θ, and then 𝑐 is warped by 𝜃.
  • 9.
    VITON-HD 9 • ALIAS generatorsynthesizes the final output image መ 𝐼 via ALIAS normalization and the multi-scale refinement. • Warped clothing image 𝑊(𝑐, 𝜃) and (𝐼𝑎, 𝑃) are injected into each layer of the generator to preserve the clothing details. • Also, each layer of the generator utilize ALIAS normalization.
  • 10.
    ALIAS Normalization 10 • ALIASnormalization standardizes the regions of 𝑀𝑚𝑖𝑠𝑎𝑙𝑖𝑔𝑛 and the other regions in ℎ𝑖 separately, and then modulates the standardized activation using affine transformation parameters. • The rationale behind this strategy is to remove the misleading information in the misaligned regions.
  • 11.
    Experiments Figure: Qualitative comparisonof the segmentation generator of ACGPN and VITON-HD. The clothing-agnostic segmentation map used by each model is also reported. 11
  • 12.
  • 13.
    Experiments Table: Quantitative comparisonwith baselines across different resolutions. VITON-HD* is a VITON-HD variant where the standardization in ALIAS normalization is replace by channel-wise standardization as in the original instance normalization. For the SSIM, higher is better. For the LPIPS and the FID, lower is better. 13
  • 14.
    Experiments Figure: Effects ofALIAS normalization. The orange colored areas in the enlarged images indicate the misaligned regions. 14
  • 15.