Cross-domain Complementary Learning
with Synthetic Data for Multi-Person
Part Segmentation
Kevin Lin, Lijuan Wang, Kun Luo, Yinpeng Chen, Zicheng Liu, Ming-Ting Sun
University of Washington, Seattle
Microsoft, Redmond
International Conference on Computer Vision (ICCV), Demonstration, 2019
1
Outline
• Introduction
• Related works
• Proposed method
• Experiments
• On-going work and Conclusion
2
Human part segmentation
• Human part segmentation aims at partitioning persons in the image
to multiple semantically consistent regions.
Typically 14 parts: Head, torso, left upper-arm, right upper-arm, left lower-arm, right lower-
arm, left hand, right hand, left thigh, right thigh, left shank, right shank, left foot, right foot
Input Image Part Segmentation
3
Challenges
• Training data labeling in pixel-level is very expensive and labor intensive.
4
Previous works
• People have been exploring synthetic data as an alternative.
• They trained deep CNN using the synthetic data.
Samples of the synthetic training data and the synthetic labels [CVPR17]
5
Previous works
Their method works well only on the well-controlled, single-person
scenario.
Learning from Synthetic Humans, CVPR 2017
Input
images
Output
results
6
The domain gap
• The discrepancy of pixel-value distributions between the
synthetic and real data makes transferring the knowledge
from the synthetic to real domain challenging.
Synthetic image Real images 7
Related works on street-view segmentation
• People are also trying to use graphics simulation for training a
segmentation model for street-view images.
• They also observe the domain-gap issue.
Zhang et al, Fully Convolutional Adaptation Networks for Semantic Segmentation, CVPR 2018.
8
Related works on street-view segmentation
• Previous studies tried to address the domain-gap issue by using
adversarial training.
• They use a discriminator to distinguish whether the input is from the
source or target domain.
[Tsai et al, ICCV2019], [Tsai et al, CVPR 2018], [Ren et al, CVPR2018], [Tzeng et al, CVPR2017], [Ganin et al, ICML2015]
Graphics simulation
Real-world images
9
Challenges
• Can we learn human part segmentation without data labeling?
• How to learn human part segmentation from graphics simulations,
and make the resulting model work well on real world scenario?
We propose a new approach, named cross-domain complementary
learning (CDCL) to address the challenges.
10
Our multi-person synthetic data
• We create a new multi-person synthetic dataset which contains multiple
persons performing various actions in a 3D room.
11
The idea
•We observe that real and synthetic humans both have
a skeleton (pose) representation.
12
Proposed method
• We propose to bridge the domains with skeletons and learn part
segmentation from synthetic data.
13
Proposed network: Module 1
Backbone
(ResNet101)
Part Affinity Fields
Keypoint Maps
Skeletons
Real Inputs
Head networks
The network architecture is similar to “Realtime Multi-Person
2D Pose Estimation using Part Affinity Fields,” in CVPR 2017.
14
Proposed network: Module 2
Backbone
(ResNet101)
Head networks
Keypoint Maps
Body Part Maps
Skeletons
Body Part
Segmentation
Synthetic Inputs
Part Affinity
Fields
15
Two modules are trained interchangeably
Backbone
(ResNet101)
Head networks
Keypoint Maps
Body Part Maps
Skeletons
Body Part
Segmentation
Backbone
(ResNet101)
Part Affinity
Fields
Keypoint Maps
Skeletons
Parameter Sharing
Synthetic Inputs
Real Inputs Head networks
Module 2
Module 1
Part Affinity
Fields
16
• Intersection over Union (IoU) is one of the most commonly used
metrics in semantic segmentation.
• IoU is calculated for each body part category separately.
• We average over all categories to provide a mean IoU.
Evaluation metric
IoU =
𝐴𝑟𝑒𝑎 𝑜𝑓 𝐼𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡𝑖𝑜𝑛
𝐴𝑒𝑟𝑎 𝑜𝑓 𝑈𝑛𝑖𝑜𝑛
𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 ∩ 𝐺𝑟𝑜𝑢𝑛𝑑 𝑡𝑟𝑢𝑡ℎ
𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 ∪ 𝐺𝑟𝑜𝑢𝑛𝑑 𝑡𝑟𝑢𝑡ℎ
17
Evaluation benchmarks
• Pascal-Person-Parts dataset
• 1716 training images
• 1817 test images
• COCO-DensePose dataset
• 26151 training images
• 1508 test images
18
Comparison on Pascal and COCO (mIOU, %)
Synthetic
Only
Adversarial
Training
Fang et al
CVPR18
OursChen et al
TPAMI18
Gong et al
CVPR17
Ours +
Real part labels
Use real part labels Use additional real
part labels
Ideal
19
Comparison on Pascal and COCO (mIOU, %)
Synthetic
Only
Adversarial
Training
Fang et al
CVPR18
OursChen et al
TPAMI18
Gong et al
CVPR17
Ours +
Real part labels
Use real part labels Use additional real
part labels
Ideal
Performance
Gap
20
Comparison on Pascal and COCO (mIOU, %)
Synthetic
Only
Adversarial
Training
Fang et al
CVPR18
OursChen et al
TPAMI18
Gong et al
CVPR17
Ours +
Real part labels
Use real part labels Use additional real
part labels
Ideal
Performance
Gap
21
Comparison on Pascal and COCO (mIOU, %)
Synthetic
Only
Adversarial
Training
Fang et al
CVPR18
OursChen et al
TPAMI18
Gong et al
CVPR17
Ours +
Real part labels
Use real part labels Use additional real
part labels
Ideal
Relax labeling
requirements!
22
Comparison on Pascal and COCO (mIOU, %)
Synthetic
Only
Adversarial
Training
Fang et al
CVPR18
OursChen et al
TPAMI18
Gong et al
CVPR17
Ours +
Real part labels
Use real part labels Use additional real
part labels
Ideal
Our performance
upper bound
23
Qualitative comparison
Training with Synthetic Data Only
[CVPR17]
Ours
24
Qualitative comparison
Domain Adaptation with
Adversarial Training
[CVPR18]
Ours
25
Ablation study
26
Synthetic training data analysis
27
Qualitative comparison
[1] Learning from Synthetic Humans, CVPR17.
28
Qualitative comparison
[1] Learning from Synthetic Humans, CVPR17.
29
General approach
• Our proposed cross-domain training approach is general and can be
extended to other applications, such as novel keypoint detection.
We can simply generate new labels on the synthetic data
30
Novel keypoint detection
• In some applications, we need to detect other keypoints (e.g., joints) such
as hand tips, toes, pelvis, spine.
• We create novel keypoints using the graphics simulator and train our model
to detect new human skeleton including those on the hands and feet.
The definition of our newly
created novel keypoints
31
Qualitative results
32
Conclusion
• We discover human pose is very effective to bridge the real and
synthetic domains for multi-person part segmentation.
• We introduce an effective framework to leverage information in
both real and synthetic images for multi-person part segmentation.
• Our method can be extended to generate labels for keypoints such
as those on hands and feet in real images without human labeling.
33
On-going work and future directions
• Reconstruct 3D human mesh from a single image
without ground truth training labels
34
On-going work and future directions
• Training data labeling for 3D body shape is very expensive.
First stage:
Ask workers to label parts
Second stage:
Ask workers to label the corresponding
points on 3D human model
Sampled points: uniformly sampled points within the part
Guler et al, “DensePose: Learning image-to-surface correspondence,” CVPR 2018. 35
On-going work and future directions
• We plan to explore different approaches to learn
human 3D body shape from graphics simulations.
36
Thank you
37

Cross-domain complementary learning with synthetic data for multi-person part segmentation

  • 1.
    Cross-domain Complementary Learning withSynthetic Data for Multi-Person Part Segmentation Kevin Lin, Lijuan Wang, Kun Luo, Yinpeng Chen, Zicheng Liu, Ming-Ting Sun University of Washington, Seattle Microsoft, Redmond International Conference on Computer Vision (ICCV), Demonstration, 2019 1
  • 2.
    Outline • Introduction • Relatedworks • Proposed method • Experiments • On-going work and Conclusion 2
  • 3.
    Human part segmentation •Human part segmentation aims at partitioning persons in the image to multiple semantically consistent regions. Typically 14 parts: Head, torso, left upper-arm, right upper-arm, left lower-arm, right lower- arm, left hand, right hand, left thigh, right thigh, left shank, right shank, left foot, right foot Input Image Part Segmentation 3
  • 4.
    Challenges • Training datalabeling in pixel-level is very expensive and labor intensive. 4
  • 5.
    Previous works • Peoplehave been exploring synthetic data as an alternative. • They trained deep CNN using the synthetic data. Samples of the synthetic training data and the synthetic labels [CVPR17] 5
  • 6.
    Previous works Their methodworks well only on the well-controlled, single-person scenario. Learning from Synthetic Humans, CVPR 2017 Input images Output results 6
  • 7.
    The domain gap •The discrepancy of pixel-value distributions between the synthetic and real data makes transferring the knowledge from the synthetic to real domain challenging. Synthetic image Real images 7
  • 8.
    Related works onstreet-view segmentation • People are also trying to use graphics simulation for training a segmentation model for street-view images. • They also observe the domain-gap issue. Zhang et al, Fully Convolutional Adaptation Networks for Semantic Segmentation, CVPR 2018. 8
  • 9.
    Related works onstreet-view segmentation • Previous studies tried to address the domain-gap issue by using adversarial training. • They use a discriminator to distinguish whether the input is from the source or target domain. [Tsai et al, ICCV2019], [Tsai et al, CVPR 2018], [Ren et al, CVPR2018], [Tzeng et al, CVPR2017], [Ganin et al, ICML2015] Graphics simulation Real-world images 9
  • 10.
    Challenges • Can welearn human part segmentation without data labeling? • How to learn human part segmentation from graphics simulations, and make the resulting model work well on real world scenario? We propose a new approach, named cross-domain complementary learning (CDCL) to address the challenges. 10
  • 11.
    Our multi-person syntheticdata • We create a new multi-person synthetic dataset which contains multiple persons performing various actions in a 3D room. 11
  • 12.
    The idea •We observethat real and synthetic humans both have a skeleton (pose) representation. 12
  • 13.
    Proposed method • Wepropose to bridge the domains with skeletons and learn part segmentation from synthetic data. 13
  • 14.
    Proposed network: Module1 Backbone (ResNet101) Part Affinity Fields Keypoint Maps Skeletons Real Inputs Head networks The network architecture is similar to “Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields,” in CVPR 2017. 14
  • 15.
    Proposed network: Module2 Backbone (ResNet101) Head networks Keypoint Maps Body Part Maps Skeletons Body Part Segmentation Synthetic Inputs Part Affinity Fields 15
  • 16.
    Two modules aretrained interchangeably Backbone (ResNet101) Head networks Keypoint Maps Body Part Maps Skeletons Body Part Segmentation Backbone (ResNet101) Part Affinity Fields Keypoint Maps Skeletons Parameter Sharing Synthetic Inputs Real Inputs Head networks Module 2 Module 1 Part Affinity Fields 16
  • 17.
    • Intersection overUnion (IoU) is one of the most commonly used metrics in semantic segmentation. • IoU is calculated for each body part category separately. • We average over all categories to provide a mean IoU. Evaluation metric IoU = 𝐴𝑟𝑒𝑎 𝑜𝑓 𝐼𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡𝑖𝑜𝑛 𝐴𝑒𝑟𝑎 𝑜𝑓 𝑈𝑛𝑖𝑜𝑛 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 ∩ 𝐺𝑟𝑜𝑢𝑛𝑑 𝑡𝑟𝑢𝑡ℎ 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 ∪ 𝐺𝑟𝑜𝑢𝑛𝑑 𝑡𝑟𝑢𝑡ℎ 17
  • 18.
    Evaluation benchmarks • Pascal-Person-Partsdataset • 1716 training images • 1817 test images • COCO-DensePose dataset • 26151 training images • 1508 test images 18
  • 19.
    Comparison on Pascaland COCO (mIOU, %) Synthetic Only Adversarial Training Fang et al CVPR18 OursChen et al TPAMI18 Gong et al CVPR17 Ours + Real part labels Use real part labels Use additional real part labels Ideal 19
  • 20.
    Comparison on Pascaland COCO (mIOU, %) Synthetic Only Adversarial Training Fang et al CVPR18 OursChen et al TPAMI18 Gong et al CVPR17 Ours + Real part labels Use real part labels Use additional real part labels Ideal Performance Gap 20
  • 21.
    Comparison on Pascaland COCO (mIOU, %) Synthetic Only Adversarial Training Fang et al CVPR18 OursChen et al TPAMI18 Gong et al CVPR17 Ours + Real part labels Use real part labels Use additional real part labels Ideal Performance Gap 21
  • 22.
    Comparison on Pascaland COCO (mIOU, %) Synthetic Only Adversarial Training Fang et al CVPR18 OursChen et al TPAMI18 Gong et al CVPR17 Ours + Real part labels Use real part labels Use additional real part labels Ideal Relax labeling requirements! 22
  • 23.
    Comparison on Pascaland COCO (mIOU, %) Synthetic Only Adversarial Training Fang et al CVPR18 OursChen et al TPAMI18 Gong et al CVPR17 Ours + Real part labels Use real part labels Use additional real part labels Ideal Our performance upper bound 23
  • 24.
    Qualitative comparison Training withSynthetic Data Only [CVPR17] Ours 24
  • 25.
    Qualitative comparison Domain Adaptationwith Adversarial Training [CVPR18] Ours 25
  • 26.
  • 27.
  • 28.
    Qualitative comparison [1] Learningfrom Synthetic Humans, CVPR17. 28
  • 29.
    Qualitative comparison [1] Learningfrom Synthetic Humans, CVPR17. 29
  • 30.
    General approach • Ourproposed cross-domain training approach is general and can be extended to other applications, such as novel keypoint detection. We can simply generate new labels on the synthetic data 30
  • 31.
    Novel keypoint detection •In some applications, we need to detect other keypoints (e.g., joints) such as hand tips, toes, pelvis, spine. • We create novel keypoints using the graphics simulator and train our model to detect new human skeleton including those on the hands and feet. The definition of our newly created novel keypoints 31
  • 32.
  • 33.
    Conclusion • We discoverhuman pose is very effective to bridge the real and synthetic domains for multi-person part segmentation. • We introduce an effective framework to leverage information in both real and synthetic images for multi-person part segmentation. • Our method can be extended to generate labels for keypoints such as those on hands and feet in real images without human labeling. 33
  • 34.
    On-going work andfuture directions • Reconstruct 3D human mesh from a single image without ground truth training labels 34
  • 35.
    On-going work andfuture directions • Training data labeling for 3D body shape is very expensive. First stage: Ask workers to label parts Second stage: Ask workers to label the corresponding points on 3D human model Sampled points: uniformly sampled points within the part Guler et al, “DensePose: Learning image-to-surface correspondence,” CVPR 2018. 35
  • 36.
    On-going work andfuture directions • We plan to explore different approaches to learn human 3D body shape from graphics simulations. 36
  • 37.