This document proposes a cross-domain complementary learning method with synthetic data for multi-person part segmentation. The method trains two modules interchangeably: one on synthetic data to predict keypoints and part segmentation, and one on real data to predict keypoints. By sharing parameters between the modules and leveraging the common skeleton representation in both domains, the method is able to transfer knowledge between synthetic and real data to improve part segmentation performance without requiring real part labels. Experimental results show the method outperforms alternatives that only use synthetic or real data, demonstrating it can relax labeling requirements for multi-person part segmentation tasks.
Cross-domain complementary learning with synthetic data for multi-person part segmentation
1. Cross-domain Complementary Learning
with Synthetic Data for Multi-Person
Part Segmentation
Kevin Lin, Lijuan Wang, Kun Luo, Yinpeng Chen, Zicheng Liu, Ming-Ting Sun
University of Washington, Seattle
Microsoft, Redmond
International Conference on Computer Vision (ICCV), Demonstration, 2019
1
3. Human part segmentation
• Human part segmentation aims at partitioning persons in the image
to multiple semantically consistent regions.
Typically 14 parts: Head, torso, left upper-arm, right upper-arm, left lower-arm, right lower-
arm, left hand, right hand, left thigh, right thigh, left shank, right shank, left foot, right foot
Input Image Part Segmentation
3
5. Previous works
• People have been exploring synthetic data as an alternative.
• They trained deep CNN using the synthetic data.
Samples of the synthetic training data and the synthetic labels [CVPR17]
5
6. Previous works
Their method works well only on the well-controlled, single-person
scenario.
Learning from Synthetic Humans, CVPR 2017
Input
images
Output
results
6
7. The domain gap
• The discrepancy of pixel-value distributions between the
synthetic and real data makes transferring the knowledge
from the synthetic to real domain challenging.
Synthetic image Real images 7
8. Related works on street-view segmentation
• People are also trying to use graphics simulation for training a
segmentation model for street-view images.
• They also observe the domain-gap issue.
Zhang et al, Fully Convolutional Adaptation Networks for Semantic Segmentation, CVPR 2018.
8
9. Related works on street-view segmentation
• Previous studies tried to address the domain-gap issue by using
adversarial training.
• They use a discriminator to distinguish whether the input is from the
source or target domain.
[Tsai et al, ICCV2019], [Tsai et al, CVPR 2018], [Ren et al, CVPR2018], [Tzeng et al, CVPR2017], [Ganin et al, ICML2015]
Graphics simulation
Real-world images
9
10. Challenges
• Can we learn human part segmentation without data labeling?
• How to learn human part segmentation from graphics simulations,
and make the resulting model work well on real world scenario?
We propose a new approach, named cross-domain complementary
learning (CDCL) to address the challenges.
10
11. Our multi-person synthetic data
• We create a new multi-person synthetic dataset which contains multiple
persons performing various actions in a 3D room.
11
12. The idea
•We observe that real and synthetic humans both have
a skeleton (pose) representation.
12
13. Proposed method
• We propose to bridge the domains with skeletons and learn part
segmentation from synthetic data.
13
14. Proposed network: Module 1
Backbone
(ResNet101)
Part Affinity Fields
Keypoint Maps
Skeletons
Real Inputs
Head networks
The network architecture is similar to “Realtime Multi-Person
2D Pose Estimation using Part Affinity Fields,” in CVPR 2017.
14
15. Proposed network: Module 2
Backbone
(ResNet101)
Head networks
Keypoint Maps
Body Part Maps
Skeletons
Body Part
Segmentation
Synthetic Inputs
Part Affinity
Fields
15
16. Two modules are trained interchangeably
Backbone
(ResNet101)
Head networks
Keypoint Maps
Body Part Maps
Skeletons
Body Part
Segmentation
Backbone
(ResNet101)
Part Affinity
Fields
Keypoint Maps
Skeletons
Parameter Sharing
Synthetic Inputs
Real Inputs Head networks
Module 2
Module 1
Part Affinity
Fields
16
17. • Intersection over Union (IoU) is one of the most commonly used
metrics in semantic segmentation.
• IoU is calculated for each body part category separately.
• We average over all categories to provide a mean IoU.
Evaluation metric
IoU =
𝐴𝑟𝑒𝑎 𝑜𝑓 𝐼𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡𝑖𝑜𝑛
𝐴𝑒𝑟𝑎 𝑜𝑓 𝑈𝑛𝑖𝑜𝑛
𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 ∩ 𝐺𝑟𝑜𝑢𝑛𝑑 𝑡𝑟𝑢𝑡ℎ
𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 ∪ 𝐺𝑟𝑜𝑢𝑛𝑑 𝑡𝑟𝑢𝑡ℎ
17
19. Comparison on Pascal and COCO (mIOU, %)
Synthetic
Only
Adversarial
Training
Fang et al
CVPR18
OursChen et al
TPAMI18
Gong et al
CVPR17
Ours +
Real part labels
Use real part labels Use additional real
part labels
Ideal
19
20. Comparison on Pascal and COCO (mIOU, %)
Synthetic
Only
Adversarial
Training
Fang et al
CVPR18
OursChen et al
TPAMI18
Gong et al
CVPR17
Ours +
Real part labels
Use real part labels Use additional real
part labels
Ideal
Performance
Gap
20
21. Comparison on Pascal and COCO (mIOU, %)
Synthetic
Only
Adversarial
Training
Fang et al
CVPR18
OursChen et al
TPAMI18
Gong et al
CVPR17
Ours +
Real part labels
Use real part labels Use additional real
part labels
Ideal
Performance
Gap
21
22. Comparison on Pascal and COCO (mIOU, %)
Synthetic
Only
Adversarial
Training
Fang et al
CVPR18
OursChen et al
TPAMI18
Gong et al
CVPR17
Ours +
Real part labels
Use real part labels Use additional real
part labels
Ideal
Relax labeling
requirements!
22
23. Comparison on Pascal and COCO (mIOU, %)
Synthetic
Only
Adversarial
Training
Fang et al
CVPR18
OursChen et al
TPAMI18
Gong et al
CVPR17
Ours +
Real part labels
Use real part labels Use additional real
part labels
Ideal
Our performance
upper bound
23
30. General approach
• Our proposed cross-domain training approach is general and can be
extended to other applications, such as novel keypoint detection.
We can simply generate new labels on the synthetic data
30
31. Novel keypoint detection
• In some applications, we need to detect other keypoints (e.g., joints) such
as hand tips, toes, pelvis, spine.
• We create novel keypoints using the graphics simulator and train our model
to detect new human skeleton including those on the hands and feet.
The definition of our newly
created novel keypoints
31
33. Conclusion
• We discover human pose is very effective to bridge the real and
synthetic domains for multi-person part segmentation.
• We introduce an effective framework to leverage information in
both real and synthetic images for multi-person part segmentation.
• Our method can be extended to generate labels for keypoints such
as those on hands and feet in real images without human labeling.
33
34. On-going work and future directions
• Reconstruct 3D human mesh from a single image
without ground truth training labels
34
35. On-going work and future directions
• Training data labeling for 3D body shape is very expensive.
First stage:
Ask workers to label parts
Second stage:
Ask workers to label the corresponding
points on 3D human model
Sampled points: uniformly sampled points within the part
Guler et al, “DensePose: Learning image-to-surface correspondence,” CVPR 2018. 35
36. On-going work and future directions
• We plan to explore different approaches to learn
human 3D body shape from graphics simulations.
36