Cross-domain complementary learning with synthetic data for multi-person part segmentation

Cross-domain Complementary Learning
with Synthetic Data for Multi-Person
Part Segmentation
Kevin Lin, Lijuan Wang, Kun Luo, Yinpeng Chen, Zicheng Liu, Ming-Ting Sun
University of Washington, Seattle
Microsoft, Redmond
International Conference on Computer Vision (ICCV), Demonstration, 2019
1

Outline
• Introduction
• Related works
• Proposed method
• Experiments
• On-going work and Conclusion
2

Human part segmentation
• Human part segmentation aims at partitioning persons in the image
to multiple semantically consistent regions.
Typically 14 parts: Head, torso, left upper-arm, right upper-arm, left lower-arm, right lower-
arm, left hand, right hand, left thigh, right thigh, left shank, right shank, left foot, right foot
Input Image Part Segmentation
3

Challenges
• Training data labeling in pixel-level is very expensive and labor intensive.
4

Previous works
• People have been exploring synthetic data as an alternative.
• They trained deep CNN using the synthetic data.
Samples of the synthetic training data and the synthetic labels [CVPR17]
5

Previous works
Their method works well only on the well-controlled, single-person
scenario.
Learning from Synthetic Humans, CVPR 2017
Input
images
Output
results
6

The domain gap
• The discrepancy of pixel-value distributions between the
synthetic and real data makes transferring the knowledge
from the synthetic to real domain challenging.
Synthetic image Real images 7

Related works on street-view segmentation
• People are also trying to use graphics simulation for training a
segmentation model for street-view images.
• They also observe the domain-gap issue.
Zhang et al, Fully Convolutional Adaptation Networks for Semantic Segmentation, CVPR 2018.
8

Related works on street-view segmentation
• Previous studies tried to address the domain-gap issue by using
adversarial training.
• They use a discriminator to distinguish whether the input is from the
source or target domain.
[Tsai et al, ICCV2019], [Tsai et al, CVPR 2018], [Ren et al, CVPR2018], [Tzeng et al, CVPR2017], [Ganin et al, ICML2015]
Graphics simulation
Real-world images
9

Challenges
• Can we learn human part segmentation without data labeling?
• How to learn human part segmentation from graphics simulations,
and make the resulting model work well on real world scenario?
We propose a new approach, named cross-domain complementary
learning (CDCL) to address the challenges.
10

Our multi-person synthetic data
• We create a new multi-person synthetic dataset which contains multiple
persons performing various actions in a 3D room.
11

The idea
•We observe that real and synthetic humans both have
a skeleton (pose) representation.
12

Proposed method
• We propose to bridge the domains with skeletons and learn part
segmentation from synthetic data.
13

Proposed network: Module 1
Backbone
(ResNet101)
Part Affinity Fields
Keypoint Maps
Skeletons
Real Inputs
Head networks
The network architecture is similar to “Realtime Multi-Person
2D Pose Estimation using Part Affinity Fields,” in CVPR 2017.
14

Proposed network: Module 2
Backbone
(ResNet101)
Head networks
Keypoint Maps
Body Part Maps
Skeletons
Body Part
Segmentation
Synthetic Inputs
Part Affinity
Fields
15

Two modules are trained interchangeably
Backbone
(ResNet101)
Head networks
Keypoint Maps
Body Part Maps
Skeletons
Body Part
Segmentation
Backbone
(ResNet101)
Part Affinity
Fields
Keypoint Maps
Skeletons
Parameter Sharing
Synthetic Inputs
Real Inputs Head networks
Module 2
Module 1
Part Affinity
Fields
16

• Intersection over Union (IoU) is one of the most commonly used
metrics in semantic segmentation.
• IoU is calculated for each body part category separately.
• We average over all categories to provide a mean IoU.
Evaluation metric
IoU =
𝐴𝑟𝑒𝑎 𝑜𝑓 𝐼𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡𝑖𝑜𝑛
𝐴𝑒𝑟𝑎 𝑜𝑓 𝑈𝑛𝑖𝑜𝑛
𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 ∩ 𝐺𝑟𝑜𝑢𝑛𝑑 𝑡𝑟𝑢𝑡ℎ
𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 ∪ 𝐺𝑟𝑜𝑢𝑛𝑑 𝑡𝑟𝑢𝑡ℎ
17

Evaluation benchmarks
• Pascal-Person-Parts dataset
• 1716 training images
• 1817 test images
• COCO-DensePose dataset
• 26151 training images
• 1508 test images
18

Comparison on Pascal and COCO (mIOU, %)
Synthetic
Only
Adversarial
Training
Fang et al
CVPR18
OursChen et al
TPAMI18
Gong et al
CVPR17
Ours +
Real part labels
Use real part labels Use additional real
part labels
Ideal
19

Synthetic
Only
Adversarial
Training
Fang et al
CVPR18
OursChen et al
TPAMI18
Gong et al
CVPR17
Ours +
Real part labels
part labels
Ideal
Performance
Gap
20

Synthetic
Only
Adversarial
Training
Fang et al
CVPR18
OursChen et al
TPAMI18
Gong et al
CVPR17
Ours +
Real part labels
part labels
Ideal
Performance
Gap
21

Synthetic
Only
Adversarial
Training
Fang et al
CVPR18
OursChen et al
TPAMI18
Gong et al
CVPR17
Ours +
Real part labels
part labels
Ideal
Relax labeling
requirements!
22

Synthetic
Only
Adversarial
Training
Fang et al
CVPR18
OursChen et al
TPAMI18
Gong et al
CVPR17
Ours +
Real part labels
part labels
Ideal
Our performance
upper bound
23

Qualitative comparison
Training with Synthetic Data Only
[CVPR17]
Ours
24

Domain Adaptation with
Adversarial Training
[CVPR18]
Ours
25

Synthetic training data analysis
27

[1] Learning from Synthetic Humans, CVPR17.
28

[1] Learning from Synthetic Humans, CVPR17.
29

General approach
• Our proposed cross-domain training approach is general and can be
extended to other applications, such as novel keypoint detection.
We can simply generate new labels on the synthetic data
30

Novel keypoint detection
• In some applications, we need to detect other keypoints (e.g., joints) such
as hand tips, toes, pelvis, spine.
• We create novel keypoints using the graphics simulator and train our model
to detect new human skeleton including those on the hands and feet.
The definition of our newly
created novel keypoints
31

Conclusion
• We discover human pose is very effective to bridge the real and
synthetic domains for multi-person part segmentation.
• We introduce an effective framework to leverage information in
both real and synthetic images for multi-person part segmentation.
• Our method can be extended to generate labels for keypoints such
as those on hands and feet in real images without human labeling.
33

On-going work and future directions
• Reconstruct 3D human mesh from a single image
without ground truth training labels
34

• Training data labeling for 3D body shape is very expensive.
First stage:
Ask workers to label parts
Second stage:
Ask workers to label the corresponding
points on 3D human model
Sampled points: uniformly sampled points within the part
Guler et al, “DensePose: Learning image-to-surface correspondence,” CVPR 2018. 35

• We plan to explore different approaches to learn
human 3D body shape from graphics simulations.
36

Cross-domain complementary learning with synthetic data for multi-person part segmentation

More Related Content

What's hot

Similar to Cross-domain complementary learning with synthetic data for multi-person part segmentation

More from 哲东 郑

Recently uploaded

Cross-domain complementary learning with synthetic data for multi-person part segmentation

More from 哲东郑