5. Which part corresponds to a body part?
• Local evidence is weak
• Part context is a strong cue
• Top-down cues are helpful2016/8/11 5
6. Using Local Image Evidence
Multi-ClassClassificationof Patches
g1
Image
Features
1xz
ImageLocation z
Input Image
handsfeet
Requires a high-capacity
supervised predictor capable of
handling multi-modal data2016/8/11 6
7. Using Local Image Evidence
A ClassicalSlidingWindowDetectionPipeline
Image
Feature
Extraction
Classification
2016/8/11 7
8. Local Image Evidence is Weak
• Certain parts are easier to detect than others
head neck l.shoulder l.elbow l.wrist
2016/8/11 8
9. Part Context is a Strong Cue
Part detection confidences provide spatial context cues
L-ShoulderL-ElbowImage Neck
2016/8/11 9
10. Tree Structures vs Loopy Graphs
Tree Structures
• Fast and exact
inference
• Double counting
Loopy Graphs
• Rich context
• Approximate inference
2015/9/11 10
12. Context
Features
g2 g3
Stage II Stage III
Confidence Maps Confidence Maps
g1
Context
Features
Stage I
Confidence Maps
Stage I Confidence
Image
Features
Head Neck L-Shoulder L-Elbow L-Wrist
2016/8/11 L b12
15. Level 1 parts Level 2 poselet Level 3 full body
[Bourdev et al., CVPR2009][Sun et al., CVPR2012]
[Duan et al., BMVC 2012][Singh et al., ECCV2012]
[Pishchulin et al., CVPR2013] etc.
Top Down Cues are Helpful
Larger Composite Parts can be Easier to detect
2016/8/11 15
16. 2
gT
1
gT
Stage t = (T = 3)
Context
Features
Context
Context
Features
Image
Features
Features
Context
Features
Context
Features
Context
Features
Image
Features
Image
Features
Image
Features
2g1
L g1
Stage t = 1
1
g1
Level
1
Level
2
LevelL
Image
Features
Image
Features
Image
Features
L
g2
2
g2
1
g
Stage t = 2
Incorporating Hierarchical Cues
• Each level of the hierarchy uses a separate predictor
• Context features are computed on the outputs of the previous stage
• Spatial context information is passed across layers via context features
L
gT
2016/8/11 16
18. Stage t = 2
Level 2 Confidence Maps
StageIStageIIStageIII
Head+Sho L.Arm R.Arm Torso L.Leg Bkgd.R.Leg
1g2
1g1
Level
1
1
gT
Image
Features
Image
Features
Image
Features
Context
Features
Context
Features
Level
2
2g1
L g1
L g2
2g2
Stage t = 1
L gT
2gT
Stage t = (T = 3)
Context
Features
Context
Features
Context
Features
Context
Features
LevelL
2016/8/11 L b18
19. Stage t = 2
Level 3 Confidence Maps
Torso Bkgd.
Stage
I
Stage
II
Stage
III
1g2
1g1
Level
1
1
gT
Image
Features
Image
Features
Image
Features
Context
Features
Context
Features
Level
2
2g1
L g1
L g2
2g2
Stage t = 1
L gT
2gT
Stage t = (T = 3)
Context
Features
Context
Features
Context
Features
Context
Features
LevelL
2016/8/11 L b19
21. Pose Machines
Sequential Prediction with Spatial Context
Training reduces to training multiple
supervised classifiers
g2g1 g3
Context
Features
Context
Features
Stage I
Confidence Maps
Stage II
Confidence Maps
Stage III
Confidence MapsImage
Features
Image
Features
Image
Features
No structured loss function
No specialized solvers
No handcrafted spatial model
Spatial model is learned implicitly by the
classifiers in a data-driven fashion
2016/8/11 21
26. Learning
• Joint Training with Intermediate Supervision
𝑓𝑡 = − 2
2 Loss: Euclidean distance
groundtruth prediction
Network without Intermediate Supervision leads vanishing gradients
2016/8/11 26
27. Input Stage 1
Layer 1 Layer 3 Layer 6
4
1 10
3
10
Epoch
10
2
1
10
0
10
Output
Layer 18
Stage 2
Layer 7 Layer 9 Layer 12 Layer 13
Stage 3
Layer 15
4
2 10
3
10
Epoch
10
2
1
10
0
10
4
3 10
3
10
Epoch
10
2
1
10
0
10
−0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5
Supervision Supervision
Histograms of Gradient Magnitude During Training
Supervision
Learning
Intermediate Supervision Addresses Vanishing Gradients
Gradient Magnitude
10
Gradient (× 10−3
) With Intermediate Supervision Without Intermediate Supervision
0
10
1
10
2
10
3
10
4
Input
Image
h w 3
5⇥5
C
5⇥5
C
2⇥ 5⇥5 9⇥9 1⇥1 1⇥1
P C C C C
9⇥9
C
9⇥9 13⇥13 13⇥13 15⇥15 1⇥1 1⇥1
C C C C C C
2⇥
P
5⇥5
C
5⇥5
C
5⇥5
C
2⇥
P
2⇥
P
Input
Image
h w 3
h0 w0
P1+1 P1+1
9⇥9
C
Loss
1
f 2
Loss
1f 1
x1
1
x1
2 9⇥9 13⇥13 13⇥13 15⇥15 1⇥1 1⇥1
C C C C C C
5⇥5 2⇥ 5⇥5 2⇥ 5⇥5
C P C P C
Input
Image
h w 3
h0 w0
P1+1
Loss
1f 3
x1
2
h0 w0
Stage 3, level 1Stage 2, level 1Stage 1, level 1
2016/8/11 27
28. Input
Layer 1
Output
Layer 18
10
0
10
1
10
2
10
3
10
4
Epoch1
Stage 1
Layer 3 Layer 6 Layer 7
Stage 2
Layer 9 Layer 12 Layer 13
Stage 3
Layer 15
10
0
10
1
10
2
10
3
10
4
Epoch2
−0.5 0.0 0.5
10
0
10
1
10
2
10
3
10
4
Epoch3
−0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5
Histograms of Gradient Magnitude During Training
Supervision Supervision Supervision
Input
Image
h w 3
5⇥5
C
5⇥5
C
2⇥ 5⇥5 9⇥9 1⇥1 1⇥1
P C C C C
9⇥9
C
9⇥9 13⇥13 13⇥13 15⇥15 1⇥1 1⇥1
C C C C C C
2⇥
P
5⇥5
C
5⇥5
C
5⇥5
C
2⇥
P
2⇥
P
Input
Image
h w 3
h0 w0
P1+1 P1+1
9⇥9
C
Loss
1
f 2
Loss
1f 1
x1
1
x1
2 9⇥9 13⇥13 13⇥13 15⇥15 1⇥1 1⇥1
C C C C C C
5⇥5 2⇥ 5⇥5 2⇥ 5⇥5
C P C P C
Input
Image
h w 3
h0 w0
P1+1
Loss
1f 3
x1
2
h0 w0
Gradient (× 10−3
) With Intermediate Supervision Without Intermediate Supervision
Stage 3, level 1Stage 2, level 1Stage 1, level 1
Learning
Intermediate Supervision Addresses Vanishing Gradients
2016/8/11 28
29. 0
0
Detectionrate%
(i) With Intermediate Supervision (IS)
(ii) Stagewise
(iii) IS + Stagewise Pretrain
(iv) Without Intermediate Supervision
0.05 0.1 0.15 0.2
Normalized distance
100
90
80
70
60
50
40
30
20
10
PCK total, LSP OC
Learning
Comparison of Learning Methods
2016/8/11 29
34. Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted
Pose
Level 1 Part Confidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Stage
II
Stage
I
Stage
III
Ablative Spatial Analysis
2016/8/11 34
35. Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted
Pose
Stage
II
Stage
I
Stage
III
Level 1 Part Confidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilient to missing
context (of one part)
Context from the confidence map of
head is removed
Ablative Spatial Analysis
2016/8/11 35
36. Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted
Pose
Level 1 Part Confidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilient to missing
context (of one part)
Stage
II
Stage
I
Stage
III
Ablative Spatial Analysis
2016/8/11 36
37. Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted
Pose
Level 1 Part Confidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilient to missing
context (of one part)
Stage
II
Stage
I
Stage
III
Ablative Spatial Analysis
2016/8/11 37
38. Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Level 1 Part Confidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilient to missing
context (of one part)
StageIIStageIStageIII
Predicted
Pose
Ablative Spatial Analysis
2016/8/11 38
39. Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted
Pose
Level 1 Part Confidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilient to missing
context (of one part)
Stage
II
Stage
I
Stage
III
Ablative Spatial Analysis
2016/8/11 39
40. Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted
Pose
Level 1 Part Confidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilient to missing
context (of one part)
Stage
II
Stage
I
Stage
III
Ablative Spatial Analysis
2016/8/11 40
41. Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted
Pose
Level 1 Part Confidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilient to missing
context (of one part)
Stage
II
Stage
I
Stage
III
Ablative Spatial Analysis
2016/8/11 41
42. Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted
Pose
Level 1 Part Confidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilient to missing
context (of one part)
Stage
II
Stage
I
Stage
III
Ablative Spatial Analysis
2016/8/11 42
43. Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted
Pose
Level 1 Part Confidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilient to missing
context (of one part)
Stage
II
Stage
I
Stage
III
Ablative Spatial Analysis
2016/8/11 43
44. Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted
Pose
Level 1 Part Confidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilient to missing
context (of one part)
Stage
II
Stage
I
Stage
III
Ablative Spatial Analysis
2016/8/11 44
45. Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted
Pose
Level 1 Part Confidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilient to missing
context (of one part)
Stage
II
Stage
I
Stage
III
Ablative Spatial Analysis
2016/8/11 45
46. Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted
Pose
Level 1 Part Confidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilient to missing
context (of one part)
Stage
II
Stage
I
Stage
III
Ablative Spatial Analysis
2016/8/11 46
47. Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted
Pose
Level 1 Part Confidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilient to missing
context (of one part)
Stage
II
Stage
I
Stage
III
Ablative Spatial Analysis
2016/8/11 47
48. Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted
Pose
Level 1 Part Confidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilient to missing
context (of one part)
Stage
II
Stage
I
Stage
III
Ablative Spatial Analysis
2016/8/11 48