Pose Machines
Estimating Articulated Pose from Images
Robotics Institute
Carnegie Mellon University
Convolutional Pose Machines. Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2016.
Pose Machines: Articulated Pose Estimation via Inference Machines. Varun Ramakrishna, Daniel Munoz, Martial Hebert, J.A. Bagnell,
Yaser Sheikh. In ECCV 2014 (Oral presentation).
2016/8/11 1
Goal: Articulated Pose Estimation
2016/8/11 2
Goal: Articulated Pose Estimation
https://www.youtube.com/watch?v=Oi_ycvFHd64&index=6&list=PLNh5A7HtLRcpsMfvyG0DED-Dr4zW5Lpcg
2016/8/11 3
Goal: Articulated Pose Estimation
https://www.youtube.com/watch?v=MsZkLK0Wcmk&list=PLNh5A7HtLRcpsMfvyG0DED-Dr4zW5Lpcg&index=1
2016/8/11 4
Which part corresponds to a body part?
• Local evidence is weak
• Part context is a strong cue
• Top-down cues are helpful2016/8/11 5
Using Local Image Evidence
Multi-ClassClassificationof Patches
g1
Image
Features
1xz
ImageLocation z
Input Image
handsfeet
Requires a high-capacity
supervised predictor capable of
handling multi-modal data2016/8/11 6
Using Local Image Evidence
A ClassicalSlidingWindowDetectionPipeline
Image
Feature
Extraction
Classification
2016/8/11 7
Local Image Evidence is Weak
• Certain parts are easier to detect than others
head neck l.shoulder l.elbow l.wrist
2016/8/11 8
Part Context is a Strong Cue
Part detection confidences provide spatial context cues
L-ShoulderL-ElbowImage Neck
2016/8/11 9
Tree Structures vs Loopy Graphs
Tree Structures
• Fast and exact
inference
• Double counting
Loopy Graphs
• Rich context
• Approximate inference
2015/9/11 10
Designing Context Representations
Context features encode responses of aprevious prediction stage
Offset
Features
Patch
Features
Image
2016/8/11 L b11
Context
Features
g2 g3
Stage II Stage III
Confidence Maps Confidence Maps
g1
Context
Features
Stage I
Confidence Maps
Stage I Confidence
Image
Features
Head Neck L-Shoulder L-Elbow L-Wrist
2016/8/11 L b12
g2g1
Context
Features
g3
Image
Features
Context
Features
Stage I
Confidence Maps
Stage II
Confidence Maps
Stage III
Confidence Maps
Stage II Confidence
Head Neck L-Shoulder L-Elbow L-Wrist
2016/8/11 L b13
g2g1 g3
Context
Features
Context
Features
Stage I
Confidence Maps
Stage II
Confidence Maps
Stage III
Confidence Maps
Image
Features
Stage III Confidence
Head Neck L-Shoulder L-Elbow L-Wrist
2016/8/11 L b14
Level 1 parts Level 2 poselet Level 3 full body
[Bourdev et al., CVPR2009][Sun et al., CVPR2012]
[Duan et al., BMVC 2012][Singh et al., ECCV2012]
[Pishchulin et al., CVPR2013] etc.
Top Down Cues are Helpful
Larger Composite Parts can be Easier to detect
2016/8/11 15
2
gT
1
gT
Stage t = (T = 3)
Context
Features
Context
Context
Features
Image
Features
Features
Context
Features
Context
Features
Context
Features
Image
Features
Image
Features
Image
Features
2g1
L g1
Stage t = 1
1
g1
Level
1
Level
2
LevelL
Image
Features
Image
Features
Image
Features
L
g2
2
g2
1
g
Stage t = 2
Incorporating Hierarchical Cues
• Each level of the hierarchy uses a separate predictor
• Context features are computed on the outputs of the previous stage
• Spatial context information is passed across layers via context features
L
gT
2016/8/11 16
1g2
1g1
Level
1
1
gT
Image
Features
Image
Features
Image
Features
Context
Features
Context
Features
Level
2
2g1
L g1
L g2
2g2
Stage t = 1 Stage t = 2
Level I Confidence Maps
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
L gT
2gT
Stage t = (T = 3)
Context
Features
Context
Features
Context
Features
Context
Features
LevelL
Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
StageIStageIIStageIII
2016/8/11 L b17
Stage t = 2
Level 2 Confidence Maps
StageIStageIIStageIII
Head+Sho L.Arm R.Arm Torso L.Leg Bkgd.R.Leg
1g2
1g1
Level
1
1
gT
Image
Features
Image
Features
Image
Features
Context
Features
Context
Features
Level
2
2g1
L g1
L g2
2g2
Stage t = 1
L gT
2gT
Stage t = (T = 3)
Context
Features
Context
Features
Context
Features
Context
Features
LevelL
2016/8/11 L b18
Stage t = 2
Level 3 Confidence Maps
Torso Bkgd.
Stage
I
Stage
II
Stage
III
1g2
1g1
Level
1
1
gT
Image
Features
Image
Features
Image
Features
Context
Features
Context
Features
Level
2
2g1
L g1
L g2
2g2
Stage t = 1
L gT
2gT
Stage t = (T = 3)
Context
Features
Context
Features
Context
Features
Context
Features
LevelL
2016/8/11 L b19
1g2
1g1
Level
1
1gT
Image
Features
Image
Features
Image
Features
Context
Features
Context
Features
Level
2
2g1
L g1
L g2
2g2
Stage t = 1 Stage t = 2
L gT
2gT
Stage t = (T = 3)
Context
Features
Context
Features
Context
Features
Context
Features
LevelL
Fully Connected Model
2016/8/11 L b20
Pose Machines
Sequential Prediction with Spatial Context
Training reduces to training multiple
supervised classifiers
g2g1 g3
Context
Features
Context
Features
Stage I
Confidence Maps
Stage II
Confidence Maps
Stage III
Confidence MapsImage
Features
Image
Features
Image
Features
No structured loss function
No specialized solvers
No handcrafted spatial model
Spatial model is learned implicitly by the
classifiers in a data-driven fashion
2016/8/11 21
Learning Feature Representations
• Convolutional Architectures for Feature Embedding
2016/8/11 22
Learning Context Representations
• Large Receptive Fields as a Design Criterion
2016/8/11 23
Learning Context Representations
• Large Receptive Fields Improve Pose Estimation
2016/8/11 24
Convolutional Pose Machines
• Designing a Convolutional Architecture
2016/8/11 25
Learning
• Joint Training with Intermediate Supervision
𝑓𝑡 = − 2
2 Loss: Euclidean distance
groundtruth prediction
Network without Intermediate Supervision leads vanishing gradients
2016/8/11 26
Input Stage 1
Layer 1 Layer 3 Layer 6
4
1 10
3
10
Epoch
10
2
1
10
0
10
Output
Layer 18
Stage 2
Layer 7 Layer 9 Layer 12 Layer 13
Stage 3
Layer 15
4
2 10
3
10
Epoch
10
2
1
10
0
10
4
3 10
3
10
Epoch
10
2
1
10
0
10
−0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5
Supervision Supervision
Histograms of Gradient Magnitude During Training
Supervision
Learning
Intermediate Supervision Addresses Vanishing Gradients
Gradient Magnitude
10
Gradient (× 10−3
) With Intermediate Supervision Without Intermediate Supervision
0
10
1
10
2
10
3
10
4
Input
Image
h w 3
5⇥5
C
5⇥5
C
2⇥ 5⇥5 9⇥9 1⇥1 1⇥1
P C C C C
9⇥9
C
9⇥9 13⇥13 13⇥13 15⇥15 1⇥1 1⇥1
C C C C C C
2⇥
P
5⇥5
C
5⇥5
C
5⇥5
C
2⇥
P
2⇥
P
Input
Image
h w 3
h0 w0
P1+1 P1+1
9⇥9
C
Loss
1
f 2
Loss
1f 1
x1
1
x1
2 9⇥9 13⇥13 13⇥13 15⇥15 1⇥1 1⇥1
C C C C C C
5⇥5 2⇥ 5⇥5 2⇥ 5⇥5
C P C P C
Input
Image
h w 3
h0 w0
P1+1
Loss
1f 3
x1
2
h0 w0
Stage 3, level 1Stage 2, level 1Stage 1, level 1
2016/8/11 27
Input
Layer 1
Output
Layer 18
10
0
10
1
10
2
10
3
10
4
Epoch1
Stage 1
Layer 3 Layer 6 Layer 7
Stage 2
Layer 9 Layer 12 Layer 13
Stage 3
Layer 15
10
0
10
1
10
2
10
3
10
4
Epoch2
−0.5 0.0 0.5
10
0
10
1
10
2
10
3
10
4
Epoch3
−0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5
Histograms of Gradient Magnitude During Training
Supervision Supervision Supervision
Input
Image
h w 3
5⇥5
C
5⇥5
C
2⇥ 5⇥5 9⇥9 1⇥1 1⇥1
P C C C C
9⇥9
C
9⇥9 13⇥13 13⇥13 15⇥15 1⇥1 1⇥1
C C C C C C
2⇥
P
5⇥5
C
5⇥5
C
5⇥5
C
2⇥
P
2⇥
P
Input
Image
h w 3
h0 w0
P1+1 P1+1
9⇥9
C
Loss
1
f 2
Loss
1f 1
x1
1
x1
2 9⇥9 13⇥13 13⇥13 15⇥15 1⇥1 1⇥1
C C C C C C
5⇥5 2⇥ 5⇥5 2⇥ 5⇥5
C P C P C
Input
Image
h w 3
h0 w0
P1+1
Loss
1f 3
x1
2
h0 w0
Gradient (× 10−3
) With Intermediate Supervision Without Intermediate Supervision
Stage 3, level 1Stage 2, level 1Stage 1, level 1
Learning
Intermediate Supervision Addresses Vanishing Gradients
2016/8/11 28
0
0
Detectionrate%
(i) With Intermediate Supervision (IS)
(ii) Stagewise
(iii) IS + Stagewise Pretrain
(iv) Without Intermediate Supervision
0.05 0.1 0.15 0.2
Normalized distance
100
90
80
70
60
50
40
30
20
10
PCK total, LSP OC
Learning
Comparison of Learning Methods
2016/8/11 29
Qualitative Results
2016/8/11 L b30
Evaluation
Qualitative Examples on LEEDS (Person-centric)
2016/8/11 L b31
Evaluation
Qualitative Examples on MPI (Person-centric)
2016/8/11 L b32
Resolving Symmetric Confusions
LeftRight
t = 1 t =
2
t =
3
Wrists
2016/8/11 L b33
Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted
Pose
Level 1 Part Confidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Stage
II
Stage
I
Stage
III
Ablative Spatial Analysis
2016/8/11 34
Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted
Pose
Stage
II
Stage
I
Stage
III
Level 1 Part Confidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilient to missing
context (of one part)
Context from the confidence map of
head is removed
Ablative Spatial Analysis
2016/8/11 35
Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted
Pose
Level 1 Part Confidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilient to missing
context (of one part)
Stage
II
Stage
I
Stage
III
Ablative Spatial Analysis
2016/8/11 36
Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted
Pose
Level 1 Part Confidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilient to missing
context (of one part)
Stage
II
Stage
I
Stage
III
Ablative Spatial Analysis
2016/8/11 37
Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Level 1 Part Confidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilient to missing
context (of one part)
StageIIStageIStageIII
Predicted
Pose
Ablative Spatial Analysis
2016/8/11 38
Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted
Pose
Level 1 Part Confidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilient to missing
context (of one part)
Stage
II
Stage
I
Stage
III
Ablative Spatial Analysis
2016/8/11 39
Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted
Pose
Level 1 Part Confidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilient to missing
context (of one part)
Stage
II
Stage
I
Stage
III
Ablative Spatial Analysis
2016/8/11 40
Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted
Pose
Level 1 Part Confidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilient to missing
context (of one part)
Stage
II
Stage
I
Stage
III
Ablative Spatial Analysis
2016/8/11 41
Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted
Pose
Level 1 Part Confidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilient to missing
context (of one part)
Stage
II
Stage
I
Stage
III
Ablative Spatial Analysis
2016/8/11 42
Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted
Pose
Level 1 Part Confidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilient to missing
context (of one part)
Stage
II
Stage
I
Stage
III
Ablative Spatial Analysis
2016/8/11 43
Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted
Pose
Level 1 Part Confidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilient to missing
context (of one part)
Stage
II
Stage
I
Stage
III
Ablative Spatial Analysis
2016/8/11 44
Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted
Pose
Level 1 Part Confidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilient to missing
context (of one part)
Stage
II
Stage
I
Stage
III
Ablative Spatial Analysis
2016/8/11 45
Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted
Pose
Level 1 Part Confidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilient to missing
context (of one part)
Stage
II
Stage
I
Stage
III
Ablative Spatial Analysis
2016/8/11 46
Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted
Pose
Level 1 Part Confidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilient to missing
context (of one part)
Stage
II
Stage
I
Stage
III
Ablative Spatial Analysis
2016/8/11 47
Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted
Pose
Level 1 Part Confidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilient to missing
context (of one part)
Stage
II
Stage
I
Stage
III
Ablative Spatial Analysis
2016/8/11 48
0 0.05 0.1 0.15
Normalized distance
0.2 0
0
100
90
80
70
60
50
40
30
20
10
Detectionrate%
Ours 3−Stage 2−Level
Tompson et al., CVPR’15
Tompson et al., NIPS’14
Chen&Yullie, NIPS’14
Toshev et al., CVPR’14
Sapp et al., CVPR’13
Evaluation
PCK Performance Comparison on FLIC dataset
PCK wrist, FLIC
0.05 0.1 0.15
Normalized distance
0.2
PCK elbow, FLIC
2016/8/11 49
0 0.05 0.1 0.15
Normalized distance
Ours 3−Stage 2−Level
0.2 0
0
100
90
80
70
60
50
40
30
20
10
PCK total, LSP PC
Detectionrate%
Tompson et al., NIPS’14 Pishchulin et al., ICCV’13 Chen&Yuille, NIPS’14 Wang et al., CVPR’13
0.05 0.1 0.15 0.2 0
Normalized distance
0.05 0.1 0.15 0.2 0
Normalized distance
PCK wrist&elbow, LSP PC
0.05 0.1 0.15 0.2 0
Normalized distance
PCK knee, LSP PC
0.05 0.1 0.15 0.2
PCK ankle, LSP PC
Normalized distance
PCK hip, LSP PC
Evaluation
PCK Performance Comparison on LEEDS dataset (Person-centric)
2016/8/11 50

Pose Machine

  • 1.
    Pose Machines Estimating ArticulatedPose from Images Robotics Institute Carnegie Mellon University Convolutional Pose Machines. Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. Pose Machines: Articulated Pose Estimation via Inference Machines. Varun Ramakrishna, Daniel Munoz, Martial Hebert, J.A. Bagnell, Yaser Sheikh. In ECCV 2014 (Oral presentation). 2016/8/11 1
  • 2.
    Goal: Articulated PoseEstimation 2016/8/11 2
  • 3.
    Goal: Articulated PoseEstimation https://www.youtube.com/watch?v=Oi_ycvFHd64&index=6&list=PLNh5A7HtLRcpsMfvyG0DED-Dr4zW5Lpcg 2016/8/11 3
  • 4.
    Goal: Articulated PoseEstimation https://www.youtube.com/watch?v=MsZkLK0Wcmk&list=PLNh5A7HtLRcpsMfvyG0DED-Dr4zW5Lpcg&index=1 2016/8/11 4
  • 5.
    Which part correspondsto a body part? • Local evidence is weak • Part context is a strong cue • Top-down cues are helpful2016/8/11 5
  • 6.
    Using Local ImageEvidence Multi-ClassClassificationof Patches g1 Image Features 1xz ImageLocation z Input Image handsfeet Requires a high-capacity supervised predictor capable of handling multi-modal data2016/8/11 6
  • 7.
    Using Local ImageEvidence A ClassicalSlidingWindowDetectionPipeline Image Feature Extraction Classification 2016/8/11 7
  • 8.
    Local Image Evidenceis Weak • Certain parts are easier to detect than others head neck l.shoulder l.elbow l.wrist 2016/8/11 8
  • 9.
    Part Context isa Strong Cue Part detection confidences provide spatial context cues L-ShoulderL-ElbowImage Neck 2016/8/11 9
  • 10.
    Tree Structures vsLoopy Graphs Tree Structures • Fast and exact inference • Double counting Loopy Graphs • Rich context • Approximate inference 2015/9/11 10
  • 11.
    Designing Context Representations Contextfeatures encode responses of aprevious prediction stage Offset Features Patch Features Image 2016/8/11 L b11
  • 12.
    Context Features g2 g3 Stage IIStage III Confidence Maps Confidence Maps g1 Context Features Stage I Confidence Maps Stage I Confidence Image Features Head Neck L-Shoulder L-Elbow L-Wrist 2016/8/11 L b12
  • 13.
    g2g1 Context Features g3 Image Features Context Features Stage I Confidence Maps StageII Confidence Maps Stage III Confidence Maps Stage II Confidence Head Neck L-Shoulder L-Elbow L-Wrist 2016/8/11 L b13
  • 14.
    g2g1 g3 Context Features Context Features Stage I ConfidenceMaps Stage II Confidence Maps Stage III Confidence Maps Image Features Stage III Confidence Head Neck L-Shoulder L-Elbow L-Wrist 2016/8/11 L b14
  • 15.
    Level 1 partsLevel 2 poselet Level 3 full body [Bourdev et al., CVPR2009][Sun et al., CVPR2012] [Duan et al., BMVC 2012][Singh et al., ECCV2012] [Pishchulin et al., CVPR2013] etc. Top Down Cues are Helpful Larger Composite Parts can be Easier to detect 2016/8/11 15
  • 16.
    2 gT 1 gT Stage t =(T = 3) Context Features Context Context Features Image Features Features Context Features Context Features Context Features Image Features Image Features Image Features 2g1 L g1 Stage t = 1 1 g1 Level 1 Level 2 LevelL Image Features Image Features Image Features L g2 2 g2 1 g Stage t = 2 Incorporating Hierarchical Cues • Each level of the hierarchy uses a separate predictor • Context features are computed on the outputs of the previous stage • Spatial context information is passed across layers via context features L gT 2016/8/11 16
  • 17.
    1g2 1g1 Level 1 1 gT Image Features Image Features Image Features Context Features Context Features Level 2 2g1 L g1 L g2 2g2 Staget = 1 Stage t = 2 Level I Confidence Maps L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank. L gT 2gT Stage t = (T = 3) Context Features Context Features Context Features Context Features LevelL Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd. StageIStageIIStageIII 2016/8/11 L b17
  • 18.
    Stage t =2 Level 2 Confidence Maps StageIStageIIStageIII Head+Sho L.Arm R.Arm Torso L.Leg Bkgd.R.Leg 1g2 1g1 Level 1 1 gT Image Features Image Features Image Features Context Features Context Features Level 2 2g1 L g1 L g2 2g2 Stage t = 1 L gT 2gT Stage t = (T = 3) Context Features Context Features Context Features Context Features LevelL 2016/8/11 L b18
  • 19.
    Stage t =2 Level 3 Confidence Maps Torso Bkgd. Stage I Stage II Stage III 1g2 1g1 Level 1 1 gT Image Features Image Features Image Features Context Features Context Features Level 2 2g1 L g1 L g2 2g2 Stage t = 1 L gT 2gT Stage t = (T = 3) Context Features Context Features Context Features Context Features LevelL 2016/8/11 L b19
  • 20.
    1g2 1g1 Level 1 1gT Image Features Image Features Image Features Context Features Context Features Level 2 2g1 L g1 L g2 2g2 Staget = 1 Stage t = 2 L gT 2gT Stage t = (T = 3) Context Features Context Features Context Features Context Features LevelL Fully Connected Model 2016/8/11 L b20
  • 21.
    Pose Machines Sequential Predictionwith Spatial Context Training reduces to training multiple supervised classifiers g2g1 g3 Context Features Context Features Stage I Confidence Maps Stage II Confidence Maps Stage III Confidence MapsImage Features Image Features Image Features No structured loss function No specialized solvers No handcrafted spatial model Spatial model is learned implicitly by the classifiers in a data-driven fashion 2016/8/11 21
  • 22.
    Learning Feature Representations •Convolutional Architectures for Feature Embedding 2016/8/11 22
  • 23.
    Learning Context Representations •Large Receptive Fields as a Design Criterion 2016/8/11 23
  • 24.
    Learning Context Representations •Large Receptive Fields Improve Pose Estimation 2016/8/11 24
  • 25.
    Convolutional Pose Machines •Designing a Convolutional Architecture 2016/8/11 25
  • 26.
    Learning • Joint Trainingwith Intermediate Supervision 𝑓𝑡 = − 2 2 Loss: Euclidean distance groundtruth prediction Network without Intermediate Supervision leads vanishing gradients 2016/8/11 26
  • 27.
    Input Stage 1 Layer1 Layer 3 Layer 6 4 1 10 3 10 Epoch 10 2 1 10 0 10 Output Layer 18 Stage 2 Layer 7 Layer 9 Layer 12 Layer 13 Stage 3 Layer 15 4 2 10 3 10 Epoch 10 2 1 10 0 10 4 3 10 3 10 Epoch 10 2 1 10 0 10 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 Supervision Supervision Histograms of Gradient Magnitude During Training Supervision Learning Intermediate Supervision Addresses Vanishing Gradients Gradient Magnitude 10 Gradient (× 10−3 ) With Intermediate Supervision Without Intermediate Supervision 0 10 1 10 2 10 3 10 4 Input Image h w 3 5⇥5 C 5⇥5 C 2⇥ 5⇥5 9⇥9 1⇥1 1⇥1 P C C C C 9⇥9 C 9⇥9 13⇥13 13⇥13 15⇥15 1⇥1 1⇥1 C C C C C C 2⇥ P 5⇥5 C 5⇥5 C 5⇥5 C 2⇥ P 2⇥ P Input Image h w 3 h0 w0 P1+1 P1+1 9⇥9 C Loss 1 f 2 Loss 1f 1 x1 1 x1 2 9⇥9 13⇥13 13⇥13 15⇥15 1⇥1 1⇥1 C C C C C C 5⇥5 2⇥ 5⇥5 2⇥ 5⇥5 C P C P C Input Image h w 3 h0 w0 P1+1 Loss 1f 3 x1 2 h0 w0 Stage 3, level 1Stage 2, level 1Stage 1, level 1 2016/8/11 27
  • 28.
    Input Layer 1 Output Layer 18 10 0 10 1 10 2 10 3 10 4 Epoch1 Stage1 Layer 3 Layer 6 Layer 7 Stage 2 Layer 9 Layer 12 Layer 13 Stage 3 Layer 15 10 0 10 1 10 2 10 3 10 4 Epoch2 −0.5 0.0 0.5 10 0 10 1 10 2 10 3 10 4 Epoch3 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 Histograms of Gradient Magnitude During Training Supervision Supervision Supervision Input Image h w 3 5⇥5 C 5⇥5 C 2⇥ 5⇥5 9⇥9 1⇥1 1⇥1 P C C C C 9⇥9 C 9⇥9 13⇥13 13⇥13 15⇥15 1⇥1 1⇥1 C C C C C C 2⇥ P 5⇥5 C 5⇥5 C 5⇥5 C 2⇥ P 2⇥ P Input Image h w 3 h0 w0 P1+1 P1+1 9⇥9 C Loss 1 f 2 Loss 1f 1 x1 1 x1 2 9⇥9 13⇥13 13⇥13 15⇥15 1⇥1 1⇥1 C C C C C C 5⇥5 2⇥ 5⇥5 2⇥ 5⇥5 C P C P C Input Image h w 3 h0 w0 P1+1 Loss 1f 3 x1 2 h0 w0 Gradient (× 10−3 ) With Intermediate Supervision Without Intermediate Supervision Stage 3, level 1Stage 2, level 1Stage 1, level 1 Learning Intermediate Supervision Addresses Vanishing Gradients 2016/8/11 28
  • 29.
    0 0 Detectionrate% (i) With IntermediateSupervision (IS) (ii) Stagewise (iii) IS + Stagewise Pretrain (iv) Without Intermediate Supervision 0.05 0.1 0.15 0.2 Normalized distance 100 90 80 70 60 50 40 30 20 10 PCK total, LSP OC Learning Comparison of Learning Methods 2016/8/11 29
  • 30.
  • 31.
    Evaluation Qualitative Examples onLEEDS (Person-centric) 2016/8/11 L b31
  • 32.
    Evaluation Qualitative Examples onMPI (Person-centric) 2016/8/11 L b32
  • 33.
    Resolving Symmetric Confusions LeftRight t= 1 t = 2 t = 3 Wrists 2016/8/11 L b33
  • 34.
    Head Neck L.Sho.L.Elb. R.Hip R.Knee R.Ank. Bkgd. Predicted Pose Level 1 Part Confidences L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank. Stage II Stage I Stage III Ablative Spatial Analysis 2016/8/11 34
  • 35.
    Head Neck L.Sho.L.Elb. R.Hip R.Knee R.Ank. Bkgd. Predicted Pose Stage II Stage I Stage III Level 1 Part Confidences L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank. Predicted confidences are resilient to missing context (of one part) Context from the confidence map of head is removed Ablative Spatial Analysis 2016/8/11 35
  • 36.
    Head Neck L.Sho.L.Elb. R.Hip R.Knee R.Ank. Bkgd. Predicted Pose Level 1 Part Confidences L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank. Predicted confidences are resilient to missing context (of one part) Stage II Stage I Stage III Ablative Spatial Analysis 2016/8/11 36
  • 37.
    Head Neck L.Sho.L.Elb. R.Hip R.Knee R.Ank. Bkgd. Predicted Pose Level 1 Part Confidences L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank. Predicted confidences are resilient to missing context (of one part) Stage II Stage I Stage III Ablative Spatial Analysis 2016/8/11 37
  • 38.
    Head Neck L.Sho.L.Elb. R.Hip R.Knee R.Ank. Bkgd. Level 1 Part Confidences L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank. Predicted confidences are resilient to missing context (of one part) StageIIStageIStageIII Predicted Pose Ablative Spatial Analysis 2016/8/11 38
  • 39.
    Head Neck L.Sho.L.Elb. R.Hip R.Knee R.Ank. Bkgd. Predicted Pose Level 1 Part Confidences L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank. Predicted confidences are resilient to missing context (of one part) Stage II Stage I Stage III Ablative Spatial Analysis 2016/8/11 39
  • 40.
    Head Neck L.Sho.L.Elb. R.Hip R.Knee R.Ank. Bkgd. Predicted Pose Level 1 Part Confidences L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank. Predicted confidences are resilient to missing context (of one part) Stage II Stage I Stage III Ablative Spatial Analysis 2016/8/11 40
  • 41.
    Head Neck L.Sho.L.Elb. R.Hip R.Knee R.Ank. Bkgd. Predicted Pose Level 1 Part Confidences L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank. Predicted confidences are resilient to missing context (of one part) Stage II Stage I Stage III Ablative Spatial Analysis 2016/8/11 41
  • 42.
    Head Neck L.Sho.L.Elb. R.Hip R.Knee R.Ank. Bkgd. Predicted Pose Level 1 Part Confidences L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank. Predicted confidences are resilient to missing context (of one part) Stage II Stage I Stage III Ablative Spatial Analysis 2016/8/11 42
  • 43.
    Head Neck L.Sho.L.Elb. R.Hip R.Knee R.Ank. Bkgd. Predicted Pose Level 1 Part Confidences L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank. Predicted confidences are resilient to missing context (of one part) Stage II Stage I Stage III Ablative Spatial Analysis 2016/8/11 43
  • 44.
    Head Neck L.Sho.L.Elb. R.Hip R.Knee R.Ank. Bkgd. Predicted Pose Level 1 Part Confidences L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank. Predicted confidences are resilient to missing context (of one part) Stage II Stage I Stage III Ablative Spatial Analysis 2016/8/11 44
  • 45.
    Head Neck L.Sho.L.Elb. R.Hip R.Knee R.Ank. Bkgd. Predicted Pose Level 1 Part Confidences L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank. Predicted confidences are resilient to missing context (of one part) Stage II Stage I Stage III Ablative Spatial Analysis 2016/8/11 45
  • 46.
    Head Neck L.Sho.L.Elb. R.Hip R.Knee R.Ank. Bkgd. Predicted Pose Level 1 Part Confidences L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank. Predicted confidences are resilient to missing context (of one part) Stage II Stage I Stage III Ablative Spatial Analysis 2016/8/11 46
  • 47.
    Head Neck L.Sho.L.Elb. R.Hip R.Knee R.Ank. Bkgd. Predicted Pose Level 1 Part Confidences L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank. Predicted confidences are resilient to missing context (of one part) Stage II Stage I Stage III Ablative Spatial Analysis 2016/8/11 47
  • 48.
    Head Neck L.Sho.L.Elb. R.Hip R.Knee R.Ank. Bkgd. Predicted Pose Level 1 Part Confidences L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank. Predicted confidences are resilient to missing context (of one part) Stage II Stage I Stage III Ablative Spatial Analysis 2016/8/11 48
  • 49.
    0 0.05 0.10.15 Normalized distance 0.2 0 0 100 90 80 70 60 50 40 30 20 10 Detectionrate% Ours 3−Stage 2−Level Tompson et al., CVPR’15 Tompson et al., NIPS’14 Chen&Yullie, NIPS’14 Toshev et al., CVPR’14 Sapp et al., CVPR’13 Evaluation PCK Performance Comparison on FLIC dataset PCK wrist, FLIC 0.05 0.1 0.15 Normalized distance 0.2 PCK elbow, FLIC 2016/8/11 49
  • 50.
    0 0.05 0.10.15 Normalized distance Ours 3−Stage 2−Level 0.2 0 0 100 90 80 70 60 50 40 30 20 10 PCK total, LSP PC Detectionrate% Tompson et al., NIPS’14 Pishchulin et al., ICCV’13 Chen&Yuille, NIPS’14 Wang et al., CVPR’13 0.05 0.1 0.15 0.2 0 Normalized distance 0.05 0.1 0.15 0.2 0 Normalized distance PCK wrist&elbow, LSP PC 0.05 0.1 0.15 0.2 0 Normalized distance PCK knee, LSP PC 0.05 0.1 0.15 0.2 PCK ankle, LSP PC Normalized distance PCK hip, LSP PC Evaluation PCK Performance Comparison on LEEDS dataset (Person-centric) 2016/8/11 50