Pose Machine

Pose Machines
Estimating Articulated Pose from Images
Robotics Institute
Carnegie Mellon University
Convolutional Pose Machines. Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2016.
Pose Machines: Articulated Pose Estimation via Inference Machines. Varun Ramakrishna, Daniel Munoz, Martial Hebert, J.A. Bagnell,
Yaser Sheikh. In ECCV 2014 (Oral presentation).
2016/8/11 1

Goal: Articulated Pose Estimation
2016/8/11 2

https://www.youtube.com/watch?v=Oi_ycvFHd64&index=6&list=PLNh5A7HtLRcpsMfvyG0DED-Dr4zW5Lpcg
2016/8/11 3

https://www.youtube.com/watch?v=MsZkLK0Wcmk&list=PLNh5A7HtLRcpsMfvyG0DED-Dr4zW5Lpcg&index=1
2016/8/11 4

Which part corresponds to a body part?
• Local evidence is weak
• Part context is a strong cue
• Top-down cues are helpful2016/8/11 5

Using Local Image Evidence
Multi-ClassClassiﬁcationof Patches
g1
Image
Features
1xz
ImageLocation z
Input Image
handsfeet
Requires a high-capacity
supervised predictor capable of
handling multi-modal data2016/8/11 6

Using Local Image Evidence
A ClassicalSlidingWindowDetectionPipeline
Image
Feature
Extraction
Classification
2016/8/11 7

Local Image Evidence is Weak
• Certain parts are easier to detect than others
head neck l.shoulder l.elbow l.wrist
2016/8/11 8

Part Context is a Strong Cue
Part detection confidences provide spatial context cues
L-ShoulderL-ElbowImage Neck
2016/8/11 9

Tree Structures vs Loopy Graphs
Tree Structures
• Fast and exact
inference
• Double counting
Loopy Graphs
• Rich context
• Approximate inference
2015/9/11 10

Designing Context Representations
Context features encode responses of aprevious prediction stage
Offset
Features
Patch
Features
Image
2016/8/11 L b11

Context
Features
g2 g3
Stage II Stage III
Confidence Maps Confidence Maps
g1
Context
Features
Stage I
Confidence Maps
Stage I Confidence
Image
Features
Head Neck L-Shoulder L-Elbow L-Wrist
2016/8/11 L b12

g2g1
Context
Features
g3
Image
Features
Context
Features
Stage I
Confidence Maps
Stage II
Confidence Maps
Stage III
Confidence Maps
Stage II Confidence
2016/8/11 L b13

g2g1 g3
Context
Features
Context
Features
Stage I
Confidence Maps
Stage II
Confidence Maps
Stage III
Confidence Maps
Image
Features
Stage III Confidence
2016/8/11 L b14

Level 1 parts Level 2 poselet Level 3 full body
[Bourdev et al., CVPR2009][Sun et al., CVPR2012]
[Duan et al., BMVC 2012][Singh et al., ECCV2012]
[Pishchulin et al., CVPR2013] etc.
Top Down Cues are Helpful
Larger Composite Parts can be Easier to detect
2016/8/11 15

2
gT
1
gT
Stage t = (T = 3)
Context
Features
Context
Context
Features
Image
Features
Features
Context
Features
Context
Features
Context
Features
Image
Features
Image
Features
Image
Features
2g1
L g1
Stage t = 1
1
g1
Level
1
Level
2
LevelL
Image
Features
Image
Features
Image
Features
L
g2
2
g2
1
g
Stage t = 2
Incorporating Hierarchical Cues
• Each level of the hierarchy uses a separate predictor
• Context features are computed on the outputs of the previous stage
• Spatial context information is passed across layers via context features
L
gT
2016/8/11 16

1g2
1g1
Level
1
1
gT
Image
Features
Image
Features
Image
Features
Context
Features
Context
Features
Level
2
2g1
L g1
L g2
2g2
Stage t = 1 Stage t = 2
Level I Confidence Maps
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
L gT
2gT
Stage t = (T = 3)
Context
Features
Context
Features
Context
Features
Context
Features
LevelL
Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
StageIStageIIStageIII
2016/8/11 L b17

Stage t = 2
Level 2 Confidence Maps
StageIStageIIStageIII
Head+Sho L.Arm R.Arm Torso L.Leg Bkgd.R.Leg
1g2
1g1
Level
1
1
gT
Image
Features
Image
Features
Image
Features
Context
Features
Context
Features
Level
2
2g1
L g1
L g2
2g2
Stage t = 1
L gT
2gT
Stage t = (T = 3)
Context
Features
Context
Features
Context
Features
Context
Features
LevelL
2016/8/11 L b18

Stage t = 2
Level 3 Confidence Maps
Torso Bkgd.
Stage
I
Stage
II
Stage
III
1g2
1g1
Level
1
1
gT
Image
Features
Image
Features
Image
Features
Context
Features
Context
Features
Level
2
2g1
L g1
L g2
2g2
Stage t = 1
L gT
2gT
Stage t = (T = 3)
Context
Features
Context
Features
Context
Features
Context
Features
LevelL
2016/8/11 L b19

1g2
1g1
Level
1
1gT
Image
Features
Image
Features
Image
Features
Context
Features
Context
Features
Level
2
2g1
L g1
L g2
2g2
Stage t = 1 Stage t = 2
L gT
2gT
Stage t = (T = 3)
Context
Features
Context
Features
Context
Features
Context
Features
LevelL
Fully Connected Model
2016/8/11 L b20

Pose Machines
Sequential Prediction with Spatial Context
Training reduces to training multiple
supervised classifiers
g2g1 g3
Context
Features
Context
Features
Stage I
Confidence Maps
Stage II
Confidence Maps
Stage III
Confidence MapsImage
Features
Image
Features
Image
Features
No structured loss function
No specialized solvers
No handcrafted spatial model
Spatial model is learned implicitly by the
classifiers in a data-driven fashion
2016/8/11 21

Learning Feature Representations
• Convolutional Architectures for Feature Embedding
2016/8/11 22

Learning Context Representations
• Large Receptive Fields as a Design Criterion
2016/8/11 23

Learning Context Representations
• Large Receptive Fields Improve Pose Estimation
2016/8/11 24

Convolutional Pose Machines
• Designing a Convolutional Architecture
2016/8/11 25

Learning
• Joint Training with Intermediate Supervision
𝑓𝑡 = − 2
2 Loss: Euclidean distance
groundtruth prediction
Network without Intermediate Supervision leads vanishing gradients
2016/8/11 26

Input Stage 1
Layer 1 Layer 3 Layer 6
4
1 10
3
10
Epoch
10
2
1
10
0
10
Output
Layer 18
Stage 2
Layer 7 Layer 9 Layer 12 Layer 13
Stage 3
Layer 15
4
2 10
3
10
Epoch
10
2
1
10
0
10
4
3 10
3
10
Epoch
10
2
1
10
0
10
−0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5
Supervision Supervision
Histograms of Gradient Magnitude During Training
Supervision
Learning
Intermediate Supervision Addresses Vanishing Gradients
Gradient Magnitude
10
Gradient (× 10−3
) With Intermediate Supervision Without Intermediate Supervision
0
10
1
10
2
10
3
10
4
Input
Image
h w 3
5⇥5
C
5⇥5
C
2⇥ 5⇥5 9⇥9 1⇥1 1⇥1
P C C C C
9⇥9
C
9⇥9 13⇥13 13⇥13 15⇥15 1⇥1 1⇥1
C C C C C C
2⇥
P
5⇥5
C
5⇥5
C
5⇥5
C
2⇥
P
2⇥
P
Input
Image
h w 3
h0 w0
P1+1 P1+1
9⇥9
C
Loss
1
f 2
Loss
1f 1
x1
1
x1
2 9⇥9 13⇥13 13⇥13 15⇥15 1⇥1 1⇥1
C C C C C C
5⇥5 2⇥ 5⇥5 2⇥ 5⇥5
C P C P C
Input
Image
h w 3
h0 w0
P1+1
Loss
1f 3
x1
2
h0 w0
Stage 3, level 1Stage 2, level 1Stage 1, level 1
2016/8/11 27

Input
Layer 1
Output
Layer 18
10
0
10
1
10
2
10
3
10
4
Epoch1
Stage 1
Stage 2
Stage 3
Layer 15
10
0
10
1
10
2
10
3
10
4
Epoch2
−0.5 0.0 0.5
10
0
10
1
10
2
10
3
10
4
Epoch3
−0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5
Histograms of Gradient Magnitude During Training
Supervision Supervision Supervision
Input
Image
h w 3
5⇥5
C
5⇥5
C
2⇥ 5⇥5 9⇥9 1⇥1 1⇥1
P C C C C
9⇥9
C
9⇥9 13⇥13 13⇥13 15⇥15 1⇥1 1⇥1
C C C C C C
2⇥
P
5⇥5
C
5⇥5
C
5⇥5
C
2⇥
P
2⇥
P
Input
Image
h w 3
h0 w0
P1+1 P1+1
9⇥9
C
Loss
1
f 2
Loss
1f 1
x1
1
x1
2 9⇥9 13⇥13 13⇥13 15⇥15 1⇥1 1⇥1
C C C C C C
5⇥5 2⇥ 5⇥5 2⇥ 5⇥5
C P C P C
Input
Image
h w 3
h0 w0
P1+1
Loss
1f 3
x1
2
h0 w0
Gradient (× 10−3
) With Intermediate Supervision Without Intermediate Supervision
Stage 3, level 1Stage 2, level 1Stage 1, level 1
Learning
Intermediate Supervision Addresses Vanishing Gradients
2016/8/11 28

0
0
Detectionrate%
(i) With Intermediate Supervision (IS)
(ii) Stagewise
(iii) IS + Stagewise Pretrain
(iv) Without Intermediate Supervision
0.05 0.1 0.15 0.2
Normalized distance
100
90
80
70
60
50
40
30
20
10
PCK total, LSP OC
Learning
Comparison of Learning Methods
2016/8/11 29

Qualitative Results
2016/8/11 L b30

Evaluation
Qualitative Examples on LEEDS (Person-centric)
2016/8/11 L b31

Evaluation
Qualitative Examples on MPI (Person-centric)
2016/8/11 L b32

Resolving Symmetric Confusions
LeftRight
t = 1 t =
2
t =
3
Wrists
2016/8/11 L b33

Predicted
Pose
Level 1 Part Confidences
Stage
II
Stage
I
Stage
III
Ablative Spatial Analysis
2016/8/11 34

Predicted
Pose
Stage
II
Stage
I
Stage
III
Predicted confidences are resilient to missing
context (of one part)
Context from the confidence map of
head is removed
2016/8/11 35

Predicted
Pose
Stage
II
Stage
I
Stage
III
2016/8/11 36

Predicted
Pose
Stage
II
Stage
I
Stage
III
2016/8/11 37

StageIIStageIStageIII
Predicted
Pose
2016/8/11 38

Predicted
Pose
Stage
II
Stage
I
Stage
III
2016/8/11 39

Predicted
Pose
Stage
II
Stage
I
Stage
III
2016/8/11 40

Predicted
Pose
Stage
II
Stage
I
Stage
III
2016/8/11 41

Predicted
Pose
Stage
II
Stage
I
Stage
III
2016/8/11 42

Predicted
Pose
Stage
II
Stage
I
Stage
III
2016/8/11 43

Predicted
Pose
Stage
II
Stage
I
Stage
III
2016/8/11 44

Predicted
Pose
Stage
II
Stage
I
Stage
III
2016/8/11 45

Predicted
Pose
Stage
II
Stage
I
Stage
III
2016/8/11 46

Predicted
Pose
Stage
II
Stage
I
Stage
III
2016/8/11 47

Predicted
Pose
Stage
II
Stage
I
Stage
III
2016/8/11 48

0 0.05 0.1 0.15
Normalized distance
0.2 0
0
100
90
80
70
60
50
40
30
20
10
Detectionrate%
Ours 3−Stage 2−Level
Tompson et al., CVPR’15
Tompson et al., NIPS’14
Chen&Yullie, NIPS’14
Toshev et al., CVPR’14
Sapp et al., CVPR’13
Evaluation
PCK Performance Comparison on FLIC dataset
PCK wrist, FLIC
0.05 0.1 0.15
Normalized distance
0.2
PCK elbow, FLIC
2016/8/11 49

0 0.05 0.1 0.15
Normalized distance
Ours 3−Stage 2−Level
0.2 0
0
100
90
80
70
60
50
40
30
20
10
PCK total, LSP PC
Detectionrate%
Tompson et al., NIPS’14 Pishchulin et al., ICCV’13 Chen&Yuille, NIPS’14 Wang et al., CVPR’13
0.05 0.1 0.15 0.2 0
Normalized distance
0.05 0.1 0.15 0.2 0
Normalized distance
PCK wrist&elbow, LSP PC
0.05 0.1 0.15 0.2 0
Normalized distance
PCK knee, LSP PC
0.05 0.1 0.15 0.2
PCK ankle, LSP PC
Normalized distance
PCK hip, LSP PC
Evaluation
PCK Performance Comparison on LEEDS dataset (Person-centric)
2016/8/11 50

Pose Machine

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Pose Machine

Similar to Pose Machine (20)

Recently uploaded

Recently uploaded (20)

Pose Machine