ICCV2011: Human Action Recognition by Learning bases of action attributes and parts

Human Action Recognition
by Learning Bases of Action
Attributes and Parts
Bangpeng Yao, Xiaoye Jiang, Aditya Khosla,
Andy Lai Lin, Leonidas Guibas, and Li Fei-Fei

Stanford University

1

Action Classification in Still Images
Low level feature
Riding bike

Yao & Fei-Fei, 2010
Koniusz et al., 2010
Delaitre et al., 2010
Yao et al., 2011

2

Low level feature High-level representation
Riding bike
- Semantic concepts – Attributes

Riding a bike
Yao & Fei-Fei, 2010
Koniusz et al., 2010 Sitting on a bike seat
Delaitre et al., 2010 Wearing a helmet
Yao et al., 2011
Peddling the pedals
…

3

Riding bike
- Objects

Riding a bike
Yao & Fei-Fei, 2010
Yao et al., 2011
Peddling the pedals
…

4

Riding bike
- Objects
Parts
- Human poses

Riding a bike
Yao & Fei-Fei, 2010
Yao et al., 2011
Peddling the pedals
…

5

Riding bike
- Objects
Parts
- Human poses
- Contexts of attributes & parts
Riding

Riding a bike
Yao & Fei-Fei, 2010
Yao et al., 2011
Peddling the pedals
…

6

Riding bike
wearing
a helmet - Semantic concepts – Attributes
- Objects
sitting on Parts
bike seat - Human poses
Peddling - Contexts of attributes & parts
the pedal
riding a bike

Yao & Fei-Fei, 2010 Farhadi et al., 2009 Gupta et al., 2009 Yang et al., 2010
Koniusz et al., 2010 Lampert et al., 2009 Yao & Fei-Fei, 2010 Maji et al., 2011
Delaitre et al., 2010 Berg et al., 2010 Torresani et al., 2010 Liu et al., 2011
Yao et al., 2011 Parikh & Grauman, 2011 Li et al., 2010

 Incorporate human knowledge;
 More understanding of image content;
 More discriminative classifier.
7

Outline
• Intuition: Action Attributes and Parts
• Algorithm: Learning Bases of Attributes
and Parts
• Experiments: PASCAL VOC & Stanford
40 Actions
• Conclusion

8

Outline
and Parts
40 Actions
• Conclusion

9

Action Attributes and Parts
Attributes: semantic descriptions of human actions

……

10

Attributes: semantic descriptions of human actions

Discriminative classifier, e.g. SVM
……

Riding
bike Not
riding
bike
Lampert et al., 2009
Berg et al., 2010

11

Attributes:

A pre-trained detector
……
Parts-Objects:

……
Parts-Poselets:

……
Object Bank, Li et al., 2010
Poselet, Bourdev & Malik, 2009

12

Attributes:
a: Image feature vector
…… Attribute classification

Parts-Objects:
Object detection
……
Parts-Poselets:
Poselet detection

……

13

Attributes: Action bases Φ

…… Attribute classification

Parts-Objects: …
Object detection
……
Parts-Poselets:
Poselet detection

……

14


……
Parts-Objects: …

……
Parts-Poselets:

……

15


……
Parts-Objects: …

……
Parts-Poselets:

……

16


……
Parts-Objects: …

……
Parts-Poselets:
a Φw
……

Bases coefficients w
17


……
Parts-Objects: …

……
Parts-Poselets:
a Φw
……
• Sparse
• Encodes context
• Robust to initially
Bases coefficients w weak detections
18

Outline
• Algorithm: Learning Bases of
Attributes and Parts
40 Actions
• Conclusion

19

Bases of Atr. & Parts: Training
a Φ
• Input: a1 ,, a N
• Output: Φ Φ1 ,, ΦM
… sparse
W w1 ,, w N
• Jointly estimate Φ and W :
w N
1 2
min ai Φw i wi ,
a Φw Φ ,W
i 1 2 2 1

Accurate approximation L1 regularization, sparsity of W
2
s.t. j, Φ j Φj 1
1 2 2

Elastic net, sparsity of Φ [Zou & Hasti, 2005]

• Optimization: stochastic gradient descent.
20

Bases of Atr. & Parts: Testing
a Φ
• Input: a
Φ Φ1 ,, ΦM
…
• Output: w sparse

• Estimate w:
w
1 2
a Φw min a Φw 2
w1
w 2
Accurate approximation L1 regularization, sparsity of W

• Optimization: stochastic gradient descent.

21

Outline
and Parts
40 Actions
• Conclusion

22

PASCAL VOC 2010 Action Dataset
• 9 classes, 50-100 trainval / testing images per class

Figure credit: Ivan Laptev

• 14 attributes – trained from the trainval images;
27 objects – taken from Li et al, NIPS 2010;
150 poselets – taken from Bourdev & Malik, ICCV 2009.
23

VOC 2010: Classification Result
0.9 SURREY_MK
UCLEAR_DOSP
0.8 Poselet, Maji et al, 2011
Average precision

0.7 Our method, use “a”

0.6

0.5

0.4

0.3

1
Phoning 2
Playing 3
Reading 4
Riding 5
Riding 6
Running 7
Taking 8 9
Walking
Using
instrument bike horse photo computer

a Φ

…

w
24

VOC 2010: Classification Result
0.9 SURREY_MK
UCLEAR_DOSP
Average precision

Our method, use “w”
0.6

0.5

0.4

0.3

1 2 3 4 5 6 7 8 9
Phoning Playing Reading Riding Riding Running Taking Using Walking

a Φ

…

w
25

VOC 2010: Analysis of Bases
0.9 SURREY_MK
UCLEAR_DOSP
Average precision

0.6

0.5

0.4

0.3

1 2 3 4 5 6 7 8 9

a Φ attributes
objects

…
poselets

w
400 action bases 26

0.9 SURREY_MK
UCLEAR_DOSP
Average precision

0.6

0.5

0.4

0.3

1 2 3 4 5 6 7 8 9

a Φ attributes
objects

…
poselets

w
400 action bases 27

0.9 SURREY_MK
UCLEAR_DOSP
Average precision

0.6

0.5

0.4

0.3

1 2 3 4 5 6 7 8 9

a Φ attributes
objects

…
poselets

w
400 action bases 28

VOC 2010: Control Experiment

0.7
Use “a”
Mean average 0.65 Use “w”

0.6
precision

0.55

0.5

a Φ 0.45
A+O+P A+O A+P O+P

… A: attribute
O: object
P: poselet

w
29

PASCAL VOC 2011 Result
• Our method ranks the first in nine out of ten classes in
comp10.
Others’ best Others’ best Our
in comp9 in comp10 method
Jumping 71.6 59.5 66.7
Phoning 50.7 31.3 41.1
Playing instrument 77.5 45.6 60.8
Reading 37.8 27.8 42.2
Riding bike 88.8 84.4 90.5
Riding horse 90.2 88.3 92.2
Running 87.9 77.6 86.2
Taking photo 25.7 31.0 28.8
Using computer 58.9 47.4 63.5
Walking 59.5 57.6 64.2

30

PASCAL VOC 2011 Result
• Our method achieves the best performance in five out
of ten classes if we consider both comp9 and comp10.
Others’ best Others’ best Our
in comp9 in comp10 method
Jumping 71.6 59.5 66.7
Phoning 50.7 31.3 41.1
Playing instrument 77.5 45.6 60.8
Reading 37.8 27.8 42.2
Riding bike 88.8 84.4 90.5
Riding horse 90.2 88.3 92.2
Running 87.9 77.6 86.2
Taking photo 25.7 31.0 28.8
Using computer 58.9 47.4 63.5
Walking 59.5 57.6 64.2

31

Stanford 40 Actions
• 40 actions classes, 9532 real world images from Google, Flickr, etc.
Applauding Blowing Brushing Calling Cleaning Climbing Cooking Cutting
bubbles teeth floor wall trees

Cutting Drinking Feeding Fishing Fixing Gardening Holding Jumping
vegetables horse bike umbrella

Playing Playing Pouring Pushing Reading Repairing Riding Riding
guitar violin liquid cart car bike horse

Rowing Running Shooting Smoking Taking Texting Throwing Using
arrow cigarette photo message frisbee computer

Using Using Walking Washing Watching Waving Writing on Writing on
microscope telescope dog dishes television hands board paper

http://vision.stanford.edu/Datasets/40actions.html 32

Stanford 40 Actions

Fixing
bike

Riding
bike




Stanford 40 Actions



Writing on Writing on
board paper


Stanford 40 Actions

Drinking Gardening



Smoking
Cigarette



Average precision
R
id
i
ng
a

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
R
ow ho
in r
g se
C Rid a b
lim in o
bi g a at
ng bi
m k
ou e
C nt
a
le
an Jum in
in pi
g ng
W th
Sh alk e flo
oo ing or
tin a
g do
Pl an g
ay ar
H in ro
ol
di g w
ng gu
ita
up Fi r
an sh
um i ng
Th br
ro el
wi Ru la
W ng nn
rit a i ng
in fri
g
o n sb e
a e
W b
at oa
ch r d
C i ng
ut
Fe tin TV
ed g
in tre
g es
a
h
W G or
rit ard se
in
Lo g en
ok o in
in Rep n a g
g
t h ai r bo
ru i n ok
g
C am a
ut
tin icr car
g os
ve co
Bl
ow get pe
in abl
g e
P l bub s
(LLC, Wang et al, CVPR 2010) baseline.

ay b
i le
B r ng s
us vio
R h l
ep ing in
ai
rin tee
Pu g a th
U shi bik
sin ng e
g a
a c
co art
m
A p pu
pl te
au r
Lo S m di
ng
ok ok C
in in
• We use 45 attributes, 81 objects, and 150 poselets.

oo
g g k
th c in
ru iga g
a re
W te tt
as les e
hi co
ng p
di e
sh
D es
rin
Stanford 40 Actions: Result

ki
n
W C g
av all
in in
Po g h g
ur an
R in d
ea g l s
di iq
ng uid
Ta a
• Compare our method with the Locality-constrained Linear Coding

k b
Te ing oo
xt k
i n pho
g
m tos
es
LLC

sa
36

ge
Our Method

Average precision
R
id
i
ng
a

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
R
ow ho
in r
g se
C Rid a b
lim in o
bi g a at
ng bi
m k
ou e
C nt
a
le
an Jum in
in pi
g ng
W th
Sh alk e flo
oo ing or
tin a
g do
Pl an g
ay ar
H in ro
ol
di g w
ng gu
ita
up Fi r
an sh
um i ng
Th br
ro el
wi Ru la
W ng nn
rit a i ng
in fri
g
o n sb e
a e
W b
at oa
ch r d
C i ng
ut
Fe tin TV
ed g
in tre
g es
a
h
W G or
rit ard se
in
Lo g en
ok o in
in Rep n a g
g
t h ai r bo
ru i n ok
g
C am a
ut
tin icr car
g os
ve co
Bl
ow get pe
in abl
g e
P l bub s
ay b
i le
B r ng s
us vio
R h l
ep ing in
ai
rin tee
Pu g a th
U shi bik
sin ng e
g a
a c
co art
m
A p pu
pl te
au r
Lo S m di
ng
ok ok C
in in oo
g g k
th c in
ru iga g
a re
W te tt
as les e
hi co
ng p
di e
sh
D es
rin
Stanford 40 Actions: Result

ki
n
W C g
av all
in in
Po g h g
ur an
R in d
ea g l s
di iq
ng uid
Ta a
k b
Te ing oo
xt k
i n pho
g
m tos
es
LLC

sa
37

ge
Our Method

Outline
and Parts
40 Actions
• Conclusion

38

Conclusion

……
Parts-Objects: …

……
Parts-Poselets:
a Φw
……

Bases coefficients w
39

ICCV2011: Human Action Recognition by Learning bases of action attributes and parts

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (6)

More from zukun

More from zukun (20)

Recently uploaded

Recently uploaded (20)

ICCV2011: Human Action Recognition by Learning bases of action attributes and parts