Action localization using super-voxel
segmentation
Iv Sjsn
Affiliation not available
June 1, 2015
1 Video Segmentation
We partition the video volume into CN
non-overlapping regions using GBH
segmentation. The segmentation is based on appearance and motion similarity
between the local regions. Each segment ci ∈ CN
is comprised of arbitrary
shape sized cloud of points xi = {x0
i , x1
i , ...., xP
i } in video volume space R3
.
The practical challenge is to represent segment ci efficiently without com-
primising on the memory and accuracy. Because it is difficult to fit regular
structure such as 3D bounding box or ellipsiod. So we came up with solution
to divide the video into regular m × m × m sized cells and construct the rep-
resentation based on such structure. It does reduce the memory load by m3
times. Also such cell can act like a building block to construct arbitrary shape
and sized 3D regions.
2 Vertex
A region ci will constitute a vertex vi in the video graph G(V, E). Then cardi-
nality of |V | is equal to the size of segmented regions, |C|. The BoVW histogram
hi of local features fi ∈ ci represent an unary potential for vertex vi.
2.1 Local feature histogram
Each node vi will be associated with two components of histograms: foreground
histogram hfg
i and background histogram hbg
i :
ˆ hfg
i = j∈ci
bow(fj) - a frequency of quantized local features fj extracted
inside the region ci.
ˆ hbg
i = j /∈ci
bow(fj) - a frequency of quantized local features fj extracted
outside the region ci.
1
Hence, the histogram representation of node vi can defined as:
hi = hfg
i + αbg hbg
i
(1)
2.1.1 Performance Evaluation
2.1.2 UCF-Sport Dataset
mAP performance with different types of kernel functions vs background his-
togram weight values, αbg
Kernel Type αbg = 0 αbg = 0.5 αbg = 0.75
Linear 31.91 % 56.69 % 65.29 %
Intersection 35.65 % 58.98 % 60.89 %
Chi − Square 39.03 % 63.08 % 65.96 %
Jenson − Shannon 39.50 % 63.53 % 66.38 %
The runtime evaluation (in mins) with different types of kernel functions vs
background histogram weight values, αbg
Kernel Type αbg = 0 αbg = 0.5 αbg = 0.75
Linear 2.5 22.3 15.1
Intersection 3.3 20.3 16.8
Chi − Square 2.4 14.8 11.2
Jenson − Shannon 3.0 78.7 58.8
3 Edge
Edge E will govern the relationship between the segmented regions C = {c0, c1, ..., cN },
i.e vertex V of video graph G(V, E). In essence, eij ∈ E should reflect the like-
lihood of vertex vi and vj belong to the same action category, i.e Pij(li = lj)
where li, lj ∈ L.
3.1 Video Graph Data Structure
A video sequence is partitioned into a set of non-overlapping supervoxel regions
S = {s0, s1, .., sN }. The supervoxel is mapped into graph: fmap : S → G(V, E).
Each node vi ∈ V has following attributes:
ˆ sizei : the total number of cells making up the region si, R.
ˆ mini : the minimum location of 3D bounding box point in cell-coordinate
R3
.
ˆ maxi : the maximum location of 3D bounding box point in cell-coordinate
R3
.
2
ˆ meani: the mean cell location, R3
.
ˆ histi: the bow histogram vector over the local descriptor fi ∈ si, R4k
× 5
(Used 5 different descriptor with dictionary size of 4k ).
ˆ sparsityi: the number of non-zeros bins in the bow histogram histi.
ˆ histc
i : the concatenated color histogram vector of siregionover3−channels(RGB),R10
×
3
ˆ histg
i : the orientation histogram vector of siregionwith50directionalbins,R50
3
ˆ li: the action label of si region, li ∈ L.
Each edge eij ∈ E between node ui and uj has following attribute:
ˆ intsizeij: the number of mutually adjacent cells between supervoxel si
and sj, RN×N
( 27-neigborhood is used to compute this attribute).
4 Training
The SVM classifier is trained on the node level where training instance is node
ui’s histogram vector histi ∈ R20k
and its corresponding label actioni. The chi-
square kernel is used and adopted one-vs-rest training strategy. The trained
classifier will return Pr(l|histj) the label likelihood of given test node uj.
5 MRF-based Energy Minimization
The localization problem is formulated as energy minimization problem:
E(X) = ui∈X E1(ui) + λ (ui,uj )∈X E2(ui, uj)
(2)
5.1 Likelihood (E1) and Prior (E2)
Likelihood, E1, is defined as:
E1(ui) = 1 − e
1
σ2
d
P r(li|ui)
(3)
Prior energy, E2, simply measures the label agreement between adjacent
nodes, defined as:
E2(ui, uj) =
0, if li = lj
1 − e
1
σ2
s
D(ui,uj )
, otherwise
(4)
3
6 Preliminary results for UCF-Sports Dataset
The following results are obtained without tuning the MRF objective functions
parameters: σs and σd. The evaluation is performed for localization of action
in the video sequences.
Method mAP
Raptis et al [2011] 79.4%
Lan at al [2012] 73.1 %
SDPM [2011] 75.2 %
Ma et al [2013] 81.7 %
Xu et al [2014] 78.8 %
Ours 77.5 %
4

Segmentation based graph construction (5)

  • 1.
    Action localization usingsuper-voxel segmentation Iv Sjsn Affiliation not available June 1, 2015 1 Video Segmentation We partition the video volume into CN non-overlapping regions using GBH segmentation. The segmentation is based on appearance and motion similarity between the local regions. Each segment ci ∈ CN is comprised of arbitrary shape sized cloud of points xi = {x0 i , x1 i , ...., xP i } in video volume space R3 . The practical challenge is to represent segment ci efficiently without com- primising on the memory and accuracy. Because it is difficult to fit regular structure such as 3D bounding box or ellipsiod. So we came up with solution to divide the video into regular m × m × m sized cells and construct the rep- resentation based on such structure. It does reduce the memory load by m3 times. Also such cell can act like a building block to construct arbitrary shape and sized 3D regions. 2 Vertex A region ci will constitute a vertex vi in the video graph G(V, E). Then cardi- nality of |V | is equal to the size of segmented regions, |C|. The BoVW histogram hi of local features fi ∈ ci represent an unary potential for vertex vi. 2.1 Local feature histogram Each node vi will be associated with two components of histograms: foreground histogram hfg i and background histogram hbg i : ˆ hfg i = j∈ci bow(fj) - a frequency of quantized local features fj extracted inside the region ci. ˆ hbg i = j /∈ci bow(fj) - a frequency of quantized local features fj extracted outside the region ci. 1
  • 2.
    Hence, the histogramrepresentation of node vi can defined as: hi = hfg i + αbg hbg i (1) 2.1.1 Performance Evaluation 2.1.2 UCF-Sport Dataset mAP performance with different types of kernel functions vs background his- togram weight values, αbg Kernel Type αbg = 0 αbg = 0.5 αbg = 0.75 Linear 31.91 % 56.69 % 65.29 % Intersection 35.65 % 58.98 % 60.89 % Chi − Square 39.03 % 63.08 % 65.96 % Jenson − Shannon 39.50 % 63.53 % 66.38 % The runtime evaluation (in mins) with different types of kernel functions vs background histogram weight values, αbg Kernel Type αbg = 0 αbg = 0.5 αbg = 0.75 Linear 2.5 22.3 15.1 Intersection 3.3 20.3 16.8 Chi − Square 2.4 14.8 11.2 Jenson − Shannon 3.0 78.7 58.8 3 Edge Edge E will govern the relationship between the segmented regions C = {c0, c1, ..., cN }, i.e vertex V of video graph G(V, E). In essence, eij ∈ E should reflect the like- lihood of vertex vi and vj belong to the same action category, i.e Pij(li = lj) where li, lj ∈ L. 3.1 Video Graph Data Structure A video sequence is partitioned into a set of non-overlapping supervoxel regions S = {s0, s1, .., sN }. The supervoxel is mapped into graph: fmap : S → G(V, E). Each node vi ∈ V has following attributes: ˆ sizei : the total number of cells making up the region si, R. ˆ mini : the minimum location of 3D bounding box point in cell-coordinate R3 . ˆ maxi : the maximum location of 3D bounding box point in cell-coordinate R3 . 2
  • 3.
    ˆ meani: themean cell location, R3 . ˆ histi: the bow histogram vector over the local descriptor fi ∈ si, R4k × 5 (Used 5 different descriptor with dictionary size of 4k ). ˆ sparsityi: the number of non-zeros bins in the bow histogram histi. ˆ histc i : the concatenated color histogram vector of siregionover3−channels(RGB),R10 × 3 ˆ histg i : the orientation histogram vector of siregionwith50directionalbins,R50 3 ˆ li: the action label of si region, li ∈ L. Each edge eij ∈ E between node ui and uj has following attribute: ˆ intsizeij: the number of mutually adjacent cells between supervoxel si and sj, RN×N ( 27-neigborhood is used to compute this attribute). 4 Training The SVM classifier is trained on the node level where training instance is node ui’s histogram vector histi ∈ R20k and its corresponding label actioni. The chi- square kernel is used and adopted one-vs-rest training strategy. The trained classifier will return Pr(l|histj) the label likelihood of given test node uj. 5 MRF-based Energy Minimization The localization problem is formulated as energy minimization problem: E(X) = ui∈X E1(ui) + λ (ui,uj )∈X E2(ui, uj) (2) 5.1 Likelihood (E1) and Prior (E2) Likelihood, E1, is defined as: E1(ui) = 1 − e 1 σ2 d P r(li|ui) (3) Prior energy, E2, simply measures the label agreement between adjacent nodes, defined as: E2(ui, uj) = 0, if li = lj 1 − e 1 σ2 s D(ui,uj ) , otherwise (4) 3
  • 4.
    6 Preliminary resultsfor UCF-Sports Dataset The following results are obtained without tuning the MRF objective functions parameters: σs and σd. The evaluation is performed for localization of action in the video sequences. Method mAP Raptis et al [2011] 79.4% Lan at al [2012] 73.1 % SDPM [2011] 75.2 % Ma et al [2013] 81.7 % Xu et al [2014] 78.8 % Ours 77.5 % 4