Tv parser an automatic tv video parsing method_liang_20100309

Introduction
Our solution
TVParser model
Experimental Results
Conclusion

TVParser: An Automatic TV Video Parsing
Method

Chao Liang

National Laboratory of Pattern Recognition (NLPR)
Chinese Academy of Sciences, Institute of Automation (CASIA)

March 9, 2011

Chao Liang TVParser: An Automatic TV Video Parsing Method

Introduction
Our solution
TVParser model
Conclusion

Outline
1 Introduction
Motivation
Related work
2 Our solution
Basic ideas
Role histogram
3 TVParser model
Model formulation
Parameter estimation
State inference
4 Experimental Results
Data sets
Face naming
Scene segmentation
5 Conclusion


Introduction
Our solution
Motivation
TVParser model
Related work
Conclusion

Introduction
Motivation
Voluminous TV videos vs. eﬃcient management


Introduction
Our solution
Motivation
TVParser model
Related work
Conclusion

Introduction
TV video
Story plot (scene structure)
[Scene: Monica and Rachel's, Carol and Susan
are showing off Ben to the gang.]
Phoebe: Oh my God, oh, ok, was that too much pressure for him?
Susan: Oh, is he hungry already?
Carol: I guess so. (Carol starts to breast feed Ben.)
… …

[Scene: Central Perk, the gang is all there.]
Julie: Rachel, do you have any muffins left?
Rachel: Yeah, I forget which ones.
Julie: Oh, you're busy, that's ok, I'll get it. Anybody else want one?
… …

Characters (named faces)

RACH MNCA PHBE ROSS JOY CHAN


Introduction
Our solution
Motivation
TVParser model
Related work
Conclusion

Related work
Movie/Script alignment
Script-subtitle alignment

00:10:44,210 -->
[Scene: Rachel is 00:10:45,177
entering the living room.] Monica: Julie.

Monica: Julie. 00:10:45,444 -->
00:10:46,775
Rachel: What?! Rachel: What?!

script subtitle movie

Disadvantages
Syntax and words discrepancy between the script and subtitle
Availability of the subtitle


Introduction
Our solution
Motivation
TVParser model
Related work
Conclusion

Related work (cont.)
Face naming
Fully supervised
Weakly supervised

[Scene: Rachel is
Monica
entering the living room.]

Monica: Julie.
Rachel
Rachel: What?!

(a) weakly supervised (b) fully supervised

Disadvantages
Expensive manual labels
Large-scale applications


Introduction
Our solution
Motivation
TVParser model
Related work
Conclusion

Related work (cont.)
Scene segmentation
Content-based method
Script-guided method
t=1 t=2 t=3 t=4
Shot 1 Shot 2 Shot 3 Shot 4
Observation
sequence
bq1, shot1 bq2, shot2 bq3, shot3 bq4, shot4

Scene q1 Scene q2 Scene q3 Scene q4
Hidden state aq1,q2 aq2,q3 aq3,q4
...
sequence
aq1,q3 aq2,q4

aq1,q4
HMM : λ= {A, B, п} = {A(aqi, qj), B(bqi, shotj),п} Viterbi alignment : Q = {q1, q2, q3, q4, q5, ...}

Disadvantages
Matching units are asymmetric
Latent geometric distribution


Introduction
Our solution
Basic ideas
TVParser model
Role histogram
Conclusion

Our solution
Basic ideas
A generative TVParser model to align video and script by
mining face-name correspondence.

JOEY 3 0 1 2 0 2 2 0 0 1 1 2 1
MNCA 2 1 0 2 0 1 1 1 0 2 0 0 0
RACH 1 1 0 1 0 0 1 0 1 1 0 0 0
CHAN 0 0 1 0 0 0 1 0 0 0 0 2 0
C1 C2 C3 S1 S2 S3 S4 S7 S8 S9 S10 S11 S12

C1:{S1, ,S4} C2:{S6, ,S8} C3:{S10, ,S12}
name histogram face histogram

Advantages
Face names can be identiﬁed in an unsupervised way (learning)
Global optimal scene segmentation can be inferred (inference)
Fast algorithms for both parameter learning and state inference

Introduction
Our solution
Basic ideas
TVParser model
Role histogram
Conclusion

Role histogram
Basic idea
Bag-of-Words (BoW) representation
Role composition is a generic and semantic feature for both
video (as face histogram) and script (as name histogram)
Name clustering
Face clustering
Diﬃculty: variational environment conditions, e.g. pose, etc.


Introduction
Our solution
Basic ideas
TVParser model
Role histogram
Conclusion

Role histogram
Face clustering
Solution I: Semi-supervised kernel k-means clustering

Key points
Incorporate pairwise constraints (must-link and cannot-link)
Adopt manifold-manifold distance

t

must-link and cannot-link manifold-manifold distance


Introduction
Our solution
Basic ideas
TVParser model
Role histogram
Conclusion

Role histogram
Face clustering
Solution II: Loose clustering number
Key points
Allowing puriﬁed substructures


Introduction
Our solution Model formulation
TVParser model Parameter estimation
Experimental Results State inference
Conclusion

Model formulation

Graphical TVParser model

v(i-1) v(i) v(i+1)

... ... ...
ti-1 ti-1+di-1 ti ti+di ti+1 ti+1+di+1

pi-1 = (ti-1 , di-1) pi = (ti , di) pi+1 = (ti+1 , di+1)

si-1 si si+1

S : {si |i=1, · · ·, r } is observed script scene sequence;
V : {vj |j=1, · · ·, u} is observed video shot sequence;
P : {pi =(ti , di )|i=1, · · · , r } is the hidden video scene partition
sequence where t1 = 1, i di = u and ti = ti−1 + di−1 (i > 1).


Introduction
Conclusion

Model formulation

Complete TVParser model

P(V, S, P) = P(s1 )P(p1 |s1 )P(v(1) |p1 , s1 )
r
× P(si |si−1 )P(pi |si )P(v(i) |pi , si )
i=2

The generative process
(1) Enter into the i th script scene si from its predecessor si−1 ;
(2) Decide si ’s related partition pi = (ti , di );
(3) Generate the corresponding video shot subsequence v(i) = v[ti :ti +dj ]
indexing from ti to ti + di


Introduction
Conclusion

Model formulation
Additional constraint
P(s1 ) = 1 ⇔ s1 = 1
P(si |si−1 ) = 1 ⇔ si = i, si−1 = i − 1

Simpliﬁed TVParser model
r
P(V, S, P) = P(pi |si ) P(v(i) |pi , si )
i=1
duration observation


Introduction
Conclusion

Model formulation
Scene duration probability
Poisson distribution

λdi e −λi λdi
P(pi |si ; λi ) = i
= e −λi · i
di ! di !

Reasons
Poisson is a plausible model of state duration;
Model parameter, λ = {λi }, is the expected duration of scenes;
Parameter can be estimated by Maximum likelihood method


Introduction
Conclusion

Model formulation
Observation probability
Gaussian distribution

1 (si − A v(i) ) (si − A v(i) )
P(v(i) |pi , si ; A, σi ) = exp −
2πσi2 2σi2

Meaning for parameter A
A = [Aij ] ∈ RM×N is the face-name relation matrix that associates
M name with N face clusters. By regulating the entry of A as
Aij ≥ 0 and i Aij = 1, we can treat each column as a identity
distribution of the face cluster.


Introduction
Conclusion

Model parameters Ψ = {{λi }, {σi2 }, A}
Maximum likelihood estimation (MLE)

max ˆ
P(P|V, S; Ψ) · log P(V, S, P; Ψ)
ˆ
Ψ P
s.t. 1M A = 1N
A ≥ 0,

Optimization problem
For {λi }and{σi }, unconstraint optimization
For A, constraint optimization


Introduction
Conclusion


Re-estimation for {λi }

pi P(pi |V, S; Ψ) · di
λi =
pi P(pi |V, S; Ψ)

Re-estimation for {σi }

pi P(pi |V, S; Ψ) · (si −Av(i) )(si −Av(i) )
σi2 =
pi P(pi |V, S; Ψ)


Introduction
Conclusion


Re-estimation for A

(W − 1M η )+
ij
Aij ← Aij
2(AU)ij + (W − 1 M η )−
ij

where
 r
1
 W= P(P|V, S; Ψ) si v

σi2 (i)





 P i=1
 r
1
 U= P(P|V, S; Ψ) v(i) v(i)
 2σi2
P i=1



 η = 1 · (1 W − 2 1 U)



M M N


Introduction
Conclusion


Summation in both W and U

P(P|V, S; Ψ)
P

Sum over the whole possible partition sequence space
Typical example: u = 15 (scenes) and r = 300 (shots), then
possible segmentation number: C15 ≈ O(1024 ) (Intractable!)
299

Solution: Sequence ⇒ segments
r r
P(P|V, S; Ψ) = P(pi |V, S; Ψ)
P i=1 i=1 pi


Introduction
Conclusion

Posterior probability P(pi |V, S; Ψ)
Forward-backward algorithm

Forward-backward variables

αpi (si ) P(si , pi , v[1:ti +di ] ; Ψ)
βpi (si ) P(v[ti +di +1:u] |si , pi ; Ψ)

Forward-backward recursion
Initial conditions


Introduction
Conclusion

State inference
Hidden partition sequence P ∗
Viterbi Algorithm

Local optimal

δτ (si ; θ) max P(p[1:i−1] , s[1:i−1] , τ ∈ qi , o[1:τ ] ; θ)
p[1:i−1]

Forward recursion
Backtracking


Introduction
Our solution Data sets
TVParser model Face naming
Experimental Results Scene segmentation
Conclusion

Data sets
Two TV series
6 episodes from American TV series “Friends”
5 episodes from Chinese TV series “I Love My Family”(Family)

Data details (average per episode)
Length: 30 min
Role number: 10
Face number: 2 × 105
Shot number: 300


Introduction
Conclusion

Face naming
Baselines
Face clustering
Unconstrained kernel K means (KK)
Constraint K -means (CK)
Completely positive factorization (CP)
Constraint spectral Learning (SL)

Face Recognition
K nearest neighbor (KNN)
Support vector machine (SVM)


Introduction
Conclusion

Face naming
Criteria
Face clustering
n·nl,h
l h nl.h log( nl nh )
NMI = nl nh
( l nl log n )( h nh log n )

where n is the number of objects, nl is the size of the l th class
in the groundtruth, nh is the size of the hth cluster in the result
and nl,h is the size of their intersect.
Face Recognition
2 × precisioni × recalli
Fw = wi ·
precisioni + recalli
i

where wi denotes the weight of the i th role according to
his/her spoken lines in the script.

Introduction
Conclusion

Face naming
Face clustering
Constraint vs. unconstraint
Clustering number variance

Friends Family
0.5 0.5

0.4 0.4
NMI score

NMI score
0.3 0.3

0.2 0.2
CK CK
KK KK
0.1 SSKK 0.1 SSKK
SL SL
CP CP
0 0
X 0.0 X 1.0 X 2.0 X 3.0 X 4.0 X 5.0 X 0.0 X 1.0 X 2.0 X 3.0 X 4.0 X 5.0
Cluster number (x times) Cluster number (x times)


Introduction
Conclusion

Face naming
Face recognition (naming)
Optimal recognition achieved when the clustering number
approximates 2 times of the character number

Friends Family
0.7 0.8

0.6
0.6
0.5

0.4
0.4
0.3

0.2 0.2

0.1 A purifying rate A purifying rate
Precision 0 Precision
0 Recall Recall
Fw-measure Fw-measure
-0.1 -0.2
X 0.0 X 1.0 X 2.0 X 3.0 X 4.0 X 5.0 X 0.0 X 1.0 X 2.0 X 3.0 X 4.0 X 5.0


Introduction
Conclusion

Face naming
Main character naming result
Accuracy
Robustness

Friends Family
0.8 0.7

0.7 0.6
Weighted F-measure

Weighted F-measure
0.6
0.5
0.5
0.4
0.4
0.3
0.3
1st main character 0.2 1st main character
0.2
2nd main character 2nd main character
0.1 3rd main character 0.1 3rd main character
4th main character 4th main character
0 0
X 0.0 X 1.0 X 2.0 X 3.0 X 4.0 X 5.0 X 0.0 X 1.0 X 2.0 X 3.0 X 4.0 X 5.0


Introduction
Conclusion

Face naming
Compare with supervised methods
Comparable to supervised methods
Even better when training set is limited

Friends Family
1 1

0.9 0.9
Weighted F-measure

Weighted F-measure
0.8 0.8

0.7 0.7

0.6 0.6

0.5 KNN 0.5 KNN
SVM SVM
0.4 st 0.4
TVParser (1 best) TVParser (1st best)
0.3 TVParser (2nd best) 0.3 TVParser (2nd best)
TVParser (3rd best) TVParser (3rd best)
0.2 0.2
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
training-test-ratio training-test-ratio


Introduction
Conclusion

Scene segmentation
Baselines
Scene segmentation methods (algorithms)
Shot similarity graph (SSG)
Dynamic time warping (DTW)
Hidden Markov model (HMM)


Introduction
Conclusion

Scene segmentation
Criteria
Scene segmentation
r r r r
di 2
dij dj∗ dij2
ρ=( )·( )
u di2 u dj∗2
i=1 j=1 j=1 i=1

where dij is the length of overlap between the scene segment
pi and pj∗ , di is the length of the scene pi and r is total length
of all scenes. This purity value ranges from 0 to 1, and the
larger a value is, the closer it is to the groundtruth.


Introduction
Conclusion

Scene segmentation

Scene segmentation result

Segmentation Sources Purity Scores
Methods (video+) Friends Family
SSG - 0.55 ± 0.11 0.53 ± 0.07
DTW sub.+scr. 0.60 ± 0.13 -
HMM scr. 0.59 ± 0.08 0.53 ± 0.05
TVParser scr. 0.67 ± 0.07 0.58 ± 0.03


Introduction
Conclusion

Scene segmentation
Scene segmentation result under various role histograms
Name histogram: ﬁrst four characters are dominant
Face histogram: more clusters are generally better

0.6

Average purity
↑0.05（≈29%）
0.55

0.7 0.5

↑0.12（≈71%）
0.6 0.65
Purity score

0.45
0.6
0.5
0.4
0.55 2 3 4 5 6 7 8 9 10 11
Face histogram size
0.4 0.5 0.6

0.45 0.58

Average purity
X 2.50 0.4
Fac X 2.00 0.54

e h X 1.50 10
ion
ist ens
ogr X 1.00
8
dim
am X 0.50 6 ram 0.5

dim 4 is tog
ens X 0.00 e h
ion
2 Nam 0.46
X 0.25 X 0.75 X 1.25 X 1.75 X 2.25
Face histogram size


Introduction
Our solution
TVParser model
Conclusion

Conclusion

We propose a generative model to formulate story plot
development in TV videos, which solves face naming and
scene segmentation in an uniﬁed framework.

Key novelties
Unsupervised face naming through model parameter learning
Global optimal scene segmentation by hidden state inference
Fast algorithms for both parameter learning and state inference

Future work
Personalized applications, e.g. TV video synthesis, etc;
Generic cross-media analysis and association methods.


Introduction
Our solution
TVParser model
Conclusion

Q&A
Thanks!


Tv parser an automatic tv video parsing method_liang_20100309

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Tv parser an automatic tv video parsing method_liang_20100309