"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
Video Copy Detection Using Inclined Video Tomography and Bag-of-Visual-Words
1. Video Copy Detection Using
Inclined Video Tomography and Bag-of-Visual-Words
Hyun-seok Min, Se Min Kim, Wesley De Neve, and Yong Man Ro
Image and Video Systems Lab
Korea Advanced Institute of Science and Technology (KAIST)
Daejeon, South Korea
e-mail: ymro@ee.kaist.ac.kr website: http://ivylab.kaist.ac.kr
I. INTRODUCTION III. VIDEO MATCHING USING HISTOGRAMS
- The dissimilarity between two video clips Vq and Vr:
- BoVW-based approaches can be effectively used for the detection of both
image and video copies N p: the position of the video shot in the
1
- however, these approaches typically ignore the inherent temporal
q
D(V , V ) = min
p N
r
∑ q r
Dshot (S i , S i + p ), reference video clip at which similarity
measurement starts
nature of video content i =1
q N r r L
- Conventional video tomography extracts slices from a space-time cube
q
V = Si V = Sl
i =1 l =1
that are parallel to the time axis
- however, slices that are parallel to the time axis do not take advantage
- The dissimilarity between two video shots Sq and Sr is measured by
of spatial information
making use of the cosine similarity:
- This paper proposes to create a content-based video signature by means M
of the following two sequential steps ∑ q
aj r
×aj M: the number of visual words in the
vocabulary
1) extraction of inclined tomography images from the video content q r j =1
- angle of inclination is dependent on the amount of motion Dshot (S , S ) = 1 - , aj: the weight of the jth visual word
M M
2) characterization of the inclined tomography images by means of BoVW ∑ )∑ )
( ( q 2
aj r 2
aj
j =1 j =1
II. CREATION OF A VIDEO SIGNATURE BY MEANS OF IV. EXPERIMENTS
INCLINED VIDEO TOMOGRAPHY AND BOVW 1. Experimental setup
1. Extraction of inclined tomography images - Use of TRECVID 2009 for creating NDVCs and reference video clips
- To extract inclined tomography images from a video clip V, we first - Use of 100 query video clips by applying five transformations to 20
segment V into N space-time cubes such that V = <S1, S2, …, SN> video clips randomly selected from the reference video database
- We subsequently segment each space-time cube into several space- - blurring: we blurred frames using a Gaussian kernel with a radius
time sub-cubes of 15;
- picture-in-picture: we inserted a picture with a size that is 30% of
Fv, Fb : number of frames in a space- the size of the main frame;
time cube and space-time
sub-cube - change in brightness: we increased the brightness with 40%;
Wv, Wb : width of a space-time cube - mirroring: we reversed frames from the left to the right;
and space-time sub-cube - change in frame rate: we halved the frame rate.
Hv, Hb : height of a space-time cube
and space-time sub-cube 2. Experimental results
1.1 1.1
1 1
0.9 0.9
Fig. 1. Segmentation of a space-time cube into space-time sub-cubes. 0.8 0.8
0.7 0.7
Precision
Recall
0.6 0.6
- The angle of inclination of the tomography image extracted reflects the 0.5 0.5
intensity of motion in the space-time sub-cube under consideration 0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
blur pattern change in mirroring frame rate average blur pattern change in mirroring frame rate average
insertion brightness change insertion brightness change
Transformations Transformations
Proposed video signature BoVW using SIFT Video tomography Proposed video signature BoVW using SIFT Video tomography
Fig. 4. Comparison of the effectiveness of several video signatures.
Fig. 2. Extraction of an inclined tomography image from a space-time sub-cube.
L(x, y, t): the luminance value of a
β
θ=
Wb × H b × Fb
∑L( x, y, f + 1) - L( x, y, f ) , pixel (x, y) of a particular
frame at time t
( x, y , f )
β: a weight parameter
2. BoVW applied to inclined tomography images
- each space-time cube Si can be represented as a vector Ai that
summarizes how the space-time sub-cubes are distributed over the
vocabulary of visual words used (a) (b)
M: the number of visual words vj in the Fig. 5. Example images: (a) example key frame and (b) 16 inclined tomography images
A i = ai,1 , ai ,2 ,...,ai ,M , vocabulary used
extracted from the key frame shown in (a).
ai,j: the weight of the jth visual word
V. CONCLUSIONS
- This paper introduced a novel video signature that takes advantage of
both inclined video tomography and BoVW
- The proposed video signature is able to capture both spatial and
temporal information
- the angle of inclination of the extracted tomography images is
Fig. 3. Extraction of a histogram of visual words from an inclined tomography image. dependent on the amount of motion in the local volumes
IEEE International Conference on Multimedia and Expo (ICME), July 2012, Melbourne (Australia)