Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers

Near-Duplicate Video Retrieval by
Aggregating Intermediate CNN Layers
Giorgos Kordopatis-Zilos1,2, Symeon Papadopoulos1,
Ioannis Patras2 and Yiannis Kompatsiaris1
1Information Technologies Institute, CERTH, Thessaloniki, Greece
2Queen Mary University of London, Mile end Campus, UK, E14NS
23rd International Conference on MultiMedia Modeling
Reykjavík, Iceland, 4-6 January 2017

Problem & Motivation
• Near-Duplicate Video Retrieval (NDVR)
• Given a query video, search a video dataset to retrieve (visually)
highly similar videos
• Rank the candidate videos based on their similarity to the query
• Various applications
• content verification
• video retrieval, management and recommendation
• copyright protection
• Crucial importance of NDVR, due to the exponential growth
of video content

Near-Duplicate Videos: Definition
• Variety of definitions and understandings regarding the
near-duplicate videos
• Adopt definition by Wu et al. (2007)
• photometric variations: gamma, contrast, brightness, etc.
• editing operations: resize, shift, crop, flip
• insertion of patterns: caption, logo, subtitles, sliding captions, etc.
• re-encoding: video format, compression
• video modifications: frame rate, frame insertion, deletion, swap
X. Wu, A. G. Hauptmann, and C. W. Ngo. Practical elimination of near-duplicates from web video search. In
Proceedings of the 15th ACM international conference on Multimedia, pp. 218-227, 2007

Related Work
• Variety of approaches (Liu et al., 2013)
• Video-level matching: comparison of global signatures
• Global feature vectors
• Fingerprints
• Hash codes
• Frame-level matching: frames or sequences
• Local descriptors
• Spatiotemporal features
• Hybrid-level matching
• Filter-and-refine methods
• TRECVID content-based copy detection (Kraaij & Awad, 2011)
• duplicates artificially generated by standard transformations
W. Kraaij, and G. Awad. TRECVID 2011 content-based copy detection: Task overview. Proc. TRECVid 2010, 2011
J. Liu, Z. Huang, H. Cai, H. T. Shen, C. W. Ngo, and W. Wang. Near-duplicate video retrieval: Current research and
future trends. ACM Computing Surveys, vol.45, no. 4, 44, 2013

Feature Extraction (1/2)
• Employ a pre-trained CNN with 𝐿 convolutional layers
• Apply max pooling on every channel of the feature map of
each layer (Zheng et al., 2016)
𝑣 𝑙
𝑖 = max 𝑀 𝑙
(∙,∙, 𝑖) , 𝑖 = 1, 2, … 𝑐 𝑙
, 𝑙 = 1, 2, … 𝐿
• 𝐿 𝑐 𝑙-dimensional vectors generated
L. Zheng, Y. Zhao, S. Wang, J. Wang, and Q. Tian. Good Practice in CNN Feature Transfer. arXiv:1604.00133, 2016

Feature Extraction (2/2)
• Pre-trained CNN networks from Caffe (Jia et al., 2014):
a) AlexNet, b) VGGNet, c) GoogLeNet
• Feature extraction uses the convolution layers of the
architectures
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe:
Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM int. conference on
Multimedia, pp. 675-678, 2014
AlexNet VGGNet GoogLeNet

Video Indexing and Querying
• tf-idf weighting of visual words
𝑤𝑡𝑑 = 𝑛 𝑡𝑑 ∙ log 𝐷 𝑏 /𝑛 𝑡
• Inverted file indexing structure for fast search
• Retrieve candidates with at least one common visual word
• Rank candidates based on cosine similarity of their tf-idf
representations
𝑠𝑖𝑚 𝑞, 𝑝 =
𝒘 𝒒 ∙ 𝒘 𝒑
𝒘 𝒒 𝒘 𝒑

Evaluation: Dataset
• Dataset: CC_WEB_VIDEO
• Videos: 13,139 videos
• Keyframes: 397,965 images
CC_WEB_VIDEO: http://vireo.cs.cityu.edu.hk/webvideo/
Dataset Annotation
• Evaluation metrics
• precision-recall (PR)
• mean Average Precision (mAP)
𝐴𝑃 =
1
𝑛
𝑖=0
𝑛
𝑖
𝑟𝑖

Query video Near-duplicate Videos
Dataset Examples

Results I
Impact of CNN architecture and vocabulary size

Results II
Performance using individual layers
AlexNet VGGNet GoogLeNet

Results III
• Performance per query
• Best runs
• CNN-V: Vector-based aggregation GoogLeNet
• CNN-L: Layer-based aggregation VGGNet
Lower precision in hard
queries
• query 18 (Bus uncle)
• query 22 (Numa Gary)

Evaluation: Comparison to SoA
• Color Histograms (CH) (Wu et al., 2007) - Video-level matching, color histograms
• Auto Color Correlograms (ACC) (Cai et al., 2011) - Frame-level matching, auto-
color correlograms, BoW, tf-idf weighted cosine similarity
• Local Structure (LS) (Wu et al., 2007) - Hybrid-level matching, Color Histograms,
keyframes similarity of PCA-SIFT descriptors
• Multiple Feature Hashing (MFH) (Song et al., 2013) - Video-level matching, hash
multiple features into Hamming space, combination of the keyframe hash code
to a global video representation
• Pattern-based approach (PPT) (Chou et al., 2015) - Hybrid-level matching,
pattern-based indexing tree (PI-tree), m-pattern-based dynamic programming
(mPDP), time-shift m-pattern similarity (TPS)
X. Wu, A. G. Hauptmann, and C. W. Ngo. Practical elimination of near-duplicates from web video search. In
Y. Cai, L. Yang, W. Ping, F. Wang, T. Mei, X. S. Hua, and S. Li. Million-scale near-duplicate video retrieval system. In
J. Song, Y. Yang, Z. Huang, H. T. Shen, and J. Luo Effective multiple feature hashing for large-scale near-duplicate
video retrieval. In IEEE Transactions on Multimedia, vol. 15, no. 8, pp. 1997-2008, 2013
C. L. Chou, H. T. Chen, and S. Y. Lee. Pattern-Based Near-Duplicate Video Retrieval and Localization on Web-Scale
Videos. IEEE Transactions on Multimedia, vol. 17, no. 3, pp. 382-395, 2015

Results IV
Comparison against existing NDVR approaches

Future Work
• Exploit the C3D features (Tran et al., 2015)
• Conduct more comprehensive evaluations
• More challenging datasets: larger scale, more similar but non-
relevant videos (distractors)
• Partial Duplicate Video Retrieval (PDVR)
• Assess the applicability of the approach on the PDVR problem
D. Tran, L. Bourdev, R. Fergus, L. Torresani and M. Paluri. Learning spatiotemporal features with 3D convolutional networks.
In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497, 2015

Thank you!
Get in touch:
• George Kordopatis-Zilos: georgekordopatis@iti.gr
• Symeon Papadopoulos: papadop@iti.gr / @sympap
With the support of:

Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers

Similar to Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers (20)

More from Symeon Papadopoulos

More from Symeon Papadopoulos (20)

Recently uploaded

Recently uploaded (20)

Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers