Presenter: Claire-Hélène Demarty, Technicolor, France
Paper: http://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_19.pdf
Video: https://youtu.be/WDlCL1bFPEA
Authors: Eloise Berson, Claire-Hélène Demarty, Ngoc Q.K. Duong
Abstract: This paper summarizes the computational models that Technicolor proposes to predict interestingness of images and videos within the MediaEval 2017 Predicting Media Interestingness Task. Our systems are based on deep learning architectures and exploit the use of both semantic and multimodal features. Based on the obtained results, we discuss our findings and obtain some scientific perspectives for the task.
MediaEval 2017 - Interestingness Task: Multimodality and Deep Learning when predicting Media Interestingness
1. Multimodality and Deep Learning when
predicting Media Interestingness
Eloise Berson - Claire-Hélène Demarty – Ngoc Duong
Technicolor, France
MediaEval 2017 Workshop
September, 13-15th 2017
2. Build incrementally from last year’s systems
Re-use similar features and DNN architectures
➢ Make use of the multimodal nature of content
➢ Model its temporal evolution
2
Motivation
9/14/2017
3. Build incrementally from last year’s systems
Re-use similar features and DNN architectures
➢ Make use of the multimodal nature of content
➢ Model its temporal evolution
Investigate benefit of
Adding some semantic & contextual information to the content
➢ Add a textual modality from IMDb movie description
➢ Use Image Captioning-based features
3
Motivation
9/14/2017
4. For image and frame:
CNN features from fc7 layer of the CaffeNet model
Dimension: 4096
For audio:
60 MFCC features + first & second derivatives
Dimension: 180
For image only:
Image Captioning Based (ICB) features [1]
Dimension: 1024
For text (used for the image & video subtasks):
From IMDb description → keyword extraction → Word2Vec (W2V) description
Dimension: 300
[1] R. Kiros, R. Salakhutdinov and R. S. Zemel. Unifying visual-semantic embeddings with
multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.
4
Features
9/14/2017
5. 5 9/14/2017
Image subtask – different feature concatenations
Run#1: Baseline: 2016 best run
Run#2: ICB features
Run#3: CNN+W2V features
Run#4: CNN+ICB features
Run#5: CNN+ICB+W2V features
6. 6 9/14/2017
Image subtask – different feature concatenations
Run#1: Baseline: 2016 best run
Run#2: ICB features
Run#3: CNN+W2V features
Run#4: CNN+ICB features
Run#5: CNN+ICB+W2V features
➢ Cross-validation (80%-20%)
➢ Re-train on complete devset
➢ Resampling of the data
➢ Classifier: a single MLP layer,
ReLU activation,
dropout=0.5
7. 7 9/14/2017
Image subtask - results
Dev set Test set
Run Features MAP@10 MAP MAP@10 MAP
1 CNN 0.27 0.31 0.1028 0.2615
2 ICB 0.33 0.36 0.1054 0.2525
W2V 0.23 0.28 - -
3 CNN+W2V 0.35 0.38 0.0693 0.2244
ICB+W2V 0.33 0.37 - -
4 CNN+ICB 0.29 0.32 0.0875 0.2382
5 CNN+ICB+W2V - - 0.0861 0.2347
2016 MAP: 0.2336
Improved MAP: dataset/annotations are better?
Very low MAP@10 values!
When compared with dev set: opposite results
Dev set: Seems that semantic information brings improvement?
Test set: Overfitting?
8. 8 9/14/2017
Video subtask – different levels of embedding
Run#1: Baseline: 2016 run
Run#2: embedding after temporal average, no duplication
Run#3: embedding with combined (audio+video), duplication
Run#4: embedding in parallel to audio and video, duplication
Run#5: same as Run#4 but inversion of (softmax & temporal average)
run 2
9. 9 9/14/2017
Video subtask – different levels of embedding
Run#1: Baseline: 2016 run
Run#2: embedding after temporal average, no duplication
Run#3: embedding with combined (audio+video), duplication
Run#4: embedding in parallel to audio and video, duplication
Run#5: same as Run#4 but inversion of (softmax & temporal average)
run 2
Multimodal processing:
• either one LSTM-ResNet layer (if temporal processing)
• or one simple MLP layer
Run#4 & Run#5: influence of the location of the decision step (softmax)
10. 10
Video subtask - results
9/14/2017
Dev set Test set
Run Embedding MAP@10 MAP MAP@10 MAP
1 2016 system (A+V) 0.28 0.30 0.0589 0.1856
2 After temporal modeling - 0.27 0.0465 0.1768
3 After (A+V) merging 0.29 0.31 0.0563 0.1825
4 In parallel to (A+V) 0.30 0.32 0.0641 0.1878
5 Run#4 – location of decision - - 0.0609 0.1918
Slightly improved MAP compared to 2016
This time, similar results for both dev and test sets
Semantic information did bring improvement
Embedding at lower level, even if duplication, seems to work better
Keeping decision for the very last step seems to be better, at least for MAP