MediaEval 2017 - Interestingness Task: Multimodality and Deep Learning when predicting Media Interestingness

Multimodality and Deep Learning when
predicting Media Interestingness
Eloise Berson - Claire-Hélène Demarty – Ngoc Duong
Technicolor, France
MediaEval 2017 Workshop
September, 13-15th 2017

 Build incrementally from last year’s systems
 Re-use similar features and DNN architectures
➢ Make use of the multimodal nature of content
➢ Model its temporal evolution
2
Motivation
9/14/2017

 Build incrementally from last year’s systems
 Re-use similar features and DNN architectures
➢ Make use of the multimodal nature of content
➢ Model its temporal evolution
 Investigate benefit of
 Adding some semantic & contextual information to the content
➢ Add a textual modality from IMDb movie description
➢ Use Image Captioning-based features
3
Motivation
9/14/2017

 For image and frame:
 CNN features from fc7 layer of the CaffeNet model
 Dimension: 4096
 For audio:
 60 MFCC features + first & second derivatives
 Dimension: 180
 For image only:
 Image Captioning Based (ICB) features [1]
 Dimension: 1024
 For text (used for the image & video subtasks):
 From IMDb description → keyword extraction → Word2Vec (W2V) description
 Dimension: 300
[1] R. Kiros, R. Salakhutdinov and R. S. Zemel. Unifying visual-semantic embeddings with
multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.
4
Features
9/14/2017

5 9/14/2017
Image subtask – different feature concatenations
Run#1: Baseline: 2016 best run
Run#2: ICB features
Run#3: CNN+W2V features
Run#4: CNN+ICB features
Run#5: CNN+ICB+W2V features

6 9/14/2017
Image subtask – different feature concatenations
Run#1: Baseline: 2016 best run
Run#2: ICB features
Run#3: CNN+W2V features
Run#4: CNN+ICB features
Run#5: CNN+ICB+W2V features
➢ Cross-validation (80%-20%)
➢ Re-train on complete devset
➢ Resampling of the data
➢ Classifier: a single MLP layer,
ReLU activation,
dropout=0.5

7 9/14/2017
Image subtask - results
Dev set Test set
Run Features MAP@10 MAP MAP@10 MAP
1 CNN 0.27 0.31 0.1028 0.2615
2 ICB 0.33 0.36 0.1054 0.2525
W2V 0.23 0.28 - -
3 CNN+W2V 0.35 0.38 0.0693 0.2244
ICB+W2V 0.33 0.37 - -
4 CNN+ICB 0.29 0.32 0.0875 0.2382
5 CNN+ICB+W2V - - 0.0861 0.2347
2016 MAP: 0.2336
 Improved MAP: dataset/annotations are better?
 Very low MAP@10 values!
 When compared with dev set: opposite results
 Dev set: Seems that semantic information brings improvement?
 Test set: Overfitting?

8 9/14/2017
Video subtask – different levels of embedding
Run#1: Baseline: 2016 run
Run#2: embedding after temporal average, no duplication
Run#3: embedding with combined (audio+video), duplication
Run#4: embedding in parallel to audio and video, duplication
Run#5: same as Run#4 but inversion of (softmax & temporal average)
run 2

9 9/14/2017
Video subtask – different levels of embedding
Run#1: Baseline: 2016 run
Run#2: embedding after temporal average, no duplication
Run#3: embedding with combined (audio+video), duplication
Run#4: embedding in parallel to audio and video, duplication
Run#5: same as Run#4 but inversion of (softmax & temporal average)
run 2
Multimodal processing:
• either one LSTM-ResNet layer (if temporal processing)
• or one simple MLP layer
Run#4 & Run#5: influence of the location of the decision step (softmax)

10
Video subtask - results
9/14/2017
Dev set Test set
Run Embedding MAP@10 MAP MAP@10 MAP
1 2016 system (A+V) 0.28 0.30 0.0589 0.1856
2 After temporal modeling - 0.27 0.0465 0.1768
3 After (A+V) merging 0.29 0.31 0.0563 0.1825
4 In parallel to (A+V) 0.30 0.32 0.0641 0.1878
5 Run#4 – location of decision - - 0.0609 0.1918
 Slightly improved MAP compared to 2016
 This time, similar results for both dev and test sets
 Semantic information did bring improvement
 Embedding at lower level, even if duplication, seems to work better
 Keeping decision for the very last step seems to be better, at least for MAP

MediaEval 2017 - Interestingness Task: Multimodality and Deep Learning when predicting Media Interestingness

Recommended

Recommended

More Related Content

Similar to MediaEval 2017 - Interestingness Task: Multimodality and Deep Learning when predicting Media Interestingness

Similar to MediaEval 2017 - Interestingness Task: Multimodality and Deep Learning when predicting Media Interestingness (20)

More from multimediaeval

More from multimediaeval (20)

Recently uploaded

Recently uploaded (20)

MediaEval 2017 - Interestingness Task: Multimodality and Deep Learning when predicting Media Interestingness