Paper: http://ceur-ws.org/Vol-2283/MediaEval_18_paper_42.pdf
Youtube: https://youtu.be/Nfn6ekXZOw4
Andreas Lommatzsch, Benjamin Kille, Baseline Algorithms for Predicting the Interest in News based on Multimedia Data. Proc. of MediaEval 2018, 29-31 October 2018, Sophia Antipolis, France.
Abstract: The analysis of images in the context of recommender systems is a challenging research topic. NewsREEL Multimedia enables researchers to study new algorithms with a large dataset. The dataset comprises news items and the number of impressions as a proxy for interestingness. Each news article comes with textual and image features. This paper presents data characteristics and baseline prediction models. We discuss the performance of these predictors.
Presented by Andreas Lommatzsch
MediaEval 2018: Baseline Algorithms for Predicting the Interest in News
1. Sophia Antipolis, France, October 29-31, 2018
NewsREEL Multimedia at MediaEval 2018:
Dataset Analysis and Baselines
Andreas Lommatzsch, Benjamin Kille
2. NewsREEL MultiMedia runs as a pilot task in 2018
Questions
• What are appropriate approaches?
• What is the level of precision we can expects?
• What are the specific challenges when
implementing a recommender?
• How to improve the Challenge?
Motivation - Objectives
Analyze Different Baseline Strategies – Report Initial Results
NewsREEL
Multimedia
4. Baseline-Algorithms
Random Recommender
• Assign a random number of impressions
Use a k-Nearest Neighbor-based Approach
• based on Text Terms
• based Image Labels
Compute Features indicating that the
item is relevant
• based on Text Terms
• based Image Labels
Algorithms
3 Baseline Approaches
5. Approach
• Assign a random number of impressions for each item
Evaluation-Scores
• Precision@10: 0.0
• Precision@10%: 0.1
• AveragePrecision@10%: 0.1
Remarks
• Scores significantly lower than the
random recommender, indicate that the
suggestions should be sorted
reversely
Baseline Approaches
Random Recommender
6. Approach
• Use a k-Nearest Neighbor approach
• Define an appropriate metric for
computing the similarity (neighbors)
• Similarity
o metrics: cosine similarity / token overlap
o Features: text tokens, image labels
• Predict the number of impressions based on average of the 10 nearest
neighbors
Challenges
• Which terms should be considered
• Similarity defined based on
image labels (weights)
Baseline Approaches
K-Nearest Neighbor Recommender
Ø Avg
7. Approach
• Compute the impact of relevant features
• Features:
o Text tokens
o Image labels
Challenges:
• Weighting model for the features
• Sparsity of the features
• Combination of the scores
computed for each feature
Baseline Approaches
Combine the impact of different features
8. Implementation
• Straight-forward implementation of the prediction approaches
• Retraceable results
• Efficiently to compute on a standard computer
Remarks
• Images without Labels (~1%)
• News portals with a very low number of impressions
Baseline Approaches
Discussion
11. • In general, the similarity-based approaches perform better than the random
baseline
• Big differences between the domains: best results for domain 13554
• Best results are reached using text-features
• The image labeling configuration have an influence on the results
Evaluation Results
Results
12. • Baseline-Algorithms
• Evaluation Results
• Discussions
• Conclusion and Outlook
Objective
Analyze Different Baseline Strategies – Report Observations
13. Text vs Image labels based features
• Text –based features seems to contain more information
• Common terms (stop words) must be excluded
• Observations
• Top-terms in domain 13555: middle-class, unique, bug
• Top image labels: in domain 13554: snake, roof, folding chair
Remarks
• Domain specific weighting models should be applied
• The correlation between text and images
should be investigated in detail
(example illustration / representative photo)
Discussion
Analyze Different Baseline Strategies – Findings
14. • Frequently used images
• Big differences between the news portals (domains)
• Reasons
• Items have a longer lifecycle
• Items stay relevant for longer period of time
• Conclusions
• A fuzzy fingerprinting method
should be added as a baseline
detecting duplicates
• Exclude frequently used images?
Discussion
Analyze Different Baseline Strategies – Findings
Daily Police Report
Sport
15. The labeling configuration makes a difference
Image Label Categories do not match our setting
Popular labels:
• suite, shirt
• animals
Frequently occurring semantically incorrect labels
• Cables => classified as snakes
• Stage => blue flashing lights (emergencies)
• Barometer => clock, tachometer, loupe
Very long, specific labels:
• “dragonfly, darning needle, devil's darning needle,
sewing needle, snake feeder, snake doctor, mosquito”
• “German shepherd, German shepherd dog,
German police dog, alsatian”
Discussion
Analyze Different Baseline Strategies – Findings
16. • Temporal Changes in the Dataset
• Domain 13554 (motor-talk) is much easier than the other domains
• Observations
• Items have a longer lifecycle in domain 13554
• Items stay relevant for longer period of time
• Items part of both training and item set
• Conclusions
• A fuzzy fingerprinting method
should be added as a baseline
• Adapt the bin size?
Discussion
Analyze Different Baseline Strategies – Findings
18. The NewsREEL Multimedia challenge is well-defined
Potential for optimization
• Larger dataset
• Improve the provided labels (labeling precision)
• Additional features (e.g., low –level visual features)
• Consider temporal aspects
• Special handling for images illustrating frequent categories / image fingerprinting
Algorithms
• Improved weighting models for the features,
machine learning and data mining
• More sophisticated models
(SVM, Low Rank Approximation,
neuronal Networks, random forest)
• Combining different features
• Use of low level features
Baseline Approaches
Evaluation Results
19. • Andreas Lommatzsch
DAI-Lab, TU Berlin
andreas@dai-lab.de
• Benjamin Kille
DAI-Lab, TU Berlin
benjamin.kille@dai-labor.de
• Additional Information
• http://www.newsreelchallenge.org/
• The code is available on request
(Java/Maven)
Contact
Further Information