This paper provides an overview of the Social Event Detection (SED) system developed at LIMSI for the 2014 cam-
paign. Our approach is based on a hierarchical agglomera-
tive clustering that uses textual metadata, user-based knowl-
edge and geographical information. These dierent sources
of knowledge, either used separately or in cascade, reach
good results for the full clustering subtask with a normal-
ized mutual information equal to 0.95 and F1 scores greater
than 0.82 for our best run
Use Visual Features From Surrounding Scenes to Improve Personal Air Quality ...
LIMSI @ MediaEval SED 2014
1. LIMSI @ MediaEval SED 2014
Camille Guinaudeau, Antoine Laurent, Hervé Bredin
2. Introduction
Social Event Detection task
— Mining social events in large collections of online
multimedia
LIMSI team only participates to the « full clustering »
task.
Our system
— relies on a hierarchical clustering approach,
— is only based on the metadata associated with
images.
3. Developement and test sets
Development dataset divided into 3 smaller datasets
Dev A, Dev B and Dev C
à lower computation time
Same number of clusters and the same distribution
in terms of number of images per cluster
Number of images in each cluster quite similar to the
number of images contained in the test set (110,541)
5. User-based clustering
Comparison of the time reference of a randomly chosen picture with the
date of all the other pictures in the cluster
à pick the closest one
6. User-based clustering
If time distance is less than α hours before or after the time reference,
then the two pictures belongs in the same cluster
Time reference = the mean of the two time references
7. User-based clustering
If time distance is greater than α hours before or after the time reference,
then the two pictures define two clusters
8. User-based clustering
Dev A Dev B Dev C
1h 0.9874 0.9872 0.9874
10h 0.9813 0.9796 0.9798
20h 0.9785 0.9766 0.9770
24h 0.9777 0.9755 0.9757
30h 0.9763 0.9743 0.9749
100h 0.9678 0.9673 0.9665
Homogeneity
equals one when each cluster contains only members of a
single class
9. Hierarchical clustering approach
Starts with the set of clusters defined in
the user-based clustering
Based on a single-linkage clustering
method
Distance matrix maintained at each
iteration
d[u;v] = distance between cluster u and v
Final clustering is obtained by forming flat clusters from the hierarchical
clustering
A thresholdθis used so that observations in each cluster have no
intergroup distance greater than θ
10. Distance matrices
Textual metadata distance matrix
Each cluster is represented by a vector composed by lemmas
weighted with a BM25 score
A cosine distance is computed between two vectors
à distance between the two corresponding clusters
Vectors creations
• Words are extracted from the textual metadata (title,
description and tags)
• Words are lemmatized and only nouns, adjectives and non
modal verbs are kept
• Each lemma is associated with a score computed using the
BM25 weighting function
11. Distance matrices
Geographic distance matrix
For each user based cluster u and v that contains at
least one picture with GPS information
A geographic distance is computed between u and v
à the minimum distance between any picture from
cluster u and any picture from cluster v
à If the associate date of the two clusters is greater
than 48h, the geographic distance is artificially
increased
12. Submitted runs
All run are based on the preliminary clustering
à α = 20 hours / 24 hours or 30 hours
Hierarchical clustering obtained thanks to :
— Textual metadata only
— Geographical information only
— both sources of knowledge
à Combination is done in cascade (hierarchical clustering
based on text is applied on the result of the geographical
clustering)
13. Results
Dev A Dev B Dev C Test
α 20h 20h 20h 20h 24h 30h 24h 24h
Text ✔ ✔ ✔ ✔ ✔ ✔ ✔
Geo ✔ ✔ ✔ ✔ ✔ ✔ ✔
F1 0.7895 0.7869 0.7912 0.8214 0.8140 0.8115 0.7563 0.7387
NMI 0.9479 0.9472 0.9483 0.9554 0.9532 0.9526 0.9423 0.9359
Div F1 0.6880 0.7258 0.7224 0.8207 0.8132 0.8107 0.7557 0.7380
14. Results
Dev A Dev B Dev C Test
α 20h 20h 20h 20h 24h 30h 24h 24h
Text ✔ ✔ ✔ ✔ ✔ ✔ ✔
Geo ✔ ✔ ✔ ✔ ✔ ✔ ✔
F1 0.7895 0.7869 0.7912 0.8214 0.8140 0.8115 0.7563 0.7387
NMI 0.9479 0.9472 0.9483 0.9554 0.9532 0.9526 0.9423 0.9359
Div F1 0.6880 0.7258 0.7224 0.8207 0.8132 0.8107 0.7557 0.7380
15. Results
Dev A Dev B Dev C Test
α 20h 20h 20h 20h 24h 30h 24h 24h
Text ✔ ✔ ✔ ✔ ✔ ✔ ✔
Geo ✔ ✔ ✔ ✔ ✔ ✔ ✔
F1 0.7895 0.7869 0.7912 0.8214 0.8140 0.8115 0.7563 0.7387
NMI 0.9479 0.9472 0.9483 0.9554 0.9532 0.9526 0.9423 0.9359
Div F1 0.6880 0.7258 0.7224 0.8207 0.8132 0.8107 0.7557 0.7380
16. Results
Dev A Dev B Dev C Test
α 20h 20h 20h 20h 24h 30h 24h 24h
Text ✔ ✔ ✔ ✔ ✔ ✔ ✔
Geo ✔ ✔ ✔ ✔ ✔ ✔ ✔
F1 0.7895 0.7869 0.7912 0.8214 0.8140 0.8115 0.7563 0.7387
NMI 0.9479 0.9472 0.9483 0.9554 0.9532 0.9526 0.9423 0.9359
Div F1 0.6880 0.7258 0.7224 0.8207 0.8132 0.8107 0.7557 0.7380
17. Conclusions and future works
Our system only based on metadata informations
works well with 82% of F1 score
Results obtained on every dataset are homogenous
We could improve the method by :
— using the pictures and the associated metadata
— using web queries (searching pictures on Google
image…)