Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Visual Summary of a Day in 40 Characters
1. Visual Summary of Egocentric
Photostreams by Representative
Keyframes
Marc Bolaños, Ricard Mestre, Estefanía Talavera, Xavier Giró-i-Nieto and Petia Radeva
1
2. Motivation
Lifelogging wearable cameras can produce 1,500 images/day, more than 500,000 images/year.
2
Producing automatic summarization methods could help in
many applications. Specially, we are working on:
● Memory aid for Mild Cognitive Impairment patients.
● Automatic nutrition diary.
3. Extract the visual summary of a whole day capturing the
most representative information for describing the day.
Goal
3
Storytelling
4. Extract the visual summary of a whole day capturing the
most representative information for describing the day.
Goal
4
Storytelling
Have breakfast
with the family
5. Extract the visual summary of a whole day capturing the
most representative information for describing the day.
Goal
5
Storytelling
Have breakfast
with the family
Go for a walk
6. Extract the visual summary of a whole day capturing the
most representative information for describing the day.
Goal
6
Storytelling
Have breakfast
with the family
Go for a walk
Go shopping
7. Extract the visual summary of a whole day capturing the
most representative information for describing the day.
Goal
7
Storytelling
Have breakfast
with the family
Go for a walk
Go shopping
Take the bus
8. Extract the visual summary of a whole day capturing the
most representative information for describing the day.
Goal
8
Storytelling
Have breakfast
with the family
Go for a walk
Go shopping
Take the bus
Have a coffee
with a friend
9. State of the Art
Lu, Zheng, and Kristen Grauman. "Story-driven summarization for egocentric video." Computer Vision and Pattern Recognition (CVPR), 2013
IEEE Conference on. IEEE, 2013.
9
High temporal resolution egocentric data.
1. Event segmentation.
2. Detection of salient objects and people.
3. Subset selection of video shots based on:
a. Story
b. Importance
c. Diversity
10. State of the Art
Doherty, Aiden R., et al. "Investigating keyframe selection methods in the novel domain of passively captured visual lifelogs." Proceedings of the
2008 international conference on Content-based image and video retrieval. ACM, 2008.
10
Low temporal resolution egocentric data.
1. Event segmentation.
2. Selection of the keyframes comparing
several methods:
a. Middle image of each segment.
b. Image close to the average value in
the segment (centroid-like).
c. Image with highest “quality”.
13. Frames Characterization
Convolutional Neural Networks (CNN) trained on ImageNet.
13
Jia, Yangqing, et al. "Caffe: Convolutional architecture for fast feature embedding." Proceedings of the ACM International Conference on Multimedia.
ACM, 2014.
14. Events Segmentation ( I )
Applying an agglomerative clustering and adapting the cut-off parameter, we can obtain a
good segmentation of all the events in our day.
14
cut-off parameter
15. Events Segmentation ( II )
Division - Fusion post-processing to obtain a more robust segmentation.
15
a) After Agglomerative Clustering
b) After Division
c) After Fusion
Division: splits and labels differently similar events spaced in time.
Fusion: merges very short sub-events not considered relevant enough.
18. Evaluation ( I )
● 5 days
● 3 users
● 4005 images
● Segmentation ground truth
18
Talavera, E., Dimiccoli, M., Bolaños, M., Aghaei, M., & Radeva, P. R-
clustering for egocentric video segmentation. IbPRIA 2015, Santiago de
Compostela, Spain. Proceedings (Vol. 9117, p. 327). Springer.
Datasets Clustering
● Jaccard Index
19. Evaluation ( II )
19
Keyframe Selection
Lu, Zheng, and Kristen Grauman. "Story-driven summarization
for egocentric video." Computer Vision and Pattern
Recognition (CVPR), 2013 IEEE Conference on. IEEE, 2013.
Figure: brandchannel.com
● Blind taste test to 30 users for quality
evaluation
Representative images of the event #1
Do you think the image on the left can represent the event?
Do you think the image on the center can represent the event?
Yes
No
Yes
No
Yes
No
Do you think the image on the right can represent the event?
What is the most representative image of the event?
Left
Center
Right
Individual Keyframes Quality Evaluation
20. Evaluation ( III )
20
Keyframe Selection General Summary Quality Evaluation
Yes
No
Do you think that this set can summarize the whole day?
Finally, which one do you think is the best visual summary of the day?
Summary 1
Summary 2
Summary 3
Summary 4
Summary 1
Some of the summaries you will see might be very similar (differentiable
only in some images). In that case you can choose any of them.
Visual summaries of the day
21. Evaluation - Individual Keyframes
21
What is the most representative image of the
event?
Do you think that the image on the
left/center/right can represent the event?
22. Evaluation - General Summary
22
Can this set of images represent the complete day? Which summary is the best, in your opinion?
23. Conclusions
● New keyframe selection methodology taking into account visual and temporal
information.
● Keyframe selection using CNN-based global information and graph-analysis.
● 88-86% user acceptance of our summaries.
● 58% users chose our summaries as the best option.
● Use semantic information (e.g. objects, people, actions).
● Clinical application on Mild Cognitive Impairment patients.
23
Future Work