Visual Summary of Egocentric
Photostreams by Representative
Keyframes
Marc Bolaños, Ricard Mestre, Estefanía Talavera, Xavier Giró-i-Nieto and Petia Radeva
1
Motivation
Lifelogging wearable cameras can produce 1,500 images/day, more than 500,000 images/year.
2
Producing automatic summarization methods could help in
many applications. Specially, we are working on:
● Memory aid for Mild Cognitive Impairment patients.
● Automatic nutrition diary.
Extract the visual summary of a whole day capturing the
most representative information for describing the day.
Goal
3
Storytelling
Extract the visual summary of a whole day capturing the
most representative information for describing the day.
Goal
4
Storytelling
Have breakfast
with the family
Extract the visual summary of a whole day capturing the
most representative information for describing the day.
Goal
5
Storytelling
Have breakfast
with the family
Go for a walk
Extract the visual summary of a whole day capturing the
most representative information for describing the day.
Goal
6
Storytelling
Have breakfast
with the family
Go for a walk
Go shopping
Extract the visual summary of a whole day capturing the
most representative information for describing the day.
Goal
7
Storytelling
Have breakfast
with the family
Go for a walk
Go shopping
Take the bus
Extract the visual summary of a whole day capturing the
most representative information for describing the day.
Goal
8
Storytelling
Have breakfast
with the family
Go for a walk
Go shopping
Take the bus
Have a coffee
with a friend
State of the Art
Lu, Zheng, and Kristen Grauman. "Story-driven summarization for egocentric video." Computer Vision and Pattern Recognition (CVPR), 2013
IEEE Conference on. IEEE, 2013.
9
High temporal resolution egocentric data.
1. Event segmentation.
2. Detection of salient objects and people.
3. Subset selection of video shots based on:
a. Story
b. Importance
c. Diversity
State of the Art
Doherty, Aiden R., et al. "Investigating keyframe selection methods in the novel domain of passively captured visual lifelogs." Proceedings of the
2008 international conference on Content-based image and video retrieval. ACM, 2008.
10
Low temporal resolution egocentric data.
1. Event segmentation.
2. Selection of the keyframes comparing
several methods:
a. Middle image of each segment.
b. Image close to the average value in
the segment (centroid-like).
c. Image with highest “quality”.
Methodology ( I )
11
Methodology ( II )
12
Frames Characterization
Convolutional Neural Networks (CNN) trained on ImageNet.
13
Jia, Yangqing, et al. "Caffe: Convolutional architecture for fast feature embedding." Proceedings of the ACM International Conference on Multimedia.
ACM, 2014.
Events Segmentation ( I )
Applying an agglomerative clustering and adapting the cut-off parameter, we can obtain a
good segmentation of all the events in our day.
14
cut-off parameter
Events Segmentation ( II )
Division - Fusion post-processing to obtain a more robust segmentation.
15
a) After Agglomerative Clustering
b) After Division
c) After Fusion
Division: splits and labels differently similar events spaced in time.
Fusion: merges very short sub-events not considered relevant enough.
Keyframe Selection
Visual similarity-based keyframe selection criteria.
16
Distances Matrix
Random Walk
Minimum Distance
Similarity-based probabilities
Summary Results
17
Evaluation ( I )
● 5 days
● 3 users
● 4005 images
● Segmentation ground truth
18
Talavera, E., Dimiccoli, M., Bolaños, M., Aghaei, M., & Radeva, P. R-
clustering for egocentric video segmentation. IbPRIA 2015, Santiago de
Compostela, Spain. Proceedings (Vol. 9117, p. 327). Springer.
Datasets Clustering
● Jaccard Index
Evaluation ( II )
19
Keyframe Selection
Lu, Zheng, and Kristen Grauman. "Story-driven summarization
for egocentric video." Computer Vision and Pattern
Recognition (CVPR), 2013 IEEE Conference on. IEEE, 2013.
Figure: brandchannel.com
● Blind taste test to 30 users for quality
evaluation
Representative images of the event #1
Do you think the image on the left can represent the event?
Do you think the image on the center can represent the event?
Yes
No
Yes
No
Yes
No
Do you think the image on the right can represent the event?
What is the most representative image of the event?
Left
Center
Right
Individual Keyframes Quality Evaluation
Evaluation ( III )
20
Keyframe Selection General Summary Quality Evaluation
Yes
No
Do you think that this set can summarize the whole day?
Finally, which one do you think is the best visual summary of the day?
Summary 1
Summary 2
Summary 3
Summary 4
Summary 1
Some of the summaries you will see might be very similar (differentiable
only in some images). In that case you can choose any of them.
Visual summaries of the day
Evaluation - Individual Keyframes
21
What is the most representative image of the
event?
Do you think that the image on the
left/center/right can represent the event?
Evaluation - General Summary
22
Can this set of images represent the complete day? Which summary is the best, in your opinion?
Conclusions
● New keyframe selection methodology taking into account visual and temporal
information.
● Keyframe selection using CNN-based global information and graph-analysis.
● 88-86% user acceptance of our summaries.
● 58% users chose our summaries as the best option.
● Use semantic information (e.g. objects, people, actions).
● Clinical application on Mild Cognitive Impairment patients.
23
Future Work

Visual Summary of Egocentric Photostreams by Representative Keyframes

  • 1.
    Visual Summary ofEgocentric Photostreams by Representative Keyframes Marc Bolaños, Ricard Mestre, Estefanía Talavera, Xavier Giró-i-Nieto and Petia Radeva 1
  • 2.
    Motivation Lifelogging wearable camerascan produce 1,500 images/day, more than 500,000 images/year. 2 Producing automatic summarization methods could help in many applications. Specially, we are working on: ● Memory aid for Mild Cognitive Impairment patients. ● Automatic nutrition diary.
  • 3.
    Extract the visualsummary of a whole day capturing the most representative information for describing the day. Goal 3 Storytelling
  • 4.
    Extract the visualsummary of a whole day capturing the most representative information for describing the day. Goal 4 Storytelling Have breakfast with the family
  • 5.
    Extract the visualsummary of a whole day capturing the most representative information for describing the day. Goal 5 Storytelling Have breakfast with the family Go for a walk
  • 6.
    Extract the visualsummary of a whole day capturing the most representative information for describing the day. Goal 6 Storytelling Have breakfast with the family Go for a walk Go shopping
  • 7.
    Extract the visualsummary of a whole day capturing the most representative information for describing the day. Goal 7 Storytelling Have breakfast with the family Go for a walk Go shopping Take the bus
  • 8.
    Extract the visualsummary of a whole day capturing the most representative information for describing the day. Goal 8 Storytelling Have breakfast with the family Go for a walk Go shopping Take the bus Have a coffee with a friend
  • 9.
    State of theArt Lu, Zheng, and Kristen Grauman. "Story-driven summarization for egocentric video." Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. IEEE, 2013. 9 High temporal resolution egocentric data. 1. Event segmentation. 2. Detection of salient objects and people. 3. Subset selection of video shots based on: a. Story b. Importance c. Diversity
  • 10.
    State of theArt Doherty, Aiden R., et al. "Investigating keyframe selection methods in the novel domain of passively captured visual lifelogs." Proceedings of the 2008 international conference on Content-based image and video retrieval. ACM, 2008. 10 Low temporal resolution egocentric data. 1. Event segmentation. 2. Selection of the keyframes comparing several methods: a. Middle image of each segment. b. Image close to the average value in the segment (centroid-like). c. Image with highest “quality”.
  • 11.
  • 12.
  • 13.
    Frames Characterization Convolutional NeuralNetworks (CNN) trained on ImageNet. 13 Jia, Yangqing, et al. "Caffe: Convolutional architecture for fast feature embedding." Proceedings of the ACM International Conference on Multimedia. ACM, 2014.
  • 14.
    Events Segmentation (I ) Applying an agglomerative clustering and adapting the cut-off parameter, we can obtain a good segmentation of all the events in our day. 14 cut-off parameter
  • 15.
    Events Segmentation (II ) Division - Fusion post-processing to obtain a more robust segmentation. 15 a) After Agglomerative Clustering b) After Division c) After Fusion Division: splits and labels differently similar events spaced in time. Fusion: merges very short sub-events not considered relevant enough.
  • 16.
    Keyframe Selection Visual similarity-basedkeyframe selection criteria. 16 Distances Matrix Random Walk Minimum Distance Similarity-based probabilities
  • 17.
  • 18.
    Evaluation ( I) ● 5 days ● 3 users ● 4005 images ● Segmentation ground truth 18 Talavera, E., Dimiccoli, M., Bolaños, M., Aghaei, M., & Radeva, P. R- clustering for egocentric video segmentation. IbPRIA 2015, Santiago de Compostela, Spain. Proceedings (Vol. 9117, p. 327). Springer. Datasets Clustering ● Jaccard Index
  • 19.
    Evaluation ( II) 19 Keyframe Selection Lu, Zheng, and Kristen Grauman. "Story-driven summarization for egocentric video." Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. IEEE, 2013. Figure: brandchannel.com ● Blind taste test to 30 users for quality evaluation Representative images of the event #1 Do you think the image on the left can represent the event? Do you think the image on the center can represent the event? Yes No Yes No Yes No Do you think the image on the right can represent the event? What is the most representative image of the event? Left Center Right Individual Keyframes Quality Evaluation
  • 20.
    Evaluation ( III) 20 Keyframe Selection General Summary Quality Evaluation Yes No Do you think that this set can summarize the whole day? Finally, which one do you think is the best visual summary of the day? Summary 1 Summary 2 Summary 3 Summary 4 Summary 1 Some of the summaries you will see might be very similar (differentiable only in some images). In that case you can choose any of them. Visual summaries of the day
  • 21.
    Evaluation - IndividualKeyframes 21 What is the most representative image of the event? Do you think that the image on the left/center/right can represent the event?
  • 22.
    Evaluation - GeneralSummary 22 Can this set of images represent the complete day? Which summary is the best, in your opinion?
  • 23.
    Conclusions ● New keyframeselection methodology taking into account visual and temporal information. ● Keyframe selection using CNN-based global information and graph-analysis. ● 88-86% user acceptance of our summaries. ● 58% users chose our summaries as the best option. ● Use semantic information (e.g. objects, people, actions). ● Clinical application on Mild Cognitive Impairment patients. 23 Future Work