Multimedia Information Retrieval and User Behavior

432 views
377 views

Published on

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
432
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
13
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Multimedia Information Retrieval and User Behavior

  1. 1. Multimedia Information Retrieval andUser Behavior in Social MediaEleonora Ciceri, ciceri@elet.polimi.itDate 22/10/2012
  2. 2. Outline✤ Multimedia Information Retrieval on large data sets ✤ The “giants” of photo uploads ✤ Image search ✤ Descriptors ✤ Bag of Visual Words✤ Analyzing User Motivations in Video Blogging ✤ What is a video blog? ✤ Non-verbal communication ✤ Automatic processing pipeline ✤ Cues extraction & Results ✤ Cues vs. Social Attention
  3. 3. Multimedia Information Retrievalon large data sets
  4. 4. The “giants” of photo uploads✤ Flickr uploads: (source: http://www.flickr.com/) ✤ 1,54 million photos per day in average ✤ 51 million users ✤ 6 billion images✤ Facebook uploads: (source: http://thenextweb.com/) ✤ 250 million photos per day in average ✤ 845 million users in February 2012 ✤ 90+ billion in August 2011✤ “Flickr hits 6 billion total photos, Facebook does that every two months”
  5. 5. Image search✤ Query by example: look for a particular object / scene / location in a collection of images
  6. 6. Image search✤ Copy detection✤ Annotation / Classification / Detection “dog” “dog”? “dog” “child”
  7. 7. Descriptors✤ How can we look for similar images? ✤ Compute a descriptor: mathematical representation ✤ Find similar descriptors✤ Problem: occlusions, changes in rotations-scale-lighting
  8. 8. Descriptors✤ How can we look for similar images? ✤ Compute a descriptor: mathematical representation ✤ Find similar descriptors✤ Solution: invariant descriptors (to scale / rotation...)
  9. 9. Global descriptors✤ Global descriptors: one descriptor per image (highly scalable)✤ Color histogram: representation of the distribution of colors ✤ Pros: high invariance to many transformations ✤ Cons: high invariance to TOO many transformations (limited discriminative power)
  10. 10. Local descriptors✤ Local descriptors: find regions of interest that will be exploited for image comparison✤ SIFT: Scale Invariant Feature Transform ✤ Extract key-points (maxima and minima in the Difference of Gaussian image) ✤ Assign orientation to key-points (result: rotation invariance) ✤ Generate the feature vector for each key-point
  11. 11. Direct matching query✤ Assumption: image ✤ m=1000 descriptors for one image ✤ Each descriptor has d=128 dimensions ✤ N>1000000 images in the data set✤ Search: a query is submitted; results are retrieved ✤ Each descriptor of the query image is tested again each descriptor of the image data set ✤ Complexity: m2Nd elementary operations; Required space: ???
  12. 12. Bag of Visual Words✤ Objective: “put the images into words” (visual words)✤ What is a visual word? “A small part of the image that carries some kind of information related to the features” [Wikipedia]✤ Analogy Text-Image: ✤ Visual word: small patch of the image ✤ Visual term: cluster of patches that give the same information ✤ Bag of visual words: collection of words that give information about the meaning of the image at all
  13. 13. Bag of Visual Words✤ How to build a visual dictionary? ✤ Local descriptors are clustered ✤ A local descriptor is assigned to its nearest neighbor: q(x) = arg min x − µw 2 w∈ω w∈ω Mean of the cluster w Visual Cluster dictionary
  14. 14. Why Visual Words?✤ Pros: ✤ Much more compact representation ✤ We can take advantage from text retrieval techniques to apply them to image retrieval system Find similar Results vectors Rd f (t, d) |D| tf idf (t, d, D) = log max{f (w, d) : w ∈ d} |{d ∈ D : t ∈ d}|
  15. 15. Analyzing User Motivations inVideo Blogging
  16. 16. What is a video blog?✤ Video blog (vlog): conversational videos where people (usually a single person) discuss facing the camera and addressing the audience in a Skype-style fashion ✤ Examples: video testimonial (companies pay for testing products), video advice (e.g., how to get dressed for a party), discussions
  17. 17. Why vlogs are used Corporate communication Life documentary Comments Ratings E-learning Marketing COMMUNITY High participation Daily interaction Discussion Entertainment Critique
  18. 18. Why vlogs are studied✤ Why are vlogs relevant? ✤ Automatic analysis of personal websites, blogs and social networks is limited to text (in order to understand users’ motivations) ✤ Vlog is a new type of social media (40% of the most viewed videos on YouTube): how to do automatic analysis? ✤ Study a real-life communication scenario ✤ Humans judgements are based on first impressions: can we predict them?
  19. 19. Real communication vs. vlogs✤ Real communication ✤ Vlog✤ Synchronous ✤ Asynchronous✤ Two (or more) people interact ✤ Monologue ✤ Metadata blah blah ? blah blah blah
  20. 20. Non-verbal communication✤ The process of communication through sending and receiving wordless/visual cues between people Body Speech Gestures Voice quality Touch Rate Body language Pitch Posture Volume Facial expression Rhythm Eye contact Intonation✤ Why? To express aspects of identity (age, occupation, culture, personality)
  21. 21. An example: dominance✤ Power: capacity or right to control others✤ Dominance: way of exerting power involving the motive to control others ✤ Behavior: talk louder, talk longer, speak first, interrupt more, add gestures, receive more visual attention
  22. 22. Automatic processing pipeline Audio cues Face detection Shot selection Visual cues A B C D E A B C D E (for each shot) (Viola-Jones algorithm) Without faces Short: not talking VlogSense: Conversational Behavior and Social Attention in YouTube • 1:7 Shot boundary Aggregate !#$%#()*+,%)-$-.$/#( detection shot-level cues 0+#($*1 0*.- )-$-.$/#( Aggregated cues !#$ !-1-.$/#( (based on color (at video level) histogram differences)
  23. 23. Visual cue extraction binary image containing the✤ Figure 2: wMEI Images for two vlogs. (D ) Weighted Motion Energy Images (wMEI): wM EI = moving pixels in frame f f f ∈V✤ Feature Extraction It indicates the visual activity of a pixel (accumulated motion through We video) the automatically extracted nonverbal cues from both au- dio and video with the purpose of characterizing vloggers’ behavior. Given regions with higher motion of vlogs, in our Brighter pixels: the conversational nature ✤
  24. 24. Visual cue extraction1:8 • J-I. Biel and D. Gatica-PerezFig. 5. Nonverbal cues are extracted basedon speech/non-speech, looking/non-lookingsegmentations, and multimodal segmenta-tions. ✤ It is difficult to estimate the actual direction of the eyesin step (2), we simplified the task with the detection of frontal faces, a reasonable solution given theinherent nature of conversational vlogging. In addition to its robustness, a face detector may generalize ✤ If the face is in frontal position I’m most likely looking at the camerabetter to the case of vloggers who do not display much of their upper body. For each shot, we assessedthe presence of a talking-head by measuring the ratio of frames with face detections. Then, in step(3), we selected conversational shots based on a linear combination of the face detection rate and theduration of the shot relative to the whole duration of the video. This latter condition is motivated bythe observation that non-conversational shots tend to be short, independently on whether they featurepeople or not. We used existing implementations of algorithms based the OpenCV library [Bradski and Kaehler2008]. The shot boundary detector finds shot discontinuities by thresholding the Bhattacharyya dis- We are interested in frontal face detectiontance between RGB color histograms of consecutive frames. The face detector implements the boostedclassifiers and Haar-like features from the Viola-Jones algorithm [Viola and Jones 2002] using an ex-isting cascaded on the OpenCV version 2.0 [Bradski and Kaehler 2008], which scans faces as small as20x20 pixels. The shot boundary detection algorithm (step 1) and and the conversational shot selection (step 3) were tuned in a development set constructed from
  25. 25. 1:8 Visual cue extraction • J-I. Biel and D. Gatica-PerezFig. 5. Nonverbal cues are extracted basedon speech/non-speech, looking/non-lookingsegmentations, and multimodal segmenta-tions. how much the vlogger looks choice of to the camera addressing the camera from close-ups✤ Looking time: looking activity persistence of a ✤ Proximity to camerain step (2), we simplified the task with the detection of frontal faces, a reasonable solution given the vlogger’s gazeinherent nature of conversational vlogging. In addition to its robustness, a face detector may generalizebetter to the case of vloggers who do not display much of their upper body. For each shot, we assessed✤ Looking segment length: persistence ✤ Vertical framing: upper bodythe presence of a talking-head by measuring the ratio of frames with face detections. Then, in step(3), we selected conversational shots based on a linear combination of the face detection rate and theduration of the shot relative to the whole duration of the video. This latter condition is motivated by how much thethe observation that non-conversational shots tend to be short, independently on whether they feature vlogger shows✤ Looking turns: looking activitypeople or not. ✤ Vertical head motion the upper body We used existing implementations of algorithms based the OpenCV library [Bradski and Kaehler2008]. The shot boundary detector finds shot discontinuities by thresholding the Bhattacharyya dis- how much thetance between RGB color histogramslooks vlogger of consecutive frames. The face detector implements the boosted to the cameraclassifiers and Haar-like features from the Viola-Jones algorithm [Viola and Jones 2002] using an ex-isting cascaded on the OpenCV version 2.0 [Bradski and Kaehler 2008], which scans faces as small as20x20 pixels. The shot boundary detection algorithm (step 1) and and the conversational shot selection (step 3) were tuned in a development set constructed from
  26. 26. 1:8 Visual cue extraction • J-I. Biel and D. Gatica-PerezFig. 5. Nonverbal cues are extracted basedon speech/non-speech, looking/non-lookingsegmentations, and multimodal segmenta-tions. face area in the current frame frame containing looking segment a face number of frames containing a face L∈V tL f ∈V Af ace (f )✤ Looking time: ✤ Proximity to camera: tVin step (2), we simplified the task with the detection of frontal faces, a reasonable solution given the Nf · A(f ) frame inherent nature of conversational vlogging. In addition to its robustness, a face detector may generalize face center area ✤ Looking segment length: L∈V tLbetter to the case of vloggers who do not display much of their upper body. For each shot, we assessed ✤ Vertical framing: f ∈V cf ace (f ) − c(f ) the presence of a talking-head by measuring the ratio of frames with face detections. Then, in step NL(3), we selected conversational shots based on a linear combination of the face detection rate and the frame Nf · f h frameduration of the shot relative to the whole duration of the video. This latter condition is motivated by height center Nthe observation that non-conversational shots tend to be short, independently on whether they feature L cf ace (f ) − c(f ) ) Vertical head motion: σ( cf ace (f ) − c(f ) ) number of✤ Looking turns:people or not. looking segment ✤ µ( t We used existing implementations of algorithms based the OpenCV library [Bradski and Kaehler V2008]. The shot boundary detector finds shot discontinuities by thresholding the Bhattacharyya dis-tance between RGB color histograms of consecutive frames. The face detector implements the boostedclassifiers and Haar-like features from the Viola-Jones algorithm [Viola and Jones 2002] using an ex-isting cascaded on the OpenCV version 2.0 [Bradski and Kaehler 2008], which scans faces as small as20x20 pixels. The shot boundary detection algorithm (step 1) and and the conversational shot selection (step 3) were tuned in a development set constructed from
  27. 27. Audio cue extraction:8 • J-I. Biel and D. Gatica-Perezig. 5. Nonverbal cues are extracted basedn speech/non-speech, looking/non-lookingegmentations, and multimodal segmenta-ons. # of phonemes how much the (how fast the vlogger talks vlogger speaks) how well the ✤ Speaking time: speaking activity ✤ Voicing rate: fluency vlogger controls loudness ✤ Speech segment avg length: fluency ✤ Speaking energy: emotional stability Speaking turns: fluency duration andn step (2), we simplified the task with the detectionnumber of ✤ Pitch variation: emotional state of frontal faces, ✤ reasonable solution given the anherent nature of conversational vlogging. In addition topauses silent its robustness, a face detector may generalizeetter to the case of vloggers who do not display much of their upper body. For each shot, we assessed how well thehe presence of a talking-head by measuring the ratio of frames with face detections. Then, in step vlogger3), we selected conversational shots based on a linear combination of the face detection rate and the tone controlsuration of the shot relative to the whole duration of the video. This latter condition is motivated byhe observation that non-conversational shots tend to be short, independently on whether they feature
  28. 28. Audio cue extraction:8 • J-I. Biel and D. Gatica-Perezig. 5. Nonverbal cues are extracted basedn speech/non-speech, looking/non-lookingegmentations, and multimodal segmenta-ons. speech segment speech segment duration tS NS ✤ Speaking time: S∈V ✤ Voicing rate: tV S∈V tS video duration S∈V tS σ(Senergy ) ✤ Speech segment avg length: NS Speaking energy: ✤ µ(Senergy ) NS number of σ(pitch)n step Speaking turns: ✤ (2), we simplified the task with the detection of frontal faces, ✤ reasonablevariation: the a Pitch solution given speech segments tV µ(pitch)nherent nature of conversational vlogging. In addition to its robustness, a face detector may generalizeetter to the case of vloggers who do not display much of their upper body. For each shot, we assessedhe presence of a talking-head by measuring the ratio of frames with face detections. Then, in step3), we selected conversational shots based on a linear combination of the face detection rate and theuration of the shot relative to the whole duration of the video. This latter condition is motivated byhe observation that non-conversational shots tend to be short, independently on whether they feature
  29. 29. Combining audio and visual cues 1:8 • J-I. Biel and D. Gatica-Perez Fig. 5. Nonverbal cues are extracted based on speech/non-speech, looking/non-looking segmentations, and multimodal segmenta- tions. in step (2), we simplified the task with the detection of frontal faces, a reasonable solution given the inherent nature of conversational vlogging. In addition to its robustness, a face detector may generalize✤ Combine “looking at the camera” with “speaking”: four modalities better to the case of vloggers who do not display much of their upper body. For each shot, we assessed the presence of a talking-head by measuring the ratio of frames with face detections. Then, in step (3), we selected conversational shots based on a linear combination of the face detection rate and the duration of the shot relative to the whole duration of the video. This latter condition is motivated by the observation that non-conversational shots tend to be short, independently on whether they feature✤ These measures are used to determine dominance in dyadic people or not. We used existing implementations of algorithms based the OpenCV library [Bradski and Kaehler conversations 2008]. The shot boundary detector finds shot discontinuities by thresholding the Bhattacharyya dis- tance between RGB color histograms of consecutive frames. The face detector implements the boosted classifiers and Haar-like features from the Viola-Jones algorithm [Viola and Jones 2002] using an ex- isting cascaded on the OpenCV version 2.0 [Bradski and Kaehler 2008], which scans faces as small as 20x20 pixels. The shot boundary detection algorithm (step 1) and ✤ Looking-while-speaking: dominant people and the conversational shot selection (step 3) were tuned in a development set constructed from a small random sample of 100 vlogs. For this purpose, we first annotated hard shot discontinuities (a total of 168 hard shot cuts) and then labeled the shot conversational state (s = 1: conversational, s = 0: non-conversational). For shot boundary detection, we experimented with different thresholding methodologies, including global, relative, and adaptive thresholds [Hanjalic 2002], and we obtained the best performance (EER = 15%) using a global threshold γ = 0.5. This performance suggests that
  30. 30. Analysis: video editing elements✤ Elements manually coded as a support to the core conversational part ✤ Snippets (opening, ending, intermediate), background music, objects brought toward the camera opening object toward intermediate object toward ending snippet the camera snippet the camera snippet Results ✤ Snippets: 45% of vlogs (16% - 20% with opening/endings, 32% with intermediate snippets) ✤ Videos without snippets are monologues ✤ Snippets tend to be a small fraction of the content of the video (~10%) ✤ Audio: 25% using soundtrack on snippets, 12% using music on the entire video ✤ Objects: 26% of vloggers bring the object toward the camera
  31. 31. Analysis: non-verbal behavior✤ Vloggers are mainly talking: 85% of people talking for more than half of time✤ Speaking segments tend to be short (hesitations and low fluency) VlogSense: Conversational Behavior and Social Att 15 25 20 Percent of Total Percent of Total Percent of Total 20 15 10 15 10 5 10 5 5 0 0 0 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10 0.0 0.2 0.4 0.6 0.8 1. Time speaking (ratio) Avg Leng. of Speech seg. (s) Number of turns (Hz) 30 25 al al al
  32. 32. Analysis: non-verbal behavior VlogSense: Conversational Behavior and Social Attention in YouTube • 1:13 15 25 20 20✤ 50% of vloggers look at the camera over 90% of the time, atPercent of Total Percent of Total Percent of Total Percent of Total 20 15 10 15 15 “standard” distance to the camera (not too close, not too far), 5 10 5 10 5 10 5 showing the upper body 0 0.0 0.2 0.4 0.6 0.8 1.0 0 0 2 4 6 8 10 0 0.0 0.2 0.4 0.6 0.8 1.0 0 0.00 0.02 0.04 0.06 0.08 0.10 Time speaking (ratio) Avg Leng. of Speech seg. (s) Number of turns (Hz) Voicing rate (Hz) VlogSense: Conversational Behavior and Social Attention in YouTube 30 • 1:13 25 25 Percent of Total Percent of Total Percent of Total Percent of Total 10 20 20 20 15 15 15 10 20 5 10 25 20 10Percent of Total Percent of Total Percent of Total Percent of Total 5 5 20 15 10 15 0 0 0 0 15 0.2 0.4 0.6 0.8 1.0 10 0.0 0.1 0.2 0.3 0.4 0.510 !2 !1 0 1 0 5 10 15 5 10 Time looking (ratio) Proximity to camera (ratio) Vertical framing (ratio) LS/LNS 5 5 5 Fig. 8: Selected nonverbal cue distributions for conversational shots in YouTube vlogs: four audio cues, three visual cues, and 0 0 one multimodal. 0 0✤ Vloggers look at the camera when they speak more frequently than 0.0 0.2 0.4 0.6 0.8 1.0 Time speaking (ratio) 0 2 4 6 8 10 Avg Leng. of Speech seg. (s) 0.0 0.2 0.4 0.6 0.8 1.0 Number of turns (Hz) 0.00 0.02 0.04 0.06 0.08 0.10 Voicing rate (Hz) when they are silent the fact that most of the vlogs are composed of few conversational shots (see Section 6.1). 25 result from 30 25 Percent of TotalPercent of Total Percent of Total Percent of Total These distributions unveil information that may be useful to understand some basic characteristics 10 20 ✤ Behavior of dominant people For example, the speaking time distribution, biased towards high 20 20 15 of nonverbal behavior in vlogging. 15 10 speaking times (median = 0.65, mean = 0.67, sd10 0.15), shows that 85% of the conversational shots 10 5 = 5 contain speech more than half of the time, which suggests that vloggers who were perceived as mainly 5 0 0 talking during the annotation process (Section 4) are indeed speaking for a significant proportion of 0 0 0.2 0.4 0.6 0.8 1.0 0.0 the time. Speaking segments !1 0.1 0.2 0.3 0.4 0.5 !2 tend 0 be1 short (median = 1.98s, mean = 2.36s, sd = 1.36s), which is to 0 5 10 15 Time looking (ratio) Proximity to camera (ratio) Vertical framing (ratio) LS/LNS common in spontaneous speech, typically characterized by higher numbers of hesitations and lowerFig. 8: Selected nonverbal cue distributions for conversational shots in YouTube vlogs: four audio cues, three visualper second (median = mean = 0.33, fluency [Levelt 1989]. The median number of speaking turns cues, andone multimodal.
  33. 33. Social attention✤ Social attention on YouTube is measured by considering the number of views received by a video Popularity Borrowed from the Latinpopularis in 1490, originally meant common this measure reflects the number of times that the item has been accessed (resembling other measures of the way audiences are popularity, BUT: not all measured in traditional the people that access mainstream media) to the video like it!
  34. 34. Social attention✤ Audio cues: vloggers talking longer, faster and using fewer pauses receive more views from the audience
  35. 35. Social attention✤ Visual cues: ✤ The time looking at the camera and the average duration of looking turns are positively correlated with attention ✤ Vloggers that are too close to the camera are penalized: the audience cannot perceive body language cues
  36. 36. Future work (...or not?)✤ Background analysis: do the background tell something about the speaker?
  37. 37. Bibliography
  38. 38. Bibliography✤ Joan-Isaac Biel, Daniel Gatica-Perez, ‘VLogSense: Conversational Behavior and Social Attention in YouTube’, ACM Transactions on Multimedia Computing, Communications and Applications, 2010✤ Joan-Isaac Biel, Oya Aran, Daniel Gatica-Perez, ‘You Are Known by How You Vlog: Personality Impressions and Nonverbal Behavior in YouTube’, AAAI, 2011✤ Joan-Isaac Biel, Daniel Gatica-Perez, ‘Voices of Vlogging’, AAAI, 2010✤ Joan-Isaac Biel, Daniel Gatica-Perez, ‘Vlogcast Yourself: Nonverbal Behavior and Attention in Social Media’, ICMI-MLMI, 2010
  39. 39. Bibliography✤ Joan-Isaac Biel, Daniel Gatica-Perez, ‘The Good, the Bad and the Angry: Analyzing Crowdsourced Impressions of Vloggers’, AAAI, 2012✤ Hervé Jégou , ‘Very Large Scale Image/Video Search’, SSMS’12, Santorini✤ Utkarsh, ‘SIFT: Scale Invariant Feature Transform’, http:// www.aishack.in/2010/05/sift-scale-invariant-feature-transform/✤ Wikipedia, ‘Bag of Words’ and ‘Visual Word’✤ Wikipedia, ‘tf-idf’✤ Wikipedia, ‘k-means clustering’
  40. 40. Bibliography✤ Rong Yan, ‘Data mining and machine learning for large-scale social media’, SSMS’12, Santorini

×