Discovering User Perceptions of  Semantic Similarity in  Near-duplicate Multimedia FilesRaynor Vliegendhart (speaker)Marth...
Outline• Introduction• Crowdsourcing Task• Results• Conclusions and Future Work                                2
Question:Are these the same? Why (not)?            Chrono Cross -            Dream of the Shore Near Another World        ...
Question:Are these the same? Why (not)?                 Chrono Cross -                 Dream of the Shore Near Another Wor...
Question:Are these the same? Why (not)?            Chrono Cross -            Dream of the Shore Near Another World        ...
Problem:What constitutes a near duplicate?Functional near-duplicate multimedia items are itemsthat fulfill the same purpos...
Problem:What constitutes a near duplicate?Our work:• Discovering new notions of user-perceived  similarity between multime...
Motivation:Clustering items in search results                            screenshot from Tribler (tribler.org)            ...
Motivation:Clustering items in search results                            screenshot from Tribler (tribler.org)            ...
Outline• Introduction• Crowdsourcing Task• Results• Conclusions and Future Work                                10
Crowdsourcing Task:Point the odd one out• Three multimedia files displayed as search results• Worker points the odd one ou...
Crowdsourcing Task:   Eliciting serious judgments (1)   “Imagine that you downloaded    the three items in the list    and...
Crowdsourcing Task:Eliciting serious judgments (2)• Don’t force workers to make a contrast• Explain the definition of func...
Final HIT Design                   14
Outline• Introduction• Crowdsourcing Task• Results• Conclusions and Future Work                                15
Datasettop 100 content   75 queries               75 results lists /                                           32,773 file...
Results                                                    1000 test triads3 validation triads                          + ...
Results                                                1000 test triads3 validation triads                      + 28 valid...
Card Sort• Print judgments on small pieces of paper• Group similar judgments into piles• Merge piles iteratively• Label ea...
Card SortExample: “different language”• “The third item is a Hindi language version of the movie”• “This is a Spanish vers...
User-perceivedSimilarity DimensionsDifferent movie vs. TV show                     Different movieNormal cut vs. extended ...
Outline• Introduction• Crowdsourcing Task• Results• Conclusions and Future Work                                22
Conclusions• A wealth of user-perceived dimensions of similarity discovered,  some we could not have thought of• Quick res...
Future Work• Expand experiments, larger worker volume• Other multimedia search settings• Crowdsourcing the card sorting pr...
Questions?             25
Upcoming SlideShare
Loading in …5
×

CrowdSearch2012 - Discovering User Perceptions of Semantic Similarity in Near-duplicate Multimedia Files

548 views

Published on

CrowdSearch 2012 workshop: http://crowdsearch.como.polimi.it/

Paper abstract:
We address the problem of discovering new notions of user-perceived similarity between near-duplicate multimedia files. We focus on file-sharing, since in this setting, users have a well-developed understanding of the available content, but what constitutes a near-duplicate is nonetheless nontrivial. We elicited judgments of semantic similarity by implementing triadic elicitation as a crowdsourcing task and ran it on Amazon Mechanical Turk. We categorized the judgments and arrived at 44 different dimensions of semantic similarity perceived by users. These discovered dimensions can be used for clustering items in search result lists. The challenge in performing elicitations in this way is to ensure that workers are encouraged to answer seriously and remain engaged.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
548
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

CrowdSearch2012 - Discovering User Perceptions of Semantic Similarity in Near-duplicate Multimedia Files

  1. 1. Discovering User Perceptions of Semantic Similarity in Near-duplicate Multimedia FilesRaynor Vliegendhart (speaker)Martha LarsonJohan PouwelseWWW 2012 Workshop on Crowdsourcing Web Search (CrowdSearch 2012),Lyon, France, April 17, 2012.
  2. 2. Outline• Introduction• Crowdsourcing Task• Results• Conclusions and Future Work 2
  3. 3. Question:Are these the same? Why (not)? Chrono Cross - Dream of the Shore Near Another World Violin/Piano Cover Chrono Cross Dream of the Shore Near Another World Violin and Piano sources: YouTube, IQYNEj51EUI (left), Iuh3YrJtK3M (right) 3
  4. 4. Question:Are these the same? Why (not)? Chrono Cross - Dream of the Shore Near Another World Violin/Piano Cover Yes, it’s the same song Chrono Cross Dream of the Shore Near Another World Violin and Piano sources: YouTube, IQYNEj51EUI (left), Iuh3YrJtK3M (right) 4
  5. 5. Question:Are these the same? Why (not)? Chrono Cross - Dream of the Shore Near Another World Violin/Piano Cover No, these are different performances by different performers Chrono Cross Dream of the Shore Near Another World Violin and Piano sources: YouTube, IQYNEj51EUI (left), Iuh3YrJtK3M (right) 5
  6. 6. Problem:What constitutes a near duplicate?Functional near-duplicate multimedia items are itemsthat fulfill the same purpose for the user.Once the user has one of these items, there is noadditional need for another. 6
  7. 7. Problem:What constitutes a near duplicate?Our work:• Discovering new notions of user-perceived similarity between multimedia files• in a file-sharing setting• through a crowdsourcing task. 7
  8. 8. Motivation:Clustering items in search results screenshot from Tribler (tribler.org) 8
  9. 9. Motivation:Clustering items in search results screenshot from Tribler (tribler.org) 9
  10. 10. Outline• Introduction• Crowdsourcing Task• Results• Conclusions and Future Work 10
  11. 11. Crowdsourcing Task:Point the odd one out• Three multimedia files displayed as search results• Worker points the odd one out and justifies why• Challenge: eliciting serious judgments 11
  12. 12. Crowdsourcing Task: Eliciting serious judgments (1) “Imagine that you downloaded the three items in the list and that you view them.”Harry Potter and the Sorcerers Stone AudioBook (478 MB)Harry Potter and the Sorcerer s Stone(2001)(ENG GER NL) 2Lions- (4.36 GB)Harry Potter.And.The.Sorcerer.Stone.DVDR.NTSC.SKJACK.Universal.S (4.46 GB) 12
  13. 13. Crowdsourcing Task:Eliciting serious judgments (2)• Don’t force workers to make a contrast• Explain the definition of functional similarityo The items are comparable. They are for all practical purposes the same. Someone would never really need all three of these.o Each item can be considered unique. I can imagine that someone might really want to download all three of these items.o One item is not like the other two. (Please mark that item in the list.) The other two items are comparable. 13
  14. 14. Final HIT Design 14
  15. 15. Outline• Introduction• Crowdsourcing Task• Results• Conclusions and Future Work 15
  16. 16. Datasettop 100 content 75 queries 75 results lists / 32,773 filenames 1000 random triads (test set) 28 manually selected triads (validation set) 16
  17. 17. Results 1000 test triads3 validation triads + 28 validation triads mixed in Recruitment Main HIT HIT (3 workers per test triad) two HITs run concurrently 17
  18. 18. Results 1000 test triads3 validation triads + 28 validation triads mixed in Recruitment 8 Main HIT HIT 14 qualified workers free-text judgments< 36h for 308 test triads 18
  19. 19. Card Sort• Print judgments on small pieces of paper• Group similar judgments into piles• Merge piles iteratively• Label each pile 19
  20. 20. Card SortExample: “different language”• “The third item is a Hindi language version of the movie”• “This is a Spanish version of the movie represented by the other two”•… 20
  21. 21. User-perceivedSimilarity DimensionsDifferent movie vs. TV show Different movieNormal cut vs. extended cut Movie vs. trailerCartoon vs. movie Comic vs. movieMovie vs. book Audiobook vs. movieGame vs. corresponding movie Sequels (movies)Commentary document vs. movie Soundtrack vs. corresponding movieMovie/TV show vs. unrelated audio album Movie vs. wallpaperDifferent episode Complete season vs. individual episodesEpisodes from different season Graphic novel vs. TV episodeMultiple episodes vs. full season Different realization of same legend/storyDifferent songs Different albumsSong vs. album Collection vs. albumAlbum vs. remix Event capture vs. songExplicit version Bonus track includedSong vs. collection of songs+videos Event capture vs. unrelated movieLanguage of subtitles Different languageMobile vs. normal version Quality and/or sourceDifferent codec/container (MP4 audio vs. MP3) Different gameCrack vs. game Software versionsDifferent game, same series Different applicationAddon vs. main application Documentation (pdf) vs. softwareList (text document) vs. unrelated item Safe vs. X-Rated 21
  22. 22. Outline• Introduction• Crowdsourcing Task• Results• Conclusions and Future Work 22
  23. 23. Conclusions• A wealth of user-perceived dimensions of similarity discovered, some we could not have thought of• Quick results due to interesting crowdsourcing task, with the focus on engagement and encouraging serious workers 23
  24. 24. Future Work• Expand experiments, larger worker volume• Other multimedia search settings• Crowdsourcing the card sorting process• Use findings to guide design of clustering algorithms Done: first version is deployed in Tribler 24
  25. 25. Questions? 25

×