Groundhog Day: Near-Duplicate Detection on Twitter

1,966 views

Published on

Presentation given by Ke Tao at 22nd International World Wide Web Conference in Rio de Janeiro Brazil, titled Groundhog Day: Near-Duplicate Detection on Twitter, during the track of Social Web Engineering

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,966
On SlideShare
0
From Embeds
0
Number of Embeds
448
Actions
Shares
0
Downloads
71
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Groundhog Day: Near-Duplicate Detection on Twitter

  1. 1. 1Groundhog Day:Near-Duplicate Detection on Twitter#www2013 Rio de Janeiro, BrazilMay 15th, 2013Ke Tao1, Fabian Abel1,2, Claudia Hauff1, Geert-Jan Houben1, Ujwal Gadiraju11 Web Information Systems, TU Delft, the Netherlands2 XING AG, Germany
  2. 2. 2Groundhog Day: Near-Duplicate Detection on TwitterOutline• Search & Retrieval on Twitter• Duplicate Content on Twitter• Near-duplicates in Twitter Search• Our solution to Twitter Search: the Twinder Framework• Analysis & Evaluation• Conclusion
  3. 3. 3Groundhog Day: Near-Duplicate Detection on TwitterSearch & Retrieval on Twitter• Twitter is more like a news media.• How do people search on Twitter? [Teevan et al.]• Repeated queries & monitoring for new content• Problems:• Short tweets  lots of similar information• Few people produce contents  many retweets, copied contentHow do people use Twitter as a source of information?J. Teevan, D. Ramage, and M. R. Morris. #TwitterSearch: A Comparison of Microblog Search and WebSearch. In Proceedings of the 4th International Conference on Web Search and Web Data Mining(WSDM), 2011.
  4. 4. 4Groundhog Day: Near-Duplicate Detection on TwitterDuplicate Content on Twitter (1/3)• Exact copy• Completely identical in terms of characters.• Nearly exact copy• Completely identical except for #hashtags, URLs or@mentionsClassification of near-duplicates in 5 levelst1: Huge New Toyota Recall Includes 245,000 Lexus GS, IS Sedans- http://newzfor.me/?cuyet2: Huge New Toyota Recall Includes 245,000 Lexus GS, IS Sedans- http://newzfor.me/?cuyet3: Huge New Toyota Recall Includes 245,000 Lexus GS,IS Sedans - http://bit.ly/ibUoJs
  5. 5. 5Groundhog Day: Near-Duplicate Detection on TwitterDuplicate Content on Twitter (2/3)• Strong near-duplicate• Same core message, one tweet contains more information.• Weak near-duplicate• Same core message, one tweet contains personal views.• Convey semantically the same message with differing informationnuggets.Classification of near-duplicates in 5 levelst4: Toyota recalls 1.7 million vehicles for fuel leaks: Toyota’s latestrecalls are mostly in Japan, but they also... http://bit.ly/dH0Pmwt5: Toyota Recalls 1.7 Million Vehicles For Fuel Leakshttp://bit.ly/flWFWUt6: The White Stripes broke up. Oh well.t7: The White Stripes broke up. That’s a bummer for me.
  6. 6. 6Groundhog Day: Near-Duplicate Detection on TwitterDuplicate Content on Twitter (3/3)• Low overlap• Semantically contain the same core message, but only have a fewwords in commonClassification of near-duplicates in 5 levelst8: Federal Judge rules Obamacare is unconsitutional...t9: Our man of the hour: Judge Vinson gave Obamacare its secondunconstitutional ruling. http://fb.me/zQsChak9
  7. 7. 7Groundhog Day: Near-Duplicate Detection on TwitterNear-Duplicates in Twitter Search (1/2)Analysis of the Tweets2011 corpus (TREC microblog track)1.89%&9.51%&21.09%&48.71%&18.80%&Exact&copy&Nearly&exact&copy&Strong&near;duplicate&Weak&near;duplicate&Low&overlapping&• For the 49 topics (queries),2,825 topic-tweet pairs arerelevant.• We manually labeled 55,362tweet pairs• We found 2,745 pairs ofduplicates in different levels.
  8. 8. 8Groundhog Day: Near-Duplicate Detection on TwitterNear-Duplicates in Twitter Search (2/2)Analysis of the Tweets2011 corpus (TREC microblog track)• Number of duplicate tweet pairs among top 10, 20, 50and the whole range of the search results (All tweetsjudged as relevant in the corpus)• On average, we found around 20% duplicates in thesearch results.Range Top 10 Top 20 Top 50 AllDuplicate%19.4% 22.2% 22.5% 22.3%
  9. 9. 9Groundhog Day: Near-Duplicate Detection on TwitterTwinder FrameworkOur Search InfrastructureFeature Extrac onRelevance Es ma onSocial Web StreamsFeatureExtraconTaskBrokerCloudComputingInfrastructureIndexKeyword-basedRelevancemessagesTwinderSearchEnginefeatureextractiontasksSearch User InterfacequeryresultsfeedbackusersDuplicate Detec on and Diversifica onSeman c-basedRelevanceSeman c FeaturesSyntac cal FeaturesContextual Features Further Enrichment
  10. 10. 10Groundhog Day: Near-Duplicate Detection on TwitterBuilding a Classifier … (1/5)Features DescriptionLevenshteindistanceNumber of characters required to change (substitution,insertion, deletion) one tweet to the otherOverlap intermsJaccard similarity between two sets of words in tweets.Overlap in#hashtagsJaccard similarity between two sets of #hashtags intweets.Overlap in URL Jaccard similarity between two sets of URLs in tweets.Overlap inexpanded URLRecomputed “Overlap in URL” after expanding shortenedURLs in both tweets.LengthdifferenceThe difference in length between two tweets.Overview of our syntactic features
  11. 11. 11Groundhog Day: Near-Duplicate Detection on TwitterBuilding a Classifier … (2/5)Extract semantics from tweetsdbp:Tim_Berners-Lee dbp:World_Wide_Webdbp:Rio_de_Janeirodbp:International_World_Wide_Web_ConferenceTopic:Internet_Technology
  12. 12. 12Groundhog Day: Near-Duplicate Detection on TwitterBuilding a Classifier … (3/5)Features DescriptionOverlap in EntitiesJaccard similarity between two sets of entitiesextracted from two tweetsOverlap in EntitiesTypesJaccard similarity between two sets of types of entitiesfrom two tweetsOverlap in TopicsJaccard similarity between two sets of detected topicsin two tweetsOverlap inWordNet ConceptsJaccard similarity between two sets of WordNet Nounsin tweetsOverlap inWordNet SynsetConceptsRecomputed Overlap in WordNet Concepts afterCombining interlinked Concepts in SynsetsWordNet similarityThe similarity calculated based on semanticrelatedness* between concepts from two tweetsOverview of our semantic features
  13. 13. 13Groundhog Day: Near-Duplicate Detection on TwitterBuilding a Classifier … (4/5)Enriched semantic features• We integrate content from external resources and constructthe same set of semantic featurest3: Huge New Toyota Recall Includes 245,000 Lexus GS,IS Sedans - http://bit.ly/ibUoJs
  14. 14. 14Groundhog Day: Near-Duplicate Detection on TwitterBuilding a Classifier … (5/5)Features DescriptionTemporaldifferenceThe difference of posting time of two tweetsDifference in#followeesThe difference in number of followees of the authorof the tweetsDifference in#followersThe difference in number of followers of the authorof the tweetsSame clientIndicator of whether the two tweets are posted fromthe same client applicationOverview of contextual features
  15. 15. 15Groundhog Day: Near-Duplicate Detection on TwitterSummary of Features• What feature categories do we have?• Syntactical features (6)• Semantic features (6)• Enriched semantic features (6)• Contextual features (4)• Classification strategies  different featurecombinations• Dependent on available resources and time constraints
  16. 16. 16Groundhog Day: Near-Duplicate Detection on TwitterClassification StrategiesUsing different sets of features for near-duplicate detection on TwitterStrategy DescriptionSy Only Syntactical featuresSySe Add semantics from tweetsSyCo Without SemanticsSySeCo Without Enriched SemanticsSySeEn Without Contextual featuresSySeEnCoAll Feature included
  17. 17. 17Groundhog Day: Near-Duplicate Detection on TwitterAnalysis and Evaluation• Research Questions:1. How accurately can the different duplicate detectionstrategies identify duplicates?2. What kind of features are of particular importance forduplicate detection?3. How does the accuracy vary for the different levels ofduplicates?4. How do the duplicate detection strategies impact searcheffectiveness on Twitter?• Experimental setup• Consider the problem as a classification task• Logistic Regression
  18. 18. 18Groundhog Day: Near-Duplicate Detection on TwitterData set: Tweets2011• Twitter corpus• 16 million tweets (Jan. 24th, 2011 – Feb. 8th)• 4,766,901 tweets classified as English• 6.2 million entity-extractions (140k distinct entities)• Relevance judgments• 49 topics• 40,855 (topic, tweet) pairs, 2,825 judged as relevant• 57.65 relevant tweets per topic (on average)• Duplicate level labeling• 55,362 tweet pairs labeled• 2,745 labeled as duplicates (in 5 levels)• Publicly available at http://wis.ewi.tudelft.nl/duptweet/TREC 2011 Microblog Track
  19. 19. 19Groundhog Day: Near-Duplicate Detection on TwitterClassification AccuracyDuplicate or not?  RQ1Features Precision Recall F-measureBaseline 0.5068 0.1913 0.2777Sy 0.5982 0.2918 0.3923SyCo 0.5127 0.3370 0.4067SySe 0.5333 0.3679 0.4354SySeEn 0.5297 0.3767 0.4403Overall, we can achieve a precision andrecall of about 49% and 43% respectivelyby applying all possible features.SySeCo 0.4816 0.4200 0.4487SySeEnCo 0.4868 0.4299 0.4566
  20. 20. 20Groundhog Day: Near-Duplicate Detection on TwitterFeature Weights (1/2)Which features matter the most?  RQ2-4-3-2-10123Syntactical-3-2-1012345Semantic
  21. 21. 21Groundhog Day: Near-Duplicate Detection on TwitterFeature Weights (2/2)Which features matter the most?  RQ2-14-12-10-8-6-4-202Contextual-3-2-10123Enriched Semantics
  22. 22. 22Groundhog Day: Near-Duplicate Detection on TwitterResults for Predicting Duplicate Levels (1/2)Exact copy, weak near-duplicate, … or low overlap?  RQ3Features Precision Recall F-measureBaseline 0.5553 0.5208 0.5375Sy 0.6599 0.5809 0.6179SyCo 0.6747 0.5889 0.6289SySe 0.6708 0.6151 0.6417SySeEn 0.6694 0.6241 0.6460Overall, we achieve a precision and recallof about 67% and 63% respectively byapplying all features.SySeCo 0.6852 0.6198 0.6508SySeEnCo 0.6739 0.6308 0.6516
  23. 23. 23Groundhog Day: Near-Duplicate Detection on TwitterResults for Predicting Duplicate Levels (2/2)Exact copy, weak near-duplicate, … or low overlap?  RQ3
  24. 24. 24Groundhog Day: Near-Duplicate Detection on TwitterSearch Result DiversificationHow much redundancy can we detect and remove?  RQ4Range Top10 Top20 Top50 AllBaseline 19.4% 22.2% 22.5% 22.3%After Filtering 9.1% 10.5% 12.0% 12.1%Improvement +53.1% +52.0% +46.7% +45.7%• A core application of near-duplicate detection strategies isthe diversification of search results. We simply remove theduplicates that are identified by our method.• Near-duplicates after filtering:
  25. 25. 25Groundhog Day: Near-Duplicate Detection on TwitterConclusions1. We conduct an analysis of duplicate content in Twitter search resultsand infer a model for categorizing different levels of duplicity.2. We develop a near-duplicate detection framework for micropoststhat provides functionality for analyzing 4 categories of features.3. Given our duplicate detection framework, we perform extensiveevaluations and analyses of different duplicate detectionstrategies on a large, standardized Twitter corpus to investigate thequality of (i) detecting duplicates and (ii) categorizing the duplicity level oftwo tweets.4. Our approach enables search result diversification and analyzes theimpact of the diversification on the search quality.• The progress on Twinder on be found at:http://wis.ewi.tudelft.nl/twinder/
  26. 26. 26Groundhog Day: Near-Duplicate Detection on TwitterTHANK YOU!May 15th, 2013 Slides : http://goo.gl/gffBmk.tao@tudelft.nl http://ktao.nl/QUESTIONS?

×