SlideShare a Scribd company logo
1 of 26
1
Groundhog Day:
Near-Duplicate Detection on Twitter
#www2013 Rio de Janeiro, Brazil
May 15th, 2013
Ke Tao1, Fabian Abel1,2, Claudia Hauff1, Geert-Jan Houben1, Ujwal Gadiraju1
1 Web Information Systems, TU Delft, the Netherlands
2 XING AG, Germany
2Groundhog Day: Near-Duplicate Detection on Twitter
Outline
• Search & Retrieval on Twitter
• Duplicate Content on Twitter
• Near-duplicates in Twitter Search
• Our solution to Twitter Search: the Twinder Framework
• Analysis & Evaluation
• Conclusion
3Groundhog Day: Near-Duplicate Detection on Twitter
Search & Retrieval on Twitter
• Twitter is more like a news media.
• How do people search on Twitter? [Teevan et al.]
• Repeated queries & monitoring for new content
• Problems:
• Short tweets  lots of similar information
• Few people produce contents  many retweets, copied content
How do people use Twitter as a source of information?
J. Teevan, D. Ramage, and M. R. Morris. #TwitterSearch: A Comparison of Microblog Search and Web
Search. In Proceedings of the 4th International Conference on Web Search and Web Data Mining
(WSDM), 2011.
4Groundhog Day: Near-Duplicate Detection on Twitter
Duplicate Content on Twitter (1/3)
• Exact copy
• Completely identical in terms of characters.
• Nearly exact copy
• Completely identical except for #hashtags, URLs or
@mentions
Classification of near-duplicates in 5 levels
t1: Huge New Toyota Recall Includes 245,000 Lexus GS, IS Sedans
- http://newzfor.me/?cuye
t2: Huge New Toyota Recall Includes 245,000 Lexus GS, IS Sedans
- http://newzfor.me/?cuye
t3: Huge New Toyota Recall Includes 245,000 Lexus GS,
IS Sedans - http://bit.ly/ibUoJs
5Groundhog Day: Near-Duplicate Detection on Twitter
Duplicate Content on Twitter (2/3)
• Strong near-duplicate
• Same core message, one tweet contains more information.
• Weak near-duplicate
• Same core message, one tweet contains personal views.
• Convey semantically the same message with differing information
nuggets.
Classification of near-duplicates in 5 levels
t4: Toyota recalls 1.7 million vehicles for fuel leaks: Toyota’s latest
recalls are mostly in Japan, but they also... http://bit.ly/dH0Pmw
t5: Toyota Recalls 1.7 Million Vehicles For Fuel Leaks
http://bit.ly/flWFWU
t6: The White Stripes broke up. Oh well.
t7: The White Stripes broke up. That’s a bummer for me.
6Groundhog Day: Near-Duplicate Detection on Twitter
Duplicate Content on Twitter (3/3)
• Low overlap
• Semantically contain the same core message, but only have a few
words in common
Classification of near-duplicates in 5 levels
t8: Federal Judge rules Obamacare is unconsitutional...
t9: Our man of the hour: Judge Vinson gave Obamacare its second
unconstitutional ruling. http://fb.me/zQsChak9
7Groundhog Day: Near-Duplicate Detection on Twitter
Near-Duplicates in Twitter Search (1/2)
Analysis of the Tweets2011 corpus (TREC microblog track)
1.89%&
9.51%&
21.09%&
48.71%&
18.80%&
Exact&copy&
Nearly&exact&
copy&
Strong&near;
duplicate&
Weak&near;
duplicate&
Low&overlapping&
• For the 49 topics (queries),
2,825 topic-tweet pairs are
relevant.
• We manually labeled 55,362
tweet pairs
• We found 2,745 pairs of
duplicates in different levels.
8Groundhog Day: Near-Duplicate Detection on Twitter
Near-Duplicates in Twitter Search (2/2)
Analysis of the Tweets2011 corpus (TREC microblog track)
• Number of duplicate tweet pairs among top 10, 20, 50
and the whole range of the search results (All tweets
judged as relevant in the corpus)
• On average, we found around 20% duplicates in the
search results.
Range Top 10 Top 20 Top 50 All
Duplicate
%
19.4% 22.2% 22.5% 22.3%
9Groundhog Day: Near-Duplicate Detection on Twitter
Twinder Framework
Our Search Infrastructure
Feature Extrac on
Relevance Es ma on
Social Web Streams
FeatureExtraconTask
Broker
Cloud
Computing
Infrastructure
Index
Keyword-based
Relevance
messages
Twinder
Search
Engine
feature
extraction
tasks
Search User Interface
query
results
feedback
users
Duplicate Detec on and Diversifica on
Seman c-based
Relevance
Seman c FeaturesSyntac cal Features
Contextual Features Further Enrichment
10Groundhog Day: Near-Duplicate Detection on Twitter
Building a Classifier … (1/5)
Features Description
Levenshtein
distance
Number of characters required to change (substitution,
insertion, deletion) one tweet to the other
Overlap in
terms
Jaccard similarity between two sets of words in tweets.
Overlap in
#hashtags
Jaccard similarity between two sets of #hashtags in
tweets.
Overlap in URL Jaccard similarity between two sets of URLs in tweets.
Overlap in
expanded URL
Recomputed “Overlap in URL” after expanding shortened
URLs in both tweets.
Length
difference
The difference in length between two tweets.
Overview of our syntactic features
11Groundhog Day: Near-Duplicate Detection on Twitter
Building a Classifier … (2/5)
Extract semantics from tweets
dbp:Tim_Berners-Lee dbp:World_Wide_Web
dbp:Rio_de_Janeiro
dbp:International_World_Wide_Web_Conference
Topic:Internet_Technology
12Groundhog Day: Near-Duplicate Detection on Twitter
Building a Classifier … (3/5)
Features Description
Overlap in Entities
Jaccard similarity between two sets of entities
extracted from two tweets
Overlap in Entities
Types
Jaccard similarity between two sets of types of entities
from two tweets
Overlap in Topics
Jaccard similarity between two sets of detected topics
in two tweets
Overlap in
WordNet Concepts
Jaccard similarity between two sets of WordNet Nouns
in tweets
Overlap in
WordNet Synset
Concepts
Recomputed Overlap in WordNet Concepts after
Combining interlinked Concepts in Synsets
WordNet similarity
The similarity calculated based on semantic
relatedness* between concepts from two tweets
Overview of our semantic features
13Groundhog Day: Near-Duplicate Detection on Twitter
Building a Classifier … (4/5)
Enriched semantic features
• We integrate content from external resources and construct
the same set of semantic features
t3: Huge New Toyota Recall Includes 245,000 Lexus GS,
IS Sedans - http://bit.ly/ibUoJs
14Groundhog Day: Near-Duplicate Detection on Twitter
Building a Classifier … (5/5)
Features Description
Temporal
difference
The difference of posting time of two tweets
Difference in
#followees
The difference in number of followees of the author
of the tweets
Difference in
#followers
The difference in number of followers of the author
of the tweets
Same client
Indicator of whether the two tweets are posted from
the same client application
Overview of contextual features
15Groundhog Day: Near-Duplicate Detection on Twitter
Summary of Features
• What feature categories do we have?
• Syntactical features (6)
• Semantic features (6)
• Enriched semantic features (6)
• Contextual features (4)
• Classification strategies  different feature
combinations
• Dependent on available resources and time constraints
16Groundhog Day: Near-Duplicate Detection on Twitter
Classification Strategies
Using different sets of features for near-duplicate detection on Twitter
Strategy Description
Sy Only Syntactical features
SySe Add semantics from tweets
SyCo Without Semantics
SySeCo Without Enriched Semantics
SySeEn Without Contextual features
SySeEnC
o
All Feature included
17Groundhog Day: Near-Duplicate Detection on Twitter
Analysis and Evaluation
• Research Questions:
1. How accurately can the different duplicate detection
strategies identify duplicates?
2. What kind of features are of particular importance for
duplicate detection?
3. How does the accuracy vary for the different levels of
duplicates?
4. How do the duplicate detection strategies impact search
effectiveness on Twitter?
• Experimental setup
• Consider the problem as a classification task
• Logistic Regression
18Groundhog Day: Near-Duplicate Detection on Twitter
Data set: Tweets2011
• Twitter corpus
• 16 million tweets (Jan. 24th, 2011 – Feb. 8th)
• 4,766,901 tweets classified as English
• 6.2 million entity-extractions (140k distinct entities)
• Relevance judgments
• 49 topics
• 40,855 (topic, tweet) pairs, 2,825 judged as relevant
• 57.65 relevant tweets per topic (on average)
• Duplicate level labeling
• 55,362 tweet pairs labeled
• 2,745 labeled as duplicates (in 5 levels)
• Publicly available at http://wis.ewi.tudelft.nl/duptweet/
TREC 2011 Microblog Track
19Groundhog Day: Near-Duplicate Detection on Twitter
Classification Accuracy
Duplicate or not?  RQ1
Features Precision Recall F-measure
Baseline 0.5068 0.1913 0.2777
Sy 0.5982 0.2918 0.3923
SyCo 0.5127 0.3370 0.4067
SySe 0.5333 0.3679 0.4354
SySeEn 0.5297 0.3767 0.4403
Overall, we can achieve a precision and
recall of about 49% and 43% respectively
by applying all possible features.
SySeCo 0.4816 0.4200 0.4487
SySeEnCo 0.4868 0.4299 0.4566
20Groundhog Day: Near-Duplicate Detection on Twitter
Feature Weights (1/2)
Which features matter the most?  RQ2
-4
-3
-2
-1
0
1
2
3
Syntactical
-3
-2
-1
0
1
2
3
4
5
Semantic
21Groundhog Day: Near-Duplicate Detection on Twitter
Feature Weights (2/2)
Which features matter the most?  RQ2
-14
-12
-10
-8
-6
-4
-2
0
2
Contextual
-3
-2
-1
0
1
2
3
Enriched Semantics
22Groundhog Day: Near-Duplicate Detection on Twitter
Results for Predicting Duplicate Levels (1/2)
Exact copy, weak near-duplicate, … or low overlap?  RQ3
Features Precision Recall F-measure
Baseline 0.5553 0.5208 0.5375
Sy 0.6599 0.5809 0.6179
SyCo 0.6747 0.5889 0.6289
SySe 0.6708 0.6151 0.6417
SySeEn 0.6694 0.6241 0.6460
Overall, we achieve a precision and recall
of about 67% and 63% respectively by
applying all features.
SySeCo 0.6852 0.6198 0.6508
SySeEnCo 0.6739 0.6308 0.6516
23Groundhog Day: Near-Duplicate Detection on Twitter
Results for Predicting Duplicate Levels (2/2)
Exact copy, weak near-duplicate, … or low overlap?  RQ3
24Groundhog Day: Near-Duplicate Detection on Twitter
Search Result Diversification
How much redundancy can we detect and remove?  RQ4
Range Top10 Top20 Top50 All
Baseline 19.4% 22.2% 22.5% 22.3%
After Filtering 9.1% 10.5% 12.0% 12.1%
Improvement +53.1% +52.0% +46.7% +45.7%
• A core application of near-duplicate detection strategies is
the diversification of search results. We simply remove the
duplicates that are identified by our method.
• Near-duplicates after filtering:
25Groundhog Day: Near-Duplicate Detection on Twitter
Conclusions
1. We conduct an analysis of duplicate content in Twitter search results
and infer a model for categorizing different levels of duplicity.
2. We develop a near-duplicate detection framework for microposts
that provides functionality for analyzing 4 categories of features.
3. Given our duplicate detection framework, we perform extensive
evaluations and analyses of different duplicate detection
strategies on a large, standardized Twitter corpus to investigate the
quality of (i) detecting duplicates and (ii) categorizing the duplicity level of
two tweets.
4. Our approach enables search result diversification and analyzes the
impact of the diversification on the search quality.
• The progress on Twinder on be found at:
http://wis.ewi.tudelft.nl/twinder/
26Groundhog Day: Near-Duplicate Detection on Twitter
THANK YOU!
May 15th, 2013 Slides : http://goo.gl/gffBm
k.tao@tudelft.nl http://ktao.nl/
QUESTIONS?

More Related Content

What's hot

Anticipating Discussion Activity on Community Forums
Anticipating Discussion Activity on Community ForumsAnticipating Discussion Activity on Community Forums
Anticipating Discussion Activity on Community ForumsMatthew Rowe
 
Twitter Users Guide 2010 Upload
Twitter   Users Guide 2010   UploadTwitter   Users Guide 2010   Upload
Twitter Users Guide 2010 UploadStephen Dann
 
Using Chaos to Disentangle an ISIS-Related Twitter Network
Using Chaos to Disentangle an ISIS-Related Twitter NetworkUsing Chaos to Disentangle an ISIS-Related Twitter Network
Using Chaos to Disentangle an ISIS-Related Twitter NetworkSteve Kramer
 
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS,...
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS,...Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS,...
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS,...Steve Kramer
 
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS ...
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS ...Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS ...
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS ...Steve Kramer
 
Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014
Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014
Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014Lora Aroyo
 
ipoque p2p Survey 2006
ipoque p2p Survey 2006ipoque p2p Survey 2006
ipoque p2p Survey 2006ipoque
 
Data Analytics Capstone
Data Analytics CapstoneData Analytics Capstone
Data Analytics CapstoneMacemann
 
Fake News Detector
Fake News DetectorFake News Detector
Fake News DetectorIrisYoon5
 
Crowdsourcing ambiguity aware ground truth - collective intelligence 2017
Crowdsourcing ambiguity aware ground truth - collective intelligence 2017Crowdsourcing ambiguity aware ground truth - collective intelligence 2017
Crowdsourcing ambiguity aware ground truth - collective intelligence 2017Lora Aroyo
 
Twitter Analysis: Fake News
Twitter Analysis: Fake  NewsTwitter Analysis: Fake  News
Twitter Analysis: Fake NewsErika Siregar
 
Tumblr 2014 - statistical overview and comparison with popular social services
Tumblr 2014 - statistical overview and comparison with popular social servicesTumblr 2014 - statistical overview and comparison with popular social services
Tumblr 2014 - statistical overview and comparison with popular social servicesStephan Tschierschwitz
 

What's hot (16)

Red Blue Presentation
Red Blue PresentationRed Blue Presentation
Red Blue Presentation
 
Broker Bots: Analyzing automated activity during High Impact Events on Twitter
Broker Bots: Analyzing automated activity during High Impact Events on TwitterBroker Bots: Analyzing automated activity during High Impact Events on Twitter
Broker Bots: Analyzing automated activity during High Impact Events on Twitter
 
Anticipating Discussion Activity on Community Forums
Anticipating Discussion Activity on Community ForumsAnticipating Discussion Activity on Community Forums
Anticipating Discussion Activity on Community Forums
 
Twitter Users Guide 2010 Upload
Twitter   Users Guide 2010   UploadTwitter   Users Guide 2010   Upload
Twitter Users Guide 2010 Upload
 
Using Chaos to Disentangle an ISIS-Related Twitter Network
Using Chaos to Disentangle an ISIS-Related Twitter NetworkUsing Chaos to Disentangle an ISIS-Related Twitter Network
Using Chaos to Disentangle an ISIS-Related Twitter Network
 
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS,...
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS,...Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS,...
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS,...
 
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS ...
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS ...Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS ...
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS ...
 
Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014
Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014
Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014
 
ipoque p2p Survey 2006
ipoque p2p Survey 2006ipoque p2p Survey 2006
ipoque p2p Survey 2006
 
BotNetBenchmark - A Benchmark for Social Network
BotNetBenchmark - A Benchmark for Social NetworkBotNetBenchmark - A Benchmark for Social Network
BotNetBenchmark - A Benchmark for Social Network
 
Data Analytics Capstone
Data Analytics CapstoneData Analytics Capstone
Data Analytics Capstone
 
Fake News Detector
Fake News DetectorFake News Detector
Fake News Detector
 
Crowdsourcing ambiguity aware ground truth - collective intelligence 2017
Crowdsourcing ambiguity aware ground truth - collective intelligence 2017Crowdsourcing ambiguity aware ground truth - collective intelligence 2017
Crowdsourcing ambiguity aware ground truth - collective intelligence 2017
 
Twitter Analysis: Fake News
Twitter Analysis: Fake  NewsTwitter Analysis: Fake  News
Twitter Analysis: Fake News
 
Tumblr 2014 - statistical overview and comparison with popular social services
Tumblr 2014 - statistical overview and comparison with popular social servicesTumblr 2014 - statistical overview and comparison with popular social services
Tumblr 2014 - statistical overview and comparison with popular social services
 
Web Science 2010 slides
Web Science 2010 slidesWeb Science 2010 slides
Web Science 2010 slides
 

Viewers also liked

Twitter: Tweeting, Following, Finding
Twitter: Tweeting, Following, FindingTwitter: Tweeting, Following, Finding
Twitter: Tweeting, Following, FindingMonica Hamburg
 
Presentationcasestudy eventdetectionintwitter-140912160454-phpapp01
Presentationcasestudy eventdetectionintwitter-140912160454-phpapp01Presentationcasestudy eventdetectionintwitter-140912160454-phpapp01
Presentationcasestudy eventdetectionintwitter-140912160454-phpapp01Falitokiniaina Rabearison
 
Hashtags & Retweets: Using Twitter to aid Community, Communication and Casual...
Hashtags & Retweets: Using Twitter to aid Community, Communication and Casual...Hashtags & Retweets: Using Twitter to aid Community, Communication and Casual...
Hashtags & Retweets: Using Twitter to aid Community, Communication and Casual...Peter Reed
 
Topic & Sentiment Detection on Twitter and Facebook
Topic & Sentiment Detection on Twitter and FacebookTopic & Sentiment Detection on Twitter and Facebook
Topic & Sentiment Detection on Twitter and FacebookSylvia van Schie
 
Real-Time Top-R Topic Detection on Twitter with Topic Hijack Filtering
Real-Time Top-R Topic Detection on Twitter with Topic Hijack FilteringReal-Time Top-R Topic Detection on Twitter with Topic Hijack Filtering
Real-Time Top-R Topic Detection on Twitter with Topic Hijack FilteringKohei Hayashi
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionSymeon Papadopoulos
 
Community Detection in Social Media
Community Detection in Social MediaCommunity Detection in Social Media
Community Detection in Social MediaSymeon Papadopoulos
 

Viewers also liked (8)

Twitter: Tweeting, Following, Finding
Twitter: Tweeting, Following, FindingTwitter: Tweeting, Following, Finding
Twitter: Tweeting, Following, Finding
 
Presentationcasestudy eventdetectionintwitter-140912160454-phpapp01
Presentationcasestudy eventdetectionintwitter-140912160454-phpapp01Presentationcasestudy eventdetectionintwitter-140912160454-phpapp01
Presentationcasestudy eventdetectionintwitter-140912160454-phpapp01
 
Hashtags & Retweets: Using Twitter to aid Community, Communication and Casual...
Hashtags & Retweets: Using Twitter to aid Community, Communication and Casual...Hashtags & Retweets: Using Twitter to aid Community, Communication and Casual...
Hashtags & Retweets: Using Twitter to aid Community, Communication and Casual...
 
Topic & Sentiment Detection on Twitter and Facebook
Topic & Sentiment Detection on Twitter and FacebookTopic & Sentiment Detection on Twitter and Facebook
Topic & Sentiment Detection on Twitter and Facebook
 
Real-Time Top-R Topic Detection on Twitter with Topic Hijack Filtering
Real-Time Top-R Topic Detection on Twitter with Topic Hijack FilteringReal-Time Top-R Topic Detection on Twitter with Topic Hijack Filtering
Real-Time Top-R Topic Detection on Twitter with Topic Hijack Filtering
 
Basics of Twitter
Basics of TwitterBasics of Twitter
Basics of Twitter
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 
Community Detection in Social Media
Community Detection in Social MediaCommunity Detection in Social Media
Community Detection in Social Media
 

Similar to Groundhog Day: Near-Duplicate Detection on Twitter

Groundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitterGroundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitterDan Nguyen
 
Building a Microblog Corpus for Search Result Diversification
Building a Microblog Corpus for Search Result DiversificationBuilding a Microblog Corpus for Search Result Diversification
Building a Microblog Corpus for Search Result DiversificationKe Tao
 
Python report on twitter sentiment analysis
Python report on twitter sentiment analysisPython report on twitter sentiment analysis
Python report on twitter sentiment analysisAntaraBhattacharya12
 
Questions about questions
Questions about questionsQuestions about questions
Questions about questionsmoresmile
 
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHI
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHIBig Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHI
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHIRuchika Sharma
 
GeospatialDataAnalysis
GeospatialDataAnalysisGeospatialDataAnalysis
GeospatialDataAnalysisTaylor Graham
 
Twitter: A Hands On Learning Session for Researchers
Twitter: A Hands On Learning Session for ResearchersTwitter: A Hands On Learning Session for Researchers
Twitter: A Hands On Learning Session for ResearchersKMb Unit, York University
 
Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Serge Beckers
 
Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Serge Beckers
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalCarsten Eickhoff
 
Five steps to search and store tweets by keywords
Five steps to search and store tweets by keywordsFive steps to search and store tweets by keywords
Five steps to search and store tweets by keywordsWeiai Wayne Xu
 
How not to be all a flutter about Twitter
How not to be all a flutter about TwitterHow not to be all a flutter about Twitter
How not to be all a flutter about TwitterMargaret Hazel
 
Twitter: A Hands-On Learning Session for Researcher
Twitter: A Hands-On Learning Session for ResearcherTwitter: A Hands-On Learning Session for Researcher
Twitter: A Hands-On Learning Session for ResearcherKMb Unit, York University
 
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013Digital Methods Initiative
 
Computational Verification Challenges in Social Media
Computational Verification Challenges in Social MediaComputational Verification Challenges in Social Media
Computational Verification Challenges in Social MediaSymeon Papadopoulos
 
Twitter as a personalizable information service ii
Twitter as a personalizable information service iiTwitter as a personalizable information service ii
Twitter as a personalizable information service iiKan-Han (John) Lu
 
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...IRJET Journal
 
Detection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
Detection and Analysis of Twitter Trending Topics via Link-Anomaly DetectionDetection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
Detection and Analysis of Twitter Trending Topics via Link-Anomaly DetectionIJERA Editor
 

Similar to Groundhog Day: Near-Duplicate Detection on Twitter (20)

Groundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitterGroundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitter
 
Building a Microblog Corpus for Search Result Diversification
Building a Microblog Corpus for Search Result DiversificationBuilding a Microblog Corpus for Search Result Diversification
Building a Microblog Corpus for Search Result Diversification
 
Python report on twitter sentiment analysis
Python report on twitter sentiment analysisPython report on twitter sentiment analysis
Python report on twitter sentiment analysis
 
Questions about questions
Questions about questionsQuestions about questions
Questions about questions
 
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHI
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHIBig Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHI
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHI
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
GeospatialDataAnalysis
GeospatialDataAnalysisGeospatialDataAnalysis
GeospatialDataAnalysis
 
Twitter: A Hands On Learning Session for Researchers
Twitter: A Hands On Learning Session for ResearchersTwitter: A Hands On Learning Session for Researchers
Twitter: A Hands On Learning Session for Researchers
 
Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?
 
Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Five steps to search and store tweets by keywords
Five steps to search and store tweets by keywordsFive steps to search and store tweets by keywords
Five steps to search and store tweets by keywords
 
How not to be all a flutter about Twitter
How not to be all a flutter about TwitterHow not to be all a flutter about Twitter
How not to be all a flutter about Twitter
 
Twitter: A Hands-On Learning Session for Researcher
Twitter: A Hands-On Learning Session for ResearcherTwitter: A Hands-On Learning Session for Researcher
Twitter: A Hands-On Learning Session for Researcher
 
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
 
tweet segmentation
tweet segmentation tweet segmentation
tweet segmentation
 
Computational Verification Challenges in Social Media
Computational Verification Challenges in Social MediaComputational Verification Challenges in Social Media
Computational Verification Challenges in Social Media
 
Twitter as a personalizable information service ii
Twitter as a personalizable information service iiTwitter as a personalizable information service ii
Twitter as a personalizable information service ii
 
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
 
Detection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
Detection and Analysis of Twitter Trending Topics via Link-Anomaly DetectionDetection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
Detection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
 

Recently uploaded

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 

Recently uploaded (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 

Groundhog Day: Near-Duplicate Detection on Twitter

  • 1. 1 Groundhog Day: Near-Duplicate Detection on Twitter #www2013 Rio de Janeiro, Brazil May 15th, 2013 Ke Tao1, Fabian Abel1,2, Claudia Hauff1, Geert-Jan Houben1, Ujwal Gadiraju1 1 Web Information Systems, TU Delft, the Netherlands 2 XING AG, Germany
  • 2. 2Groundhog Day: Near-Duplicate Detection on Twitter Outline • Search & Retrieval on Twitter • Duplicate Content on Twitter • Near-duplicates in Twitter Search • Our solution to Twitter Search: the Twinder Framework • Analysis & Evaluation • Conclusion
  • 3. 3Groundhog Day: Near-Duplicate Detection on Twitter Search & Retrieval on Twitter • Twitter is more like a news media. • How do people search on Twitter? [Teevan et al.] • Repeated queries & monitoring for new content • Problems: • Short tweets  lots of similar information • Few people produce contents  many retweets, copied content How do people use Twitter as a source of information? J. Teevan, D. Ramage, and M. R. Morris. #TwitterSearch: A Comparison of Microblog Search and Web Search. In Proceedings of the 4th International Conference on Web Search and Web Data Mining (WSDM), 2011.
  • 4. 4Groundhog Day: Near-Duplicate Detection on Twitter Duplicate Content on Twitter (1/3) • Exact copy • Completely identical in terms of characters. • Nearly exact copy • Completely identical except for #hashtags, URLs or @mentions Classification of near-duplicates in 5 levels t1: Huge New Toyota Recall Includes 245,000 Lexus GS, IS Sedans - http://newzfor.me/?cuye t2: Huge New Toyota Recall Includes 245,000 Lexus GS, IS Sedans - http://newzfor.me/?cuye t3: Huge New Toyota Recall Includes 245,000 Lexus GS, IS Sedans - http://bit.ly/ibUoJs
  • 5. 5Groundhog Day: Near-Duplicate Detection on Twitter Duplicate Content on Twitter (2/3) • Strong near-duplicate • Same core message, one tweet contains more information. • Weak near-duplicate • Same core message, one tweet contains personal views. • Convey semantically the same message with differing information nuggets. Classification of near-duplicates in 5 levels t4: Toyota recalls 1.7 million vehicles for fuel leaks: Toyota’s latest recalls are mostly in Japan, but they also... http://bit.ly/dH0Pmw t5: Toyota Recalls 1.7 Million Vehicles For Fuel Leaks http://bit.ly/flWFWU t6: The White Stripes broke up. Oh well. t7: The White Stripes broke up. That’s a bummer for me.
  • 6. 6Groundhog Day: Near-Duplicate Detection on Twitter Duplicate Content on Twitter (3/3) • Low overlap • Semantically contain the same core message, but only have a few words in common Classification of near-duplicates in 5 levels t8: Federal Judge rules Obamacare is unconsitutional... t9: Our man of the hour: Judge Vinson gave Obamacare its second unconstitutional ruling. http://fb.me/zQsChak9
  • 7. 7Groundhog Day: Near-Duplicate Detection on Twitter Near-Duplicates in Twitter Search (1/2) Analysis of the Tweets2011 corpus (TREC microblog track) 1.89%& 9.51%& 21.09%& 48.71%& 18.80%& Exact&copy& Nearly&exact& copy& Strong&near; duplicate& Weak&near; duplicate& Low&overlapping& • For the 49 topics (queries), 2,825 topic-tweet pairs are relevant. • We manually labeled 55,362 tweet pairs • We found 2,745 pairs of duplicates in different levels.
  • 8. 8Groundhog Day: Near-Duplicate Detection on Twitter Near-Duplicates in Twitter Search (2/2) Analysis of the Tweets2011 corpus (TREC microblog track) • Number of duplicate tweet pairs among top 10, 20, 50 and the whole range of the search results (All tweets judged as relevant in the corpus) • On average, we found around 20% duplicates in the search results. Range Top 10 Top 20 Top 50 All Duplicate % 19.4% 22.2% 22.5% 22.3%
  • 9. 9Groundhog Day: Near-Duplicate Detection on Twitter Twinder Framework Our Search Infrastructure Feature Extrac on Relevance Es ma on Social Web Streams FeatureExtraconTask Broker Cloud Computing Infrastructure Index Keyword-based Relevance messages Twinder Search Engine feature extraction tasks Search User Interface query results feedback users Duplicate Detec on and Diversifica on Seman c-based Relevance Seman c FeaturesSyntac cal Features Contextual Features Further Enrichment
  • 10. 10Groundhog Day: Near-Duplicate Detection on Twitter Building a Classifier … (1/5) Features Description Levenshtein distance Number of characters required to change (substitution, insertion, deletion) one tweet to the other Overlap in terms Jaccard similarity between two sets of words in tweets. Overlap in #hashtags Jaccard similarity between two sets of #hashtags in tweets. Overlap in URL Jaccard similarity between two sets of URLs in tweets. Overlap in expanded URL Recomputed “Overlap in URL” after expanding shortened URLs in both tweets. Length difference The difference in length between two tweets. Overview of our syntactic features
  • 11. 11Groundhog Day: Near-Duplicate Detection on Twitter Building a Classifier … (2/5) Extract semantics from tweets dbp:Tim_Berners-Lee dbp:World_Wide_Web dbp:Rio_de_Janeiro dbp:International_World_Wide_Web_Conference Topic:Internet_Technology
  • 12. 12Groundhog Day: Near-Duplicate Detection on Twitter Building a Classifier … (3/5) Features Description Overlap in Entities Jaccard similarity between two sets of entities extracted from two tweets Overlap in Entities Types Jaccard similarity between two sets of types of entities from two tweets Overlap in Topics Jaccard similarity between two sets of detected topics in two tweets Overlap in WordNet Concepts Jaccard similarity between two sets of WordNet Nouns in tweets Overlap in WordNet Synset Concepts Recomputed Overlap in WordNet Concepts after Combining interlinked Concepts in Synsets WordNet similarity The similarity calculated based on semantic relatedness* between concepts from two tweets Overview of our semantic features
  • 13. 13Groundhog Day: Near-Duplicate Detection on Twitter Building a Classifier … (4/5) Enriched semantic features • We integrate content from external resources and construct the same set of semantic features t3: Huge New Toyota Recall Includes 245,000 Lexus GS, IS Sedans - http://bit.ly/ibUoJs
  • 14. 14Groundhog Day: Near-Duplicate Detection on Twitter Building a Classifier … (5/5) Features Description Temporal difference The difference of posting time of two tweets Difference in #followees The difference in number of followees of the author of the tweets Difference in #followers The difference in number of followers of the author of the tweets Same client Indicator of whether the two tweets are posted from the same client application Overview of contextual features
  • 15. 15Groundhog Day: Near-Duplicate Detection on Twitter Summary of Features • What feature categories do we have? • Syntactical features (6) • Semantic features (6) • Enriched semantic features (6) • Contextual features (4) • Classification strategies  different feature combinations • Dependent on available resources and time constraints
  • 16. 16Groundhog Day: Near-Duplicate Detection on Twitter Classification Strategies Using different sets of features for near-duplicate detection on Twitter Strategy Description Sy Only Syntactical features SySe Add semantics from tweets SyCo Without Semantics SySeCo Without Enriched Semantics SySeEn Without Contextual features SySeEnC o All Feature included
  • 17. 17Groundhog Day: Near-Duplicate Detection on Twitter Analysis and Evaluation • Research Questions: 1. How accurately can the different duplicate detection strategies identify duplicates? 2. What kind of features are of particular importance for duplicate detection? 3. How does the accuracy vary for the different levels of duplicates? 4. How do the duplicate detection strategies impact search effectiveness on Twitter? • Experimental setup • Consider the problem as a classification task • Logistic Regression
  • 18. 18Groundhog Day: Near-Duplicate Detection on Twitter Data set: Tweets2011 • Twitter corpus • 16 million tweets (Jan. 24th, 2011 – Feb. 8th) • 4,766,901 tweets classified as English • 6.2 million entity-extractions (140k distinct entities) • Relevance judgments • 49 topics • 40,855 (topic, tweet) pairs, 2,825 judged as relevant • 57.65 relevant tweets per topic (on average) • Duplicate level labeling • 55,362 tweet pairs labeled • 2,745 labeled as duplicates (in 5 levels) • Publicly available at http://wis.ewi.tudelft.nl/duptweet/ TREC 2011 Microblog Track
  • 19. 19Groundhog Day: Near-Duplicate Detection on Twitter Classification Accuracy Duplicate or not?  RQ1 Features Precision Recall F-measure Baseline 0.5068 0.1913 0.2777 Sy 0.5982 0.2918 0.3923 SyCo 0.5127 0.3370 0.4067 SySe 0.5333 0.3679 0.4354 SySeEn 0.5297 0.3767 0.4403 Overall, we can achieve a precision and recall of about 49% and 43% respectively by applying all possible features. SySeCo 0.4816 0.4200 0.4487 SySeEnCo 0.4868 0.4299 0.4566
  • 20. 20Groundhog Day: Near-Duplicate Detection on Twitter Feature Weights (1/2) Which features matter the most?  RQ2 -4 -3 -2 -1 0 1 2 3 Syntactical -3 -2 -1 0 1 2 3 4 5 Semantic
  • 21. 21Groundhog Day: Near-Duplicate Detection on Twitter Feature Weights (2/2) Which features matter the most?  RQ2 -14 -12 -10 -8 -6 -4 -2 0 2 Contextual -3 -2 -1 0 1 2 3 Enriched Semantics
  • 22. 22Groundhog Day: Near-Duplicate Detection on Twitter Results for Predicting Duplicate Levels (1/2) Exact copy, weak near-duplicate, … or low overlap?  RQ3 Features Precision Recall F-measure Baseline 0.5553 0.5208 0.5375 Sy 0.6599 0.5809 0.6179 SyCo 0.6747 0.5889 0.6289 SySe 0.6708 0.6151 0.6417 SySeEn 0.6694 0.6241 0.6460 Overall, we achieve a precision and recall of about 67% and 63% respectively by applying all features. SySeCo 0.6852 0.6198 0.6508 SySeEnCo 0.6739 0.6308 0.6516
  • 23. 23Groundhog Day: Near-Duplicate Detection on Twitter Results for Predicting Duplicate Levels (2/2) Exact copy, weak near-duplicate, … or low overlap?  RQ3
  • 24. 24Groundhog Day: Near-Duplicate Detection on Twitter Search Result Diversification How much redundancy can we detect and remove?  RQ4 Range Top10 Top20 Top50 All Baseline 19.4% 22.2% 22.5% 22.3% After Filtering 9.1% 10.5% 12.0% 12.1% Improvement +53.1% +52.0% +46.7% +45.7% • A core application of near-duplicate detection strategies is the diversification of search results. We simply remove the duplicates that are identified by our method. • Near-duplicates after filtering:
  • 25. 25Groundhog Day: Near-Duplicate Detection on Twitter Conclusions 1. We conduct an analysis of duplicate content in Twitter search results and infer a model for categorizing different levels of duplicity. 2. We develop a near-duplicate detection framework for microposts that provides functionality for analyzing 4 categories of features. 3. Given our duplicate detection framework, we perform extensive evaluations and analyses of different duplicate detection strategies on a large, standardized Twitter corpus to investigate the quality of (i) detecting duplicates and (ii) categorizing the duplicity level of two tweets. 4. Our approach enables search result diversification and analyzes the impact of the diversification on the search quality. • The progress on Twinder on be found at: http://wis.ewi.tudelft.nl/twinder/
  • 26. 26Groundhog Day: Near-Duplicate Detection on Twitter THANK YOU! May 15th, 2013 Slides : http://goo.gl/gffBm k.tao@tudelft.nl http://ktao.nl/ QUESTIONS?