Twinder: A Search Engine for Twitter Streams                                   #ICWE2012 Berlin, Germany                  ...
Get information from TwitterHow do people use Twitter as a source of information?•  Twitter is more like a news media.•  H...
Research QuestionsWhat are the challenges we are facing? 1. Given a topic, can we identify the relevant tweets based on th...
Search on TwitterWhat you can do on current Twitter?• Twitter search interface• Ordered by time• Keyword-based match      ...
Twinder = TWeet +fINDEROur solution - Architecture                                                                        ...
Core Components of TwinderFeature Extraction •  Receive Twitter messages from Social Web Streams •  Features of two catego...
Core Components of TwinderFeature Extraction Task Broker •  Twinder makes use of MapReduce and cloud computing    infrastr...
Core Components of TwinderRelevance Estimation •  Accepting search queries from front-end, passing them to    Feature Extr...
Efficiency of IndexingHow good does Twinder make use of cloud-computinginfrastructure?Corpus size        Mainstream Server...
Features of Microposts Hypothesis H1: The greater the keyword- based relevance score, the more relevant and interesting th...
Semantic-based relevanceExpand the queries to match more tweets                               dbp:United_States  Reformula...
Semantic-based relatednessIs there a semantic overlap between the query and the tweet?                                dbp:...
Overview of featuresWhat do we have now?   Topic sensitive             Topic insensitive    Keyword-based                 ...
Syntactical feature : HashtagIs a tweet more relevant if it contains a #hashtag?  Hypothesis 4: tweets that contain hashta...
Syntactical feature : hasURLIs a tweet that contains a URL more relevant?  Hypothesis 5: tweets that contain a URL are  mo...
Syntactical feature : isReplyIs a tweet which is a reply to @somebody more relevant?  Hypothesis 6: tweets that are formul...
Syntactical feature : lengthDoes the length of a tweet influence its relevance for a topic?  Hypothesis 7: the longer a tw...
Overview of features Short summary    Topic sensitive             Topic insensitive     Keyword-based              Syntact...
Semantic featuresFind semantics in a tweet to estimate the relevance         dbp:Tim_Berners-Lee          dbp:World_Wide_W...
Semantic features : #entityIs a tweet with more entities more interesting?• 5 entities extracted.  Hypothesis 8: the more ...
Semantic features : diversityHow many types are there in the entities?• 4 types of entities  Hypothesis 9: the greater the...
Semantic features : sentimentWas the author of the tweet happy or not?• Sentiment : Neutral  Hypothesis 10: the likelihood...
Overview of featuresBy now, we have 4 types of features.    Topic sensitive              Topic insensitive     Keyword-bas...
Contextual featuresDoes the number of followers influence the relatedness?    Hypothesis 11: The higher the number of    f...
Contextual featuresOr the number of followers that the author appears in? Hypothesis 12: The higher the number of lists in...
Contextual featuresHow long has been the author on Twitter? Hypothesis 13: The older the Twitter account of a user, the mo...
Summary of FeaturesThe features    Topic sensitive             Topic insensitive     Keyword-based                 Syntact...
Analysis• Research Questions:  1.  Which features are more influential on predicting the      relatedness of a tweet to a ...
DatasetFrom TREC 2011 Microblog Track•  Twitter corpus   •  16 million tweets (Jan. 24th, 2011 – Feb. 8th)   •  4,766,901 ...
ResultsWhich type of features matters?Features             Precision       Recall            F-measurekeyword relevance   ...
Weights of features                              Syntactical     Which feature matters?         2                         ...
Topics of different categoriesThe impact on the performance and models•  49 topics categorized into 2 parts w.r.t. 3 dimen...
ConclusionsWhat are our contributions?1.  Twinder search engine proposed: analyzing various    features to determine the r...
ConclusionsThe lessons learned1.  The learned models which take advantage of semantics    and topic-sensitive features out...
QUESTIONS?July 25th, 2012k.tao@tudelft.nl       http://ktao.nl/THANK YOU!            Twinder: A search engine for Twitter ...
Upcoming SlideShare
Loading in …5
×

Twinder: A Search Engine for Twitter Streams

1,559 views
1,417 views

Published on

Published in: Technology, Design
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,559
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
26
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Twinder: A Search Engine for Twitter Streams

  1. 1. Twinder: A Search Engine for Twitter Streams #ICWE2012 Berlin, Germany July 25th, 2012 Ke Tao, Fabian Abel, Claudia Hauff, Geert-Jan Houben Web Information Systems, TU Delft 1
  2. 2. Get information from TwitterHow do people use Twitter as a source of information?•  Twitter is more like a news media.•  How do people search Twitter? •  Search on Twitter (Teevan et al.) Web Search Twitter Search Query length (chars) 18.80 12.00 Query length (words) 3.08 1.64 Is a celebrity name 3.11% 15.22% Twinder: A search engine for Twitter streams 2
  3. 3. Research QuestionsWhat are the challenges we are facing? 1. Given a topic, can we identify the relevant tweets based on the characteristics of the tweets? 2. Are semantics meaningful for determining the tweets’ relevance for a topic? 3. How can we design an architecture that is scalable? Twinder: A search engine for Twitter streams 3
  4. 4. Search on TwitterWhat you can do on current Twitter?• Twitter search interface• Ordered by time• Keyword-based match Twinder: A search engine for Twitter streams 4
  5. 5. Twinder = TWeet +fINDEROur solution - Architecture $03.0& B%"&@ 3""?:#*< &"1%5$1 ."#&*/01"&2-$"&3#*" 784*%3.&& 93/.2:&& 4"5"6#-*"(1+7#+,- ;*+4*3& !"#$%&"()$&#*+,- !"#$%&"()$&#*+,- 1#(4254*03*04)63&& 1#(42503*04)63&& ;#1<=&,<"& .@-$#*+*#5!"#$%&"1 C"@D,&?>:#1"? 4"5"6#-*" 3"#$%&" A,-$")$%#5!"#$%&"1 2-?") ")$&#*+,- ."7#-+*1>:#1"?!"#$%&!#($)*+& $#1<1 ."7#-+*!"#$%&"1 4"5"6#-*" ,*-./01.$21$.3& 7"11#E"1 .,*8#59":.$&"#71 Twinder: A search engine for Twitter streams 5
  6. 6. Core Components of TwinderFeature Extraction •  Receive Twitter messages from Social Web Streams •  Features of two categories: •  (1) Topic-sensitive features •  (2) Topic-insensitive features •  Different extracting strategies designed for different features Twinder: A search engine for Twitter streams 6
  7. 7. Core Components of TwinderFeature Extraction Task Broker •  Twinder makes use of MapReduce and cloud computing infrastructures to allow for high scalability and frequent updates of its multifaceted index. •  Feature Extration Task Broker dispatches features extration tasks and indexing tasks to cloud computing infrastructure. Twinder: A search engine for Twitter streams 7
  8. 8. Core Components of TwinderRelevance Estimation •  Accepting search queries from front-end, passing them to Feature Extration Component. •  Tweets are classified into the relevant and the non-relevant by Relevance Estimation component, and are further delivered to front-end for rendering. •  Twinder can learn the classification model, initially from training dataset, then from usage data. Twinder: A search engine for Twitter streams 8
  9. 9. Efficiency of IndexingHow good does Twinder make use of cloud-computinginfrastructure?Corpus size Mainstream Server EMR(10 instances)100k (13MBytes) 0.4 min 5 min1m (122MBytes) 5 min 8 min10m (1.3GBytes) 48 min 19 min32m (3.9GBytes) 283 min 47 min Twinder: A search engine for Twitter streams 9
  10. 10. Features of Microposts Hypothesis H1: The greater the keyword- based relevance score, the more relevant and interesting the tweet is to the topic. We already have keyword-based relevance, and…? Topic sensitive Topic insensitive Keyword-based relevance ? Twinder: A search engine for Twitter streams 10
  11. 11. Semantic-based relevanceExpand the queries to match more tweets dbp:United_States Reformulated query is expected to get a dbp:Hu_Jintao more accurate retrieval score. Hypothesis H2 : The greater the semantic- based relevance score, the more relevant and interesting the tweet is. Twinder: A search engine for Twitter streams 11
  12. 12. Semantic-based relatednessIs there a semantic overlap between the query and the tweet? dbp:the_United_Statesdbp:Hu_JintaoHypothesis H3 : If a tweet is considered tobe semantically related to the query then itis also relevant and interesting for the user. Twinder: A search engine for Twitter streams 12
  13. 13. Overview of featuresWhat do we have now? Topic sensitive Topic insensitive Keyword-based ? Semantic-based Twinder: A search engine for Twitter streams 13
  14. 14. Syntactical feature : HashtagIs a tweet more relevant if it contains a #hashtag? Hypothesis 4: tweets that contain hashtags are more likely to be relevant than tweets that do not contain hashtags. Twinder: A search engine for Twitter streams 14
  15. 15. Syntactical feature : hasURLIs a tweet that contains a URL more relevant? Hypothesis 5: tweets that contain a URL are more likely to be relevant than tweets that do not contain a URL. Twinder: A search engine for Twitter streams 15
  16. 16. Syntactical feature : isReplyIs a tweet which is a reply to @somebody more relevant? Hypothesis 6: tweets that are formulated as a reply to another tweet are less likely to be relevant than other tweets. Twinder: A search engine for Twitter streams 16
  17. 17. Syntactical feature : lengthDoes the length of a tweet influence its relevance for a topic? Hypothesis 7: the longer a tweet, the more likely it is to be relevant and interesting. Twinder: A search engine for Twitter streams 17
  18. 18. Overview of features Short summary Topic sensitive Topic insensitive Keyword-based Syntactical features Semantic-based ? Are there further features thatallow for estimating the relevance? Twinder: A search engine for Twitter streams 18
  19. 19. Semantic featuresFind semantics in a tweet to estimate the relevance dbp:Tim_Berners-Lee dbp:World_Wide_Web dbp:France dbp:Lyon dbp:International_World_Wide_Web_Conference Twinder: A search engine for Twitter streams 19
  20. 20. Semantic features : #entityIs a tweet with more entities more interesting?• 5 entities extracted. Hypothesis 8: the more entities a tweet mentions, the more likely it is to be relevant and interesting. Twinder: A search engine for Twitter streams 20
  21. 21. Semantic features : diversityHow many types are there in the entities?• 4 types of entities Hypothesis 9: the greater the diversity of concepts mentioned in a tweet, the more likely it is to be interesting and relevant. Twinder: A search engine for Twitter streams 21
  22. 22. Semantic features : sentimentWas the author of the tweet happy or not?• Sentiment : Neutral Hypothesis 10: the likelihood of a tweet’s relevance is influenced by its sentiment polarity. Twinder: A search engine for Twitter streams 22
  23. 23. Overview of featuresBy now, we have 4 types of features. Topic sensitive Topic insensitive Keyword-based Syntactical Semantic-based Semantics Can we utilize the contextual information of tweets? Twinder: A search engine for Twitter streams 23
  24. 24. Contextual featuresDoes the number of followers influence the relatedness? Hypothesis 11: The higher the number of followers a creator of a message has, the more likely it is that her tweets are relevant. Twinder: A search engine for Twitter streams 24
  25. 25. Contextual featuresOr the number of followers that the author appears in? Hypothesis 12: The higher the number of lists in which the creator of a message appears, the more likely it is that her tweets are relevant. Twinder: A search engine for Twitter streams 25
  26. 26. Contextual featuresHow long has been the author on Twitter? Hypothesis 13: The older the Twitter account of a user, the more likely it is that her tweets are relevant. Signed up Post July 2008 June 2012 Twinder: A search engine for Twitter streams 26
  27. 27. Summary of FeaturesThe features Topic sensitive Topic insensitive Keyword-based Syntactical Semantic-based Semantics Contextual Twinder: A search engine for Twitter streams 27
  28. 28. Analysis• Research Questions: 1.  Which features are more influential on predicting the relatedness of a tweet to a certain topic? 2.  Which types of features are more important? Are semantics meaningful? 3.  What’s the performance that we can achieve by utilizing these features?• Twinder Setup •  Consider the search problem as a classification task •  Classification algorithm = Logistic Regression Twinder: A search engine for Twitter streams 28
  29. 29. DatasetFrom TREC 2011 Microblog Track•  Twitter corpus •  16 million tweets (Jan. 24th, 2011 – Feb. 8th) •  4,766,901 tweets classified as English •  6.2 million entity-extractions (140k distinct entities)•  Relevance judgments •  49 topics •  40,855 (topic, tweet) pairs •  60.31 relevant tweets per topic (on average) Twinder: A search engine for Twitter streams 29
  30. 30. ResultsWhich type of features matters?Features Precision Recall F-measurekeyword relevance 0.3036 0.2851 0.2940semantic relevance 0.3050 0.3294 0.3167topic-sensitive 0.3135 0.3252 0.3192topic-insensitive 0.1956 0.0064 0.0123without semantics 0.3363 0.4618 0.3965without sentiment 0.3701 0.3923 0.4048without context 0.3827 0.4714 0.4225all features 0.3674 0.4736 0.4138 Overall, we can achieve the precision and recall of over 35% and 45% respectively by applying all the features. Twinder: A search engine for Twitter streams 30
  31. 31. Weights of features Syntactical Which feature matters? 2 1 Keyword-based 0 hasHashtag hasURL isReply length2 -110 Semantics-1 Keyword-based relevance 2 1 Semantic-based 0 2 #entities diversity sentiment -1 1 Contextual 0 2 -1 Relevance Relatedness 1 0 #followers #lists Age -1 Twinder: A search engine for Twitter streams 31
  32. 32. Topics of different categoriesThe impact on the performance and models•  49 topics categorized into 2 parts w.r.t. 3 dimensions: •  Popularity •  Gobal vs. Local •  Temporal persistence•  Popularity •  Higher recall for popular topics •  Less impact from sentiment features on unpopular topics•  Temporal persistence •  Higher performance on shorter-term topics •  Less impact from sentiment features on persistent topic Twinder: A search engine for Twitter streams 32
  33. 33. ConclusionsWhat are our contributions?1.  Twinder search engine proposed: analyzing various features to determine the relevance and interestingness of Twitter messages for a given topic.2.  Scalability demonstrated for the Twinder search engine.3.  Extensive analysis on 13 features along two-dimensions: topic-sensitive features and topic-insensitive features. Twinder: A search engine for Twitter streams 33
  34. 34. ConclusionsThe lessons learned1.  The learned models which take advantage of semantics and topic-sensitive features outperform those which do not take the semantics and topic-sensitive features into account.2.  Contextual features that characterize the users who are posting the messages have little impact on the relevance estimation.3.  The importance of a feature differs depending on the topic characteristics; for example, the sentiment-based features are more important for popular than for unpopular topics. Twinder: A search engine for Twitter streams 34
  35. 35. QUESTIONS?July 25th, 2012k.tao@tudelft.nl http://ktao.nl/THANK YOU! Twinder: A search engine for Twitter streams 35

×