Learning from Twitter Hashtags: LeveragingProximate Tags to Enhance Graph-basedKeyphrase ExtractionAbdelghani Bellaachia &...
Overview•   Twitter Introduction•   Why Extracting Keyphrases in Twitter?•   Learning from Twitter Hashtags•   Twitter Lex...
Twitter Introduction• Twitter is a micro-blogging social network site• It enables users to blog or broadcast their thought...
Tweets• Tweets are the posts or messages broadcasted by users.• It can only include up to 140 characters• In it is nature,...
Tweets• Example of a tweet containing a hashtag, text, and link                                                            5
Hashtags• Hashtags started as a user convention.• They are used to index and organize tweets.• Trend discovery• Every Hash...
Why Extracting Keyphrasesin Twitter?• In 2011, Twitter has attracted over 200 million users, whom  publish at least a bill...
Definitions• Topical Tweets: are the collection of tweets that we will  extract keyphrases from. Also called target set• A...
Learning from Twitter Hashtags• Tweets are short text documents• The shortage of text in tweets could be an obstacle when ...
Twitter Lexical GraphExpansion  Target Tweets Set                 Lexical Graph  t      t  t         t       t            ...
Proposed Approach• From a random collection of tweets:  • Identify topics  • Cluster tweets based on topics found  • For e...
Proposed Approach for Graph Expansion                                        12
How to Choose Hashtags?• Hashtags are user generated and varies in scope• Expanding the graph with the wrong hashtags can ...
Frequency Approach• Frequency approach is not always correct• Topic “Sandusky”                                            ...
Hybrid ApproachTarget Tweets                                                 Cosine Sim   k1   k2   k3                    ...
Hybrid Approach• Let Target Tweets be a set of tweets {t1, t2, …,tn}•From all tweets in the set, we have a vector of words...
Hybrid Approach• For each hashtag in HashtagTitles set = {h1, h2 ,…, hn},we search Twitter for all tweets that does not oc...
Hybrid Approach•For each HT, we build a vector of words representing eachhashtag separately which we call HT_terms•We comp...
Hybrid Approach• Measures the similarity of top frequent hashtag tweets  content with target tweets content using cosine s...
Hybrid Approach• After selecting an auxiliary hashtag tweet set:• classify each hashtag’s tweet as either relevant or  irr...
How to Build Lexical Graph• Let G=(V,E) be a weighted graph that represent the text• Vertices V denote words• We build an ...
How to Build Lexical Graph•                             22
How to Build Lexical Graph•                             23
Topic Modeling• Latent Dirichlet Allocation (LDA) (D. M. Blei, A. Y. Ng, and  M. I. Jordan)   • Unsupervised model that id...
Topic Modeling•   Latent Dirichlet Allocation (LDA)•   Dirichlet prior α and β•   Multinomial distribution over topics Ѳ• ...
Graph-based Ranking Scheme• PageRank (Brin and Page, 1998)  • Voting idea!  • When a vertex links to another, it cast a vo...
Graph-based Ranking Scheme•                             27
Graph-based Ranking Scheme• TextRank (Mihalcea & Tarau, 2004)   • Create a graph for text   • Words are represented in nod...
Graph-based Ranking Scheme•                             29
Graph-based Ranking Scheme• NE-Rank (Node Edge- Rank)(Bellaachia & Al-Dhelaan)  • Incorporate node’s weight into the formu...
Graph-based Ranking Scheme•                             31
Experiment•   Crawled Twitter since 1/19/2012 to 2/6/2012•   Dataset have 31,227 tweets.•   244,139 tokens•   40,674 hasht...
Experiment• Preprocessing :  • Removed non-English tweets  • Removed URL links  • Normalized tweets from conversational st...
Experiment• Since NE-Rank has showed better result compared to  other ranking methods in our previous research[8], we  use...
Experiment• Since there is no golden labels to compare against, we  empirically designed an evaluation approach utilizing ...
Experimental ResultsAutomatic Approach Using Search EngineTop-10 Keyphrases                                              P...
Conclusion•   Twitter Introduction•   Why Extracting Keyphrases in Twitter?•   Learning from Twitter Hashtags•   Twitter L...
References• [1] Liu, et al.,2010. “Automatic Keyphrase Extraction via Topic  Decomposition”• [2] Lin, Snow, & Morgan “Smoo...
The End Thank You!              39
Upcoming SlideShare
Loading in …5
×

Learning from Twitter Hashtags: Leveraging Proximate Tags to Enhance Graph-based Keyphrase Extraction

1,545 views

Published on

Paper "Learning from Twitter Hashtags: Leveraging Proximate Tags to Enhance Graph-based Keyphrase Extraction"

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,545
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
27
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Learning from Twitter Hashtags: Leveraging Proximate Tags to Enhance Graph-based Keyphrase Extraction

  1. 1. Learning from Twitter Hashtags: LeveragingProximate Tags to Enhance Graph-basedKeyphrase ExtractionAbdelghani Bellaachia & Mohammed Al-Dhelaan(Bell@gwu.edu , mdhelaan@gwu.edu) Computer Science Department George Washington University Washington, DC, USA 1
  2. 2. Overview• Twitter Introduction• Why Extracting Keyphrases in Twitter?• Learning from Twitter Hashtags• Twitter Lexical Graph Expansion• Proposed Approach for Graph Expansion• How to Choose Hashtags • Frequency Approach • Hybrid Approach• How to Build Lexical Graph• Topic Modeling• Graph-based Ranking Scheme• Experiments• Experimental Results 2• Conclusion
  3. 3. Twitter Introduction• Twitter is a micro-blogging social network site• It enables users to blog or broadcast their thoughts and messages• It gained a lot of popularity due to the speed of broadcasting news through it.• The main idea behind it is that a user can follow people or organizations accounts that seems to be interesting to the user.• Once a user follows an account, all the news and tweets issued by that account will be shown to that user in his timeline tweets. 3
  4. 4. Tweets• Tweets are the posts or messages broadcasted by users.• It can only include up to 140 characters• In it is nature, it meant to be broadcasted to all the followers of a user. However, it can be directed to a specific user using the mention “@” feature.• Tweets are generally public and anyone can view them except if the user made his tweets private and only can be seen by his/her followers (rarely used!).• Tweets can include text, hashtags, or mentions. Or any combination of them. 4
  5. 5. Tweets• Example of a tweet containing a hashtag, text, and link 5
  6. 6. Hashtags• Hashtags started as a user convention.• They are used to index and organize tweets.• Trend discovery• Every Hashtag is generally about a specific topic that if you include a hashtag into a tweet, that tweet will be directed to that topic which have a specific audience.• Multiple hashtags are accepted• Hashtag is a hyperlink to all tweets containing that hashtag. 6
  7. 7. Why Extracting Keyphrasesin Twitter?• In 2011, Twitter has attracted over 200 million users, whom publish at least a billion tweets each week [2].• With such massive amount of user generated text, the need for summarizing topics in tweets becomes important• However, tweets are short text documents so normal summarization techniques are not applicable• Instead, extracting short keyphrases that could represent topics in tweets can be an insightful approach 7
  8. 8. Definitions• Topical Tweets: are the collection of tweets that we will extract keyphrases from. Also called target set• Auxiliary Hashtag Tweets: Are the collection of tweets gathered from a selected hashtag from the topical tweets.• In this research, we investigate the possibility of expanding the lexical graph for topical tweets with auxiliary hashtag tweets, and whether it could improve the ranking for keyphrases extracted from the target tweets. 8
  9. 9. Learning from Twitter Hashtags• Tweets are short text documents• The shortage of text in tweets could be an obstacle when trying to learn from text• However, tweets can contain an abundant number of links in the form of hashtags• Can we improve the ranking using an auxiliary set of hashtag tweets (external tweets)?• How can we choose the best hashtags to fit the topic? Some hashtags are general! Some are very specific!• Can we expand the graph to include auxiliary hashtag tweets? How can it affect the ranking? 9
  10. 10. Twitter Lexical GraphExpansion Target Tweets Set Lexical Graph t t t t t H Hashtags H H Expanded Lexical Graph H Auxiliary Tweets Set t t t t 10
  11. 11. Proposed Approach• From a random collection of tweets: • Identify topics • Cluster tweets based on topics found • For every cluster (topic): • Build a lexical graph to calculate words weights • Expand the graph with auxiliary hashtag tweets similar to topic • Generate keyphrases using top keywords • Rank keyphrase • Show top 10 keyphrases 11
  12. 12. Proposed Approach for Graph Expansion 12
  13. 13. How to Choose Hashtags?• Hashtags are user generated and varies in scope• Expanding the graph with the wrong hashtags can deteriorate the ranking (irrelative or general hashtags)• Two approaches to choose hashtags for expanding the graph: • Frequency Approach – By choosing the most frequent hashtag in each topical cluster of tweets (target tweets). • Hybrid Approach – By measuring similarity between top-10 frequent hashtag tweets keywords and the target tweets keywords 13
  14. 14. Frequency Approach• Frequency approach is not always correct• Topic “Sandusky” 14
  15. 15. Hybrid ApproachTarget Tweets Cosine Sim k1 k2 k3 Hashtag1 Tweets Hashtag2 Tweets Hashtag 10 Tweets . . k1 k1 k1 kn k2 k2 k2 k3 k3 … k3 . . . . . . kn kn kn K: keywords extracted from all tweets in the set Select the highest similar hashtag to expand the lexical graph 15
  16. 16. Hybrid Approach• Let Target Tweets be a set of tweets {t1, t2, …,tn}•From all tweets in the set, we have a vector of words TT_terms ={k1, k2, …,kn} Target Tweets TT_terms t1 k1 t2 k2 t3 k3 . . . . tn kn•In the Target Tweets set, we have a set of hashtagsoccurring in all tweets. We call it HashtagsTitles = {h1, h2 ,…, hn} 16
  17. 17. Hybrid Approach• For each hashtag in HashtagTitles set = {h1, h2 ,…, hn},we search Twitter for all tweets that does not occur in theTarget Tweets set.•The search result for each hashtag is grouped in a vectorof tweets called HT( Hashtag Tweets) HashtagTitles h1= Ht1, Ht2,…, Htn h1 h2= Ht1, Ht2,…, Htn h2 h3 : . hn= Ht1, Ht2,…, Htn . hn 17
  18. 18. Hybrid Approach•For each HT, we build a vector of words representing eachhashtag separately which we call HT_terms•We compute the cosine similarity between the twovectors TT_terms and HT_terms•Finally, we choose the most similar hashtag to expand thegraph with 18
  19. 19. Hybrid Approach• Measures the similarity of top frequent hashtag tweets content with target tweets content using cosine similarity• The top-10 frequent hashtags are used since we assume that the most relevant hashtag is frequent• Selecting the most similar hashtag using cosine similarity with top-10 frequent hashtags will use both approach which will improve the accuracy of the selection 19
  20. 20. Hybrid Approach• After selecting an auxiliary hashtag tweet set:• classify each hashtag’s tweet as either relevant or irrelevant• by measuring the word overlap between auxiliary tweet terms and top-10 tf-idf in target tweets terms• If there is at least two words from the top-10, then we classify an auxiliary tweet as relevant. 20
  21. 21. How to Build Lexical Graph• Let G=(V,E) be a weighted graph that represent the text• Vertices V denote words• We build an edge E between every two words if they co-occur within a specific window size• The weight of the edges for terms in the target tweets is the frequency of the co-occurrence• The frequency of the co-occurrence shows how strong the relationship between two nodes Edge_weight(Vi, Vj) = |co-occurrence| 21
  22. 22. How to Build Lexical Graph• 22
  23. 23. How to Build Lexical Graph• 23
  24. 24. Topic Modeling• Latent Dirichlet Allocation (LDA) (D. M. Blei, A. Y. Ng, and M. I. Jordan) • Unsupervised model that identifies topics in a collection of documents. • A statistical model that uses “bag of words” assumption for each document. • Documents are represented over probability distribution over topics . • Topics are represented over probability distribution over collection of words. 24
  25. 25. Topic Modeling• Latent Dirichlet Allocation (LDA)• Dirichlet prior α and β• Multinomial distribution over topics Ѳ• Multinomial distribution over words φ Ѳ Z w J D α β φ 25
  26. 26. Graph-based Ranking Scheme• PageRank (Brin and Page, 1998) • Voting idea! • When a vertex links to another, it cast a vote for the other vertex. • The algorithm has a recursive nature! The importance of the vertex casting the vote determines the importance of the vote. • Uses nodes rank iteratively until convergence 26
  27. 27. Graph-based Ranking Scheme• 27
  28. 28. Graph-based Ranking Scheme• TextRank (Mihalcea & Tarau, 2004) • Create a graph for text • Words are represented in nodes (nouns and adjectives only) • Edges are the co-occurrence between words within a window • Frequency of co-occurring words is represented on edge weights • TextRank uses edge weights to influence the rank 28
  29. 29. Graph-based Ranking Scheme• 29
  30. 30. Graph-based Ranking Scheme• NE-Rank (Node Edge- Rank)(Bellaachia & Al-Dhelaan) • Incorporate node’s weight into the formula • Instead of either using only node weights or only edge weights, we try to use both features. • In text, node weights are best represented by tf-idf to represent the content of documents. • PageRank only focuses on the relations between objects without the content. • TextRank only uses the co-occurrence relation to identify important words. • NE-Rank takes the content into consideration as tf-idf 30
  31. 31. Graph-based Ranking Scheme• 31
  32. 32. Experiment• Crawled Twitter since 1/19/2012 to 2/6/2012• Dataset have 31,227 tweets.• 244,139 tokens• 40,674 hashtags in tweets (4,079 unique hashtag).• Hashtags have been segmented into word tokens into tokenization step.• We have extracted 30 topics out of tweets.• Let C be the collection of tweets, 1..k are topics.• Aggregate tweets for topic yielding Ck• Build a graph and extract keyphrases from every Ck 32• C= C1 U C2 U …Ck
  33. 33. Experiment• Preprocessing : • Removed non-English tweets • Removed URL links • Normalized tweets from conversational style to standard English: for example: luv became love • Part of speech tagging to extract nouns and adjectives only • Stemming and stopwords removal 33
  34. 34. Experiment• Since NE-Rank has showed better result compared to other ranking methods in our previous research[8], we used it to compare the ranking of 3 approaches: • Single Approach: No graph expansion • Expanded with hashtags-Frequency Approach • Expanded with hashtags-Hybrid Approach• We validated our results using an empirical evaluation approach as in the next slides 34
  35. 35. Experiment• Since there is no golden labels to compare against, we empirically designed an evaluation approach utilizing a search engine to generate labels.• To generate such labels we searched Google using top-5 terms in LDA for each topic.• We only focused on two fields from search snippets results: title and description• If a keyphrase happens to occur in search results, then we consider it correct 35
  36. 36. Experimental ResultsAutomatic Approach Using Search EngineTop-10 Keyphrases Precision BPrefSingle NE-Rank 0.40 0.67Expanded with Hashtags – Frequency Approach 0.45 0.52Expanded with Hashtags – Hybrid Approach 0.55 0.73 36
  37. 37. Conclusion• Twitter Introduction• Why Extracting Keyphrases in Twitter?• Learning from Twitter Hashtags• Twitter Lexical Graph Expansion• Proposed Approach for Graph Expansion• How to Choose Hashtags • Frequency Approach • Hybrid Approach• How to Build Lexical Graph• Topic Modeling• Graph-based Ranking Scheme• Experiments• Experimental Results 37• Conclusion
  38. 38. References• [1] Liu, et al.,2010. “Automatic Keyphrase Extraction via Topic Decomposition”• [2] Lin, Snow, & Morgan “Smoothing Techniques for Adaptive Online Language Models: Topic Tracking in Tweet Streams,”• [3] Liu, et al., 2011. “Why is “SXSW” Trending? Exploring Multiple Text Sources For Twitter Topic Summarization”• [4] X. Wan and J. Xiao, “Single document keyphrase extraction using neighborhood knowledge,”• [5] Weng, et al., 2010. “TwitterRank: Finding Topic-sensitive Influential Twitterers”• [6] Zhao, et al., 2011. “Topical Keyphrase Extraction from Twitter”• [7] Mihaleca & Tarau, “Textrank: Bringing order into texts”• [8] Bellaachia & Al-Dhelaan, “NE-Rank: A Novel Graph-based Keyphrase 38 Exctraction in Twitter” in press
  39. 39. The End Thank You! 39

×