Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links
Upcoming SlideShare
Loading in...5
×
 

Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

on

  • 407 views

How to automatically distinguish between high-quality and low-quality content in Twitter? ...

How to automatically distinguish between high-quality and low-quality content in Twitter?

Twitter is a rapidly growing microblogging platform, which provides a large amount, diversity and varying quality of content. In order to provide higher quality content (e.g. posts mentioning news, events, useful facts or well-formed opinions) when a user searches for tweets on Twitter, we propose a new method to filter and rank tweets according to their quality. In order to model the quality of tweets, we devise a new set of link-based features, in addition to content-based features. We examine the implicit links between tweets, URLs, hashtags and users, and then propose novel metrics to reflect the popularity as well as quality-based reputation of websites, hashtags and users. We then evaluate both the content-based and link-based features in terms of classification effectiveness and identify an optimal feature subset that achieves the best classification accuracy.

Presentation given at the DASFAA 2011 conference (15-18 April 2012, Busan, South Korea).
Authors: Jan Vosecky, Kenneth Wai-Ting Leung, and Wilfred Ng
Full paper: http://www.cse.ust.hk/~wilfred/paper/dasfaa12.pdf

Statistics

Views

Total Views
407
Views on SlideShare
407
Embed Views
0

Actions

Likes
1
Downloads
2
Comments
1

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links Presentation Transcript

  • Searching for Quality Microblog Posts:Filtering and Ranking based on ContentAnalysis and Implicit LinksJan Vosecky, Kenneth Wai-Ting Leung, and Wilfred NgDepartment of Computer Science and EngineeringHKUSTHong KongDASFAA‟12
  • Introduction Method Features Experiments Conclusions Agenda2  Introduction  Proposed method  Quality features of tweets  Experiments  Conclusions
  • Introduction Method Features Experiments Conclusions 3 Introduction
  • Introduction Method Features Experiments Conclusions Microblogs4 mentioned user timestamp user Tweet 1 Tweet 2 hashtag URL link  Both social network and social media  Linksbetween users (follow, mention, re-tweet)  Users post updates (tweets)
  • Introduction Method Features Experiments Conclusions Searching for “ipad” on Twitter5 Around 50 tweets mentioning “iPad” posted within a 1-minute period
  • Introduction Method Features Experiments Conclusions Research challenge6  Twitter: user-generated content  Short messages, often comments or opinions  High volume  Varying quality  “Most tweets are not of general interest (57%)” (Alonso et al.’10)  Information overload  Research questions:  How to distinguish content worth reading from useless or less important messages?  How to promote „high quality‟ content?
  • Introduction Method Features Experiments Conclusions Defining „quality‟7  General (global) definition for assessing tweet quality  3 criteria:  Well-formedness + Well-written, grammatically correct, understandable - Heavy slang, misspellings, excessive punctuation  Factuality + News, events, announcements - Unclear message, private conversations, generic personal feelings  Navigational quality (URL links) + Reputable external resources (e.g. news articles)
  • Introduction Method Features Experiments Conclusions Quality-based tweet filtering8 + - - + -
  • Introduction Method Features Experiments Conclusions Quality-based tweet ranking9 5 4 3 1 1
  • Introduction Method Features Experiments Conclusions Research goals10  Quality-based tweet filtering  Filtering out low-quality tweets  In twitter feeds  In search results  Quality-based tweet ranking  Re-ranking Twitter search results  For a given time period
  • Introduction Method Features Experiments Conclusions 11 Proposed Method
  • Introduction Method Features Experiments Conclusions Representation of tweets12  Vector-space model: not sufficient  Short tweet length, terms often malformed  Ignores special features in Twitter  Feature-vector representation  Extract features from tweet  Traditional features: e.g. length, spelling  Twitter-specific features:  Exploiting hashtags, URL links, mentioned usernames
  • Introduction Method Features Experiments Conclusions 13 Quality Features of Tweets
  • Introduction Method Features Experiments Conclusions Feature categories14 1. Punctuation and Spelling 2. Syntactic and semantic complexity Number of exclamation marks Max. & Avg. word length Number of question marks Length of tweet Max. no. of repeated letters Percentage of stopwords % of correctly spelled words Contains numbers No. of capitalized words Contains a measure Max. no. of consecutive capitalized Contains emoticons words Uniqueness score 3. Grammaticality 4. Link-based Has first-person part-of-speech Contains link Formality score Is reply-tweet Number of proper names Is re-tweet Max. no. of consecutive No. of mentions of users proper names Number of hashtags Number of named entities URL domain reputation score RT source reputation score Hashtag reputation score 5. Timestamp
  • Introduction Method Features Experiments Conclusions 1. Punctuation and spelling15  Excessive punctuation  Number of exclamation marks  Number of question marks  Max. number of consecutive dots  Capitalization  Presence of all-capitalized words  Largest number of consecutive words in capital letters  Spellchecking  Number of correctly spelled words  Percentage of words found in a dictionary RT @_ChocolateCoco: WHO IS CHUCK NORRIS??!!!?? lls. Hes only the greatest guy next to jesus lmao
  • Introduction Method Features Experiments Conclusions 2. Syntactic and semantic16 complexity  Syntactic complexity  Tweet length  Max. & avg. word length  Percentage of stopwords  Presence of emoticons and other sentiment indicators  Presence of measure symbols ($, %)  Numbers – number of digits  Tweet uniqueness  Uniqueness of the tweet relative to other tweets by the author where
  • Introduction Method Features Experiments Conclusions 3. Grammaticality17  Parts-of-speech labelling  Presence of first person parts-of-speech  Formality score [Heylighen‟02]  F = (noun frequency + adjective freq. + preposition freq.+ article freq. − pronoun freq. − verb freq. − adverb freq. − interjection freq. + 100)/2  Names  Number of „proper names‟ as words with a single initial capital letter  Number of consecutive „proper names‟  Number of Named entities F. Heylighen and J.-M. Dewaele. Variation in the contextuality of language: An empirical measure. Context in Context. Special issue Foundations of Science, 7(3):293–340, 2002.
  • Introduction Method Features Experiments Conclusions 4. Link-based features18  Links to other items  Re-tweet(RT), reply tweet, mention of other users  Presence of a URL link  Number of hashtags as indicated by the “#” sign  Link target‟s quality reputation  metrics to reflect the quality of tweets which relate to a  URL domain  Hashtag  a user
  • Introduction Method Features Experiments Conclusions URL domain reputation19  Observation:  Tweets which link to news articles usually better quality than tweets which link to photo sharing websites Q=1 Q=5 Tweet 1 Tweet 4 Tweetpic.co NYtimes.co Q=3 m m Q=4 Q=2 Tweet 2 Tweet 5 Q=5 Tweet 3 Tweet 6  Questions:  What does the quality of tweets linking to a website say about its quality?  Can we predict quality of future tweets linking to that website?
  • Introduction Method Features Experiments Conclusions URL domain reputation20  Step 1: URL translation Short link to original link bit.ly/e2jt9F  http://www.reuters.com/4151120  Step 2: summarize tweets linking to a URL domain  Accumulate “quality reputation” over time
  • Introduction Method Features Experiments Conclusions URL domain reputation21  Average URL domain quality  Td = set of tweets linking to domain d  qt = quality label of tweet t  Weakness:  Does not reflect the number of inlink tweets in the score  Favours domains with few inlink tweets
  • Introduction Method Features Experiments Conclusions URL domain reputation22  Domain reputation score where AvgQ(d) is between [-1, +1]  “Collecting evidence” behaviour:  Score getting higher with more good quality inlink tweets 4.00 -1 2.00 -0,5 DRS 0.00 0 AvgQ 1 10 100 1000 0,5 -2.00 1 -4.00 |Td|
  • Introduction Method Features Experiments Conclusions URL domain reputation23 10 domains with a high DRS: 10 domains with a low DRS: Domain AvgQ Inlinks RS Domain AvgQ Inlinks RS gallup.com 0,96 99 1,92 tweetphoto.com -0,77 106 -1,57 mashable.com 0,79 97 1,58 twitpic.com -0,75 113 -1,54 hrw.org 0,86 57 1,51 twitlonger.com -0,85 66 -1,54 foxnews.com 0,68 38 1,08 myloc.me -0,85 54 -1,48 good.is 0,68 31 1,01 instagr.am -0,62 52 -1,06 intuit.com 0,57 60 1,01 formspring.me -0,78 18 -0,98 forbes.com 0,68 19 0,87 yfrog.com -0,55 53 -0,94 reuters.com 1,00 6 0,78 lockerz.com -0,63 16 -0,75 cnn.com 0,36 85 0,70 qik.com -0,75 8 -0,68 Mainly Mainly News-oriented Image and sites location sharing sites
  • Introduction Method Features Experiments Conclusions Reputation of hashtag & user24 Q=1 Q=5 Tweet 1 Tweet 4 #justforfun #DASFAA Q=3 Q=4 Q=2 Tweet 2 Tweet 5 Q=5 Tweet 3 Tweet 6  Hashtag reputation #DASFAA vs. #justforfun  Re-tweet source user reputation @barackobama vs. @wysz22212
  • Introduction Method Features Experiments Conclusions 25 Experiments
  • Introduction Method Features Experiments Conclusions Dataset27  10,000 tweets  100 users, 100 recent tweets per user  Users:  50 random users  50 influential users  Selected from listorious.com  5 categories: technology, business, politics, celebrities, activism  10 users per category
  • Introduction Method Features Experiments Conclusions Labelling28  Crowdsourcing  Amazon Mechanical Turk  3 labels per tweet from different reviewers  Possible labels: 1 to 5  1 = low quality, 5 = high quality  Random order of tweets
  • Introduction Method Features Experiments Conclusions Labelling results29  Tweet quality distribution Quality score:
  • Introduction Method Features Experiments Conclusions Feature analysis30  Total 29 features  Top 5 features based on Information Gain: 0.374 Domain reputation 0.287 Contains link 0.130 Formality score 0.127 Num. proper names 0.113 Max. proper names
  • Introduction Method Features Experiments Conclusions Feature selection31  Greedy attribute selection  15 selected features: Domain reputation RT source reputation Formality Tweet uniqueness No. named entities % correct. spelled words Max. no. repeat. Letters No. hash-tags Contains numbers No. capitalized words Is reply-tweet Is re-tweet Avg. word length Contains first-person No. exclam. Marks
  • Introduction Method Features Experiments Conclusions Classification and Ranking32 Method  Classification:  SVM, binary classification (high-quality, low- quality)  50/50 split for training/testing  Ranking:  Learning-to-rank (Rank SVM)  30 queries from 5 topic categories  Process: 1. Retrieve tweets matching a query 2. Extract features from the tweets 3. „Query-tweet vector‟ pairs + quality scores of the
  • Introduction Method Features Experiments Conclusions Classification results33 #attribute High-Quality Low-Quality Overall Features s P R P R AUC Link only 1 0.798 0.702 0.894 0.934 0.818 TF-IDF 3322 0.862 0.665 0.885 0.96 0.813 Subset.Reputation 3 0.812 0.746 0.909 0.936 0.841 Subset.SVM (“greedy”) 15 0.715 0.758 0.912 0.936 0.847 All quality features 29 0.815 0.66 0.882 0.944 0.802 All quality ftr‟s + TF- 3351 0.739 0.775 0.915 0.899 0.837 IDF  Optimal feature set (15 attrs.) outperforms TF-IDF (3322 attrs.)  Link-based “reputation” features (3 attrs.) achieve the 2nd best result  Combining quality features + TF-IDF does not improve result
  • Introduction Method Features Experiments Conclusions Classification results34 #attribute Features s AUC Link only 1 0.818 TF-IDF 3322 0.813 Subset.Reputation 3 0.841 Subset.SVM 15 0.847 (“greedy”) Storage cost All quality features 29 0.802 All quality ftr‟s + TF- 3351 0.837 IDF  Optimal feature set achieves reduced training time and storage cost Training time
  • Introduction Method Features Experiments Conclusions Ranking results35 where NDCG@N Features #attributes 1 2 5 10 MAP Link only 1 0.067 0.111 0.22 0.324 0.398 Subset.Reputation 3 0.822 0.777 0.777 0.764 0.661 Subset.SVM (“greedy”) 15 0.867 0.767 0.778 0.769 0.653 All quality features 29 0.733 0.733 0.763 0.753 0.637  Optimal feature set (15 attrs.) achieves the best result  Link-based “reputation” features (3 attrs.) achieve the 2nd best result
  • Introduction Method Features Experiments Conclusions 36 Conclusions
  • Introduction Method Features Experiments Conclusions Summary37  Method for quality-based classification and ranking of tweets  Proposed and evaluated a set of tweet‟s features to capture the tweet‟s quality  Link-based features lead to the best performance
  • Introduction Method Features Experiments Conclusions Future work38  Consider different types of queries in Twitter  E.g. searching for hot topics, movie reviews, facts, opinions, etc.  Different features may be important in different scenarios  Incorporating recent hot topics  Personalized re-ranking
  • Introduction Method Features Experiments Conclusions Q/A39
  • Introduction Method Features Experiments Conclusions Thank You40
  • Related work41  Spam detection  Bag-of-words, keyword-based  Feature-based approaches  Combinations  Social networks  Finding quality answers in Q-A systems  E.g. Yahoo Answers  Feature-based  Web search  Quality-based ranking of web documents  Feature-based quality score (WSDM‟11)
  • ROC Curve42  Area under the ROC curve: probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one