Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

Searching for Quality Microblog Posts:
Filtering and Ranking based on Content
Analysis and Implicit Links

Jan Vosecky, Kenneth Wai-Ting Leung, and Wilfred Ng
Department of Computer Science and Engineering
HKUST
Hong Kong

DASFAA‟12

Introduction Method Features Experiments Conclusions

Agenda
2

 Introduction
 Proposed method
 Quality features of tweets
 Experiments
 Conclusions


3 Introduction


Microblogs
4

mentioned user timestamp
user

Tweet 1

Tweet 2
hashtag
URL link

 Both social network and social media
 Linksbetween users (follow, mention, re-tweet)
 Users post updates (tweets)


Searching for “ipad” on Twitter
5

Around 50 tweets
mentioning “iPad”
posted within a
1-minute period


Research challenge
6

 Twitter: user-generated content
 Short messages, often comments or opinions
 High volume
 Varying quality
 “Most tweets are not of general interest (57%)” (Alonso et
al.’10)
 Information overload
 Research questions:
 How to distinguish content worth reading from
useless or less important messages?
 How to promote „high quality‟ content?


Defining „quality‟
7

 General (global) definition for assessing tweet
quality
 3 criteria:
 Well-formedness
+ Well-written, grammatically correct, understandable
- Heavy slang, misspellings, excessive punctuation
 Factuality
+ News, events, announcements
- Unclear message, private conversations, generic personal
feelings
 Navigational quality (URL links)
+ Reputable external resources (e.g. news articles)


Quality-based tweet filtering
8

+
-
-
+
-


Quality-based tweet ranking
9

5
4
3
1
1


Research goals
10

 Quality-based tweet filtering
 Filtering out low-quality tweets
 In twitter feeds
 In search results

 Quality-based tweet ranking
 Re-ranking Twitter search results
 For a given time period


11 Proposed Method


Representation of tweets
12

 Vector-space model: not sufficient
 Short tweet length, terms often malformed
 Ignores special features in Twitter

 Feature-vector representation
 Extract features from tweet
 Traditional features: e.g. length, spelling

 Twitter-specific features:
 Exploiting hashtags, URL links, mentioned usernames


13 Quality Features of Tweets


Feature categories
14

1. Punctuation and Spelling 2. Syntactic and semantic
complexity
Number of exclamation marks Max. & Avg. word length
Number of question marks Length of tweet
Max. no. of repeated letters Percentage of stopwords
% of correctly spelled words Contains numbers
No. of capitalized words Contains a measure
Max. no. of consecutive capitalized Contains emoticons
words Uniqueness score

3. Grammaticality 4. Link-based
Has first-person part-of-speech Contains link
Formality score Is reply-tweet
Number of proper names Is re-tweet
Max. no. of consecutive No. of mentions of users
proper names Number of hashtags
Number of named entities URL domain reputation score
RT source reputation score
Hashtag reputation score
5. Timestamp


1. Punctuation and spelling
15

 Excessive punctuation
 Number of exclamation marks
 Number of question marks
 Max. number of consecutive dots
 Capitalization
 Presence of all-capitalized words
 Largest number of consecutive words in capital letters
 Spellchecking
 Number of correctly spelled words
 Percentage of words found in a dictionary
RT @_ChocolateCoco: WHO IS CHUCK NORRIS??!!!??
lls. He's only the greatest guy next to jesus lmao

2. Syntactic and semantic
16
complexity
 Syntactic complexity
 Tweet length
 Max. & avg. word length
 Percentage of stopwords
 Presence of emoticons and other sentiment indicators
 Presence of measure symbols ($, %)
 Numbers – number of digits
 Tweet uniqueness
 Uniqueness of the tweet relative to other tweets by the author

where


3. Grammaticality
17

 Parts-of-speech labelling
 Presence of first person parts-of-speech
 Formality score [Heylighen‟02]
 F = (noun frequency + adjective freq. + preposition freq.+ article freq.
− pronoun freq. − verb freq. − adverb freq. − interjection freq. + 100)/2
 Names
 Number of „proper names‟ as words with a single initial capital
letter
 Number of consecutive „proper names‟
 Number of Named entities

F. Heylighen and J.-M. Dewaele. Variation in the contextuality of language: An empirical measure.
Context in Context. Special issue Foundations of Science, 7(3):293–340, 2002.


4. Link-based features
18

 Links to other items
 Re-tweet(RT), reply tweet, mention of other users
 Presence of a URL link

 Number of hashtags as indicated by the “#” sign

 Link target‟s quality reputation
 metrics to reflect the quality of tweets which relate
to a
 URL domain
 Hashtag
 a user


URL domain reputation
19

 Observation:
 Tweets which link to news articles usually better quality than
tweets which link to photo sharing websites

Q=1 Q=5
Tweet 1 Tweet 4
Tweetpic.co NYtimes.co
Q=3
m m
Q=4
Q=2
Tweet 2 Tweet 5 Q=5
Tweet 3 Tweet 6

 Questions:
 What does the quality of tweets linking to a website say about its
quality?
 Can we predict quality of future tweets linking to that website?


20

 Step 1: URL translation
Short link to original link
bit.ly/e2jt9F  http://www.reuters.com/4151120

 Step 2: summarize tweets linking to a URL
domain
 Accumulate “quality reputation” over time


21

 Average URL domain quality

 Td = set of tweets linking to domain d
 qt = quality label of tweet t

 Weakness:
 Does not reflect the number of inlink tweets in the score
 Favours domains with few inlink tweets


22

 Domain reputation score

where AvgQ(d) is between [-1, +1]

 “Collecting evidence” behaviour:
 Score getting higher with more good quality inlink tweets

4.00
-1
2.00
-0,5
DRS 0.00 0 AvgQ
1 10 100 1000 0,5
-2.00
1
-4.00

|Td|


23

10 domains with a high DRS: 10 domains with a low DRS:
Domain AvgQ Inlinks RS Domain AvgQ Inlinks RS
gallup.com 0,96 99 1,92 tweetphoto.com -0,77 106 -1,57
mashable.com 0,79 97 1,58 twitpic.com -0,75 113 -1,54
hrw.org 0,86 57 1,51 twitlonger.com -0,85 66 -1,54
foxnews.com 0,68 38 1,08 myloc.me -0,85 54 -1,48
good.is 0,68 31 1,01 instagr.am -0,62 52 -1,06
intuit.com 0,57 60 1,01 formspring.me -0,78 18 -0,98
forbes.com 0,68 19 0,87 yfrog.com -0,55 53 -0,94
reuters.com 1,00 6 0,78 lockerz.com -0,63 16 -0,75
cnn.com 0,36 85 0,70 qik.com -0,75 8 -0,68

Mainly Mainly
News-oriented Image and
sites location sharing
sites


Reputation of hashtag & user
24

Q=1 Q=5
Tweet 1 Tweet 4
#justforfun #DASFAA
Q=3 Q=4
Q=2
Tweet 2 Tweet 5 Q=5
Tweet 3 Tweet 6

 Hashtag reputation #DASFAA vs. #justforfun

 Re-tweet source user reputation @barackobama vs.
@wysz22212


25 Experiments


Dataset
27

 10,000 tweets
 100 users, 100 recent tweets per user
 Users:
 50 random users
 50 influential users
 Selected from listorious.com
 5 categories: technology, business, politics,
celebrities, activism
 10 users per category


Labelling
28

 Crowdsourcing
 Amazon Mechanical Turk
 3 labels per tweet from different reviewers
 Possible labels: 1 to 5
 1 = low quality, 5 = high quality
 Random order of tweets


Labelling results
29

 Tweet quality distribution
Quality score:


Feature analysis
30

 Total 29 features
 Top 5 features based on Information Gain:

0.374 Domain reputation
0.287 Contains link
0.130 Formality score
0.127 Num. proper names
0.113 Max. proper names


Feature selection
31

 Greedy attribute selection
 15 selected features:

Domain reputation RT source reputation
Formality Tweet uniqueness
No. named entities % correct. spelled words
Max. no. repeat. Letters No. hash-tags
Contains numbers No. capitalized words
Is reply-tweet Is re-tweet
Avg. word length Contains first-person
No. exclam. Marks

Classification and Ranking
32
Method
 Classification:
 SVM, binary classification (high-quality, low-
quality)
 50/50 split for training/testing

 Ranking:
 Learning-to-rank (Rank SVM)
 30 queries from 5 topic categories

 Process:
1. Retrieve tweets matching a query
2. Extract features from the tweets
3. „Query-tweet vector‟ pairs + quality scores of the


Classification results
33

#attribute High-Quality Low-Quality Overall
Features s P R P R AUC
Link only 1 0.798 0.702 0.894 0.934 0.818
TF-IDF 3322 0.862 0.665 0.885 0.96 0.813
Subset.Reputation 3 0.812 0.746 0.909 0.936 0.841
Subset.SVM (“greedy”) 15 0.715 0.758 0.912 0.936 0.847
All quality features 29 0.815 0.66 0.882 0.944 0.802
All quality ftr‟s + TF- 3351 0.739 0.775 0.915 0.899 0.837
IDF

 Optimal feature set (15 attrs.) outperforms TF-IDF (3322 attrs.)
 Link-based “reputation” features (3 attrs.) achieve the 2nd best result
 Combining quality features + TF-IDF does not improve result


Classification results
34

#attribute
Features s AUC
Link only 1 0.818
TF-IDF 3322 0.813
Subset.Reputation 3 0.841
Subset.SVM 15 0.847
(“greedy”) Storage cost

All quality features 29 0.802
All quality ftr‟s + TF- 3351 0.837
IDF

 Optimal feature set achieves
reduced training time and storage
cost
Training time


Ranking results
35

where

NDCG@N
Features #attributes 1 2 5 10 MAP
Link only 1 0.067 0.111 0.22 0.324 0.398
Subset.Reputation 3 0.822 0.777 0.777 0.764 0.661
Subset.SVM (“greedy”) 15 0.867 0.767 0.778 0.769 0.653
All quality features 29 0.733 0.733 0.763 0.753 0.637

 Optimal feature set (15 attrs.) achieves the best result
 Link-based “reputation” features (3 attrs.) achieve the 2nd best result


36 Conclusions


Summary
37

 Method for quality-based classification and
ranking of tweets
 Proposed and evaluated a set of tweet‟s
features to capture the tweet‟s quality
 Link-based features lead to the best
performance


Future work
38

 Consider different types of queries in Twitter
 E.g. searching for hot topics, movie reviews,
facts, opinions, etc.
 Different features may be important in different
scenarios
 Incorporating recent hot topics
 Personalized re-ranking


Q/A
39


Thank You
40

Related work
41

 Spam detection
 Bag-of-words, keyword-based
 Feature-based approaches
 Combinations

 Social networks
 Finding quality answers in Q-A systems
 E.g. Yahoo Answers
 Feature-based

 Web search
 Quality-based ranking of web documents
 Feature-based quality score (WSDM‟11)

ROC Curve
42

 Area under the ROC curve: probability that a classifier
will rank a randomly chosen positive instance higher
than a randomly chosen negative one

Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

Recommended

Recommended

More Related Content

Similar to Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

Similar to Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links (20)

Recently uploaded

Recently uploaded (20)

Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links