3. Why social media?
Mine and analyze data in blogs, postings, tweets
can:
• Support marketing and customer service activities
• Help decision making
• Enhance the products and services
• Improve the competitive advantage of companies
Twitter is one of the most important social media
resources.
Support different types of data: text, pictures, videos
7/7/2016
3
4. Sarcasm poses problems for
algorithms in U.S. election 2016
7/7/2016
4
In the race for the White House in 2016, election
campaigns rely on social media analysis to help
them tailor advertising and other outreach to
particular groups of voters.
Average follower growth since
Jan 26 --- Feb 26
1. @realDonaldTrump 20,900
2. @BernieSanders 10,400
3. @HillaryClinton 10,300
4. @MarcoRubio 5,320
5. @TedCruz 3,950
6. @RealBenCarson 1,870
7. @JohnKasich 1,440
5. Stay Classy
7/7/2016
5
A predictive analysis firm,
examined Tweets
containing the expression
“classy” and found 72
percent of them used it in a
positive way.
But when used near the
name of Republican
presidential candidate
Donald Trump, around three
quarters of tweets citing
"classy" were negative.
6. What is Sarcasm on Twitter
7/7/2016
6
A sarcastic tweet. The speaker is clearly not
welcoming allergy season back.
Lexical clues could provide enough knowledge to
detect sarcasm.
7. What is Sarcasm on Twitter
7/7/2016
7
Another sarcastic tweet. The speaker actually
supports democrat.
This one needs contextual information surrounding
his posting to detect it is whether or not sarcastic.
8. Sarcasm Detection on Twitter
State-of-the-art method combines lexical and contextual
information to achieve robust classification performance.
In this project, I re-implement of a recent method for automatic
sarcasm detection due to Bamman and Smith (2015).
I utilize multiple approaches to extract large mount of data and
apply machine learning models to detect sarcastic and non-
sarcastic tweets.
7/7/2016 8
9. DATA
Bamman dataset: 19534 tweets, around half
sarcastic tweets, while the other half non-sarcastic
tweets. Bamman shares the IDs of those tweets.
Tweets are dispearing with time goes, because
users may quit Twitter, protect their accounts from
viewing by the public or delete tweets. After data
crawling, I finally collected 17926 tweets.
10. DATA
The labels of tweets are inferred from self-
declaration of sarcasm, e.g. a tweet is marked as
sarcastic if it contains the hashtag #sarcasm or
#sarcastic and non-sarcastic otherwise.
12. DATA
Audience(the user who responded to the target
tweet, or was mentioned in the target tweet)
Original Tweet(the tweet to which the target tweet
responded)
14. DATA EXTRACTION
Static web crawling:Scrapes static web pages
and extracts text from the HTML mark
profile
15. DATA EXTRACTION
Dynamic web crawling: Focus on the data sent from the
Twitter server when I interact with a website, e.g. scroll down
the page to view more tweets from a user
16. DATA EXTRACTION
Twitter Stream API: Make it efficient to collect
public tweets. Twitter provides an interface to
developers using its API.
Limit: 1% of public tweets
17. DATA PROCESSING
Remove tweets that are:
• Not English
• Shorter than 3 words
• Retweet
Replace URLs and user mentions
Remove hashtags #sarcastic and #sarcasm in the Sarcastic
tweets
Normalize profile data, e.g.,
timezone data are mapped to different area using Google
geocoder package
Numbers in Twitter are displayed in string, like ’22K’ or ‘2
Million’, and they are converted to numeric type.
18. FEATURE ENGINEERING
In machine learning and pattern recognition, a feature is an
individual measurable property of a phenomenon being observed.
Similar concept: the explanatory variable used in statistical
techniques such as linear regression
19. FEATURE ENGINEERING
Tweet Features Author Features
Represent the lexical and grammatical
information of the target tweet.
Using only text of the target tweet
Capture information about the author of
the target tweet.
Using historical tweets and profile
information of the author
Audience Features Response Features
Encode information about the addressee
of the tweet
Using historical tweets, profile information
of the audience, and the communication
between audience and the author
Consider the interaction between the
target tweet and the tweet that it is
responding to.
Using text of the original tweet
20. TWEET FEATURES
Bag of Words: In this model, a text (such as a sentence or a
document) is represented as the bag (multiset) of its words,
disregarding grammar and even word order but keeping
multiplicity.
“Get in am at work (not) #Work” 1 1 1 1 0 0
“Love my new work #Work” 0 0 1 0 1 1
Stop words are removed.
get am work not love new
Pronunciation features: Twitter users have specific writing styles,
e.g., RT (Retweet), CHK (Check) and IIRC (If I recall correctly).
I count the number of words that only have alphabetic characters
but no vowels, and the words with more than three syllables.
Wow! wtf man? RT @latimes: Gov. Brown signs bills to
raise smoking age to 21, restrict e-cigarettes
2 0
21. AUTHOR FEATURES
Author historical topics:Historical topic features are inferred
under LDA with 100 topics over all historical tweets.
LDA , short for Latent Dirichlet Allocation, is a generative
statistical model that allows sets of observations to be explained
by unobserved groups that explain why some parts of the data are
similar(Blei, Ng, and Jordan 2003)
Author 1 (tweet01, tweet11… tweetX1)
Author 2 (tweet02, tweet12… tweetX2)
Topic 1, Topic2 ,…, Topic 100
0.3232 0.932 ,…, 0.1522
0.4232 0.3322 ,…, 0.5522
Each topic is defined by multiple words, e.g.,
Topic 1 : basketball, StephCurry, Stadium, fans, awesome,
champion…
22. AUDIENCE FEATURES
Author/Audience Interactional topics: This feature measures the
similarity of historical topics of the audience and author.
I take the element-wise product of the author and audience's
historical topic distribution. Similar topics will have higher
distribution.
Author historical topic
Audience historical topic
element-wise product 0.05 0.81 ,…, 0.01
Topic 1, Topic2 ,…, Topic 100
0.1 0.9 ,…, 0.1
0.5 0.9 ,…, 0.1
23. RESPONSE FEATURES
Bag of Words: Here we use the BoW from the original tweet(the
tweet that it is responding to the target tweet)
24. EXPERIMENTAL SETTING
Data meaningful features
Machine learning model: Logistic Regression
Tune
set
LR
Model
Optimized
Parameter Train
set
LR
Model
Fit
Test
set
Evalute
26. Discussion
• Combining lexical information of text and contextual
information can generate the best accuracy in detecting
sarcasm.
• Collecting historical tweets is very expensive in both time
and computing. Not very practical!
• I suggest to use less contextual information of the author,
especially the data that can be collected easily and fast.
E.g., the profile information of the author and the response
features are relatively effective and cost less.
7/7/2016
26
27. Discussion
• Extract the historical tweets around the target tweet. From
intuition, these surrounding tweets posted in the closer
time could probably emphasize on the similar object more
often.
• Random sampling from the historical tweet cans also both
generate the topic distribution and reduce cost.
7/7/2016
27