Scattertext: A Tool for Visualizing Differences in Language
1. 1
Jason S. Kessler | Data Day Texas, January 14, 2017
@jasonkessler
Scattertext: A Tool for
Visualizing Differences in
Language
2. 2
Word frequency
• Women and men tend to use different terms on Facebook.
• As do introverts and extroverts.
• Hillary Clinton and Donald Trump used different terms in the
presidential debate.
• Reveal differences in
• content,
• perceived strengths and weaknesses
• communication style
• These are often obvious after being surfaced
3. 3
Outline
• Previous work
• Ways of visualizing word association
• Scattertext
• Open-source Python/D3 framework for visualizing these
differences
• Inspecting LDA, word2vec, sparse classification models
• How CDK Global is using this to help dealerships better sell
cars.
• We’re hiring senior data scientists + devs in Austin and Seattle.
4. 4
OKCupid: an online dating site
hobos
almond
butter
100 Years of
Solitude
Bikram yoga
Christian Rudder: http://blog.okcupid.com/index.php/page/7/
Which words
and phrases
statistically
distinguish
ethnic groups
and genders?
5. 5Source: Christian Rudder.
Dataclysm. 2014.
Ranking with everyone else
High distance: white men
ignore k-pop
Low distance: white men
disproportionately mention
Phish
The smaller the
distance from the top
left, the higher the
association with
white men.
8. 8
Word Use Reflecting Gender and Personality in Facebook
Statuses
• Objective:
• Find words, phrases, and topics that correlate to
• gender, and
• Big 5 personality type
• Data source:
• My Personality App
• 75k voluntary participants in Facebook based survey, >300mm
words
• Agreed to give researchers access to statuses.
• Scoring algorithm
• Linear regression weights, 2000 LDA topics. Lyle Ungar
2013 AAAI
TutorialSchwartz et al. Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary
Approach. Plos One. 2013.
9. 9
Lyle Ungar
2013 AAAI
Tutorial
The good:
• Word clouds force
you to hunt for the
most impactful
terms
• You end up
examining the long
tail in the process
• Compactly
represent a lot of
phrases and topics
10. 10
Lyle Ungar
2013 AAAI
Tutorial
The bad:
• “Mullets of the
Internet” --Jeffrey
Zeldman, 2005
• Longer phrases are
are more prominent.
• Ranking is unclear
• Does size indicate
higher frequency?
13. 13
Mike Bostock et al., http://www.nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html
NYT: 2012 Political Convention Word Use by Party
15. 15
Monroe, Burt L., Michael P. Colaresi, and
Kevin M. Quinn. "Fightin'words: Lexical
feature selection and evaluation for identifying
the content of political conflict." Political
Analysis 16.4 (2008): 372-403.
Differenceinz-scoresoflog-oddsw/prior
log
#(𝑤, 𝐴)
𝐴 − #(𝑤, 𝐴)
− 𝑙𝑜𝑔
#(𝑤, 𝐵)
𝐵 − #(𝑤, 𝐵)
Log-odds for word w, categories A,B
log
# 𝑤, 𝐴 + #(𝑤, 𝐶)
𝐴 + |𝐶| − #(𝑤, 𝐴) − #(𝑤, 𝐶)
− ⋯
Log-odds w/ Dirichlet prior, given
background corpus C
• Difference in z-score accounts for
variation in word frequencies.
• Words with differences < 1.96 are greyed
out.
16. 16
Monroe, Burt L., Michael P. Colaresi, and
Kevin M. Quinn. "Fightin'words: Lexical
feature selection and evaluation for identifying
the content of political conflict." Political
Analysis 16.4 (2008): 372-403.
Differenceinz-scoresoflog-oddsw/prior
• Pros:
• Popular among major CL
researchers (3rd edition of J+M)
• Favors words which appear less
frequent in background.
• Natural linear word listing
• Cons:
• You have to pick a
representative, large background
corpus.
• If the corpus is small, divide
by 0 issue
• Probably only practical for
unigrams
• Inefficient use of space on chart
17. 17Page 17
Repo: https://github.com/JasonKessler/scattertext
$ pip install scattertext
Why the plots look the way they do:
http://bit.ly/scattertextdevelopment
Topic models, word vectors, and The Lasso:
http://bit.ly/scattertext2016debates
Movie revenue and practical use:
http://bit.ly/scattertextrevenuemovie
Hands-on Tutorial
18. 18
CDK Global: Finding Words that Sell Cars
…I was very skeptical giving up my truck and
buying an "Economy Car." I'm 6' 215lbs, but
my new career has me driving a personal
vehicle to make sales calls. I am overly
impressed with my Cruze…
Rating: 4.4/5 Stars
Example Review Appearing on a 3rd
Party Automotive Site
# of users who
read review:
# who went on to visit
a Chevy dealer’s
website: 15
20
Conversion rate of everyone who read
review:
15/20=75%
Text:
Car Reviewed: Chevy Cruze
Median conversion
rate: 22%
19. 19
CDK Global: Finding Words that Sell Cars
5 star review words
Love
Comfortable
Features
Solid
Amazing
<3 star review words
Transmission
Problem
Issue
Dealership
Times
20. 20
CDK Global: Finding Words that Sell Cars
5 star review words High conversion words
Love Comfortable
Comfortable Front [Seats]
Features Acceleration
Solid Free [Car Wash, Oil Change]
Amazing Quiet
<3 star review words
Transmission
Problem
Issue
Dealership
Times
21. 21
CDK Global: Finding Words that Sell Cars
5 star review words High conversion words
Love Comfortable
Comfortable Front [Seats]
Features Acceleration
Solid Free [Car Wash, Oil Change]
Amazing Quiet
<3 star review words Low conversion words
Transmission Money [spend my, save]
Problem Features
Issue Dealership
Dealership Amazing
Times Build Quality [typically positive]
22. 22
CDK Global: Finding Words that Sell Cars (SUV Specific)
5 star review words High conversion words
Love Comfortable
Comfortable Front [Seats]
Features Acceleration
Solid Free [Car Wash, Oil Change]
Amazing Quiet
<3 star review words Low conversion words
Transmission Money [spend my, save]
Problem Features
Issue Dealership
Dealership Amazing
Times Build Quality [typically positive]
The worst thing you can say about an
SUV may be:
I saved money and got all these
amazing features!
23. 23
Thank you.
[first].[last]@gmail.com .
Please see https://github.com/JasonKessler/scattertext
for more info on this project.
We are hiring data scientists and developers in Seattle and
Austin! Please contact me if you’d like to know more.
https://jobs.cdkglobal.com/
Editor's Notes
This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)
This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)
This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)
This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)
This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)
This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)
This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)
You’re left to your own devices to try and make sense of the differences.
Similar scatter plot to rudder. Less efficient use of space. Natural listing of words on left-hand side.
Similar scatter plot to rudder. Less efficient use of space. Natural listing of words on left-hand side.
This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)
This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)
This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)
This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)
This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)