Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1
Jason S. Kessler | Data Day Texas, January 14, 2017
@jasonkessler
Scattertext: A Tool for
Visualizing Differences in
Lan...
2
Word frequency
• Women and men tend to use different terms on Facebook.
• As do introverts and extroverts.
• Hillary Cli...
3
Outline
• Previous work
• Ways of visualizing word association
• Scattertext
• Open-source Python/D3 framework for visua...
4
OKCupid: an online dating site
hobos
almond
butter
100 Years of
Solitude
Bikram yoga
Christian Rudder: http://blog.okcup...
5Source: Christian Rudder.
Dataclysm. 2014.
Ranking with everyone else
High distance: white men
ignore k-pop
Low distance:...
6Source: Christian Rudder.
Dataclysm. 2014.
my blue eyes
7
Scattertext: Democrats vs Republicans: 2012 Convention Speeches
8
Word Use Reflecting Gender and Personality in Facebook
Statuses
• Objective:
• Find words, phrases, and topics that corr...
9
Lyle Ungar
2013 AAAI
Tutorial
The good:
• Word clouds force
you to hunt for the
most impactful
terms
• You end up
examin...
10
Lyle Ungar
2013 AAAI
Tutorial
The bad:
• “Mullets of the
Internet” --Jeffrey
Zeldman, 2005
• Longer phrases are
are mor...
11
Lyle Ungar
2013 AAAI
Tutorial
12
Lyle Ungar
2013 AAAI
Tutorial
13
Mike Bostock et al., http://www.nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html
NYT: 2012 Po...
14
Source: http://www.nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html,
15
Monroe, Burt L., Michael P. Colaresi, and
Kevin M. Quinn. "Fightin'words: Lexical
feature selection and evaluation for ...
16
Monroe, Burt L., Michael P. Colaresi, and
Kevin M. Quinn. "Fightin'words: Lexical
feature selection and evaluation for ...
17Page 17
Repo: https://github.com/JasonKessler/scattertext
$ pip install scattertext
Why the plots look the way they do:
...
18
CDK Global: Finding Words that Sell Cars
…I was very skeptical giving up my truck and
buying an "Economy Car." I'm 6' 2...
19
CDK Global: Finding Words that Sell Cars
5 star review words
Love
Comfortable
Features
Solid
Amazing
<3 star review wor...
20
CDK Global: Finding Words that Sell Cars
5 star review words High conversion words
Love Comfortable
Comfortable Front [...
21
CDK Global: Finding Words that Sell Cars
5 star review words High conversion words
Love Comfortable
Comfortable Front [...
22
CDK Global: Finding Words that Sell Cars (SUV Specific)
5 star review words High conversion words
Love Comfortable
Comf...
23
Thank you.
[first].[last]@gmail.com .
Please see https://github.com/JasonKessler/scattertext
for more info on this proj...
Upcoming SlideShare
Loading in …5
×

Scattertext: A Tool for Visualizing Differences in Language

1,476 views

Published on

Past work on lexicon mining and discussion of a new python library for visualizing differences in language.

Published in: Software
  • Be the first to comment

Scattertext: A Tool for Visualizing Differences in Language

  1. 1. 1 Jason S. Kessler | Data Day Texas, January 14, 2017 @jasonkessler Scattertext: A Tool for Visualizing Differences in Language
  2. 2. 2 Word frequency • Women and men tend to use different terms on Facebook. • As do introverts and extroverts. • Hillary Clinton and Donald Trump used different terms in the presidential debate. • Reveal differences in • content, • perceived strengths and weaknesses • communication style • These are often obvious after being surfaced
  3. 3. 3 Outline • Previous work • Ways of visualizing word association • Scattertext • Open-source Python/D3 framework for visualizing these differences • Inspecting LDA, word2vec, sparse classification models • How CDK Global is using this to help dealerships better sell cars. • We’re hiring senior data scientists + devs in Austin and Seattle.
  4. 4. 4 OKCupid: an online dating site hobos almond butter 100 Years of Solitude Bikram yoga Christian Rudder: http://blog.okcupid.com/index.php/page/7/ Which words and phrases statistically distinguish ethnic groups and genders?
  5. 5. 5Source: Christian Rudder. Dataclysm. 2014. Ranking with everyone else High distance: white men ignore k-pop Low distance: white men disproportionately mention Phish The smaller the distance from the top left, the higher the association with white men.
  6. 6. 6Source: Christian Rudder. Dataclysm. 2014. my blue eyes
  7. 7. 7 Scattertext: Democrats vs Republicans: 2012 Convention Speeches
  8. 8. 8 Word Use Reflecting Gender and Personality in Facebook Statuses • Objective: • Find words, phrases, and topics that correlate to • gender, and • Big 5 personality type • Data source: • My Personality App • 75k voluntary participants in Facebook based survey, >300mm words • Agreed to give researchers access to statuses. • Scoring algorithm • Linear regression weights, 2000 LDA topics. Lyle Ungar 2013 AAAI TutorialSchwartz et al. Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach. Plos One. 2013.
  9. 9. 9 Lyle Ungar 2013 AAAI Tutorial The good: • Word clouds force you to hunt for the most impactful terms • You end up examining the long tail in the process • Compactly represent a lot of phrases and topics
  10. 10. 10 Lyle Ungar 2013 AAAI Tutorial The bad: • “Mullets of the Internet” --Jeffrey Zeldman, 2005 • Longer phrases are are more prominent. • Ranking is unclear • Does size indicate higher frequency?
  11. 11. 11 Lyle Ungar 2013 AAAI Tutorial
  12. 12. 12 Lyle Ungar 2013 AAAI Tutorial
  13. 13. 13 Mike Bostock et al., http://www.nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html NYT: 2012 Political Convention Word Use by Party
  14. 14. 14 Source: http://www.nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html,
  15. 15. 15 Monroe, Burt L., Michael P. Colaresi, and Kevin M. Quinn. "Fightin'words: Lexical feature selection and evaluation for identifying the content of political conflict." Political Analysis 16.4 (2008): 372-403. Differenceinz-scoresoflog-oddsw/prior log #(𝑤, 𝐴) 𝐴 − #(𝑤, 𝐴) − 𝑙𝑜𝑔 #(𝑤, 𝐵) 𝐵 − #(𝑤, 𝐵) Log-odds for word w, categories A,B log # 𝑤, 𝐴 + #(𝑤, 𝐶) 𝐴 + |𝐶| − #(𝑤, 𝐴) − #(𝑤, 𝐶) − ⋯ Log-odds w/ Dirichlet prior, given background corpus C • Difference in z-score accounts for variation in word frequencies. • Words with differences < 1.96 are greyed out.
  16. 16. 16 Monroe, Burt L., Michael P. Colaresi, and Kevin M. Quinn. "Fightin'words: Lexical feature selection and evaluation for identifying the content of political conflict." Political Analysis 16.4 (2008): 372-403. Differenceinz-scoresoflog-oddsw/prior • Pros: • Popular among major CL researchers (3rd edition of J+M) • Favors words which appear less frequent in background. • Natural linear word listing • Cons: • You have to pick a representative, large background corpus. • If the corpus is small, divide by 0 issue • Probably only practical for unigrams • Inefficient use of space on chart
  17. 17. 17Page 17 Repo: https://github.com/JasonKessler/scattertext $ pip install scattertext Why the plots look the way they do: http://bit.ly/scattertextdevelopment Topic models, word vectors, and The Lasso: http://bit.ly/scattertext2016debates Movie revenue and practical use: http://bit.ly/scattertextrevenuemovie Hands-on Tutorial
  18. 18. 18 CDK Global: Finding Words that Sell Cars …I was very skeptical giving up my truck and buying an "Economy Car." I'm 6' 215lbs, but my new career has me driving a personal vehicle to make sales calls. I am overly impressed with my Cruze… Rating: 4.4/5 Stars Example Review Appearing on a 3rd Party Automotive Site # of users who read review: # who went on to visit a Chevy dealer’s website: 15 20 Conversion rate of everyone who read review: 15/20=75% Text: Car Reviewed: Chevy Cruze Median conversion rate: 22%
  19. 19. 19 CDK Global: Finding Words that Sell Cars 5 star review words Love Comfortable Features Solid Amazing <3 star review words Transmission Problem Issue Dealership Times
  20. 20. 20 CDK Global: Finding Words that Sell Cars 5 star review words High conversion words Love Comfortable Comfortable Front [Seats] Features Acceleration Solid Free [Car Wash, Oil Change] Amazing Quiet <3 star review words Transmission Problem Issue Dealership Times
  21. 21. 21 CDK Global: Finding Words that Sell Cars 5 star review words High conversion words Love Comfortable Comfortable Front [Seats] Features Acceleration Solid Free [Car Wash, Oil Change] Amazing Quiet <3 star review words Low conversion words Transmission Money [spend my, save] Problem Features Issue Dealership Dealership Amazing Times Build Quality [typically positive]
  22. 22. 22 CDK Global: Finding Words that Sell Cars (SUV Specific) 5 star review words High conversion words Love Comfortable Comfortable Front [Seats] Features Acceleration Solid Free [Car Wash, Oil Change] Amazing Quiet <3 star review words Low conversion words Transmission Money [spend my, save] Problem Features Issue Dealership Dealership Amazing Times Build Quality [typically positive] The worst thing you can say about an SUV may be: I saved money and got all these amazing features!
  23. 23. 23 Thank you. [first].[last]@gmail.com . Please see https://github.com/JasonKessler/scattertext for more info on this project. We are hiring data scientists and developers in Seattle and Austin! Please contact me if you’d like to know more. https://jobs.cdkglobal.com/

×