Scattertext: A Tool for Visualizing Differences in Language

Jason Kessler
Jason KesslerLead Data Scientist @ CDK
1
Jason S. Kessler | Data Day Texas, January 14, 2017
@jasonkessler
Scattertext: A Tool for
Visualizing Differences in
Language
2
Word frequency
• Women and men tend to use different terms on Facebook.
• As do introverts and extroverts.
• Hillary Clinton and Donald Trump used different terms in the
presidential debate.
• Reveal differences in
• content,
• perceived strengths and weaknesses
• communication style
• These are often obvious after being surfaced
3
Outline
• Previous work
• Ways of visualizing word association
• Scattertext
• Open-source Python/D3 framework for visualizing these
differences
• Inspecting LDA, word2vec, sparse classification models
• How CDK Global is using this to help dealerships better sell
cars.
• We’re hiring senior data scientists + devs in Austin and Seattle.
4
OKCupid: an online dating site
hobos
almond
butter
100 Years of
Solitude
Bikram yoga
Christian Rudder: http://blog.okcupid.com/index.php/page/7/
Which words
and phrases
statistically
distinguish
ethnic groups
and genders?
5Source: Christian Rudder.
Dataclysm. 2014.
Ranking with everyone else
High distance: white men
ignore k-pop
Low distance: white men
disproportionately mention
Phish
The smaller the
distance from the top
left, the higher the
association with
white men.
6Source: Christian Rudder.
Dataclysm. 2014.
my blue eyes
7
Scattertext: Democrats vs Republicans: 2012 Convention Speeches
8
Word Use Reflecting Gender and Personality in Facebook
Statuses
• Objective:
• Find words, phrases, and topics that correlate to
• gender, and
• Big 5 personality type
• Data source:
• My Personality App
• 75k voluntary participants in Facebook based survey, >300mm
words
• Agreed to give researchers access to statuses.
• Scoring algorithm
• Linear regression weights, 2000 LDA topics. Lyle Ungar
2013 AAAI
TutorialSchwartz et al. Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary
Approach. Plos One. 2013.
9
Lyle Ungar
2013 AAAI
Tutorial
The good:
• Word clouds force
you to hunt for the
most impactful
terms
• You end up
examining the long
tail in the process
• Compactly
represent a lot of
phrases and topics
10
Lyle Ungar
2013 AAAI
Tutorial
The bad:
• “Mullets of the
Internet” --Jeffrey
Zeldman, 2005
• Longer phrases are
are more prominent.
• Ranking is unclear
• Does size indicate
higher frequency?
11
Lyle Ungar
2013 AAAI
Tutorial
12
Lyle Ungar
2013 AAAI
Tutorial
13
Mike Bostock et al., http://www.nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html
NYT: 2012 Political Convention Word Use by Party
14
Source: http://www.nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html,
15
Monroe, Burt L., Michael P. Colaresi, and
Kevin M. Quinn. "Fightin'words: Lexical
feature selection and evaluation for identifying
the content of political conflict." Political
Analysis 16.4 (2008): 372-403.
Differenceinz-scoresoflog-oddsw/prior
log
#(𝑤, 𝐴)
𝐴 − #(𝑤, 𝐴)
− 𝑙𝑜𝑔
#(𝑤, 𝐵)
𝐵 − #(𝑤, 𝐵)
Log-odds for word w, categories A,B
log
# 𝑤, 𝐴 + #(𝑤, 𝐶)
𝐴 + |𝐶| − #(𝑤, 𝐴) − #(𝑤, 𝐶)
− ⋯
Log-odds w/ Dirichlet prior, given
background corpus C
• Difference in z-score accounts for
variation in word frequencies.
• Words with differences < 1.96 are greyed
out.
16
Monroe, Burt L., Michael P. Colaresi, and
Kevin M. Quinn. "Fightin'words: Lexical
feature selection and evaluation for identifying
the content of political conflict." Political
Analysis 16.4 (2008): 372-403.
Differenceinz-scoresoflog-oddsw/prior
• Pros:
• Popular among major CL
researchers (3rd edition of J+M)
• Favors words which appear less
frequent in background.
• Natural linear word listing
• Cons:
• You have to pick a
representative, large background
corpus.
• If the corpus is small, divide
by 0 issue
• Probably only practical for
unigrams
• Inefficient use of space on chart
17Page 17
Repo: https://github.com/JasonKessler/scattertext
$ pip install scattertext
Why the plots look the way they do:
http://bit.ly/scattertextdevelopment
Topic models, word vectors, and The Lasso:
http://bit.ly/scattertext2016debates
Movie revenue and practical use:
http://bit.ly/scattertextrevenuemovie
Hands-on Tutorial
18
CDK Global: Finding Words that Sell Cars
…I was very skeptical giving up my truck and
buying an "Economy Car." I'm 6' 215lbs, but
my new career has me driving a personal
vehicle to make sales calls. I am overly
impressed with my Cruze…
Rating: 4.4/5 Stars
Example Review Appearing on a 3rd
Party Automotive Site
# of users who
read review:
# who went on to visit
a Chevy dealer’s
website: 15
20
Conversion rate of everyone who read
review:
15/20=75%
Text:
Car Reviewed: Chevy Cruze
Median conversion
rate: 22%
19
CDK Global: Finding Words that Sell Cars
5 star review words
Love
Comfortable
Features
Solid
Amazing
<3 star review words
Transmission
Problem
Issue
Dealership
Times
20
CDK Global: Finding Words that Sell Cars
5 star review words High conversion words
Love Comfortable
Comfortable Front [Seats]
Features Acceleration
Solid Free [Car Wash, Oil Change]
Amazing Quiet
<3 star review words
Transmission
Problem
Issue
Dealership
Times
21
CDK Global: Finding Words that Sell Cars
5 star review words High conversion words
Love Comfortable
Comfortable Front [Seats]
Features Acceleration
Solid Free [Car Wash, Oil Change]
Amazing Quiet
<3 star review words Low conversion words
Transmission Money [spend my, save]
Problem Features
Issue Dealership
Dealership Amazing
Times Build Quality [typically positive]
22
CDK Global: Finding Words that Sell Cars (SUV Specific)
5 star review words High conversion words
Love Comfortable
Comfortable Front [Seats]
Features Acceleration
Solid Free [Car Wash, Oil Change]
Amazing Quiet
<3 star review words Low conversion words
Transmission Money [spend my, save]
Problem Features
Issue Dealership
Dealership Amazing
Times Build Quality [typically positive]
The worst thing you can say about an
SUV may be:
I saved money and got all these
amazing features!
23
Thank you.
[first].[last]@gmail.com .
Please see https://github.com/JasonKessler/scattertext
for more info on this project.
We are hiring data scientists and developers in Seattle and
Austin! Please contact me if you’d like to know more.
https://jobs.cdkglobal.com/
1 of 23

Recommended

Discovering Persuasive Language through Observing Customer Behavior by
Discovering Persuasive Language through Observing Customer BehaviorDiscovering Persuasive Language through Observing Customer Behavior
Discovering Persuasive Language through Observing Customer BehaviorJason Kessler
1K views24 slides
From Sentiment to Persuasion Analysis: A Look at Idea Generation Tools by
From Sentiment to Persuasion Analysis: A Look at Idea Generation ToolsFrom Sentiment to Persuasion Analysis: A Look at Idea Generation Tools
From Sentiment to Persuasion Analysis: A Look at Idea Generation ToolsJason Kessler
2.2K views67 slides
Speak to Your Audience in Their Own Words By Brad Geddes by
Speak to Your Audience in Their Own Words By Brad GeddesSpeak to Your Audience in Their Own Words By Brad Geddes
Speak to Your Audience in Their Own Words By Brad GeddesSearch Marketing Expo - SMX
642 views22 slides
Fall 2015 4530 SEO by
Fall 2015 4530 SEOFall 2015 4530 SEO
Fall 2015 4530 SEODanFarkasOUClasses
273 views22 slides
KPA Top Ranked Dealers by
KPA Top Ranked DealersKPA Top Ranked Dealers
KPA Top Ranked DealersJD Rucker
547 views22 slides
Advanced keyword research by
Advanced keyword researchAdvanced keyword research
Advanced keyword researchJono Alderson
5.2K views97 slides

More Related Content

What's hot

Getting to the People Behind The Keywords by
Getting to the People Behind The KeywordsGetting to the People Behind The Keywords
Getting to the People Behind The KeywordsCarmen Mardiros
2K views28 slides
SMX East 2015 by
SMX East 2015SMX East 2015
SMX East 2015Grant Simmons
342 views99 slides
Global Travel Network - Design by
Global Travel Network  - DesignGlobal Travel Network  - Design
Global Travel Network - DesignSymantec
461 views13 slides
Cincinnati AMA - How to Align Content and Keywords to the Buying Cycle - 2-12-14 by
Cincinnati AMA - How to Align Content and Keywords to the Buying Cycle - 2-12-14Cincinnati AMA - How to Align Content and Keywords to the Buying Cycle - 2-12-14
Cincinnati AMA - How to Align Content and Keywords to the Buying Cycle - 2-12-14Jordan Godbey
548 views34 slides
Search Engine Marketing: How Insurance Agents Can Take Advantage by
Search Engine Marketing: How Insurance Agents Can Take AdvantageSearch Engine Marketing: How Insurance Agents Can Take Advantage
Search Engine Marketing: How Insurance Agents Can Take Advantagemmahan
771 views12 slides
Business Intelligence | Competitive Intelligence | Business Intelligence Tools by
Business Intelligence | Competitive Intelligence | Business Intelligence ToolsBusiness Intelligence | Competitive Intelligence | Business Intelligence Tools
Business Intelligence | Competitive Intelligence | Business Intelligence ToolsRoland Frasier
6.5K views141 slides

What's hot(10)

Getting to the People Behind The Keywords by Carmen Mardiros
Getting to the People Behind The KeywordsGetting to the People Behind The Keywords
Getting to the People Behind The Keywords
Carmen Mardiros2K views
Global Travel Network - Design by Symantec
Global Travel Network  - DesignGlobal Travel Network  - Design
Global Travel Network - Design
Symantec461 views
Cincinnati AMA - How to Align Content and Keywords to the Buying Cycle - 2-12-14 by Jordan Godbey
Cincinnati AMA - How to Align Content and Keywords to the Buying Cycle - 2-12-14Cincinnati AMA - How to Align Content and Keywords to the Buying Cycle - 2-12-14
Cincinnati AMA - How to Align Content and Keywords to the Buying Cycle - 2-12-14
Jordan Godbey548 views
Search Engine Marketing: How Insurance Agents Can Take Advantage by mmahan
Search Engine Marketing: How Insurance Agents Can Take AdvantageSearch Engine Marketing: How Insurance Agents Can Take Advantage
Search Engine Marketing: How Insurance Agents Can Take Advantage
mmahan771 views
Business Intelligence | Competitive Intelligence | Business Intelligence Tools by Roland Frasier
Business Intelligence | Competitive Intelligence | Business Intelligence ToolsBusiness Intelligence | Competitive Intelligence | Business Intelligence Tools
Business Intelligence | Competitive Intelligence | Business Intelligence Tools
Roland Frasier6.5K views
What? How? Why? Building Query Personas to Power Your Content Strategy by Grant Simmons
What? How? Why? Building Query Personas to Power Your Content StrategyWhat? How? Why? Building Query Personas to Power Your Content Strategy
What? How? Why? Building Query Personas to Power Your Content Strategy
Grant Simmons400 views
Optimization advice-for-watertownbuysellgold-com-just-sell-gold by Brian Bateman
Optimization advice-for-watertownbuysellgold-com-just-sell-goldOptimization advice-for-watertownbuysellgold-com-just-sell-gold
Optimization advice-for-watertownbuysellgold-com-just-sell-gold
Brian Bateman178 views
Killer Link Building Strategies - Christoph Cemper by auexpo Conference
Killer Link Building Strategies - Christoph CemperKiller Link Building Strategies - Christoph Cemper
Killer Link Building Strategies - Christoph Cemper
auexpo Conference 2.1K views

Similar to Scattertext: A Tool for Visualizing Differences in Language

Why Inbound Marketing for Online Business - EBriks Infotech by
Why Inbound Marketing for Online Business - EBriks InfotechWhy Inbound Marketing for Online Business - EBriks Infotech
Why Inbound Marketing for Online Business - EBriks InfotechEBriks Infotech Pvt. Ltd.
798 views72 slides
Get Better Content with Analytics and User Testing by
Get Better Content with Analytics and User TestingGet Better Content with Analytics and User Testing
Get Better Content with Analytics and User TestingMichael Powers
5.6K views183 slides
Search Essay Topics by
Search Essay TopicsSearch Essay Topics
Search Essay TopicsCustom Paper Writing Service
4 views11 slides
Choosing A Major And An Idea Of A Career by
Choosing A Major And An Idea Of A CareerChoosing A Major And An Idea Of A Career
Choosing A Major And An Idea Of A CareerCustom College Paper Singapore
4 views21 slides
Trends in Recruiting and HR: Think Outside of the Proverbial Box: Harness the... by
Trends in Recruiting and HR: Think Outside of the Proverbial Box: Harness the...Trends in Recruiting and HR: Think Outside of the Proverbial Box: Harness the...
Trends in Recruiting and HR: Think Outside of the Proverbial Box: Harness the...Aggregage
176 views37 slides
Own it 5 steps - mceea - for upload by
Own it   5 steps - mceea - for uploadOwn it   5 steps - mceea - for upload
Own it 5 steps - mceea - for uploadScott Patchin
355 views30 slides

Similar to Scattertext: A Tool for Visualizing Differences in Language(20)

Get Better Content with Analytics and User Testing by Michael Powers
Get Better Content with Analytics and User TestingGet Better Content with Analytics and User Testing
Get Better Content with Analytics and User Testing
Michael Powers5.6K views
Trends in Recruiting and HR: Think Outside of the Proverbial Box: Harness the... by Aggregage
Trends in Recruiting and HR: Think Outside of the Proverbial Box: Harness the...Trends in Recruiting and HR: Think Outside of the Proverbial Box: Harness the...
Trends in Recruiting and HR: Think Outside of the Proverbial Box: Harness the...
Aggregage176 views
Own it 5 steps - mceea - for upload by Scott Patchin
Own it   5 steps - mceea - for uploadOwn it   5 steps - mceea - for upload
Own it 5 steps - mceea - for upload
Scott Patchin355 views
Survival Tips for the Lone Product Manager - Kristin Bolton-Keys and Alicia D... by ExoLeaders.com
Survival Tips for the Lone Product Manager - Kristin Bolton-Keys and Alicia D...Survival Tips for the Lone Product Manager - Kristin Bolton-Keys and Alicia D...
Survival Tips for the Lone Product Manager - Kristin Bolton-Keys and Alicia D...
ExoLeaders.com999 views
Native American Dilemma by dsvaldi
Native American DilemmaNative American Dilemma
Native American Dilemma
dsvaldi366 views
Native American Dilemma by guest1f534f
Native American DilemmaNative American Dilemma
Native American Dilemma
guest1f534f600 views
Template Leading Mathematical Discussions Performance-Based.docx by rhetttrevannion
Template Leading Mathematical Discussions Performance-Based.docxTemplate Leading Mathematical Discussions Performance-Based.docx
Template Leading Mathematical Discussions Performance-Based.docx
rhetttrevannion4 views
Page 135Use verbs to present the information more forceful.docx by bunyansaturnina
Page 135Use verbs to present the information more forceful.docxPage 135Use verbs to present the information more forceful.docx
Page 135Use verbs to present the information more forceful.docx
bunyansaturnina11 views
12Organization DevelopmentFifth Edition by EttaBenton28
12Organization DevelopmentFifth Edition12Organization DevelopmentFifth Edition
12Organization DevelopmentFifth Edition
EttaBenton2828 views
Denver Startup Week: 10 Common Website Mistakes and How to Fix Them by Alli Berry
Denver Startup Week: 10 Common Website Mistakes and How to Fix ThemDenver Startup Week: 10 Common Website Mistakes and How to Fix Them
Denver Startup Week: 10 Common Website Mistakes and How to Fix Them
Alli Berry435 views

More from Jason Kessler

Visualizing Words and Topics with Scattertext by
Visualizing Words and Topics with ScattertextVisualizing Words and Topics with Scattertext
Visualizing Words and Topics with ScattertextJason Kessler
781 views41 slides
Natural Language Visualization with Scattertext by
Natural Language Visualization with ScattertextNatural Language Visualization with Scattertext
Natural Language Visualization with ScattertextJason Kessler
973 views37 slides
Lexicon Mining for Semiotic Squares: Exploding Binary Classification by
Lexicon Mining for Semiotic Squares: Exploding Binary ClassificationLexicon Mining for Semiotic Squares: Exploding Binary Classification
Lexicon Mining for Semiotic Squares: Exploding Binary ClassificationJason Kessler
882 views43 slides
Jason Kessler Problems: What's Wrong with Twitter by
Jason Kessler Problems: What's Wrong with TwitterJason Kessler Problems: What's Wrong with Twitter
Jason Kessler Problems: What's Wrong with TwitterJason Kessler
677 views18 slides
The 2010 JDPA Sentiment Corpus for the Automotive Domain by
The 2010 JDPA Sentiment Corpus for the Automotive DomainThe 2010 JDPA Sentiment Corpus for the Automotive Domain
The 2010 JDPA Sentiment Corpus for the Automotive DomainJason Kessler
610 views28 slides
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf... by
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...Jason Kessler
1.4K views31 slides

More from Jason Kessler(7)

Visualizing Words and Topics with Scattertext by Jason Kessler
Visualizing Words and Topics with ScattertextVisualizing Words and Topics with Scattertext
Visualizing Words and Topics with Scattertext
Jason Kessler781 views
Natural Language Visualization with Scattertext by Jason Kessler
Natural Language Visualization with ScattertextNatural Language Visualization with Scattertext
Natural Language Visualization with Scattertext
Jason Kessler973 views
Lexicon Mining for Semiotic Squares: Exploding Binary Classification by Jason Kessler
Lexicon Mining for Semiotic Squares: Exploding Binary ClassificationLexicon Mining for Semiotic Squares: Exploding Binary Classification
Lexicon Mining for Semiotic Squares: Exploding Binary Classification
Jason Kessler882 views
Jason Kessler Problems: What's Wrong with Twitter by Jason Kessler
Jason Kessler Problems: What's Wrong with TwitterJason Kessler Problems: What's Wrong with Twitter
Jason Kessler Problems: What's Wrong with Twitter
Jason Kessler677 views
The 2010 JDPA Sentiment Corpus for the Automotive Domain by Jason Kessler
The 2010 JDPA Sentiment Corpus for the Automotive DomainThe 2010 JDPA Sentiment Corpus for the Automotive Domain
The 2010 JDPA Sentiment Corpus for the Automotive Domain
Jason Kessler610 views
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf... by Jason Kessler
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...
Jason Kessler1.4K views
Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J... by Jason Kessler
Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...
Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...
Jason Kessler720 views

Recently uploaded

Fleet Management Software in India by
Fleet Management Software in India Fleet Management Software in India
Fleet Management Software in India Fleetable
12 views1 slide
Quality Engineer: A Day in the Life by
Quality Engineer: A Day in the LifeQuality Engineer: A Day in the Life
Quality Engineer: A Day in the LifeJohn Valentino
6 views18 slides
AI and Ml presentation .pptx by
AI and Ml presentation .pptxAI and Ml presentation .pptx
AI and Ml presentation .pptxFayazAli87
12 views15 slides
Keep by
KeepKeep
KeepGeniusee
78 views10 slides
MS PowerPoint.pptx by
MS PowerPoint.pptxMS PowerPoint.pptx
MS PowerPoint.pptxLitty Sylus
5 views14 slides
Software evolution understanding: Automatic extraction of software identifier... by
Software evolution understanding: Automatic extraction of software identifier...Software evolution understanding: Automatic extraction of software identifier...
Software evolution understanding: Automatic extraction of software identifier...Ra'Fat Al-Msie'deen
10 views33 slides

Recently uploaded(20)

Fleet Management Software in India by Fleetable
Fleet Management Software in India Fleet Management Software in India
Fleet Management Software in India
Fleetable12 views
Quality Engineer: A Day in the Life by John Valentino
Quality Engineer: A Day in the LifeQuality Engineer: A Day in the Life
Quality Engineer: A Day in the Life
John Valentino6 views
AI and Ml presentation .pptx by FayazAli87
AI and Ml presentation .pptxAI and Ml presentation .pptx
AI and Ml presentation .pptx
FayazAli8712 views
Software evolution understanding: Automatic extraction of software identifier... by Ra'Fat Al-Msie'deen
Software evolution understanding: Automatic extraction of software identifier...Software evolution understanding: Automatic extraction of software identifier...
Software evolution understanding: Automatic extraction of software identifier...
Gen Apps on Google Cloud PaLM2 and Codey APIs in Action by Márton Kodok
Gen Apps on Google Cloud PaLM2 and Codey APIs in ActionGen Apps on Google Cloud PaLM2 and Codey APIs in Action
Gen Apps on Google Cloud PaLM2 and Codey APIs in Action
Márton Kodok11 views
Dev-Cloud Conference 2023 - Continuous Deployment Showdown: Traditionelles CI... by Marc Müller
Dev-Cloud Conference 2023 - Continuous Deployment Showdown: Traditionelles CI...Dev-Cloud Conference 2023 - Continuous Deployment Showdown: Traditionelles CI...
Dev-Cloud Conference 2023 - Continuous Deployment Showdown: Traditionelles CI...
Marc Müller42 views
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra... by Marc Müller
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra....NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra...
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra...
Marc Müller41 views
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium... by Lisi Hocke
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...
Lisi Hocke35 views
Ports-and-Adapters Architecture for Embedded HMI by Burkhard Stubert
Ports-and-Adapters Architecture for Embedded HMIPorts-and-Adapters Architecture for Embedded HMI
Ports-and-Adapters Architecture for Embedded HMI
Burkhard Stubert21 views
Top-5-production-devconMunich-2023.pptx by Tier1 app
Top-5-production-devconMunich-2023.pptxTop-5-production-devconMunich-2023.pptx
Top-5-production-devconMunich-2023.pptx
Tier1 app8 views
BushraDBR: An Automatic Approach to Retrieving Duplicate Bug Reports by Ra'Fat Al-Msie'deen
BushraDBR: An Automatic Approach to Retrieving Duplicate Bug ReportsBushraDBR: An Automatic Approach to Retrieving Duplicate Bug Reports
BushraDBR: An Automatic Approach to Retrieving Duplicate Bug Reports
Sprint 226 by ManageIQ
Sprint 226Sprint 226
Sprint 226
ManageIQ8 views
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P... by NimaTorabi2
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...
NimaTorabi215 views

Scattertext: A Tool for Visualizing Differences in Language

  • 1. 1 Jason S. Kessler | Data Day Texas, January 14, 2017 @jasonkessler Scattertext: A Tool for Visualizing Differences in Language
  • 2. 2 Word frequency • Women and men tend to use different terms on Facebook. • As do introverts and extroverts. • Hillary Clinton and Donald Trump used different terms in the presidential debate. • Reveal differences in • content, • perceived strengths and weaknesses • communication style • These are often obvious after being surfaced
  • 3. 3 Outline • Previous work • Ways of visualizing word association • Scattertext • Open-source Python/D3 framework for visualizing these differences • Inspecting LDA, word2vec, sparse classification models • How CDK Global is using this to help dealerships better sell cars. • We’re hiring senior data scientists + devs in Austin and Seattle.
  • 4. 4 OKCupid: an online dating site hobos almond butter 100 Years of Solitude Bikram yoga Christian Rudder: http://blog.okcupid.com/index.php/page/7/ Which words and phrases statistically distinguish ethnic groups and genders?
  • 5. 5Source: Christian Rudder. Dataclysm. 2014. Ranking with everyone else High distance: white men ignore k-pop Low distance: white men disproportionately mention Phish The smaller the distance from the top left, the higher the association with white men.
  • 7. 7 Scattertext: Democrats vs Republicans: 2012 Convention Speeches
  • 8. 8 Word Use Reflecting Gender and Personality in Facebook Statuses • Objective: • Find words, phrases, and topics that correlate to • gender, and • Big 5 personality type • Data source: • My Personality App • 75k voluntary participants in Facebook based survey, >300mm words • Agreed to give researchers access to statuses. • Scoring algorithm • Linear regression weights, 2000 LDA topics. Lyle Ungar 2013 AAAI TutorialSchwartz et al. Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach. Plos One. 2013.
  • 9. 9 Lyle Ungar 2013 AAAI Tutorial The good: • Word clouds force you to hunt for the most impactful terms • You end up examining the long tail in the process • Compactly represent a lot of phrases and topics
  • 10. 10 Lyle Ungar 2013 AAAI Tutorial The bad: • “Mullets of the Internet” --Jeffrey Zeldman, 2005 • Longer phrases are are more prominent. • Ranking is unclear • Does size indicate higher frequency?
  • 13. 13 Mike Bostock et al., http://www.nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html NYT: 2012 Political Convention Word Use by Party
  • 15. 15 Monroe, Burt L., Michael P. Colaresi, and Kevin M. Quinn. "Fightin'words: Lexical feature selection and evaluation for identifying the content of political conflict." Political Analysis 16.4 (2008): 372-403. Differenceinz-scoresoflog-oddsw/prior log #(𝑤, 𝐴) 𝐴 − #(𝑤, 𝐴) − 𝑙𝑜𝑔 #(𝑤, 𝐵) 𝐵 − #(𝑤, 𝐵) Log-odds for word w, categories A,B log # 𝑤, 𝐴 + #(𝑤, 𝐶) 𝐴 + |𝐶| − #(𝑤, 𝐴) − #(𝑤, 𝐶) − ⋯ Log-odds w/ Dirichlet prior, given background corpus C • Difference in z-score accounts for variation in word frequencies. • Words with differences < 1.96 are greyed out.
  • 16. 16 Monroe, Burt L., Michael P. Colaresi, and Kevin M. Quinn. "Fightin'words: Lexical feature selection and evaluation for identifying the content of political conflict." Political Analysis 16.4 (2008): 372-403. Differenceinz-scoresoflog-oddsw/prior • Pros: • Popular among major CL researchers (3rd edition of J+M) • Favors words which appear less frequent in background. • Natural linear word listing • Cons: • You have to pick a representative, large background corpus. • If the corpus is small, divide by 0 issue • Probably only practical for unigrams • Inefficient use of space on chart
  • 17. 17Page 17 Repo: https://github.com/JasonKessler/scattertext $ pip install scattertext Why the plots look the way they do: http://bit.ly/scattertextdevelopment Topic models, word vectors, and The Lasso: http://bit.ly/scattertext2016debates Movie revenue and practical use: http://bit.ly/scattertextrevenuemovie Hands-on Tutorial
  • 18. 18 CDK Global: Finding Words that Sell Cars …I was very skeptical giving up my truck and buying an "Economy Car." I'm 6' 215lbs, but my new career has me driving a personal vehicle to make sales calls. I am overly impressed with my Cruze… Rating: 4.4/5 Stars Example Review Appearing on a 3rd Party Automotive Site # of users who read review: # who went on to visit a Chevy dealer’s website: 15 20 Conversion rate of everyone who read review: 15/20=75% Text: Car Reviewed: Chevy Cruze Median conversion rate: 22%
  • 19. 19 CDK Global: Finding Words that Sell Cars 5 star review words Love Comfortable Features Solid Amazing <3 star review words Transmission Problem Issue Dealership Times
  • 20. 20 CDK Global: Finding Words that Sell Cars 5 star review words High conversion words Love Comfortable Comfortable Front [Seats] Features Acceleration Solid Free [Car Wash, Oil Change] Amazing Quiet <3 star review words Transmission Problem Issue Dealership Times
  • 21. 21 CDK Global: Finding Words that Sell Cars 5 star review words High conversion words Love Comfortable Comfortable Front [Seats] Features Acceleration Solid Free [Car Wash, Oil Change] Amazing Quiet <3 star review words Low conversion words Transmission Money [spend my, save] Problem Features Issue Dealership Dealership Amazing Times Build Quality [typically positive]
  • 22. 22 CDK Global: Finding Words that Sell Cars (SUV Specific) 5 star review words High conversion words Love Comfortable Comfortable Front [Seats] Features Acceleration Solid Free [Car Wash, Oil Change] Amazing Quiet <3 star review words Low conversion words Transmission Money [spend my, save] Problem Features Issue Dealership Dealership Amazing Times Build Quality [typically positive] The worst thing you can say about an SUV may be: I saved money and got all these amazing features!
  • 23. 23 Thank you. [first].[last]@gmail.com . Please see https://github.com/JasonKessler/scattertext for more info on this project. We are hiring data scientists and developers in Seattle and Austin! Please contact me if you’d like to know more. https://jobs.cdkglobal.com/

Editor's Notes

  1. This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)
  2. This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)
  3. This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)
  4. This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)
  5. This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)
  6. This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)
  7. This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)
  8. You’re left to your own devices to try and make sense of the differences.
  9. Similar scatter plot to rudder. Less efficient use of space. Natural listing of words on left-hand side.
  10. Similar scatter plot to rudder. Less efficient use of space. Natural listing of words on left-hand side.
  11. This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)
  12. This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)
  13. This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)
  14. This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)
  15. This slide intentionally left blank. (If you do not have a page title or subtitle, leave the field as is and it will remain hidden.)