NYAAPOR 2019 - An Intro to Topic Models for Text Analysis

Patrick van Kessel
Patrick van KesselSenior Data Scientist
An Intro to Topic Models for
Text Analysis
Patrick van Kessel, Senior Data Scientist
Pew Research Center
Part 1
Text as Data
Agenda
● The role of text in social research
● Data selection and preparation
● Processing: Turning text into data
● Analysis: Finding patterns in text data
● Visualization: Turning patterns into insights
The role of text in social research
● Free of assumptions
● Potential for richer insights relative to closed-format responses
● If organic, then data collection costs are often negligible
The role of text in social research
Why text?
The role of text in social research
Where do I find it?
● Open-ended surveys / focus groups / transcripts / interviews
● Social media data (tweets, FB posts, etc.)
● Long-form content (articles, notes, logs, etc.)
The role of text in social research
What makes it challenging?
● Messy
○ “Data spaghetti” with little or no structure
● Sparse
○ Low information-to-data ratio (lots of hay, few needles)
● Often organic (rather than designed)
○ Can be naturally generated by people and processes
○ Often without a research use in mind
Data selection and preparation
Data selection and preparation
● Know your objective and subject matter (if needed find subject matter expert)
● Get familiar with the data
● Don’t make assumptions - know your data, quirks and all
Data selection and preparation
Text Acquisition and Preprepation
Select relevant data (text corpus)
● Content
● Metadata
Prepare the input file
● Determine unit of analysis
● Process text to get one document
per unit of analysis
Processing
Turning text into data
Turning text into data
Turning text into data
● How do we sift through text and produce insight?
● Might first try searching for keywords
● How many times is “analysis” mentioned?
Original
1 Text analysis is fun
2 I enjoy analyzing text data
3 Data science often involves text analytics
Turning text into data
● How do we sift through text and produce insight?
● Might first try searching for keywords
● How many times is “analysis” mentioned?
Original
1 Text analysis is fun
2 I enjoy analyzing text data
3 Data science often involves text analytics
We missed this one
And this one too
Turning text into data
● Variations of words can have the same meaning but look completely different
to a computer
Original
1 Text analysis is fun
2 I enjoy analyzing text data
3 Data science often involves text analytics
Turning text into data
Regular Expressions
● A more sophisticated solution: regular expressions
● Syntax for defining string (text) patterns
Original
1 Text analysis is fun
2 I enjoy analyzing text data
3 Data science often involves text analytics
Turning text into data
Regular Expressions
● Can use to search text or extract
specific chunks
● Example use cases:
○ Extracting dates
○ Finding URLs
○ Identifying names/entities
● https://regex101.com/
● http://www.regexlib.com/
Turning text into data
Regular Expressions
banaly[a-z]+b
Original
1 Text analysis is fun
2 I enjoy analyzing text data
3 Data science often involves text analytics
Turning text into data
Regular Expressions
Regular expressions can be extremely powerful…
...and terrifyingly complex:
URLS: ((https?://(www.)?)?[-a-zA-Z0-9@:%._+~#=]{2,4096}.[a-z]{2,6}b([-a-zA-Z0-9@:%_+.~#?&//=]*))
DOMAINS: (?:http[s]?://)?(?:www(?:s?).)?([w.-]+)(?:[/](?:.+))?
MONEY: $([0-9]{1,3}(?:(?:,[0-9]{3})+)?(?:.[0-9]{1,2})?)s
Turning text into data
Pre-processing
● Great, but we can’t write patterns for everything
● Words are messy and have a lot of variation
● We need to collapse semantically
● We need to process
Original
1 Text analysis is fun
2 I enjoy analyzing text data
3 Data science often involves text analytics
Turning text into data
Pre-processing
● Common first steps:
○ Spell check / correct
○ Remove punctuation / expand contractions
Original Processed
1 Text analysis is fun
2 I enjoy analyzing text data
3 Data science often involves text analytics
Turning text into data
Pre-processing
● Now to collapse words with the same meaning
● We do this with stemming or lemmatization
● Break words down to their roots
Original Processed
1 Text analysis is fun
2 I enjoy analyzing text data
3 Data science often involves text analytics
Turning text into data
Pre-processing
● Stemming is more conservative
● There are many different stemmers
● Here’s the Porter stemmer (1979)
Original Processed
1 Text analysis is fun Text analysi is fun
2 I enjoy analyzing text data I enjoy analyz text data
3 Data science often involves text analytics Data scienc often involv text analyt
Turning text into data
Pre-processing
● Stemming is more conservative
● There are many different stemmers
● Here’s the Porter stemmer (1979)
Original Processed
1 Text analysis is fun Text analysi is fun
2 I enjoy analyzing text data I enjoy analyz text data
3 Data science often involves text analytics Data scienc often involv text analyt
Turning text into data
Pre-processing
● The Lancaster stemmer (1990) is newer and more aggressive
● Truncates words a LOT
Original Processed
1 Text analysis is fun text analys is fun
2 I enjoy analyzing text data I enjoy analys text dat
3 Data science often involves text analytics dat sci oft involv text analys
Turning text into data
Pre-processing
● Lemmatization uses linguistic relationships and parts of speech to collapse
words down to their root form - so you get actual words (“lemma”), not stems
● WordNet Lemmatizer
Original Processed
1 Text analysis is fun text analysis is fun
2 I enjoy analyzing text data I enjoy analyze text data
3 Data science often involves text analytics data science often involve text analytics
Turning text into data
Pre-processing
● Picking the right method depends on how much you want to preserve nuance
or collapse meaning
● We’ll stick with Lancaster
Original Processed
1 Text analysis is fun text analys is fun
2 I enjoy analyzing text data I enjoy analys text dat
3 Data science often involves text analytics dat sci oft involv text analys
Turning text into data
Pre-processing
● Finally, we need to remove words that don’t hold meaning themselves
● These are called “stopwords”
● Can expand standard stopword lists with custom words
Original Processed
1 Text analysis is fun text analys fun
2 I enjoy analyzing text data enjoy analys text dat
3 Data science often involves text analytics dat sci oft involv text analys
Turning text into data
Tokenization
● Now we need to tokenize
● Break words apart according to certain rules
● Usually breaks on whitespace and punctuation
● What’s left are called “tokens”
● Single tokens or pairs of two or more tokens are called “ngrams”
Turning text into data
Tokenization
● Unigrams
text analys fun enjoy dat sci oft involv
1 1 1
1 1 1 1
1 1 1 1 1 1
Turning text into data
Tokenization
● Bigrams
Text analys 2
Analys fun 1
Enjoy analys 1
Analys text 1
Text dat 1
Dat sci 1
Sci oft 1
Oft involv 1
Involv text 1
Turning text into data
Tokenization
● If we want to characterize the whole corpus, we can just look at the most
frequent words
text analys fun enjoy dat sci oft involv
1 1 1
1 1 1 1
1 1 1 1 1 1
3 3 1 1 2 1 1 1
Turning text into data
The Term Frequency Matrix
● But what if we want to distinguish documents from each other?
● We know these documents are about text analysis
● What makes them unique?
Turning text into data
The Term Frequency Matrix
● Divide the word frequencies by the number of documents they appear in
● Downweight words that are common
● Now we’re emphasizing what makes each document unique
text analys fun enjoy dat sci oft involv
.33 .33 1
.33 .33 1 .5
.33 .33 .5 1 1 1
1 1 1 1 1 1 1 1
Turning words into numbers
Part-of-Speech Tagging
● TF-IDF is often all you need
● But sometimes you care about how a word is used
● Can use pre-trained part-of-speech (POS) taggers
● Can also help with things like negation
○ “Happy” vs. “NOT happy”
Turning words into numbers
Named Entity Extraction
● Might also be interested in people, places, organizations, etc.
● Like POS taggers, named entity extractors use trained models
Turning words into numbers
Word2Vec
● Other methods can quantify words not by frequency, but by their relation
● Word2vec uses a sliding window to read words and learn their
relationships; each word gets a vector in N-dimensional space
● https://code.google.com/archive/p/word2vec/
Analysis
Finding patterns in text data
Finding patterns in text data
Two types of approaches:
● Supervised NLP: requires training data, learns to predict labels and classes
● Unsupervised NLP: automated, extracts structure from the data
Finding patterns in text data
Supervised methods
● Often we want to categorize documents
● Classification models can take labeled data and learn to make predictions
● If we read and label some documents ourselves, ML can do the rest
Finding patterns in text data
Supervised methods
Steps:
● Label a sample of documents
● Break your sample into two sets: a training sample and a test sample
● Train a model on the training sample
● Evaluate it on the test sample
● Apply it to the full set of documents to make predictions
Finding patterns in text data
Supervised methods
● First you need to develop a codebook
● Codebook: set of rules for labeling and categorizing documents
● The best codebooks have clear rules for hard cases, and lots of examples
● Categories should be MECE: mutually exclusive and collectively exhaustive
NYAAPOR 2019 - An Intro to Topic Models for Text Analysis
Finding patterns in text data
Supervised methods
Finding patterns in text data
Supervised methods
● Need to validate the codebook by measuring interrater reliability
● Makes sure your measures are consistent, objective, and reproducible
● Multiple people code the same document
Finding patterns in text data
Supervised methods
● Various metrics to test whether their agreement is high enough
○ Krippendorf’s alpha
○ Cohen’s kappa
● Can also compare coders against a gold standard, if available
Finding patterns in text data
Supervised methods
● Mechanical Turk can be a great way to code a lot of documents
● Have 5+ Turkers code a large sample of documents
● Collapse them together with a rule
● Code a subset in-house, and compute reliability
Finding patterns in text data
Supervised methods
● Sometimes you’re trying to measure something that’s really rare
● Consider oversampling
● Example: keyword oversampling
● Disproportionately select documents that are more likely to contain your
outcomes
Finding patterns in text data
Supervised methods
● After coding, split your sample into two sets (~80/20)
○ One for training, one for testing
● We do this to check for (and avoid) overfitting
Finding patterns in text data
Supervised methods
● Next step is called feature extraction or feature selection
● Need to extract “features” from the text
○ TF-IDF
○ Word2Vec vectors
● Can also utilize metadata, if potentially useful
Finding patterns in text data
Supervised methods
● Select a classification algorithm
● Common choice for text data are
support vector machines (SVMs)
● Similar to regression, SVMs find the line
that best separates two or more groups
● Can also use non-linear “kernels” for
better fits (radial basis function, etc.)
● XGBoost is a newer and very promising
algorithm
Finding patterns in text data
Supervised methods
● Word of caution: if you oversampled, your weights are probability
(survey) weights
● Most machine learning implementations assume your weights are
frequency weights
● Not an issue for predictions - but can produce biased variance estimates
if you analyze your training data itself
Finding patterns in text data
Supervised methods
● Time to evaluate performance
● Lots of different metrics, depending on
what you care about
● Often we care about precision/recall
○ Precision: did you pick out mostly needles or
mostly hay?
○ Recall: how many needles did you miss?
● Other metrics:
○ Matthew’s correlation coefficient
○ Brier score
○ Overall accuracy
Finding patterns in text data
Supervised methods
● Doing just one split leaves a lot up to chance
● To bootstrap a better estimate of the model’s performance, it’s best to use K-
fold cross-validation
● Splits your data into train/test sets multiple times and averages the
performance metrics
● Ensures that you didn’t just get lucky (or unlucky)
Finding patterns in text data
Supervised methods
● Model not working well?
● You probably need to tune your parameters
● You can use a grid search to test out different combinations of model
parameters and feature extraction methods
● Many software packages can automatically help you pick the best
combination to maximize your model’s performance
Finding patterns in text data
Supervised methods
● Suggested design:
○ Large training sample, coded by Turkers
○ Small evaluation sample, coded by Turkers and in-house experts
○ Compute IRR between Turk and experts
○ Train model on training sample, use 5-fold cross-validation
○ Apply model to evaluation sample, compare results against in-house coders and Turkers
Finding patterns in text data
Supervised methods
● Some (but not all) models produce probabilities along with their
classifications
● Ideally you fit the model using your preferred scoring metric/function
● But you can also use post-hoc probability thresholds to adjust your model’s
predictions
That’s a lot of work
Finding patterns in text data
Unsupervised methods
● Can we extract meaningful insights more quickly and efficiently?
● Or, what if our documents are already categorized?
● There are a number of methods you can use to identify key themes and/or
make comparisons without needing to manually label documents yourself
Finding patterns in text data
Unsupervised methods
● First up, we might look at
words that commonly co-
occur
● Called “collocations” or “co-
occurrences”
● Great way to get a quick look
at common themes
Finding patterns in text data
Unsupervised methods
● Might want to compare documents (or words) to one another
● Possible applications
○ Spelling correction
○ Document deduplication
○ Measure similarity of language
■ Politicians’ speeches
■ Movie reviews
■ Product descriptions
Finding patterns in text data
Unsupervised methods
Levenshtein distance
● Compute number of steps needed to
turn a word/document into another
● Can express as a ratio (percent of
word/document that needs to
change) to measure similarity
Finding patterns in text data
Unsupervised methods
Cosine similarity
● Compute the “angle” between two word vectors
● TF-IDF: axes are the weighted frequencies for
each word
● Word2Vec: axes are the learned dimensions
from the model
Finding patterns in text data
Unsupervised methods
Clustering
● Algorithms that use word vectors (TF-IDF,
Word2Vec, etc.) to identify structural
groupings between observations (words,
documents)
● K-Means is a very commonly used one
Finding patterns in text data
Unsupervised methods
Hierarchical/agglomerative clustering
● Start with all observations, and use a
rule to pair them up, and repeat until
there’s only one group left
Finding patterns in text data
Unsupervised methods
Network analysis
● Can also get creative
● After all, we’re just working
with columns and numbers
● Example: link words together
by their strongest correlations
Finding patterns in text data
Unsupervised methods
Pointwise mutual information
● Based on information theory
● Compares conditional and joint
probabilities to measure the
likelihood of a word occurring
with a category/outcome, beyond
random chance
Finding patterns in text data
Unsupervised methods
Topic modeling
● Algorithms that characterize
documents in terms of topics
(groups of words)
● Find topics that best fit the
data
Part 2
Topic Modeling
Topic modeling
● Out of all of these methods, people are often most interested in topic
modeling
○ Documents can have multiple topics
○ Continuous measure
○ Feels like it fully maps the “space” of themes
○ Results often impress - it feels like magic!
Motivation
● Topic models are increasingly popular across the social sciences
● Researchers use them to estimate conceptual prevalence and to classify
documents
● But how well do topic models capture the concepts we care about?
● To determine what topic models can and can’t tell us, we need to compare
them against human-coded benchmarks
How Topic Models Work
● Scans a set of documents, examines how words/phrases co-occur
● Learns clusters of words that best characterize the documents
● Many different algorithms: LDA, NMF, STM, CorEx
● All you need to provide them is a number of topics
How Topic Models Work
● Input
How Topic Models Work
● Output #1: Term-Topic Matrix
How Topic Models Work
● Difficult to read, so we often look at the words with the highest weights
How Topic Models Work
● Output #2: Document-Topic Matrix
How Topic Models Work
● Can convert to binary labels using a threshold
Asking a Big Question
● 4,492 open-ended responses (92% completion rate)
● Average response: 41 words
● Longest response: 939 words
○ ...not counting the respondent that emailed us a 14-page essay
● IQR: 12 - 53 words
NYAAPOR 2019 - An Intro to Topic Models for Text Analysis
NYAAPOR 2019 - An Intro to Topic Models for Text Analysis
Let’s try topic models!
● 1-3 grams, lemmatization -> 2,796 ngrams
● 3 models: LDA, NMF, CorEx
● 50 topics each
● Matched topics up using cosine similarity
LDA NMF CorEx
travel, able, time, spend, able travel, term,
family, spend time, time family, long term
travel, able travel, ability, ability travel, freedom,
food, music, reading, travel friend, travel enjoy
able, able travel, quality time, travel, able work, able
provide, family able, quality, able enjoy, health able
kid, grandkids, married, retirement, health,
spouse, fortunate, happily, great, kid grandkids
kid, grandkids, kid grandkids, school, grow, house,
getting, kid grow, husband kid, family kid
happy, kid, make happy, grandkids, kid grandkids, happy
family, dog, family happy, wife kid, family kid
happy, nice, house, make, home, doing, don,
love, job, time
home, paid, nice, comfortable, beautiful, car,
comfortable home, family home, nice home, country
home, house, paid, car, nice, afford, vacation, able afford,
beautiful, family home
new, learning, supportive, work, learn, school,
college, experience, field, learning new
new, learning, experience, learning new, place,
learn, learn new, trying, house, recently
learning, new, class, environment, knowledge, garden,
active, nature, local, learning new
career, hobby, family, pet, ability, work, home,
creative, retirement, family career
career, successful, family career, goal, school,
college, hobby, degree, support, partner
career, balance, rewarding, family career, work balance,
colleague, partner, professional, career family, owning home
loving, jesus, christ, jesus christ, lord, loving
family, god, family, blessed, savior
jesus, christ, jesus christ, lord, savior, lord jesus,
lord jesus christ, knowing, faith jesus, serve
jesus, christ, jesus christ, lord, savior, lord jesus, need met,
grandchild great, grandchild great grandchild, thank god
Let’s try topic models!
LDA NMF CorEx
travel, able, time, spend, able travel, term,
family, spend time, time family, long term
travel, able travel, ability, ability travel, freedom,
food, music, reading, travel friend, travel enjoy
able, able travel, quality time, travel, able work, able
provide, family able, quality, able enjoy, health able
kid, grandkids, married, retirement, health,
spouse, fortunate, happily, great, kid grandkids
kid, grandkids, kid grandkids, school, grow, house,
getting, kid grow, husband kid, family kid
happy, kid, make happy, grandkids, kid grandkids, happy
family, dog, family happy, wife kid, family kid
happy, nice, house, make, home, doing, don,
love, job, time
home, paid, nice, comfortable, beautiful, car,
comfortable home, family home, nice home, country
home, house, paid, car, nice, afford, vacation, able afford,
beautiful, family home
new, learning, supportive, work, learn, school,
college, experience, field, learning new
new, learning, experience, learning new, place,
learn, learn new, trying, house, recently
learning, new, class, environment, knowledge, garden,
active, nature, local, learning new
career, hobby, family, pet, ability, work, home,
creative, retirement, family career
career, successful, family career, goal, school,
college, hobby, degree, support, partner
career, balance, rewarding, family career, work balance,
colleague, partner, professional, career family, owning home
loving, jesus, christ, jesus christ, lord, loving
family, god, family, blessed, savior
jesus, christ, jesus christ, lord, savior, lord jesus,
lord jesus christ, knowing, faith jesus, serve
jesus, christ, jesus christ, lord, savior, lord jesus, need met,
grandchild great, grandchild great grandchild, thank god
Topic models are magic!
LDA NMF CorEx
travel, able, time, spend, able travel, term,
family, spend time, time family, long term
travel, able travel, ability, ability travel, freedom,
food, music, reading, travel friend, travel enjoy
able, able travel, quality time, travel, able work, able
provide, family able, quality, able enjoy, health able
kid, grandkids, married, retirement, health,
spouse, fortunate, happily, great, kid grandkids
kid, grandkids, kid grandkids, school, grow, house,
getting, kid grow, husband kid, family kid
happy, kid, make happy, grandkids, kid grandkids, happy
family, dog, family happy, wife kid, family kid
happy, nice, house, make, home, doing, don,
love, job, time
home, paid, nice, comfortable, beautiful, car,
comfortable home, family home, nice home, country
home, house, paid, car, nice, afford, vacation, able afford,
beautiful, family home
new, learning, supportive, work, learn, school,
college, experience, field, learning new
new, learning, experience, learning new, place,
learn, learn new, trying, house, recently
learning, new, class, environment, knowledge, garden,
active, nature, local, learning new
career, hobby, family, pet, ability, work, home,
creative, retirement, family career
career, successful, family career, goal, school,
college, hobby, degree, support, partner
career, balance, rewarding, family career, work balance,
colleague, partner, professional, career family, owning home
loving, jesus, christ, jesus christ, lord, loving
family, god, family, blessed, savior
jesus, christ, jesus christ, lord, savior, lord jesus,
lord jesus christ, knowing, faith jesus, serve
jesus, christ, jesus christ, lord, savior, lord jesus, need met,
grandchild great, grandchild great grandchild, thank god
Topic models are magic!
LDA NMF CorEx
travel, able, time, spend, able travel, term,
family, spend time, time family, long term TRAVEL!
able, able travel, quality time, travel, able work, able
provide, family able, quality, able enjoy, health able
kid, grandkids, married, retirement, health,
spouse, fortunate, happily, great, kid grandkids
kid, grandkids, kid grandkids, school, grow, house,
getting, kid grow, husband kid, family kid
happy, kid, make happy, grandkids, kid grandkids, happy
family, dog, family happy, wife kid, family kid
happy, nice, house, make, home, doing, don,
love, job, time
home, paid, nice, comfortable, beautiful, car,
comfortable home, family home, nice home, country
home, house, paid, car, nice, afford, vacation, able afford,
beautiful, family home
new, learning, supportive, work, learn, school,
college, experience, field, learning new
new, learning, experience, learning new, place,
learn, learn new, trying, house, recently
learning, new, class, environment, knowledge, garden,
active, nature, local, learning new
career, hobby, family, pet, ability, work, home,
creative, retirement, family career
career, successful, family career, goal, school,
college, hobby, degree, support, partner
career, balance, rewarding, family career, work balance,
colleague, partner, professional, career family, owning home
loving, jesus, christ, jesus christ, lord, loving
family, god, family, blessed, savior
jesus, christ, jesus christ, lord, savior, lord jesus,
lord jesus christ, knowing, faith jesus, serve
jesus, christ, jesus christ, lord, savior, lord jesus, need met,
grandchild great, grandchild great grandchild, thank god
Topic models are magic!
LDA NMF CorEx
travel, able, time, spend, able travel, term,
family, spend time, time family, long term TRAVEL!
able, able travel, quality time, travel, able work, able
provide, family able, quality, able enjoy, health able
kid, grandkids, married, retirement, health,
spouse, fortunate, happily, great, kid grandkids CHILDREN!
happy, kid, make happy, grandkids, kid grandkids, happy
family, dog, family happy, wife kid, family kid
happy, nice, house, make, home, doing, don,
love, job, time
home, paid, nice, comfortable, beautiful, car,
comfortable home, family home, nice home, country
home, house, paid, car, nice, afford, vacation, able afford,
beautiful, family home
new, learning, supportive, work, learn, school,
college, experience, field, learning new
new, learning, experience, learning new, place,
learn, learn new, trying, house, recently
learning, new, class, environment, knowledge, garden,
active, nature, local, learning new
career, hobby, family, pet, ability, work, home,
creative, retirement, family career
career, successful, family career, goal, school,
college, hobby, degree, support, partner
career, balance, rewarding, family career, work balance,
colleague, partner, professional, career family, owning home
loving, jesus, christ, jesus christ, lord, loving
family, god, family, blessed, savior
jesus, christ, jesus christ, lord, savior, lord jesus,
lord jesus christ, knowing, faith jesus, serve
jesus, christ, jesus christ, lord, savior, lord jesus, need met,
grandchild great, grandchild great grandchild, thank god
Topic models are magic!
LDA NMF CorEx
travel, able, time, spend, able travel, term,
family, spend time, time family, long term TRAVEL!
able, able travel, quality time, travel, able work, able
provide, family able, quality, able enjoy, health able
kid, grandkids, married, retirement, health,
spouse, fortunate, happily, great, kid grandkids CHILDREN!
happy, kid, make happy, grandkids, kid grandkids, happy
family, dog, family happy, wife kid, family kid
happy, nice, house, make, home, doing, don,
love, job, time HOME LIFE!
home, house, paid, car, nice, afford, vacation, able afford,
beautiful, family home
new, learning, supportive, work, learn, school,
college, experience, field, learning new
new, learning, experience, learning new, place,
learn, learn new, trying, house, recently
learning, new, class, environment, knowledge, garden,
active, nature, local, learning new
career, hobby, family, pet, ability, work, home,
creative, retirement, family career
career, successful, family career, goal, school,
college, hobby, degree, support, partner
career, balance, rewarding, family career, work balance,
colleague, partner, professional, career family, owning home
loving, jesus, christ, jesus christ, lord, loving
family, god, family, blessed, savior
jesus, christ, jesus christ, lord, savior, lord jesus,
lord jesus christ, knowing, faith jesus, serve
jesus, christ, jesus christ, lord, savior, lord jesus, need met,
grandchild great, grandchild great grandchild, thank god
Topic models are magic!
LDA NMF CorEx
travel, able, time, spend, able travel, term,
family, spend time, time family, long term TRAVEL!
able, able travel, quality time, travel, able work, able
provide, family able, quality, able enjoy, health able
kid, grandkids, married, retirement, health,
spouse, fortunate, happily, great, kid grandkids CHILDREN!
happy, kid, make happy, grandkids, kid grandkids, happy
family, dog, family happy, wife kid, family kid
happy, nice, house, make, home, doing, don,
love, job, time HOME LIFE!
home, house, paid, car, nice, afford, vacation, able afford,
beautiful, family home
new, learning, supportive, work, learn, school,
college, experience, field, learning new LEARNING!
learning, new, class, environment, knowledge, garden,
active, nature, local, learning new
career, hobby, family, pet, ability, work, home,
creative, retirement, family career
career, successful, family career, goal, school,
college, hobby, degree, support, partner
career, balance, rewarding, family career, work balance,
colleague, partner, professional, career family, owning home
loving, jesus, christ, jesus christ, lord, loving
family, god, family, blessed, savior
jesus, christ, jesus christ, lord, savior, lord jesus,
lord jesus christ, knowing, faith jesus, serve
jesus, christ, jesus christ, lord, savior, lord jesus, need met,
grandchild great, grandchild great grandchild, thank god
Topic models are magic!
LDA NMF CorEx
travel, able, time, spend, able travel, term,
family, spend time, time family, long term TRAVEL!
able, able travel, quality time, travel, able work, able
provide, family able, quality, able enjoy, health able
kid, grandkids, married, retirement, health,
spouse, fortunate, happily, great, kid grandkids CHILDREN!
happy, kid, make happy, grandkids, kid grandkids, happy
family, dog, family happy, wife kid, family kid
happy, nice, house, make, home, doing, don,
love, job, time HOME LIFE!
home, house, paid, car, nice, afford, vacation, able afford,
beautiful, family home
new, learning, supportive, work, learn, school,
college, experience, field, learning new LEARNING!
learning, new, class, environment, knowledge, garden,
active, nature, local, learning new
career, hobby, family, pet, ability, work, home,
creative, retirement, family career CAREER!
career, balance, rewarding, family career, work balance,
colleague, partner, professional, career family, owning home
loving, jesus, christ, jesus christ, lord, loving
family, god, family, blessed, savior
jesus, christ, jesus christ, lord, savior, lord jesus,
lord jesus christ, knowing, faith jesus, serve
jesus, christ, jesus christ, lord, savior, lord jesus, need met,
grandchild great, grandchild great grandchild, thank god
Topic models are magic!
LDA NMF CorEx
travel, able, time, spend, able travel, term,
family, spend time, time family, long term TRAVEL!
able, able travel, quality time, travel, able work, able
provide, family able, quality, able enjoy, health able
kid, grandkids, married, retirement, health,
spouse, fortunate, happily, great, kid grandkids CHILDREN!
happy, kid, make happy, grandkids, kid grandkids, happy
family, dog, family happy, wife kid, family kid
happy, nice, house, make, home, doing, don,
love, job, time HOME LIFE!
home, house, paid, car, nice, afford, vacation, able afford,
beautiful, family home
new, learning, supportive, work, learn, school,
college, experience, field, learning new LEARNING!
learning, new, class, environment, knowledge, garden,
active, nature, local, learning new
career, hobby, family, pet, ability, work, home,
creative, retirement, family career CAREER!
career, balance, rewarding, family career, work balance,
colleague, partner, professional, career family, owning home
loving, jesus, christ, jesus christ, lord, loving
family, god, family, blessed, savior CHRISTIANITY!
jesus, christ, jesus christ, lord, savior, lord jesus, need met,
grandchild great, grandchild great grandchild, thank god
Time for analysis?
Wait a second - I’m a skeptic
NYAAPOR 2019 - An Intro to Topic Models for Text Analysis
Let’s take a closer look at our topics...
Let’s take a closer look at our topics...
grandchild
grand
child grandchild
child grand
grandchild great
great grandchild
grand child
child
florida
Let’s take a closer look at our topics...
grandchild
grand
child grandchild
child grand
grandchild great
great grandchild
grand child
child
florida
Let’s take a closer look at our topics...
grandchild
grand
child grandchild
child grand
grandchild great
great grandchild
grand child
child
florida
Let’s take a closer look at our topics...
grandchild
grand
child grandchild
child grand
grandchild great
great grandchild
grand child
child
florida
Let’s take a closer look at our topics...
grandchild
grand
child grandchild
child grand
grandchild great
great grandchild
grand child
child
florida
Let’s take a closer look at our topics...
grandchild
grand
child grandchild
child grand
grandchild great
great grandchild
grand child
child
florida
Let’s take a closer look at our topics...
grandchild
grand
child grandchild
child grand
grandchild great
great grandchild
grand child
child
florida
Let’s take a closer look at our topics...
grandchild
grand
child grandchild
child grand
grandchild great
great grandchild
grand child
child
florida
Let’s take a closer look at our topics...
grandchild
grand
child grandchild
child grand
grandchild great
great grandchild
grand child
child
florida
Let’s take a closer look at our topics...
grandchild
grand
child grandchild
child grand
grandchild great
great grandchild
grand child
child
florida
Let’s take a closer look at our topics...
grandchild
grand
child grandchild
child grand
grandchild great
great grandchild
grand child
child
florida
Let’s take a closer look at our topics...
Conceptually spurious words
● Topic models pick up on word co-occurrences but can’t guarantee that they’re
related conceptually
● Topics can often contain words that don’t align with how we want to
interpret the topic
● This can overestimate document proportions
Conceptually spurious words
Topic Target concept Topic proportion
career, balance, rewarding, family career, work
balance, colleague, partner, professional, career
family, owning home
Career 4.9%
healthy, family healthy, extended, healthy family,
happy healthy, extended family, healthy happy,
supportive, immediate, healthy child
Health 6.5%
child, member, family member, husband child, child
family, family child, health child, child doing, child
husband, helping child
Children 17.6%
Conceptually spurious words
Topic Target concept Topic proportion
career, balance, rewarding, family career, work
balance, colleague, partner, professional, career
family, owning home
Career 4.9%
healthy, family healthy, extended, healthy family,
happy healthy, extended family, healthy happy,
supportive, immediate, healthy child
Health 6.5%
child, member, family member, husband child,
child family, family child, health child, child doing,
child husband, helping child
Children 17.6%
Conceptually spurious words
Topic Target concept Topic proportion
career, balance, rewarding, family career, work
balance, colleague, partner, professional, career
family, owning home
Career 4.9% 2.7%
healthy, family healthy, extended, healthy family,
happy healthy, extended family, healthy happy,
supportive, immediate, healthy child
Health 6.5% 4.8%
child, member, family member, husband child,
child family, family child, health child, child doing,
child husband, helping child
Children 17.6% 16.8%
Conceptually spurious words
Topic Target concept Topic proportion Over-estimation
career, balance, rewarding, family career, work
balance, colleague, partner, professional, career
family, owning home
Career 4.9% 2.7% 81% over
healthy, family healthy, extended, healthy family,
happy healthy, extended family, healthy happy,
supportive, immediate, healthy child
Health 6.5% 4.8% 34% over
child, member, family member, husband child,
child family, family child, health child, child doing,
child husband, helping child
Children 17.6% 16.8% 5% over
Conceptually spurious words
● How sensitive are models to this problem? How much do document
proportions change, on average?
● Computing document proportions
○ CorEx provides natively dichotomous document predictions
○ For LDA and NMF, we used a 10% topic probability threshold, which yields similar
proportions to CorEx
● Iterated over each topic’s top 10 words and held each word out from the
vocabulary
● Averaged the impact in document proportions for each topic
Conceptually spurious words
● Some models/topics are more sensitive than
others
● Average impact of removing one of the top 10
words from a topic was a 9 to 13% decrease in
the topic’s prevalence in the corpus
● For some topics, a single word was
responsible for nearly a quarter of the topic’s
prevalence
● A single out-of-place word can substantially
inflate model estimates
Model Smallest
Topic
Effect
Average
Topic
Effect
Largest
Topic
Effect
LDA -2% -9% -17%
NMF -8% -13% -21%
Corex -5% -11% -24%
Conceptually spurious words
● And your topics may have more than
one...
Topic
career, balance, rewarding, family career, work
balance, colleague, partner, professional, career
family, owning home
healthy, family healthy, extended, healthy family,
happy healthy, extended family, healthy happy,
supportive, immediate, healthy child
child, member, family member, husband child,
child family, family child, health child, child doing,
child husband, helping child
Model Smallest
Topic
Effect
Average
Topic
Effect
Largest
Topic
Effect
LDA -2% -9% -17%
NMF -8% -13% -21%
Corex -5% -11% -24%
Undercooked topics
● Conceptually spurious words are sometimes part of a more systematic
problem: “undercooked” topics
● These topics appear to contain multiple distinct concepts
● These concepts co-occur but we may want to measure them separately
Undercooked topics
Regardless of the algorithm and
number of topics used, you’ll
often get results like this:
job, health, money, family health, health
family, family job, job enjoy, job family,
great job, paying
Undercooked topics
Or this:
grandchild, child grandchild, wife, wife
child, great grandchild, wife family, wife
son, family wife, family grandchild,
wonderful wife
Undercooked topics
Or this:
learning, new, class, environment,
knowledge, garden, active, nature, local,
learning new
Undercooked topics
● We could increase the number of topics to encourage the model to split these
“undercooked” topics up
Overcooked topics
But in this very same
model, we also have topics
like this:
marriage, happy marriage, great
marriage, marriage child, job
marriage, marriage family,
marriage relationship, marriage
kid, marriage strong, marriage
wonderful
Overcooked topics
But in this very same
model, we also have topics
like this:
marriage, happy marriage, great
marriage, marriage child, job
marriage, marriage family,
marriage relationship, marriage
kid, marriage strong, marriage
wonderful
Overcooked topics
In fact, “wife” is elsewhere in the model
Overcooked topics
Do we reduce the number of topics and hope
they merge?
Overcooked topics
● Depending on your data, they might not
● Respondents don’t tend to mention “wife” and “marriage” together
● They’re effectively synonyms, just like:
○ “job” and “career”
○ “health” and “healthy”
○ “child” and “kid”
Overcooked topics
● Like conceptually spurious words, missing words in overcooked topics can
affect document estimates substantially
Topic Target concept Topic proportion
career, balance, rewarding, family career, work
balance, colleague, partner, professional, career
family, owning home
Career 4.9%
healthy, family healthy, extended, healthy family,
happy healthy, extended family, healthy happy,
supportive, immediate, healthy child
Health 6.5%
child, member, family member, husband child,
child family, family child, health child, child doing,
child husband, helping child
Children 17.6%
Overcooked topics
● Like conceptually spurious words, missing words in overcooked topics can
affect document estimates substantially
Topic Target concept Topic proportion
career, family career, work balance, colleague,
professional, career family
Career 2.7%
healthy, family healthy, healthy family, happy healthy,
healthy happy, healthy child
Health 4.8%
child, husband child, child family, family child, health
child, child doing, child husband, helping child
Children 16.8%
Overcooked topics
● Like conceptually spurious words, missing words in overcooked topics can
affect document estimates substantially
Topic Target concept Topic proportion Missing word
career, family career, work balance, colleague,
professional, career family
Career 2.7% job
healthy, family healthy, healthy family, happy
healthy, healthy happy, healthy child
Health 4.8% health
child, husband child, child family, family child,
health child, child doing, child husband, helping
child
Children 16.8% kid
Overcooked topics
● Like conceptually spurious words, missing words in overcooked topics can
affect document estimates substantially
Topic Target concept Topic proportion Missing word
career, family career, work balance, colleague,
professional, career family
Career 2.7% 18.1% job
healthy, family healthy, healthy family, happy
healthy, healthy happy, healthy child
Health 4.8% 19.8% health
child, husband child, child family, family child,
health child, child doing, child husband, helping
child
Children 16.8% 24.0% kid
Overcooked topics
● Like conceptually spurious words, missing words in overcooked topics can
affect document estimates substantially
Topic Target
concept
Topic proportion Missing word Under-estimation
career, family career, work balance, colleague,
professional, career family
Career 2.7% 18.1% job 575% under
healthy, family healthy, healthy family, happy
healthy, healthy happy, healthy child
Health 4.8% 19.8% health 309% under
child, husband child, child family, family child,
health child, child doing, child husband, helping
child
Children 16.8% 24.0% kid 42% under
Overcooked topics
● Overcooked topics are less
common in models with only
unigrams
● But some crucial words are still
missing
Topic Target
concept
Missing
word
career, successful, goal,
school, college, degree,
support, hobby, education,
partner
Career job
healthy, beautiful, body,
mentally, importantly,
peaceful, committed, bike,
fair, honest
Health health
child, grandchild, wonderful,
taken, thank, health, blessed,
able, independent, enjoyed
Children kid
Overcooked topics
● Overcooked topics are less
common in models with only
unigrams
● But some crucial words are still
missing
● And undercooked topics /
conceptually spurious words are
more common
Topic Target
concept
Missing
word
career, successful, goal,
school, college, degree,
support, hobby, education,
partner
Career job
healthy, beautiful, body,
mentally, importantly,
peaceful, committed, bike,
fair, honest
Health health
child, grandchild, wonderful,
taken, thank, health, blessed,
able, independent, enjoyed
Children kid
Topic models have problems
● We could aggregate multiple overcooked topics together
● But that still doesn’t deal with undercooked topics or spurious words
● These problems may be less common for larger corpora, but in our experience
they’re always present to some degree
But how big are these problems?
● Can we still use topic models to classify documents?
● Ideally, we should be able to influence or modify the topic model:
○ Split apart topics
○ Combine topics
○ Add missing words
○ Remove bad words
The solution: anchored topic models?
● Leveraged CorEx because it allows you to nudge the model with “anchor
words”
○ CorEx (Correlation Explanation) topic model:
https://github.com/gregversteeg/corex_topic
○ Anchored Correlation Explanation: Topic Modeling with Minimal
Domain Knowledge, Gallagher et al., TACL 2017.
● Semi-supervised approach
The solution: anchored topic models?
● Built a little interface around CorEx
● Researchers can view the results, add/change anchor words, and re-run the
model
The solution: anchored topic models?
● Iteratively developed two models
○ Smaller model with 55 topics
○ Larger model with 110 topics
● Confirmed good words as anchors
● Used domain knowledge to add additional anchor words
● Used “junk” topics to anchor bad words and pull them away
Our topics look much better now!
Before Anchoring After Anchoring
health, family health, health family, friend health,
health issue, family friend health, health job, job
health, issue, health child
health, healthy, family health, health family, family
healthy, healthy family, friend health, health job,
job health, healthy child
money, want, term, long term, worry, want want,
worry money, need want, time money, making
money
money, financial, financially, pay, income, debt,
paid, afford, worry, paying
job, family job, job family, time job, friend job,
paying job, kid job, paying, job time, doing job
work, job, working, career, business, family job,
company, professional, family work, employed
jesus, christ, church, jesus christ, lord,
community, church family, christian, lord jesus,
family church
jesus, christ, jesus christ, christian, bible, lord jesus,
faith jesus, relationship jesus, lord jesus christ,
christian faith
Interpretation and operationalization
● Some topics looked better in one model over the other
● Identified 92 potentially interesting and interpretable topics
● Assigned each topic a short description and developed a codebook
Topic Interpretation
health, healthy, family health, health family, family healthy, healthy family,
friend health, health job, job health, healthy child
Being in good health (themselves or others)
money, financial, financially, pay, income, debt, paid, afford, worry, paying Money and finances (any references to
money/wealth at all)
work, job, working, career, business, family job, company, professional,
family work, employed
Their career, work, job, or business (good or
bad)
jesus, christ, jesus christ, christian, bible, lord jesus, faith jesus, relationship
jesus, lord jesus christ, christian faith
Specific reference to Christianity
Exploratory coding
● Drew exploratory samples of 60 documents for each topic
● Oversampled on topics’ topic words
● Used Cohen’s Kappa to get a rough sense of IRR
● Abandoned 61 topics:
○ 22 were vague or conceptually uninteresting
○ 39 had particularly bad initial agreement
Codebook validation
● 31 topics remained
● Drew random samples of 100 documents for each topic
● Two researchers coded each
● Tried to achieve 95% confidence that Kappas were > 0.7
● 17 succeeded
● 7 failed (bad reliability)
● 7 were too rare (samples were too small)
○ But we were able to salvage them by coding new samples that oversampled on anchor words
Assessing model performance
● 24 final topics with acceptable human IRR
● Resolved disagreements to produce gold standards for each topic
● Compared coders to:
○ The raw CorEx topics
○ Regular expressions of each topic’s anchor words
○ An XGBoost classifier using Word2Vec and CorEx features
Assessing model performance
● At least one method worked for 23 of the 24 final topics
● CorEx: 18 good topics
● XGBoost: 21 good topics
● Anchor RegEx: 23 good topics
Classifier CorEx RegEx
Average .88 .89 .93
Min .56 .55 .69
25% .83 .82 .91
50% .93 .94 .94
75% .97 .96 .99
Max 1.0 1.0 1.0
Assessing model performance
● Anchor RegExes were only outperformed by ML twice
● And never by more than 2%
● On topics that were already near perfect agreement
Classifier CorEx Dictionary
Family .96 .96 .94
Travel 1.0 .99 .99
Assessing model performance
● Anchor RegExes outperformed CorEx for 9 topics:
○ Average CorEx Kappa: .79
○ Average Anchor RegEx Kappa: .91
● Anchor RegExes outperformed XGBoost for 12 topics:
○ Average XGBoost Kappa: .81
○ Average Anchor RegEx Kappa: .9
Key findings
1. Unsupervised topic models are riddled with problems
Conceptually spurious words, polysemous words, and over/undercooked
topics are common and can results in frequent false positives/negatives
Key findings
2. Topics can often not be interpreted reliably
Topics that looked coherent and meaningful actually weren’t. We abandoned
61 of 92 topics because they were too vague or difficult to interpret.
This only became clear after extensive exploratory coding.
Key findings
3. Your interpretations may not be as clear to others as they are
to you
Even after filtering down to 31 topics that seemed coherent and interpretable
- we still had to abandon 7 of them. Our definitions (which seemed
reasonable on paper) were still too poorly specified.
This only became clear after extensive manual validation and inter-rater
reliability testing.
Key findings
4. Manual validation is the only way to ensure that topics are
measuring what you claim they are
We still had to throw out 6 of our 24 final topics because the topic models’
output didn’t align with our own interpretation of the topics.
Key findings
5. Topic models perform identically to or worse than less
sophisticated approaches
Manually-curated lists of words allowed us to salvage 5 of the 6 bad topics,
because word searches outperformed the machine learning approaches. Even
when both the RegExes and machine learning methods worked, the
automated methods provided absolutely no discernible benefit.
Thank you!
Questions and comments encouraged!
pvankessel@pewresearch.org
ahughes@pewresearch.org
NYAAPOR 2019 - An Intro to Topic Models for Text Analysis
Exploratory coding
● Bad agreement often resulted from polysemous (multi-context) words
● Example: “opinions on politics, society, and the state of the world”
● The problem word: “world”
● Including it alongside anchors like “politics” and “government” produced too
many false positives (“traveling the world”; “making the world a better place”)
● Excluding it produced too many false negatives (“I feel like the world is going
crazy right now”)
Keyword oversampling
● For the 7 with good Kappas but low confidence, we drew new samples
● Oversampled using the anchor words: 50% with, 50% without
● Sample sizes were determined based on the expected Kappa (from the
random samples)
● Computed Kappas (using sampling weights) and salvaged all 7
NYAAPOR 2019 - An Intro to Topic Models for Text Analysis
NYAAPOR 2019 - An Intro to Topic Models for Text Analysis
Tools and Resources
Open-source tools
Python
NLTK, scikit-learn, pandas, numpy, scipy, gensim, spacy, etc
R
https://cran.r-project.org/web/views/NaturalLanguageProcessing.html
Java
Stanford Core NLP + many other useful libraries
https://nlp.stanford.edu/software/
Commercial tools Cloud based NLP
Service Keyphrase
/ snippet
Keywords Entities Entity
Sentiment
Document
Sentiment
Sentence
Sentiment
Emotions Categories Topic
Models
Concepts Languag
e
Detection
Syntax /
Semantic
roles
Amazon
Comprehend
Google Cloud
Natural Language
IBM Watson NLU
Amazon Comprehend
● Provides topic modeling
capabilities
Google Natural Language Cloud
● Leverages Google’s deep
learning models
● Very extensive Semantic
role and document parser
capabilities
IBM Watson NLU
● Ability to create domain
specific models through
Watson Knowledge Studio
Commercial tools SPSS Text Modeler
● Advanced control over entity extraction engine
● Rule based categorization
● Co-occurrence visualizations
● Textlink analysis visualizations
● Ability to export model to be used in SPSS stream
SPSS Text Modeler
Commercial tools Provalis
Arguably most comprehensive mixed methods text
analytics software
Provalis
Commercial tools Quid
Allows interactive adjustments of underlying
topic/clustering model through its graph
based UI
Commercial tools Thematically
Questions?
(Or time for a demo?)
1 of 165

Recommended

ChatGPT and the Future of Work - Clark Boyd by
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
23.4K views69 slides
Getting into the tech field. what next by
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
5.6K views22 slides
Google's Just Not That Into You: Understanding Core Updates & Search Intent by
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
6.3K views99 slides
How to have difficult conversations by
How to have difficult conversations How to have difficult conversations
How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC
4.9K views19 slides
Introduction to Data Science by
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceChristy Abraham Joy
82.2K views51 slides
Time Management & Productivity - Best Practices by
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
169.7K views42 slides

More Related Content

Recently uploaded

Nitrosamine & NDSRI.pptx by
Nitrosamine & NDSRI.pptxNitrosamine & NDSRI.pptx
Nitrosamine & NDSRI.pptxNileshBonde4
17 views22 slides
application of genetic engineering 2.pptx by
application of genetic engineering 2.pptxapplication of genetic engineering 2.pptx
application of genetic engineering 2.pptxSankSurezz
10 views12 slides
plasmids by
plasmidsplasmids
plasmidsscribddarkened352
9 views2 slides
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance... by
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...InsideScientific
58 views62 slides
scopus cited journals.pdf by
scopus cited journals.pdfscopus cited journals.pdf
scopus cited journals.pdfKSAravindSrivastava
7 views15 slides
1978 NASA News Release Log by
1978 NASA News Release Log1978 NASA News Release Log
1978 NASA News Release Logpurrterminator
10 views146 slides

Recently uploaded(20)

Nitrosamine & NDSRI.pptx by NileshBonde4
Nitrosamine & NDSRI.pptxNitrosamine & NDSRI.pptx
Nitrosamine & NDSRI.pptx
NileshBonde417 views
application of genetic engineering 2.pptx by SankSurezz
application of genetic engineering 2.pptxapplication of genetic engineering 2.pptx
application of genetic engineering 2.pptx
SankSurezz10 views
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance... by InsideScientific
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...
InsideScientific58 views
CSF -SHEEBA.D presentation.pptx by SheebaD7
CSF -SHEEBA.D presentation.pptxCSF -SHEEBA.D presentation.pptx
CSF -SHEEBA.D presentation.pptx
SheebaD714 views
PRINCIPLES-OF ASSESSMENT by rbalmagro
PRINCIPLES-OF ASSESSMENTPRINCIPLES-OF ASSESSMENT
PRINCIPLES-OF ASSESSMENT
rbalmagro12 views
A giant thin stellar stream in the Coma Galaxy Cluster by Sérgio Sacani
A giant thin stellar stream in the Coma Galaxy ClusterA giant thin stellar stream in the Coma Galaxy Cluster
A giant thin stellar stream in the Coma Galaxy Cluster
Sérgio Sacani10 views
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ... by ILRI
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
ILRI5 views
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ... by ILRI
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
ILRI5 views
Metatheoretical Panda-Samaneh Borji.pdf by samanehborji
Metatheoretical Panda-Samaneh Borji.pdfMetatheoretical Panda-Samaneh Borji.pdf
Metatheoretical Panda-Samaneh Borji.pdf
samanehborji16 views
Pollination By Nagapradheesh.M.pptx by MNAGAPRADHEESH
Pollination By Nagapradheesh.M.pptxPollination By Nagapradheesh.M.pptx
Pollination By Nagapradheesh.M.pptx
MNAGAPRADHEESH16 views
Distinct distributions of elliptical and disk galaxies across the Local Super... by Sérgio Sacani
Distinct distributions of elliptical and disk galaxies across the Local Super...Distinct distributions of elliptical and disk galaxies across the Local Super...
Distinct distributions of elliptical and disk galaxies across the Local Super...
Sérgio Sacani31 views
Conventional and non-conventional methods for improvement of cucurbits.pptx by gandhi976
Conventional and non-conventional methods for improvement of cucurbits.pptxConventional and non-conventional methods for improvement of cucurbits.pptx
Conventional and non-conventional methods for improvement of cucurbits.pptx
gandhi97619 views

Featured

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present... by
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
55.5K views138 slides
12 Ways to Increase Your Influence at Work by
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
401.7K views64 slides
ChatGPT webinar slides by
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slidesAlireza Esmikhani
30.3K views36 slides
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G... by
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
3.6K views12 slides
Barbie - Brand Strategy Presentation by
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
25.1K views46 slides

Featured(20)

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present... by Applitools
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Applitools55.5K views
12 Ways to Increase Your Influence at Work by GetSmarter
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
GetSmarter401.7K views
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G... by DevGAMM Conference
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
DevGAMM Conference3.6K views
Barbie - Brand Strategy Presentation by Erica Santiago
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
Erica Santiago25.1K views
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well by Saba Software
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Saba Software25.2K views
Introduction to C Programming Language by Simplilearn
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
Simplilearn8.4K views
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr... by Palo Alto Software
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...
Palo Alto Software88.4K views
9 Tips for a Work-free Vacation by Weekdone.com
9 Tips for a Work-free Vacation9 Tips for a Work-free Vacation
9 Tips for a Work-free Vacation
Weekdone.com7.2K views
How to Map Your Future by SlideShop.com
How to Map Your FutureHow to Map Your Future
How to Map Your Future
SlideShop.com275.1K views
Beyond Pride: Making Digital Marketing & SEO Authentically LGBTQ+ Inclusive -... by AccuraCast
Beyond Pride: Making Digital Marketing & SEO Authentically LGBTQ+ Inclusive -...Beyond Pride: Making Digital Marketing & SEO Authentically LGBTQ+ Inclusive -...
Beyond Pride: Making Digital Marketing & SEO Authentically LGBTQ+ Inclusive -...
AccuraCast3.4K views
Exploring ChatGPT for Effective Teaching and Learning.pptx by Stan Skrabut, Ed.D.
Exploring ChatGPT for Effective Teaching and Learning.pptxExploring ChatGPT for Effective Teaching and Learning.pptx
Exploring ChatGPT for Effective Teaching and Learning.pptx
Stan Skrabut, Ed.D.57.7K views
How to train your robot (with Deep Reinforcement Learning) by Lucas García, PhD
How to train your robot (with Deep Reinforcement Learning)How to train your robot (with Deep Reinforcement Learning)
How to train your robot (with Deep Reinforcement Learning)
Lucas García, PhD42.5K views
4 Strategies to Renew Your Career Passion by Daniel Goleman
4 Strategies to Renew Your Career Passion4 Strategies to Renew Your Career Passion
4 Strategies to Renew Your Career Passion
Daniel Goleman122K views
The Student's Guide to LinkedIn by LinkedIn
The Student's Guide to LinkedInThe Student's Guide to LinkedIn
The Student's Guide to LinkedIn
LinkedIn87.9K views
Different Roles in Machine Learning Career by Intellipaat
Different Roles in Machine Learning CareerDifferent Roles in Machine Learning Career
Different Roles in Machine Learning Career
Intellipaat12.4K views
Defining a Tech Project Vision in Eight Quick Steps pdf by TechSoup
Defining a Tech Project Vision in Eight Quick Steps pdfDefining a Tech Project Vision in Eight Quick Steps pdf
Defining a Tech Project Vision in Eight Quick Steps pdf
TechSoup 9.7K views

NYAAPOR 2019 - An Intro to Topic Models for Text Analysis

  • 1. An Intro to Topic Models for Text Analysis Patrick van Kessel, Senior Data Scientist Pew Research Center
  • 3. Agenda ● The role of text in social research ● Data selection and preparation ● Processing: Turning text into data ● Analysis: Finding patterns in text data ● Visualization: Turning patterns into insights
  • 4. The role of text in social research
  • 5. ● Free of assumptions ● Potential for richer insights relative to closed-format responses ● If organic, then data collection costs are often negligible The role of text in social research Why text?
  • 6. The role of text in social research Where do I find it? ● Open-ended surveys / focus groups / transcripts / interviews ● Social media data (tweets, FB posts, etc.) ● Long-form content (articles, notes, logs, etc.)
  • 7. The role of text in social research What makes it challenging? ● Messy ○ “Data spaghetti” with little or no structure ● Sparse ○ Low information-to-data ratio (lots of hay, few needles) ● Often organic (rather than designed) ○ Can be naturally generated by people and processes ○ Often without a research use in mind
  • 8. Data selection and preparation
  • 9. Data selection and preparation ● Know your objective and subject matter (if needed find subject matter expert) ● Get familiar with the data ● Don’t make assumptions - know your data, quirks and all
  • 10. Data selection and preparation Text Acquisition and Preprepation Select relevant data (text corpus) ● Content ● Metadata Prepare the input file ● Determine unit of analysis ● Process text to get one document per unit of analysis
  • 13. Turning text into data ● How do we sift through text and produce insight? ● Might first try searching for keywords ● How many times is “analysis” mentioned? Original 1 Text analysis is fun 2 I enjoy analyzing text data 3 Data science often involves text analytics
  • 14. Turning text into data ● How do we sift through text and produce insight? ● Might first try searching for keywords ● How many times is “analysis” mentioned? Original 1 Text analysis is fun 2 I enjoy analyzing text data 3 Data science often involves text analytics We missed this one And this one too
  • 15. Turning text into data ● Variations of words can have the same meaning but look completely different to a computer Original 1 Text analysis is fun 2 I enjoy analyzing text data 3 Data science often involves text analytics
  • 16. Turning text into data Regular Expressions ● A more sophisticated solution: regular expressions ● Syntax for defining string (text) patterns Original 1 Text analysis is fun 2 I enjoy analyzing text data 3 Data science often involves text analytics
  • 17. Turning text into data Regular Expressions ● Can use to search text or extract specific chunks ● Example use cases: ○ Extracting dates ○ Finding URLs ○ Identifying names/entities ● https://regex101.com/ ● http://www.regexlib.com/
  • 18. Turning text into data Regular Expressions banaly[a-z]+b Original 1 Text analysis is fun 2 I enjoy analyzing text data 3 Data science often involves text analytics
  • 19. Turning text into data Regular Expressions Regular expressions can be extremely powerful… ...and terrifyingly complex: URLS: ((https?://(www.)?)?[-a-zA-Z0-9@:%._+~#=]{2,4096}.[a-z]{2,6}b([-a-zA-Z0-9@:%_+.~#?&//=]*)) DOMAINS: (?:http[s]?://)?(?:www(?:s?).)?([w.-]+)(?:[/](?:.+))? MONEY: $([0-9]{1,3}(?:(?:,[0-9]{3})+)?(?:.[0-9]{1,2})?)s
  • 20. Turning text into data Pre-processing ● Great, but we can’t write patterns for everything ● Words are messy and have a lot of variation ● We need to collapse semantically ● We need to process Original 1 Text analysis is fun 2 I enjoy analyzing text data 3 Data science often involves text analytics
  • 21. Turning text into data Pre-processing ● Common first steps: ○ Spell check / correct ○ Remove punctuation / expand contractions Original Processed 1 Text analysis is fun 2 I enjoy analyzing text data 3 Data science often involves text analytics
  • 22. Turning text into data Pre-processing ● Now to collapse words with the same meaning ● We do this with stemming or lemmatization ● Break words down to their roots Original Processed 1 Text analysis is fun 2 I enjoy analyzing text data 3 Data science often involves text analytics
  • 23. Turning text into data Pre-processing ● Stemming is more conservative ● There are many different stemmers ● Here’s the Porter stemmer (1979) Original Processed 1 Text analysis is fun Text analysi is fun 2 I enjoy analyzing text data I enjoy analyz text data 3 Data science often involves text analytics Data scienc often involv text analyt
  • 24. Turning text into data Pre-processing ● Stemming is more conservative ● There are many different stemmers ● Here’s the Porter stemmer (1979) Original Processed 1 Text analysis is fun Text analysi is fun 2 I enjoy analyzing text data I enjoy analyz text data 3 Data science often involves text analytics Data scienc often involv text analyt
  • 25. Turning text into data Pre-processing ● The Lancaster stemmer (1990) is newer and more aggressive ● Truncates words a LOT Original Processed 1 Text analysis is fun text analys is fun 2 I enjoy analyzing text data I enjoy analys text dat 3 Data science often involves text analytics dat sci oft involv text analys
  • 26. Turning text into data Pre-processing ● Lemmatization uses linguistic relationships and parts of speech to collapse words down to their root form - so you get actual words (“lemma”), not stems ● WordNet Lemmatizer Original Processed 1 Text analysis is fun text analysis is fun 2 I enjoy analyzing text data I enjoy analyze text data 3 Data science often involves text analytics data science often involve text analytics
  • 27. Turning text into data Pre-processing ● Picking the right method depends on how much you want to preserve nuance or collapse meaning ● We’ll stick with Lancaster Original Processed 1 Text analysis is fun text analys is fun 2 I enjoy analyzing text data I enjoy analys text dat 3 Data science often involves text analytics dat sci oft involv text analys
  • 28. Turning text into data Pre-processing ● Finally, we need to remove words that don’t hold meaning themselves ● These are called “stopwords” ● Can expand standard stopword lists with custom words Original Processed 1 Text analysis is fun text analys fun 2 I enjoy analyzing text data enjoy analys text dat 3 Data science often involves text analytics dat sci oft involv text analys
  • 29. Turning text into data Tokenization ● Now we need to tokenize ● Break words apart according to certain rules ● Usually breaks on whitespace and punctuation ● What’s left are called “tokens” ● Single tokens or pairs of two or more tokens are called “ngrams”
  • 30. Turning text into data Tokenization ● Unigrams text analys fun enjoy dat sci oft involv 1 1 1 1 1 1 1 1 1 1 1 1 1
  • 31. Turning text into data Tokenization ● Bigrams Text analys 2 Analys fun 1 Enjoy analys 1 Analys text 1 Text dat 1 Dat sci 1 Sci oft 1 Oft involv 1 Involv text 1
  • 32. Turning text into data Tokenization ● If we want to characterize the whole corpus, we can just look at the most frequent words text analys fun enjoy dat sci oft involv 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 1 1 2 1 1 1
  • 33. Turning text into data The Term Frequency Matrix ● But what if we want to distinguish documents from each other? ● We know these documents are about text analysis ● What makes them unique?
  • 34. Turning text into data The Term Frequency Matrix ● Divide the word frequencies by the number of documents they appear in ● Downweight words that are common ● Now we’re emphasizing what makes each document unique text analys fun enjoy dat sci oft involv .33 .33 1 .33 .33 1 .5 .33 .33 .5 1 1 1 1 1 1 1 1 1 1 1
  • 35. Turning words into numbers Part-of-Speech Tagging ● TF-IDF is often all you need ● But sometimes you care about how a word is used ● Can use pre-trained part-of-speech (POS) taggers ● Can also help with things like negation ○ “Happy” vs. “NOT happy”
  • 36. Turning words into numbers Named Entity Extraction ● Might also be interested in people, places, organizations, etc. ● Like POS taggers, named entity extractors use trained models
  • 37. Turning words into numbers Word2Vec ● Other methods can quantify words not by frequency, but by their relation ● Word2vec uses a sliding window to read words and learn their relationships; each word gets a vector in N-dimensional space ● https://code.google.com/archive/p/word2vec/
  • 39. Finding patterns in text data Two types of approaches: ● Supervised NLP: requires training data, learns to predict labels and classes ● Unsupervised NLP: automated, extracts structure from the data
  • 40. Finding patterns in text data Supervised methods ● Often we want to categorize documents ● Classification models can take labeled data and learn to make predictions ● If we read and label some documents ourselves, ML can do the rest
  • 41. Finding patterns in text data Supervised methods Steps: ● Label a sample of documents ● Break your sample into two sets: a training sample and a test sample ● Train a model on the training sample ● Evaluate it on the test sample ● Apply it to the full set of documents to make predictions
  • 42. Finding patterns in text data Supervised methods ● First you need to develop a codebook ● Codebook: set of rules for labeling and categorizing documents ● The best codebooks have clear rules for hard cases, and lots of examples ● Categories should be MECE: mutually exclusive and collectively exhaustive
  • 44. Finding patterns in text data Supervised methods
  • 45. Finding patterns in text data Supervised methods ● Need to validate the codebook by measuring interrater reliability ● Makes sure your measures are consistent, objective, and reproducible ● Multiple people code the same document
  • 46. Finding patterns in text data Supervised methods ● Various metrics to test whether their agreement is high enough ○ Krippendorf’s alpha ○ Cohen’s kappa ● Can also compare coders against a gold standard, if available
  • 47. Finding patterns in text data Supervised methods ● Mechanical Turk can be a great way to code a lot of documents ● Have 5+ Turkers code a large sample of documents ● Collapse them together with a rule ● Code a subset in-house, and compute reliability
  • 48. Finding patterns in text data Supervised methods ● Sometimes you’re trying to measure something that’s really rare ● Consider oversampling ● Example: keyword oversampling ● Disproportionately select documents that are more likely to contain your outcomes
  • 49. Finding patterns in text data Supervised methods ● After coding, split your sample into two sets (~80/20) ○ One for training, one for testing ● We do this to check for (and avoid) overfitting
  • 50. Finding patterns in text data Supervised methods ● Next step is called feature extraction or feature selection ● Need to extract “features” from the text ○ TF-IDF ○ Word2Vec vectors ● Can also utilize metadata, if potentially useful
  • 51. Finding patterns in text data Supervised methods ● Select a classification algorithm ● Common choice for text data are support vector machines (SVMs) ● Similar to regression, SVMs find the line that best separates two or more groups ● Can also use non-linear “kernels” for better fits (radial basis function, etc.) ● XGBoost is a newer and very promising algorithm
  • 52. Finding patterns in text data Supervised methods ● Word of caution: if you oversampled, your weights are probability (survey) weights ● Most machine learning implementations assume your weights are frequency weights ● Not an issue for predictions - but can produce biased variance estimates if you analyze your training data itself
  • 53. Finding patterns in text data Supervised methods ● Time to evaluate performance ● Lots of different metrics, depending on what you care about ● Often we care about precision/recall ○ Precision: did you pick out mostly needles or mostly hay? ○ Recall: how many needles did you miss? ● Other metrics: ○ Matthew’s correlation coefficient ○ Brier score ○ Overall accuracy
  • 54. Finding patterns in text data Supervised methods ● Doing just one split leaves a lot up to chance ● To bootstrap a better estimate of the model’s performance, it’s best to use K- fold cross-validation ● Splits your data into train/test sets multiple times and averages the performance metrics ● Ensures that you didn’t just get lucky (or unlucky)
  • 55. Finding patterns in text data Supervised methods ● Model not working well? ● You probably need to tune your parameters ● You can use a grid search to test out different combinations of model parameters and feature extraction methods ● Many software packages can automatically help you pick the best combination to maximize your model’s performance
  • 56. Finding patterns in text data Supervised methods ● Suggested design: ○ Large training sample, coded by Turkers ○ Small evaluation sample, coded by Turkers and in-house experts ○ Compute IRR between Turk and experts ○ Train model on training sample, use 5-fold cross-validation ○ Apply model to evaluation sample, compare results against in-house coders and Turkers
  • 57. Finding patterns in text data Supervised methods ● Some (but not all) models produce probabilities along with their classifications ● Ideally you fit the model using your preferred scoring metric/function ● But you can also use post-hoc probability thresholds to adjust your model’s predictions
  • 58. That’s a lot of work
  • 59. Finding patterns in text data Unsupervised methods ● Can we extract meaningful insights more quickly and efficiently? ● Or, what if our documents are already categorized? ● There are a number of methods you can use to identify key themes and/or make comparisons without needing to manually label documents yourself
  • 60. Finding patterns in text data Unsupervised methods ● First up, we might look at words that commonly co- occur ● Called “collocations” or “co- occurrences” ● Great way to get a quick look at common themes
  • 61. Finding patterns in text data Unsupervised methods ● Might want to compare documents (or words) to one another ● Possible applications ○ Spelling correction ○ Document deduplication ○ Measure similarity of language ■ Politicians’ speeches ■ Movie reviews ■ Product descriptions
  • 62. Finding patterns in text data Unsupervised methods Levenshtein distance ● Compute number of steps needed to turn a word/document into another ● Can express as a ratio (percent of word/document that needs to change) to measure similarity
  • 63. Finding patterns in text data Unsupervised methods Cosine similarity ● Compute the “angle” between two word vectors ● TF-IDF: axes are the weighted frequencies for each word ● Word2Vec: axes are the learned dimensions from the model
  • 64. Finding patterns in text data Unsupervised methods Clustering ● Algorithms that use word vectors (TF-IDF, Word2Vec, etc.) to identify structural groupings between observations (words, documents) ● K-Means is a very commonly used one
  • 65. Finding patterns in text data Unsupervised methods Hierarchical/agglomerative clustering ● Start with all observations, and use a rule to pair them up, and repeat until there’s only one group left
  • 66. Finding patterns in text data Unsupervised methods Network analysis ● Can also get creative ● After all, we’re just working with columns and numbers ● Example: link words together by their strongest correlations
  • 67. Finding patterns in text data Unsupervised methods Pointwise mutual information ● Based on information theory ● Compares conditional and joint probabilities to measure the likelihood of a word occurring with a category/outcome, beyond random chance
  • 68. Finding patterns in text data Unsupervised methods Topic modeling ● Algorithms that characterize documents in terms of topics (groups of words) ● Find topics that best fit the data
  • 70. Topic modeling ● Out of all of these methods, people are often most interested in topic modeling ○ Documents can have multiple topics ○ Continuous measure ○ Feels like it fully maps the “space” of themes ○ Results often impress - it feels like magic!
  • 71. Motivation ● Topic models are increasingly popular across the social sciences ● Researchers use them to estimate conceptual prevalence and to classify documents ● But how well do topic models capture the concepts we care about? ● To determine what topic models can and can’t tell us, we need to compare them against human-coded benchmarks
  • 72. How Topic Models Work ● Scans a set of documents, examines how words/phrases co-occur ● Learns clusters of words that best characterize the documents ● Many different algorithms: LDA, NMF, STM, CorEx ● All you need to provide them is a number of topics
  • 73. How Topic Models Work ● Input
  • 74. How Topic Models Work ● Output #1: Term-Topic Matrix
  • 75. How Topic Models Work ● Difficult to read, so we often look at the words with the highest weights
  • 76. How Topic Models Work ● Output #2: Document-Topic Matrix
  • 77. How Topic Models Work ● Can convert to binary labels using a threshold
  • 78. Asking a Big Question ● 4,492 open-ended responses (92% completion rate) ● Average response: 41 words ● Longest response: 939 words ○ ...not counting the respondent that emailed us a 14-page essay ● IQR: 12 - 53 words
  • 81. Let’s try topic models! ● 1-3 grams, lemmatization -> 2,796 ngrams ● 3 models: LDA, NMF, CorEx ● 50 topics each ● Matched topics up using cosine similarity
  • 82. LDA NMF CorEx travel, able, time, spend, able travel, term, family, spend time, time family, long term travel, able travel, ability, ability travel, freedom, food, music, reading, travel friend, travel enjoy able, able travel, quality time, travel, able work, able provide, family able, quality, able enjoy, health able kid, grandkids, married, retirement, health, spouse, fortunate, happily, great, kid grandkids kid, grandkids, kid grandkids, school, grow, house, getting, kid grow, husband kid, family kid happy, kid, make happy, grandkids, kid grandkids, happy family, dog, family happy, wife kid, family kid happy, nice, house, make, home, doing, don, love, job, time home, paid, nice, comfortable, beautiful, car, comfortable home, family home, nice home, country home, house, paid, car, nice, afford, vacation, able afford, beautiful, family home new, learning, supportive, work, learn, school, college, experience, field, learning new new, learning, experience, learning new, place, learn, learn new, trying, house, recently learning, new, class, environment, knowledge, garden, active, nature, local, learning new career, hobby, family, pet, ability, work, home, creative, retirement, family career career, successful, family career, goal, school, college, hobby, degree, support, partner career, balance, rewarding, family career, work balance, colleague, partner, professional, career family, owning home loving, jesus, christ, jesus christ, lord, loving family, god, family, blessed, savior jesus, christ, jesus christ, lord, savior, lord jesus, lord jesus christ, knowing, faith jesus, serve jesus, christ, jesus christ, lord, savior, lord jesus, need met, grandchild great, grandchild great grandchild, thank god
  • 83. Let’s try topic models! LDA NMF CorEx travel, able, time, spend, able travel, term, family, spend time, time family, long term travel, able travel, ability, ability travel, freedom, food, music, reading, travel friend, travel enjoy able, able travel, quality time, travel, able work, able provide, family able, quality, able enjoy, health able kid, grandkids, married, retirement, health, spouse, fortunate, happily, great, kid grandkids kid, grandkids, kid grandkids, school, grow, house, getting, kid grow, husband kid, family kid happy, kid, make happy, grandkids, kid grandkids, happy family, dog, family happy, wife kid, family kid happy, nice, house, make, home, doing, don, love, job, time home, paid, nice, comfortable, beautiful, car, comfortable home, family home, nice home, country home, house, paid, car, nice, afford, vacation, able afford, beautiful, family home new, learning, supportive, work, learn, school, college, experience, field, learning new new, learning, experience, learning new, place, learn, learn new, trying, house, recently learning, new, class, environment, knowledge, garden, active, nature, local, learning new career, hobby, family, pet, ability, work, home, creative, retirement, family career career, successful, family career, goal, school, college, hobby, degree, support, partner career, balance, rewarding, family career, work balance, colleague, partner, professional, career family, owning home loving, jesus, christ, jesus christ, lord, loving family, god, family, blessed, savior jesus, christ, jesus christ, lord, savior, lord jesus, lord jesus christ, knowing, faith jesus, serve jesus, christ, jesus christ, lord, savior, lord jesus, need met, grandchild great, grandchild great grandchild, thank god
  • 84. Topic models are magic! LDA NMF CorEx travel, able, time, spend, able travel, term, family, spend time, time family, long term travel, able travel, ability, ability travel, freedom, food, music, reading, travel friend, travel enjoy able, able travel, quality time, travel, able work, able provide, family able, quality, able enjoy, health able kid, grandkids, married, retirement, health, spouse, fortunate, happily, great, kid grandkids kid, grandkids, kid grandkids, school, grow, house, getting, kid grow, husband kid, family kid happy, kid, make happy, grandkids, kid grandkids, happy family, dog, family happy, wife kid, family kid happy, nice, house, make, home, doing, don, love, job, time home, paid, nice, comfortable, beautiful, car, comfortable home, family home, nice home, country home, house, paid, car, nice, afford, vacation, able afford, beautiful, family home new, learning, supportive, work, learn, school, college, experience, field, learning new new, learning, experience, learning new, place, learn, learn new, trying, house, recently learning, new, class, environment, knowledge, garden, active, nature, local, learning new career, hobby, family, pet, ability, work, home, creative, retirement, family career career, successful, family career, goal, school, college, hobby, degree, support, partner career, balance, rewarding, family career, work balance, colleague, partner, professional, career family, owning home loving, jesus, christ, jesus christ, lord, loving family, god, family, blessed, savior jesus, christ, jesus christ, lord, savior, lord jesus, lord jesus christ, knowing, faith jesus, serve jesus, christ, jesus christ, lord, savior, lord jesus, need met, grandchild great, grandchild great grandchild, thank god
  • 85. Topic models are magic! LDA NMF CorEx travel, able, time, spend, able travel, term, family, spend time, time family, long term TRAVEL! able, able travel, quality time, travel, able work, able provide, family able, quality, able enjoy, health able kid, grandkids, married, retirement, health, spouse, fortunate, happily, great, kid grandkids kid, grandkids, kid grandkids, school, grow, house, getting, kid grow, husband kid, family kid happy, kid, make happy, grandkids, kid grandkids, happy family, dog, family happy, wife kid, family kid happy, nice, house, make, home, doing, don, love, job, time home, paid, nice, comfortable, beautiful, car, comfortable home, family home, nice home, country home, house, paid, car, nice, afford, vacation, able afford, beautiful, family home new, learning, supportive, work, learn, school, college, experience, field, learning new new, learning, experience, learning new, place, learn, learn new, trying, house, recently learning, new, class, environment, knowledge, garden, active, nature, local, learning new career, hobby, family, pet, ability, work, home, creative, retirement, family career career, successful, family career, goal, school, college, hobby, degree, support, partner career, balance, rewarding, family career, work balance, colleague, partner, professional, career family, owning home loving, jesus, christ, jesus christ, lord, loving family, god, family, blessed, savior jesus, christ, jesus christ, lord, savior, lord jesus, lord jesus christ, knowing, faith jesus, serve jesus, christ, jesus christ, lord, savior, lord jesus, need met, grandchild great, grandchild great grandchild, thank god
  • 86. Topic models are magic! LDA NMF CorEx travel, able, time, spend, able travel, term, family, spend time, time family, long term TRAVEL! able, able travel, quality time, travel, able work, able provide, family able, quality, able enjoy, health able kid, grandkids, married, retirement, health, spouse, fortunate, happily, great, kid grandkids CHILDREN! happy, kid, make happy, grandkids, kid grandkids, happy family, dog, family happy, wife kid, family kid happy, nice, house, make, home, doing, don, love, job, time home, paid, nice, comfortable, beautiful, car, comfortable home, family home, nice home, country home, house, paid, car, nice, afford, vacation, able afford, beautiful, family home new, learning, supportive, work, learn, school, college, experience, field, learning new new, learning, experience, learning new, place, learn, learn new, trying, house, recently learning, new, class, environment, knowledge, garden, active, nature, local, learning new career, hobby, family, pet, ability, work, home, creative, retirement, family career career, successful, family career, goal, school, college, hobby, degree, support, partner career, balance, rewarding, family career, work balance, colleague, partner, professional, career family, owning home loving, jesus, christ, jesus christ, lord, loving family, god, family, blessed, savior jesus, christ, jesus christ, lord, savior, lord jesus, lord jesus christ, knowing, faith jesus, serve jesus, christ, jesus christ, lord, savior, lord jesus, need met, grandchild great, grandchild great grandchild, thank god
  • 87. Topic models are magic! LDA NMF CorEx travel, able, time, spend, able travel, term, family, spend time, time family, long term TRAVEL! able, able travel, quality time, travel, able work, able provide, family able, quality, able enjoy, health able kid, grandkids, married, retirement, health, spouse, fortunate, happily, great, kid grandkids CHILDREN! happy, kid, make happy, grandkids, kid grandkids, happy family, dog, family happy, wife kid, family kid happy, nice, house, make, home, doing, don, love, job, time HOME LIFE! home, house, paid, car, nice, afford, vacation, able afford, beautiful, family home new, learning, supportive, work, learn, school, college, experience, field, learning new new, learning, experience, learning new, place, learn, learn new, trying, house, recently learning, new, class, environment, knowledge, garden, active, nature, local, learning new career, hobby, family, pet, ability, work, home, creative, retirement, family career career, successful, family career, goal, school, college, hobby, degree, support, partner career, balance, rewarding, family career, work balance, colleague, partner, professional, career family, owning home loving, jesus, christ, jesus christ, lord, loving family, god, family, blessed, savior jesus, christ, jesus christ, lord, savior, lord jesus, lord jesus christ, knowing, faith jesus, serve jesus, christ, jesus christ, lord, savior, lord jesus, need met, grandchild great, grandchild great grandchild, thank god
  • 88. Topic models are magic! LDA NMF CorEx travel, able, time, spend, able travel, term, family, spend time, time family, long term TRAVEL! able, able travel, quality time, travel, able work, able provide, family able, quality, able enjoy, health able kid, grandkids, married, retirement, health, spouse, fortunate, happily, great, kid grandkids CHILDREN! happy, kid, make happy, grandkids, kid grandkids, happy family, dog, family happy, wife kid, family kid happy, nice, house, make, home, doing, don, love, job, time HOME LIFE! home, house, paid, car, nice, afford, vacation, able afford, beautiful, family home new, learning, supportive, work, learn, school, college, experience, field, learning new LEARNING! learning, new, class, environment, knowledge, garden, active, nature, local, learning new career, hobby, family, pet, ability, work, home, creative, retirement, family career career, successful, family career, goal, school, college, hobby, degree, support, partner career, balance, rewarding, family career, work balance, colleague, partner, professional, career family, owning home loving, jesus, christ, jesus christ, lord, loving family, god, family, blessed, savior jesus, christ, jesus christ, lord, savior, lord jesus, lord jesus christ, knowing, faith jesus, serve jesus, christ, jesus christ, lord, savior, lord jesus, need met, grandchild great, grandchild great grandchild, thank god
  • 89. Topic models are magic! LDA NMF CorEx travel, able, time, spend, able travel, term, family, spend time, time family, long term TRAVEL! able, able travel, quality time, travel, able work, able provide, family able, quality, able enjoy, health able kid, grandkids, married, retirement, health, spouse, fortunate, happily, great, kid grandkids CHILDREN! happy, kid, make happy, grandkids, kid grandkids, happy family, dog, family happy, wife kid, family kid happy, nice, house, make, home, doing, don, love, job, time HOME LIFE! home, house, paid, car, nice, afford, vacation, able afford, beautiful, family home new, learning, supportive, work, learn, school, college, experience, field, learning new LEARNING! learning, new, class, environment, knowledge, garden, active, nature, local, learning new career, hobby, family, pet, ability, work, home, creative, retirement, family career CAREER! career, balance, rewarding, family career, work balance, colleague, partner, professional, career family, owning home loving, jesus, christ, jesus christ, lord, loving family, god, family, blessed, savior jesus, christ, jesus christ, lord, savior, lord jesus, lord jesus christ, knowing, faith jesus, serve jesus, christ, jesus christ, lord, savior, lord jesus, need met, grandchild great, grandchild great grandchild, thank god
  • 90. Topic models are magic! LDA NMF CorEx travel, able, time, spend, able travel, term, family, spend time, time family, long term TRAVEL! able, able travel, quality time, travel, able work, able provide, family able, quality, able enjoy, health able kid, grandkids, married, retirement, health, spouse, fortunate, happily, great, kid grandkids CHILDREN! happy, kid, make happy, grandkids, kid grandkids, happy family, dog, family happy, wife kid, family kid happy, nice, house, make, home, doing, don, love, job, time HOME LIFE! home, house, paid, car, nice, afford, vacation, able afford, beautiful, family home new, learning, supportive, work, learn, school, college, experience, field, learning new LEARNING! learning, new, class, environment, knowledge, garden, active, nature, local, learning new career, hobby, family, pet, ability, work, home, creative, retirement, family career CAREER! career, balance, rewarding, family career, work balance, colleague, partner, professional, career family, owning home loving, jesus, christ, jesus christ, lord, loving family, god, family, blessed, savior CHRISTIANITY! jesus, christ, jesus christ, lord, savior, lord jesus, need met, grandchild great, grandchild great grandchild, thank god
  • 92. Wait a second - I’m a skeptic
  • 94. Let’s take a closer look at our topics...
  • 95. Let’s take a closer look at our topics... grandchild grand child grandchild child grand grandchild great great grandchild grand child child florida
  • 96. Let’s take a closer look at our topics... grandchild grand child grandchild child grand grandchild great great grandchild grand child child florida
  • 97. Let’s take a closer look at our topics... grandchild grand child grandchild child grand grandchild great great grandchild grand child child florida
  • 98. Let’s take a closer look at our topics... grandchild grand child grandchild child grand grandchild great great grandchild grand child child florida
  • 99. Let’s take a closer look at our topics... grandchild grand child grandchild child grand grandchild great great grandchild grand child child florida
  • 100. Let’s take a closer look at our topics... grandchild grand child grandchild child grand grandchild great great grandchild grand child child florida
  • 101. Let’s take a closer look at our topics... grandchild grand child grandchild child grand grandchild great great grandchild grand child child florida
  • 102. Let’s take a closer look at our topics... grandchild grand child grandchild child grand grandchild great great grandchild grand child child florida
  • 103. Let’s take a closer look at our topics... grandchild grand child grandchild child grand grandchild great great grandchild grand child child florida
  • 104. Let’s take a closer look at our topics... grandchild grand child grandchild child grand grandchild great great grandchild grand child child florida
  • 105. Let’s take a closer look at our topics... grandchild grand child grandchild child grand grandchild great great grandchild grand child child florida
  • 106. Let’s take a closer look at our topics...
  • 107. Conceptually spurious words ● Topic models pick up on word co-occurrences but can’t guarantee that they’re related conceptually ● Topics can often contain words that don’t align with how we want to interpret the topic ● This can overestimate document proportions
  • 108. Conceptually spurious words Topic Target concept Topic proportion career, balance, rewarding, family career, work balance, colleague, partner, professional, career family, owning home Career 4.9% healthy, family healthy, extended, healthy family, happy healthy, extended family, healthy happy, supportive, immediate, healthy child Health 6.5% child, member, family member, husband child, child family, family child, health child, child doing, child husband, helping child Children 17.6%
  • 109. Conceptually spurious words Topic Target concept Topic proportion career, balance, rewarding, family career, work balance, colleague, partner, professional, career family, owning home Career 4.9% healthy, family healthy, extended, healthy family, happy healthy, extended family, healthy happy, supportive, immediate, healthy child Health 6.5% child, member, family member, husband child, child family, family child, health child, child doing, child husband, helping child Children 17.6%
  • 110. Conceptually spurious words Topic Target concept Topic proportion career, balance, rewarding, family career, work balance, colleague, partner, professional, career family, owning home Career 4.9% 2.7% healthy, family healthy, extended, healthy family, happy healthy, extended family, healthy happy, supportive, immediate, healthy child Health 6.5% 4.8% child, member, family member, husband child, child family, family child, health child, child doing, child husband, helping child Children 17.6% 16.8%
  • 111. Conceptually spurious words Topic Target concept Topic proportion Over-estimation career, balance, rewarding, family career, work balance, colleague, partner, professional, career family, owning home Career 4.9% 2.7% 81% over healthy, family healthy, extended, healthy family, happy healthy, extended family, healthy happy, supportive, immediate, healthy child Health 6.5% 4.8% 34% over child, member, family member, husband child, child family, family child, health child, child doing, child husband, helping child Children 17.6% 16.8% 5% over
  • 112. Conceptually spurious words ● How sensitive are models to this problem? How much do document proportions change, on average? ● Computing document proportions ○ CorEx provides natively dichotomous document predictions ○ For LDA and NMF, we used a 10% topic probability threshold, which yields similar proportions to CorEx ● Iterated over each topic’s top 10 words and held each word out from the vocabulary ● Averaged the impact in document proportions for each topic
  • 113. Conceptually spurious words ● Some models/topics are more sensitive than others ● Average impact of removing one of the top 10 words from a topic was a 9 to 13% decrease in the topic’s prevalence in the corpus ● For some topics, a single word was responsible for nearly a quarter of the topic’s prevalence ● A single out-of-place word can substantially inflate model estimates Model Smallest Topic Effect Average Topic Effect Largest Topic Effect LDA -2% -9% -17% NMF -8% -13% -21% Corex -5% -11% -24%
  • 114. Conceptually spurious words ● And your topics may have more than one... Topic career, balance, rewarding, family career, work balance, colleague, partner, professional, career family, owning home healthy, family healthy, extended, healthy family, happy healthy, extended family, healthy happy, supportive, immediate, healthy child child, member, family member, husband child, child family, family child, health child, child doing, child husband, helping child Model Smallest Topic Effect Average Topic Effect Largest Topic Effect LDA -2% -9% -17% NMF -8% -13% -21% Corex -5% -11% -24%
  • 115. Undercooked topics ● Conceptually spurious words are sometimes part of a more systematic problem: “undercooked” topics ● These topics appear to contain multiple distinct concepts ● These concepts co-occur but we may want to measure them separately
  • 116. Undercooked topics Regardless of the algorithm and number of topics used, you’ll often get results like this: job, health, money, family health, health family, family job, job enjoy, job family, great job, paying
  • 117. Undercooked topics Or this: grandchild, child grandchild, wife, wife child, great grandchild, wife family, wife son, family wife, family grandchild, wonderful wife
  • 118. Undercooked topics Or this: learning, new, class, environment, knowledge, garden, active, nature, local, learning new
  • 119. Undercooked topics ● We could increase the number of topics to encourage the model to split these “undercooked” topics up
  • 120. Overcooked topics But in this very same model, we also have topics like this: marriage, happy marriage, great marriage, marriage child, job marriage, marriage family, marriage relationship, marriage kid, marriage strong, marriage wonderful
  • 121. Overcooked topics But in this very same model, we also have topics like this: marriage, happy marriage, great marriage, marriage child, job marriage, marriage family, marriage relationship, marriage kid, marriage strong, marriage wonderful
  • 122. Overcooked topics In fact, “wife” is elsewhere in the model
  • 123. Overcooked topics Do we reduce the number of topics and hope they merge?
  • 124. Overcooked topics ● Depending on your data, they might not ● Respondents don’t tend to mention “wife” and “marriage” together ● They’re effectively synonyms, just like: ○ “job” and “career” ○ “health” and “healthy” ○ “child” and “kid”
  • 125. Overcooked topics ● Like conceptually spurious words, missing words in overcooked topics can affect document estimates substantially Topic Target concept Topic proportion career, balance, rewarding, family career, work balance, colleague, partner, professional, career family, owning home Career 4.9% healthy, family healthy, extended, healthy family, happy healthy, extended family, healthy happy, supportive, immediate, healthy child Health 6.5% child, member, family member, husband child, child family, family child, health child, child doing, child husband, helping child Children 17.6%
  • 126. Overcooked topics ● Like conceptually spurious words, missing words in overcooked topics can affect document estimates substantially Topic Target concept Topic proportion career, family career, work balance, colleague, professional, career family Career 2.7% healthy, family healthy, healthy family, happy healthy, healthy happy, healthy child Health 4.8% child, husband child, child family, family child, health child, child doing, child husband, helping child Children 16.8%
  • 127. Overcooked topics ● Like conceptually spurious words, missing words in overcooked topics can affect document estimates substantially Topic Target concept Topic proportion Missing word career, family career, work balance, colleague, professional, career family Career 2.7% job healthy, family healthy, healthy family, happy healthy, healthy happy, healthy child Health 4.8% health child, husband child, child family, family child, health child, child doing, child husband, helping child Children 16.8% kid
  • 128. Overcooked topics ● Like conceptually spurious words, missing words in overcooked topics can affect document estimates substantially Topic Target concept Topic proportion Missing word career, family career, work balance, colleague, professional, career family Career 2.7% 18.1% job healthy, family healthy, healthy family, happy healthy, healthy happy, healthy child Health 4.8% 19.8% health child, husband child, child family, family child, health child, child doing, child husband, helping child Children 16.8% 24.0% kid
  • 129. Overcooked topics ● Like conceptually spurious words, missing words in overcooked topics can affect document estimates substantially Topic Target concept Topic proportion Missing word Under-estimation career, family career, work balance, colleague, professional, career family Career 2.7% 18.1% job 575% under healthy, family healthy, healthy family, happy healthy, healthy happy, healthy child Health 4.8% 19.8% health 309% under child, husband child, child family, family child, health child, child doing, child husband, helping child Children 16.8% 24.0% kid 42% under
  • 130. Overcooked topics ● Overcooked topics are less common in models with only unigrams ● But some crucial words are still missing Topic Target concept Missing word career, successful, goal, school, college, degree, support, hobby, education, partner Career job healthy, beautiful, body, mentally, importantly, peaceful, committed, bike, fair, honest Health health child, grandchild, wonderful, taken, thank, health, blessed, able, independent, enjoyed Children kid
  • 131. Overcooked topics ● Overcooked topics are less common in models with only unigrams ● But some crucial words are still missing ● And undercooked topics / conceptually spurious words are more common Topic Target concept Missing word career, successful, goal, school, college, degree, support, hobby, education, partner Career job healthy, beautiful, body, mentally, importantly, peaceful, committed, bike, fair, honest Health health child, grandchild, wonderful, taken, thank, health, blessed, able, independent, enjoyed Children kid
  • 132. Topic models have problems ● We could aggregate multiple overcooked topics together ● But that still doesn’t deal with undercooked topics or spurious words ● These problems may be less common for larger corpora, but in our experience they’re always present to some degree
  • 133. But how big are these problems? ● Can we still use topic models to classify documents? ● Ideally, we should be able to influence or modify the topic model: ○ Split apart topics ○ Combine topics ○ Add missing words ○ Remove bad words
  • 134. The solution: anchored topic models? ● Leveraged CorEx because it allows you to nudge the model with “anchor words” ○ CorEx (Correlation Explanation) topic model: https://github.com/gregversteeg/corex_topic ○ Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge, Gallagher et al., TACL 2017. ● Semi-supervised approach
  • 135. The solution: anchored topic models? ● Built a little interface around CorEx ● Researchers can view the results, add/change anchor words, and re-run the model
  • 136. The solution: anchored topic models? ● Iteratively developed two models ○ Smaller model with 55 topics ○ Larger model with 110 topics ● Confirmed good words as anchors ● Used domain knowledge to add additional anchor words ● Used “junk” topics to anchor bad words and pull them away
  • 137. Our topics look much better now! Before Anchoring After Anchoring health, family health, health family, friend health, health issue, family friend health, health job, job health, issue, health child health, healthy, family health, health family, family healthy, healthy family, friend health, health job, job health, healthy child money, want, term, long term, worry, want want, worry money, need want, time money, making money money, financial, financially, pay, income, debt, paid, afford, worry, paying job, family job, job family, time job, friend job, paying job, kid job, paying, job time, doing job work, job, working, career, business, family job, company, professional, family work, employed jesus, christ, church, jesus christ, lord, community, church family, christian, lord jesus, family church jesus, christ, jesus christ, christian, bible, lord jesus, faith jesus, relationship jesus, lord jesus christ, christian faith
  • 138. Interpretation and operationalization ● Some topics looked better in one model over the other ● Identified 92 potentially interesting and interpretable topics ● Assigned each topic a short description and developed a codebook Topic Interpretation health, healthy, family health, health family, family healthy, healthy family, friend health, health job, job health, healthy child Being in good health (themselves or others) money, financial, financially, pay, income, debt, paid, afford, worry, paying Money and finances (any references to money/wealth at all) work, job, working, career, business, family job, company, professional, family work, employed Their career, work, job, or business (good or bad) jesus, christ, jesus christ, christian, bible, lord jesus, faith jesus, relationship jesus, lord jesus christ, christian faith Specific reference to Christianity
  • 139. Exploratory coding ● Drew exploratory samples of 60 documents for each topic ● Oversampled on topics’ topic words ● Used Cohen’s Kappa to get a rough sense of IRR ● Abandoned 61 topics: ○ 22 were vague or conceptually uninteresting ○ 39 had particularly bad initial agreement
  • 140. Codebook validation ● 31 topics remained ● Drew random samples of 100 documents for each topic ● Two researchers coded each ● Tried to achieve 95% confidence that Kappas were > 0.7 ● 17 succeeded ● 7 failed (bad reliability) ● 7 were too rare (samples were too small) ○ But we were able to salvage them by coding new samples that oversampled on anchor words
  • 141. Assessing model performance ● 24 final topics with acceptable human IRR ● Resolved disagreements to produce gold standards for each topic ● Compared coders to: ○ The raw CorEx topics ○ Regular expressions of each topic’s anchor words ○ An XGBoost classifier using Word2Vec and CorEx features
  • 142. Assessing model performance ● At least one method worked for 23 of the 24 final topics ● CorEx: 18 good topics ● XGBoost: 21 good topics ● Anchor RegEx: 23 good topics Classifier CorEx RegEx Average .88 .89 .93 Min .56 .55 .69 25% .83 .82 .91 50% .93 .94 .94 75% .97 .96 .99 Max 1.0 1.0 1.0
  • 143. Assessing model performance ● Anchor RegExes were only outperformed by ML twice ● And never by more than 2% ● On topics that were already near perfect agreement Classifier CorEx Dictionary Family .96 .96 .94 Travel 1.0 .99 .99
  • 144. Assessing model performance ● Anchor RegExes outperformed CorEx for 9 topics: ○ Average CorEx Kappa: .79 ○ Average Anchor RegEx Kappa: .91 ● Anchor RegExes outperformed XGBoost for 12 topics: ○ Average XGBoost Kappa: .81 ○ Average Anchor RegEx Kappa: .9
  • 145. Key findings 1. Unsupervised topic models are riddled with problems Conceptually spurious words, polysemous words, and over/undercooked topics are common and can results in frequent false positives/negatives
  • 146. Key findings 2. Topics can often not be interpreted reliably Topics that looked coherent and meaningful actually weren’t. We abandoned 61 of 92 topics because they were too vague or difficult to interpret. This only became clear after extensive exploratory coding.
  • 147. Key findings 3. Your interpretations may not be as clear to others as they are to you Even after filtering down to 31 topics that seemed coherent and interpretable - we still had to abandon 7 of them. Our definitions (which seemed reasonable on paper) were still too poorly specified. This only became clear after extensive manual validation and inter-rater reliability testing.
  • 148. Key findings 4. Manual validation is the only way to ensure that topics are measuring what you claim they are We still had to throw out 6 of our 24 final topics because the topic models’ output didn’t align with our own interpretation of the topics.
  • 149. Key findings 5. Topic models perform identically to or worse than less sophisticated approaches Manually-curated lists of words allowed us to salvage 5 of the 6 bad topics, because word searches outperformed the machine learning approaches. Even when both the RegExes and machine learning methods worked, the automated methods provided absolutely no discernible benefit.
  • 150. Thank you! Questions and comments encouraged! pvankessel@pewresearch.org ahughes@pewresearch.org
  • 152. Exploratory coding ● Bad agreement often resulted from polysemous (multi-context) words ● Example: “opinions on politics, society, and the state of the world” ● The problem word: “world” ● Including it alongside anchors like “politics” and “government” produced too many false positives (“traveling the world”; “making the world a better place”) ● Excluding it produced too many false negatives (“I feel like the world is going crazy right now”)
  • 153. Keyword oversampling ● For the 7 with good Kappas but low confidence, we drew new samples ● Oversampled using the anchor words: 50% with, 50% without ● Sample sizes were determined based on the expected Kappa (from the random samples) ● Computed Kappas (using sampling weights) and salvaged all 7
  • 157. Open-source tools Python NLTK, scikit-learn, pandas, numpy, scipy, gensim, spacy, etc R https://cran.r-project.org/web/views/NaturalLanguageProcessing.html Java Stanford Core NLP + many other useful libraries https://nlp.stanford.edu/software/
  • 158. Commercial tools Cloud based NLP Service Keyphrase / snippet Keywords Entities Entity Sentiment Document Sentiment Sentence Sentiment Emotions Categories Topic Models Concepts Languag e Detection Syntax / Semantic roles Amazon Comprehend Google Cloud Natural Language IBM Watson NLU Amazon Comprehend ● Provides topic modeling capabilities Google Natural Language Cloud ● Leverages Google’s deep learning models ● Very extensive Semantic role and document parser capabilities IBM Watson NLU ● Ability to create domain specific models through Watson Knowledge Studio
  • 159. Commercial tools SPSS Text Modeler ● Advanced control over entity extraction engine ● Rule based categorization ● Co-occurrence visualizations ● Textlink analysis visualizations ● Ability to export model to be used in SPSS stream
  • 161. Commercial tools Provalis Arguably most comprehensive mixed methods text analytics software
  • 163. Commercial tools Quid Allows interactive adjustments of underlying topic/clustering model through its graph based UI

Editor's Notes

  1. And it looks like it’s working pretty well! The first time you run a topic model, it’s exciting. You threw some text at an algorithm, and it comes back with something meaningful! It feels… like MAGIC
  2. So, we’re good! Now it’s time for analysis, right? [next slide
  3. Are topic models really magic? Or is this all just a big illusion?
  4. If we want to measure people talking about their grandchildren with this topic - do we really want to include the world Florida? Sure, we can imagine several reasons why it might find its way into this topic - but it doesn’t really belong, right?
  5. We really don’t want a topic that conflates the WHO and the WHAT
  6. We see a modest reduction in prevalence, but is this really a problem?
  7. We wanted to get a more comprehensive sense of how much our corpus estimates might be sensitive to out-of-place words
  8. Sometimes, when you have more than one conceptually spurious word, it’s actually hard to tell what the base concept of a topic is. We’ve come to call these topics “undercooked” - it’s not that there’s an out-of-place word or two, it’s that there’s two or more distinct concepts being measured in the same topic.
  9. But that’s going to be a problem, because in the very same model, we have topics like...
  10. And as we saw, “wife” is actually a concept that’s elsewhere in the model [next slide]
  11. Maybe instead, we can reduce the number of topics and hope they merge, and maybe “grandchildren” will get crowded out
  12. Now, obviously this is less of a problem when you train models just on unigrams - but even when we do that, those words are still missing
  13. Fortunately, CorEx is a new type of semi-supervised algorithm, that allows you to do just that!
  14. Between the two models, we wound up with 92 topics that looked both interesting and potentially interpretable. So, we came up with some interpretations and developed a codebook for each one.
  15. So, what did we learn here? [next slide]