World Bank 2019 - Text Analytics Workshop

Patrick van Kessel
Patrick van KesselSenior Data Scientist
An Intro to NLP and Text
Analysis
Patrick van Kessel, Senior Data Scientist
Pew Research Center
The role of text in social research
● Free of assumptions
● Potential for richer insights relative to closed-format responses
● If organic, then data collection costs are often negligible
The role of text in social research
Why text?
The role of text in social research
Where do I find it?
● Open-ended surveys / focus groups / transcripts / interviews
● Social media data (tweets, FB posts, etc.)
● Long-form content (articles, notes, logs, etc.)
The role of text in social research
What makes it challenging?
● Messy
○ “Data spaghetti” with little or no structure
● Sparse
○ Low information-to-data ratio (lots of hay, few needles)
● Often organic (rather than designed)
○ Can be naturally generated by people and processes
○ Often without a research use in mind
Data selection and preparation
Data selection and preparation
● Know your objective and subject matter (if needed find subject matter expert)
● Get familiar with the data
● Don’t make assumptions - know your data, quirks and all
Data selection and preparation
Text Acquisition and Preprepation
Select relevant data (text corpus)
● Content
● Metadata
Prepare the input file
● Determine unit of analysis
● Process text to get one document
per unit of analysis
(Pre-)Processing
Turning text into data
Turning text into data
Turning text into data
● How do we sift through text and produce insight?
● Might first try searching for keywords
● How many times is “analysis” mentioned?
Original
1 Text analysis is fun
2 I enjoy analyzing text data
3 Data science often involves text analytics
Turning text into data
● How do we sift through text and produce insight?
● Might first try searching for keywords
● How many times is “analysis” mentioned?
Original
1 Text analysis is fun
2 I enjoy analyzing text data
3 Data science often involves text analytics
We missed this one
And this one too
Turning text into data
● Variations of words can have the same meaning but look completely different
to a computer
Original
1 Text analysis is fun
2 I enjoy analyzing text data
3 Data science often involves text analytics
Turning text into data
Regular Expressions
● A more sophisticated solution: regular expressions
● Syntax for defining string (text) patterns
Original
1 Text analysis is fun
2 I enjoy analyzing text data
3 Data science often involves text analytics
Turning text into data
Regular Expressions
● Can use to search text or extract
specific chunks
● Example use cases:
○ Extracting dates
○ Finding URLs
○ Identifying names/entities
● https://regex101.com/
● http://www.regexlib.com/
Turning text into data
Regular Expressions
banaly[a-z]+b
Original
1 Text analysis is fun
2 I enjoy analyzing text data
3 Data science often involves text analytics
Turning text into data
Regular Expressions
Regular expressions can be extremely powerful…
...and terrifyingly complex:
URLS: ((https?://(www.)?)?[-a-zA-Z0-9@:%._+~#=]{2,4096}.[a-z]{2,6}b([-a-zA-Z0-9@:%_+.~#?&//=]*))
DOMAINS: (?:http[s]?://)?(?:www(?:s?).)?([w.-]+)(?:[/](?:.+))?
MONEY: $([0-9]{1,3}(?:(?:,[0-9]{3})+)?(?:.[0-9]{1,2})?)s
Turning text into data
Pre-processing
● Great, but we can’t write patterns for everything
● Words are messy and have a lot of variation
● We need to collapse semantically
● We need to clean / pre-process
Original
1 Text analysis is fun
2 I enjoy analyzing text data
3 Data science often involves text analytics
Turning text into data
Pre-processing
● Common first steps:
○ Spell check / correct
○ Remove punctuation / expand contractions
Original Processed
1 Text analysis is fun
2 I enjoy analyzing text data
3 Data science often involves text analytics
can’t -> cannot
they’re -> they_are
doesn’t -> does_not
Turning text into data
Pre-processing
● Now to collapse words with the same meaning
● We do this with stemming or lemmatization
● Break words down to their roots
Original Processed
1 Text analysis is fun
2 I enjoy analyzing text data
3 Data science often involves text analytics
Turning text into data
Pre-processing
● Stemming is more conservative
● There are many different stemmers
● Here’s the Porter stemmer (1979)
Original Processed
1 Text analysis is fun Text analysi is fun
2 I enjoy analyzing text data I enjoy analyz text data
3 Data science often involves text analytics Data scienc often involv text analyt
Turning text into data
Pre-processing
● Stemming is more conservative
● There are many different stemmers
● Here’s the Porter stemmer (1979)
Original Processed
1 Text analysis is fun Text analysi is fun
2 I enjoy analyzing text data I enjoy analyz text data
3 Data science often involves text analytics Data scienc often involv text analyt
Turning text into data
Pre-processing
● The Lancaster stemmer (1990) is newer and more aggressive
● Truncates words a LOT
Original Processed
1 Text analysis is fun text analys is fun
2 I enjoy analyzing text data I enjoy analys text dat
3 Data science often involves text analytics dat sci oft involv text analys
Turning text into data
Pre-processing
● Lemmatization uses linguistic relationships and parts of speech to collapse
words down to their root form - so you get actual words (“lemma”), not stems
● WordNet Lemmatizer
Original Processed
1 Text analysis is fun text analysis is fun
2 I enjoy analyzing text data I enjoy analyze text data
3 Data science often involves text analytics data science often involve text analytics
Turning text into data
Pre-processing
● Picking the right method depends on how much you want to preserve nuance
or collapse meaning
● We’ll stick with Lancaster
Original Processed
1 Text analysis is fun text analys is fun
2 I enjoy analyzing text data I enjoy analys text dat
3 Data science often involves text analytics dat sci oft involv text analys
Turning text into data
Pre-processing
● Finally, we need to remove words that don’t hold meaning themselves
● These are called “stopwords”
● Can expand standard stopword lists with custom words
Original Processed
1 Text analysis is fun text analys fun
2 I enjoy analyzing text data enjoy analys text dat
3 Data science often involves text analytics dat sci oft involv text analys
Turning text into data
Pre-processing
● A word of caution: there aren’t any universal rules for making pre-
processing decisions
● Do what makes sense for your data - but be cautious of the researcher degrees
of freedom involved
● See:
○ Denny and Spirling, 2016. Assessing the Consequences of Text Pre-processing Decisions”
○ Denny and Spirling, 2018. “Text Preprocessing for Unsupervised Learning: Why It Matters,
When It Misleads, and What to Do About It”
Turning text into data
Tokenization
● Now we need to tokenize
● Break words apart according to certain rules
● Usually breaks on whitespace and punctuation
● What’s left are called “tokens”
● Single tokens or pairs of two or more tokens are called “ngrams”
Turning text into data
Tokenization
● Unigrams
text analys fun enjoy dat sci oft involv
1 1 1
1 1 1 1
1 1 1 1 1 1
Turning text into data
Tokenization
● Bigrams
Text analys 2
Analys fun 1
Enjoy analys 1
Analys text 1
Text dat 1
Dat sci 1
Sci oft 1
Oft involv 1
Involv text 1
Turning text into data
Tokenization
● If we want to characterize the whole corpus, we can just look at the most
frequent words
text analys fun enjoy dat sci oft involv
1 1 1
1 1 1 1
1 1 1 1 1 1
3 3 1 1 2 1 1 1
Turning text into data
The Term Frequency Matrix
● But what if we want to distinguish documents from each other?
● We know these documents are about text analysis
● What makes them unique?
Turning text into data
The Term Frequency Matrix
● Divide the word frequencies by the number of documents they appear in
● Downweight words that are common
● Now we’re emphasizing what makes each document unique
text analys fun enjoy dat sci oft involv
.33 .33 1
.33 .33 1 .5
.33 .33 .5 1 1 1
1 1 1 1 1 1 1 1
Turning words into numbers
Part-of-Speech Tagging
● TF-IDF is often all you need
● But sometimes you care about how a word is used
● Can use pre-trained part-of-speech (POS) taggers
● Can also help with things like negation
○ “Happy” vs. “NOT happy”
Turning words into numbers
Named Entity Extraction
● Might also be interested in people, places, organizations, etc.
● Like POS taggers, named entity extractors use trained models
Turning words into numbers
Word2Vec
● Other methods can quantify words not by frequency, but by their relation
● Word2vec uses a sliding window to read words and learn their
relationships; each word gets a vector in N-dimensional space
● https://code.google.com/archive/p/word2vec/
Analysis
Finding patterns in text data
Finding patterns in text data
Two types of approaches:
● Unsupervised NLP: automated, extracts structure from the data
○ Clustering
○ Topic modeling
○ Mutual information
● Supervised NLP: requires training data, learns to predict labels and classes
○ Classification
○ Regression
Finding patterns in text data
Unsupervised methods
● First up, we might look at
words that commonly co-
occur
● Called “collocations” or “co-
occurrences”
● Great way to get a quick look
at common themes
Finding patterns in text data
Unsupervised methods
● Might want to compare documents (or words) to one another
● Possible applications
○ Spelling correction
○ Document deduplication
○ Measure similarity of language
■ Politicians’ speeches
■ Movie reviews
■ Product descriptions
Finding patterns in text data
Unsupervised methods
Levenshtein distance
● Compute number of steps needed to
turn a word/document into another
● Can express as a ratio (percent of
word/document that needs to
change) to measure similarity
Finding patterns in text data
Unsupervised methods
Cosine similarity
● Compute the “angle” between two word vectors
● TF-IDF: axes are the weighted frequencies for
each word
● Word2Vec: axes are the learned dimensions
from the model
Finding patterns in text data
Unsupervised methods
Clustering
● Algorithms that use word vectors (TF-IDF,
Word2Vec, etc.) to identify structural
groupings between observations (words,
documents)
● K-Means is a very commonly used one
Finding patterns in text data
Unsupervised methods
Hierarchical/agglomerative clustering
● Start with all observations, and use a
rule to pair them up, and repeat until
there’s only one group left
Finding patterns in text data
Unsupervised methods
Network analysis
● Can also get creative
● After all, we’re just working
with columns and numbers
● Example: link words together
by their strongest correlations
Finding patterns in text data
Unsupervised methods
Pointwise mutual information
● Based on information theory
● Compares conditional and joint
probabilities to measure the
likelihood of a word occurring
with a category/outcome, beyond
random chance
Finding patterns in text data
Unsupervised methods
Topic modeling
● Algorithms that characterize
documents in terms of topics
(groups of words)
● Find topics that best fit the
data
Finding patterns in text data
Supervised methods
● Often we want to categorize documents
● Unsupervised methods can help
● But often we need to read and label them ourselves
● Classification models can take labeled data and learn to make predictions
Finding patterns in text data
Supervised methods
Steps:
● Label a sample of documents
● Break your sample into two sets: a training sample and a test sample
● Train a model on the training sample
● Evaluate it on the test sample
● Apply it to the full set of documents to make predictions
Finding patterns in text data
Supervised methods
● First you need to develop a codebook
● Codebook: set of rules for labeling and categorizing documents
● The best codebooks have clear rules for hard cases, and lots of examples
● Categories should be MECE: mutually exclusive and collectively exhaustive
World Bank 2019 - Text Analytics Workshop
Finding patterns in text data
Supervised methods
Finding patterns in text data
Supervised methods
● Need to validate the codebook by measuring interrater reliability
● Makes sure your measures are consistent, objective, and reproducible
● Multiple people code the same document
Finding patterns in text data
Supervised methods
● Various metrics to test whether their agreement is high enough
○ Krippendorf’s alpha
○ Cohen’s kappa
● Can also compare coders against a gold standard, if available
Finding patterns in text data
Supervised methods
● Mechanical Turk can be a great way to code a lot of documents
● Have 5+ Turkers code a large sample of documents
● Collapse them together with a rule
● Code a subset in-house, and compute reliability
Finding patterns in text data
Supervised methods
● Sometimes you’re trying to measure something that’s really rare
● Consider oversampling
● Example: keyword oversampling
● Disproportionately select documents that are more likely to contain your
outcomes
Finding patterns in text data
Supervised methods
● After coding, split your sample into two sets (~80/20)
○ One for training, one for testing
● We do this to check for (and avoid) overfitting
Finding patterns in text data
Supervised methods
● Next step is called feature extraction or feature selection
● Need to extract “features” from the text
○ TF-IDF
○ Word2Vec vectors
● Can also utilize metadata, if potentially useful
Finding patterns in text data
Supervised methods
● Select a classification algorithm
● Common choice for text data are
support vector machines (SVMs)
● Similar to regression, SVMs find the line
that best separates two or more groups
● Can also use non-linear “kernels” for
better fits (radial basis function, etc.)
● XGBoost is a newer and very promising
algorithm
Finding patterns in text data
Supervised methods
● Word of caution: if you oversampled, your weights are probability
(survey) weights
● Most machine learning implementations assume your weights are
frequency weights
● Not an issue for predictions - but can produce biased variance estimates
if you analyze your training data itself
Finding patterns in text data
Supervised methods
● Time to evaluate performance
● Lots of different metrics, depending on
what you care about
● Often we care about precision/recall
○ Precision: did you pick out mostly needles or
mostly hay?
○ Recall: how many needles did you miss?
● Other metrics:
○ Matthew’s correlation coefficient
○ Brier score
○ Overall accuracy
Finding patterns in text data
Supervised methods
● Doing just one split leaves a lot up to chance
● To bootstrap a better estimate of the model’s performance, it’s best to use K-
fold cross-validation
● Splits your data into train/test sets multiple times and averages the
performance metrics
● Ensures that you didn’t just get lucky (or unlucky)
Finding patterns in text data
Supervised methods
● Model not working well?
● You probably need to tune your parameters
● You can use a grid search to test out different combinations of model
parameters and feature extraction methods
● Many software packages can automatically help you pick the best
combination to maximize your model’s performance
Finding patterns in text data
Supervised methods
● Suggested design:
○ Large training sample, coded by Turkers
○ Small evaluation sample, coded by Turkers and in-house experts
○ Compute IRR between Turk and experts
○ Train model on training sample, use 5-fold cross-validation
○ Apply model to evaluation sample, compare results against in-house coders and Turkers
Finding patterns in text data
Supervised methods
● Some (but not all) models produce probabilities along with their
classifications
● Ideally you fit the model using your preferred scoring metric/function
● But you can also use post-hoc probability thresholds to adjust your model’s
predictions
Tools and Resources
Open-source tools
● Python
○ NLTK, scikit-learn, pandas, numpy, scipy, gensim, spacy, etc
● R
○ https://cran.r-project.org/web/views/NaturalLanguageProcessing.html
● Java
○ Stanford Core NLP + many other useful libraries
https://nlp.stanford.edu/software/
Commercial tools
● Cloud-based NLP
○ Amazon Comprehend
○ Google Cloud Natural Language
○ IBM Watson NLU
● Software
○ SPSS Text Modeler
○ Provalis WordStat
Time for a demo!
Demo notebook:
https://colab.research.google.com/github/patrickvankessel/WorldBank-Text-Analysis-2019/blob/master/Tutorial.ipynb
Feel free to reach out:
pvankessel@pewresearch.org
Special thanks to Michael Jugovich, who helped me put together a lot of this material together for a
previous workshop we did at NYAAPOR!
1 of 69

Recommended

ChatGPT and the Future of Work - Clark Boyd by
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
23.4K views69 slides
Getting into the tech field. what next by
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
5.6K views22 slides
Google's Just Not That Into You: Understanding Core Updates & Search Intent by
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
6.3K views99 slides
How to have difficult conversations by
How to have difficult conversations How to have difficult conversations
How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC
4.9K views19 slides
Introduction to Data Science by
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceChristy Abraham Joy
82.2K views51 slides
Time Management & Productivity - Best Practices by
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
169.7K views42 slides

More Related Content

Recently uploaded

DEVELOPMENT OF FROG.pptx by
DEVELOPMENT OF FROG.pptxDEVELOPMENT OF FROG.pptx
DEVELOPMENT OF FROG.pptxsushant292556
8 views21 slides
별헤는 사람들 2023년 12월호 전명원 교수 자료 by
별헤는 사람들 2023년 12월호 전명원 교수 자료별헤는 사람들 2023년 12월호 전명원 교수 자료
별헤는 사람들 2023년 12월호 전명원 교수 자료sciencepeople
41 views30 slides
TF-FAIR.pdf by
TF-FAIR.pdfTF-FAIR.pdf
TF-FAIR.pdfDirk Roorda
6 views120 slides
Conventional and non-conventional methods for improvement of cucurbits.pptx by
Conventional and non-conventional methods for improvement of cucurbits.pptxConventional and non-conventional methods for improvement of cucurbits.pptx
Conventional and non-conventional methods for improvement of cucurbits.pptxgandhi976
19 views35 slides
Structure of purines and pyrimidines - Jahnvi arora (11228108), mmdu ,mullana... by
Structure of purines and pyrimidines - Jahnvi arora (11228108), mmdu ,mullana...Structure of purines and pyrimidines - Jahnvi arora (11228108), mmdu ,mullana...
Structure of purines and pyrimidines - Jahnvi arora (11228108), mmdu ,mullana...jahnviarora989
5 views12 slides
CSF -SHEEBA.D presentation.pptx by
CSF -SHEEBA.D presentation.pptxCSF -SHEEBA.D presentation.pptx
CSF -SHEEBA.D presentation.pptxSheebaD7
14 views13 slides

Recently uploaded(20)

별헤는 사람들 2023년 12월호 전명원 교수 자료 by sciencepeople
별헤는 사람들 2023년 12월호 전명원 교수 자료별헤는 사람들 2023년 12월호 전명원 교수 자료
별헤는 사람들 2023년 12월호 전명원 교수 자료
sciencepeople41 views
Conventional and non-conventional methods for improvement of cucurbits.pptx by gandhi976
Conventional and non-conventional methods for improvement of cucurbits.pptxConventional and non-conventional methods for improvement of cucurbits.pptx
Conventional and non-conventional methods for improvement of cucurbits.pptx
gandhi97619 views
Structure of purines and pyrimidines - Jahnvi arora (11228108), mmdu ,mullana... by jahnviarora989
Structure of purines and pyrimidines - Jahnvi arora (11228108), mmdu ,mullana...Structure of purines and pyrimidines - Jahnvi arora (11228108), mmdu ,mullana...
Structure of purines and pyrimidines - Jahnvi arora (11228108), mmdu ,mullana...
jahnviarora9895 views
CSF -SHEEBA.D presentation.pptx by SheebaD7
CSF -SHEEBA.D presentation.pptxCSF -SHEEBA.D presentation.pptx
CSF -SHEEBA.D presentation.pptx
SheebaD714 views
How to be(come) a successful PhD student by Tom Mens
How to be(come) a successful PhD studentHow to be(come) a successful PhD student
How to be(come) a successful PhD student
Tom Mens491 views
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance... by InsideScientific
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...
InsideScientific58 views
Open Access Publishing in Astrophysics by Peter Coles
Open Access Publishing in AstrophysicsOpen Access Publishing in Astrophysics
Open Access Publishing in Astrophysics
Peter Coles906 views
Metatheoretical Panda-Samaneh Borji.pdf by samanehborji
Metatheoretical Panda-Samaneh Borji.pdfMetatheoretical Panda-Samaneh Borji.pdf
Metatheoretical Panda-Samaneh Borji.pdf
samanehborji16 views
Pollination By Nagapradheesh.M.pptx by MNAGAPRADHEESH
Pollination By Nagapradheesh.M.pptxPollination By Nagapradheesh.M.pptx
Pollination By Nagapradheesh.M.pptx
MNAGAPRADHEESH16 views
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe... by Anmol Vishnu Gupta
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...
MODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdf by KerryNuez1
MODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdfMODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdf
MODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdf
KerryNuez125 views
application of genetic engineering 2.pptx by SankSurezz
application of genetic engineering 2.pptxapplication of genetic engineering 2.pptx
application of genetic engineering 2.pptx
SankSurezz10 views
A giant thin stellar stream in the Coma Galaxy Cluster by Sérgio Sacani
A giant thin stellar stream in the Coma Galaxy ClusterA giant thin stellar stream in the Coma Galaxy Cluster
A giant thin stellar stream in the Coma Galaxy Cluster
Sérgio Sacani10 views
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ... by ILRI
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
ILRI5 views
Light Pollution for LVIS students by CWBarthlmew
Light Pollution for LVIS studentsLight Pollution for LVIS students
Light Pollution for LVIS students
CWBarthlmew7 views
Experimental animal Guinea pigs.pptx by Mansee Arya
Experimental animal Guinea pigs.pptxExperimental animal Guinea pigs.pptx
Experimental animal Guinea pigs.pptx
Mansee Arya17 views

Featured

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present... by
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
55.5K views138 slides
12 Ways to Increase Your Influence at Work by
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
401.7K views64 slides
ChatGPT webinar slides by
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slidesAlireza Esmikhani
30.3K views36 slides
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G... by
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
3.6K views12 slides
Barbie - Brand Strategy Presentation by
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
25.1K views46 slides

Featured(20)

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present... by Applitools
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Applitools55.5K views
12 Ways to Increase Your Influence at Work by GetSmarter
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
GetSmarter401.7K views
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G... by DevGAMM Conference
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
DevGAMM Conference3.6K views
Barbie - Brand Strategy Presentation by Erica Santiago
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
Erica Santiago25.1K views
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well by Saba Software
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Saba Software25.2K views
Introduction to C Programming Language by Simplilearn
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
Simplilearn8.4K views
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr... by Palo Alto Software
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...
Palo Alto Software88.4K views
9 Tips for a Work-free Vacation by Weekdone.com
9 Tips for a Work-free Vacation9 Tips for a Work-free Vacation
9 Tips for a Work-free Vacation
Weekdone.com7.2K views
How to Map Your Future by SlideShop.com
How to Map Your FutureHow to Map Your Future
How to Map Your Future
SlideShop.com275.1K views
Beyond Pride: Making Digital Marketing & SEO Authentically LGBTQ+ Inclusive -... by AccuraCast
Beyond Pride: Making Digital Marketing & SEO Authentically LGBTQ+ Inclusive -...Beyond Pride: Making Digital Marketing & SEO Authentically LGBTQ+ Inclusive -...
Beyond Pride: Making Digital Marketing & SEO Authentically LGBTQ+ Inclusive -...
AccuraCast3.4K views
Exploring ChatGPT for Effective Teaching and Learning.pptx by Stan Skrabut, Ed.D.
Exploring ChatGPT for Effective Teaching and Learning.pptxExploring ChatGPT for Effective Teaching and Learning.pptx
Exploring ChatGPT for Effective Teaching and Learning.pptx
Stan Skrabut, Ed.D.57.7K views
How to train your robot (with Deep Reinforcement Learning) by Lucas García, PhD
How to train your robot (with Deep Reinforcement Learning)How to train your robot (with Deep Reinforcement Learning)
How to train your robot (with Deep Reinforcement Learning)
Lucas García, PhD42.5K views
4 Strategies to Renew Your Career Passion by Daniel Goleman
4 Strategies to Renew Your Career Passion4 Strategies to Renew Your Career Passion
4 Strategies to Renew Your Career Passion
Daniel Goleman122K views
The Student's Guide to LinkedIn by LinkedIn
The Student's Guide to LinkedInThe Student's Guide to LinkedIn
The Student's Guide to LinkedIn
LinkedIn87.9K views
Different Roles in Machine Learning Career by Intellipaat
Different Roles in Machine Learning CareerDifferent Roles in Machine Learning Career
Different Roles in Machine Learning Career
Intellipaat12.4K views
Defining a Tech Project Vision in Eight Quick Steps pdf by TechSoup
Defining a Tech Project Vision in Eight Quick Steps pdfDefining a Tech Project Vision in Eight Quick Steps pdf
Defining a Tech Project Vision in Eight Quick Steps pdf
TechSoup 9.7K views

World Bank 2019 - Text Analytics Workshop

  • 1. An Intro to NLP and Text Analysis Patrick van Kessel, Senior Data Scientist Pew Research Center
  • 2. The role of text in social research
  • 3. ● Free of assumptions ● Potential for richer insights relative to closed-format responses ● If organic, then data collection costs are often negligible The role of text in social research Why text?
  • 4. The role of text in social research Where do I find it? ● Open-ended surveys / focus groups / transcripts / interviews ● Social media data (tweets, FB posts, etc.) ● Long-form content (articles, notes, logs, etc.)
  • 5. The role of text in social research What makes it challenging? ● Messy ○ “Data spaghetti” with little or no structure ● Sparse ○ Low information-to-data ratio (lots of hay, few needles) ● Often organic (rather than designed) ○ Can be naturally generated by people and processes ○ Often without a research use in mind
  • 6. Data selection and preparation
  • 7. Data selection and preparation ● Know your objective and subject matter (if needed find subject matter expert) ● Get familiar with the data ● Don’t make assumptions - know your data, quirks and all
  • 8. Data selection and preparation Text Acquisition and Preprepation Select relevant data (text corpus) ● Content ● Metadata Prepare the input file ● Determine unit of analysis ● Process text to get one document per unit of analysis
  • 11. Turning text into data ● How do we sift through text and produce insight? ● Might first try searching for keywords ● How many times is “analysis” mentioned? Original 1 Text analysis is fun 2 I enjoy analyzing text data 3 Data science often involves text analytics
  • 12. Turning text into data ● How do we sift through text and produce insight? ● Might first try searching for keywords ● How many times is “analysis” mentioned? Original 1 Text analysis is fun 2 I enjoy analyzing text data 3 Data science often involves text analytics We missed this one And this one too
  • 13. Turning text into data ● Variations of words can have the same meaning but look completely different to a computer Original 1 Text analysis is fun 2 I enjoy analyzing text data 3 Data science often involves text analytics
  • 14. Turning text into data Regular Expressions ● A more sophisticated solution: regular expressions ● Syntax for defining string (text) patterns Original 1 Text analysis is fun 2 I enjoy analyzing text data 3 Data science often involves text analytics
  • 15. Turning text into data Regular Expressions ● Can use to search text or extract specific chunks ● Example use cases: ○ Extracting dates ○ Finding URLs ○ Identifying names/entities ● https://regex101.com/ ● http://www.regexlib.com/
  • 16. Turning text into data Regular Expressions banaly[a-z]+b Original 1 Text analysis is fun 2 I enjoy analyzing text data 3 Data science often involves text analytics
  • 17. Turning text into data Regular Expressions Regular expressions can be extremely powerful… ...and terrifyingly complex: URLS: ((https?://(www.)?)?[-a-zA-Z0-9@:%._+~#=]{2,4096}.[a-z]{2,6}b([-a-zA-Z0-9@:%_+.~#?&//=]*)) DOMAINS: (?:http[s]?://)?(?:www(?:s?).)?([w.-]+)(?:[/](?:.+))? MONEY: $([0-9]{1,3}(?:(?:,[0-9]{3})+)?(?:.[0-9]{1,2})?)s
  • 18. Turning text into data Pre-processing ● Great, but we can’t write patterns for everything ● Words are messy and have a lot of variation ● We need to collapse semantically ● We need to clean / pre-process Original 1 Text analysis is fun 2 I enjoy analyzing text data 3 Data science often involves text analytics
  • 19. Turning text into data Pre-processing ● Common first steps: ○ Spell check / correct ○ Remove punctuation / expand contractions Original Processed 1 Text analysis is fun 2 I enjoy analyzing text data 3 Data science often involves text analytics can’t -> cannot they’re -> they_are doesn’t -> does_not
  • 20. Turning text into data Pre-processing ● Now to collapse words with the same meaning ● We do this with stemming or lemmatization ● Break words down to their roots Original Processed 1 Text analysis is fun 2 I enjoy analyzing text data 3 Data science often involves text analytics
  • 21. Turning text into data Pre-processing ● Stemming is more conservative ● There are many different stemmers ● Here’s the Porter stemmer (1979) Original Processed 1 Text analysis is fun Text analysi is fun 2 I enjoy analyzing text data I enjoy analyz text data 3 Data science often involves text analytics Data scienc often involv text analyt
  • 22. Turning text into data Pre-processing ● Stemming is more conservative ● There are many different stemmers ● Here’s the Porter stemmer (1979) Original Processed 1 Text analysis is fun Text analysi is fun 2 I enjoy analyzing text data I enjoy analyz text data 3 Data science often involves text analytics Data scienc often involv text analyt
  • 23. Turning text into data Pre-processing ● The Lancaster stemmer (1990) is newer and more aggressive ● Truncates words a LOT Original Processed 1 Text analysis is fun text analys is fun 2 I enjoy analyzing text data I enjoy analys text dat 3 Data science often involves text analytics dat sci oft involv text analys
  • 24. Turning text into data Pre-processing ● Lemmatization uses linguistic relationships and parts of speech to collapse words down to their root form - so you get actual words (“lemma”), not stems ● WordNet Lemmatizer Original Processed 1 Text analysis is fun text analysis is fun 2 I enjoy analyzing text data I enjoy analyze text data 3 Data science often involves text analytics data science often involve text analytics
  • 25. Turning text into data Pre-processing ● Picking the right method depends on how much you want to preserve nuance or collapse meaning ● We’ll stick with Lancaster Original Processed 1 Text analysis is fun text analys is fun 2 I enjoy analyzing text data I enjoy analys text dat 3 Data science often involves text analytics dat sci oft involv text analys
  • 26. Turning text into data Pre-processing ● Finally, we need to remove words that don’t hold meaning themselves ● These are called “stopwords” ● Can expand standard stopword lists with custom words Original Processed 1 Text analysis is fun text analys fun 2 I enjoy analyzing text data enjoy analys text dat 3 Data science often involves text analytics dat sci oft involv text analys
  • 27. Turning text into data Pre-processing ● A word of caution: there aren’t any universal rules for making pre- processing decisions ● Do what makes sense for your data - but be cautious of the researcher degrees of freedom involved ● See: ○ Denny and Spirling, 2016. Assessing the Consequences of Text Pre-processing Decisions” ○ Denny and Spirling, 2018. “Text Preprocessing for Unsupervised Learning: Why It Matters, When It Misleads, and What to Do About It”
  • 28. Turning text into data Tokenization ● Now we need to tokenize ● Break words apart according to certain rules ● Usually breaks on whitespace and punctuation ● What’s left are called “tokens” ● Single tokens or pairs of two or more tokens are called “ngrams”
  • 29. Turning text into data Tokenization ● Unigrams text analys fun enjoy dat sci oft involv 1 1 1 1 1 1 1 1 1 1 1 1 1
  • 30. Turning text into data Tokenization ● Bigrams Text analys 2 Analys fun 1 Enjoy analys 1 Analys text 1 Text dat 1 Dat sci 1 Sci oft 1 Oft involv 1 Involv text 1
  • 31. Turning text into data Tokenization ● If we want to characterize the whole corpus, we can just look at the most frequent words text analys fun enjoy dat sci oft involv 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 1 1 2 1 1 1
  • 32. Turning text into data The Term Frequency Matrix ● But what if we want to distinguish documents from each other? ● We know these documents are about text analysis ● What makes them unique?
  • 33. Turning text into data The Term Frequency Matrix ● Divide the word frequencies by the number of documents they appear in ● Downweight words that are common ● Now we’re emphasizing what makes each document unique text analys fun enjoy dat sci oft involv .33 .33 1 .33 .33 1 .5 .33 .33 .5 1 1 1 1 1 1 1 1 1 1 1
  • 34. Turning words into numbers Part-of-Speech Tagging ● TF-IDF is often all you need ● But sometimes you care about how a word is used ● Can use pre-trained part-of-speech (POS) taggers ● Can also help with things like negation ○ “Happy” vs. “NOT happy”
  • 35. Turning words into numbers Named Entity Extraction ● Might also be interested in people, places, organizations, etc. ● Like POS taggers, named entity extractors use trained models
  • 36. Turning words into numbers Word2Vec ● Other methods can quantify words not by frequency, but by their relation ● Word2vec uses a sliding window to read words and learn their relationships; each word gets a vector in N-dimensional space ● https://code.google.com/archive/p/word2vec/
  • 38. Finding patterns in text data Two types of approaches: ● Unsupervised NLP: automated, extracts structure from the data ○ Clustering ○ Topic modeling ○ Mutual information ● Supervised NLP: requires training data, learns to predict labels and classes ○ Classification ○ Regression
  • 39. Finding patterns in text data Unsupervised methods ● First up, we might look at words that commonly co- occur ● Called “collocations” or “co- occurrences” ● Great way to get a quick look at common themes
  • 40. Finding patterns in text data Unsupervised methods ● Might want to compare documents (or words) to one another ● Possible applications ○ Spelling correction ○ Document deduplication ○ Measure similarity of language ■ Politicians’ speeches ■ Movie reviews ■ Product descriptions
  • 41. Finding patterns in text data Unsupervised methods Levenshtein distance ● Compute number of steps needed to turn a word/document into another ● Can express as a ratio (percent of word/document that needs to change) to measure similarity
  • 42. Finding patterns in text data Unsupervised methods Cosine similarity ● Compute the “angle” between two word vectors ● TF-IDF: axes are the weighted frequencies for each word ● Word2Vec: axes are the learned dimensions from the model
  • 43. Finding patterns in text data Unsupervised methods Clustering ● Algorithms that use word vectors (TF-IDF, Word2Vec, etc.) to identify structural groupings between observations (words, documents) ● K-Means is a very commonly used one
  • 44. Finding patterns in text data Unsupervised methods Hierarchical/agglomerative clustering ● Start with all observations, and use a rule to pair them up, and repeat until there’s only one group left
  • 45. Finding patterns in text data Unsupervised methods Network analysis ● Can also get creative ● After all, we’re just working with columns and numbers ● Example: link words together by their strongest correlations
  • 46. Finding patterns in text data Unsupervised methods Pointwise mutual information ● Based on information theory ● Compares conditional and joint probabilities to measure the likelihood of a word occurring with a category/outcome, beyond random chance
  • 47. Finding patterns in text data Unsupervised methods Topic modeling ● Algorithms that characterize documents in terms of topics (groups of words) ● Find topics that best fit the data
  • 48. Finding patterns in text data Supervised methods ● Often we want to categorize documents ● Unsupervised methods can help ● But often we need to read and label them ourselves ● Classification models can take labeled data and learn to make predictions
  • 49. Finding patterns in text data Supervised methods Steps: ● Label a sample of documents ● Break your sample into two sets: a training sample and a test sample ● Train a model on the training sample ● Evaluate it on the test sample ● Apply it to the full set of documents to make predictions
  • 50. Finding patterns in text data Supervised methods ● First you need to develop a codebook ● Codebook: set of rules for labeling and categorizing documents ● The best codebooks have clear rules for hard cases, and lots of examples ● Categories should be MECE: mutually exclusive and collectively exhaustive
  • 52. Finding patterns in text data Supervised methods
  • 53. Finding patterns in text data Supervised methods ● Need to validate the codebook by measuring interrater reliability ● Makes sure your measures are consistent, objective, and reproducible ● Multiple people code the same document
  • 54. Finding patterns in text data Supervised methods ● Various metrics to test whether their agreement is high enough ○ Krippendorf’s alpha ○ Cohen’s kappa ● Can also compare coders against a gold standard, if available
  • 55. Finding patterns in text data Supervised methods ● Mechanical Turk can be a great way to code a lot of documents ● Have 5+ Turkers code a large sample of documents ● Collapse them together with a rule ● Code a subset in-house, and compute reliability
  • 56. Finding patterns in text data Supervised methods ● Sometimes you’re trying to measure something that’s really rare ● Consider oversampling ● Example: keyword oversampling ● Disproportionately select documents that are more likely to contain your outcomes
  • 57. Finding patterns in text data Supervised methods ● After coding, split your sample into two sets (~80/20) ○ One for training, one for testing ● We do this to check for (and avoid) overfitting
  • 58. Finding patterns in text data Supervised methods ● Next step is called feature extraction or feature selection ● Need to extract “features” from the text ○ TF-IDF ○ Word2Vec vectors ● Can also utilize metadata, if potentially useful
  • 59. Finding patterns in text data Supervised methods ● Select a classification algorithm ● Common choice for text data are support vector machines (SVMs) ● Similar to regression, SVMs find the line that best separates two or more groups ● Can also use non-linear “kernels” for better fits (radial basis function, etc.) ● XGBoost is a newer and very promising algorithm
  • 60. Finding patterns in text data Supervised methods ● Word of caution: if you oversampled, your weights are probability (survey) weights ● Most machine learning implementations assume your weights are frequency weights ● Not an issue for predictions - but can produce biased variance estimates if you analyze your training data itself
  • 61. Finding patterns in text data Supervised methods ● Time to evaluate performance ● Lots of different metrics, depending on what you care about ● Often we care about precision/recall ○ Precision: did you pick out mostly needles or mostly hay? ○ Recall: how many needles did you miss? ● Other metrics: ○ Matthew’s correlation coefficient ○ Brier score ○ Overall accuracy
  • 62. Finding patterns in text data Supervised methods ● Doing just one split leaves a lot up to chance ● To bootstrap a better estimate of the model’s performance, it’s best to use K- fold cross-validation ● Splits your data into train/test sets multiple times and averages the performance metrics ● Ensures that you didn’t just get lucky (or unlucky)
  • 63. Finding patterns in text data Supervised methods ● Model not working well? ● You probably need to tune your parameters ● You can use a grid search to test out different combinations of model parameters and feature extraction methods ● Many software packages can automatically help you pick the best combination to maximize your model’s performance
  • 64. Finding patterns in text data Supervised methods ● Suggested design: ○ Large training sample, coded by Turkers ○ Small evaluation sample, coded by Turkers and in-house experts ○ Compute IRR between Turk and experts ○ Train model on training sample, use 5-fold cross-validation ○ Apply model to evaluation sample, compare results against in-house coders and Turkers
  • 65. Finding patterns in text data Supervised methods ● Some (but not all) models produce probabilities along with their classifications ● Ideally you fit the model using your preferred scoring metric/function ● But you can also use post-hoc probability thresholds to adjust your model’s predictions
  • 67. Open-source tools ● Python ○ NLTK, scikit-learn, pandas, numpy, scipy, gensim, spacy, etc ● R ○ https://cran.r-project.org/web/views/NaturalLanguageProcessing.html ● Java ○ Stanford Core NLP + many other useful libraries https://nlp.stanford.edu/software/
  • 68. Commercial tools ● Cloud-based NLP ○ Amazon Comprehend ○ Google Cloud Natural Language ○ IBM Watson NLU ● Software ○ SPSS Text Modeler ○ Provalis WordStat
  • 69. Time for a demo! Demo notebook: https://colab.research.google.com/github/patrickvankessel/WorldBank-Text-Analysis-2019/blob/master/Tutorial.ipynb Feel free to reach out: pvankessel@pewresearch.org Special thanks to Michael Jugovich, who helped me put together a lot of this material together for a previous workshop we did at NYAAPOR!