Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classifiers for Twitter
1. Text Analytics:
From Colored Pens and Crumbly Papers to
Custom Machine Classifiers forTwitter
Dr. StuartW. Shulman
Founder & CEO,Texifter
@stuartwshulman
“…a wealth of information creates a poverty of attention.”
- Herbert Simon, 1971
2. Presentation Outline
1. Moving from pen and paper to machine-learning
2. Overview of the spectrum methods
3. Portfolio identification using the five pillars
4. HowTwitter data is relevant to evaluation of theWBG
6. Relations between Classes
Rates andTerms for Credit
Farm Profitability
Cost of Living
Soil Fertility
Education
Exploration
Speculation
Coding
Validation
7. Qualitative Methods: Genes,Taste, orTactic?
• Qualitative by birth or choice?
• Some look to words as an alternative to number crunching
• Others rooted in rich and meaningful interpretive traditions
• Another group is fluent in both qual & quant
• Mixed methods open up rather than limits fields of knowledge
• One central goal is valid inferences about phenomena
• Replicable and transparent methods
• Attention to error and corrective measures
• Internal and external validation of results
• Using computers for qualitative data analysis helps, but…
• Rigor still originates with the research design, not the technology
• Software makes better organization and efficiency possible
• Coders enable the researcher to step back while scaling up
8. Purist Pluralist Positivist
A Spectrum of Approaches toWorking with Qualitative Data
Different types of knowledge claims depending where you sit
deep immersion
closeness to data
antipathy to numbers
credible interpretation
in-depth analysis
contextual
subjective
experimental
mixed method
adaptive hybrid
flexible approach
interdisciplinary
quantitative
focus on error
measurement critical
validity and reliability
replication & objectivity
generalization
hypotheses
These choices can be philosophical, ideological, and ethical in nature
9. Stuart W. Shulman. 2003. "An Experiment in Digital Government at the
United States National Organic Program," Agriculture and Human Values
20(3), 253-265.
16. Over 13,000 hours of video and audio were recorded of the public spaces in a LTC facility’s dementia unit in
suburban Pittsburgh, PA. A codebook of 80+ codes was developed to categorize the behavior of the consenting
residents and staff (only in relation to patients). 22 coders spent more than 4,400 hours over a period of 22
months coding the video data.The data were coded using the Informedia DigitalVideo Library (IDVL), an
interface designed by computer scientists at Carnegie Mellon University.
19. Grimmer & Stewart
“Text as Data” Political Analysis (2013)
Volume is a problem for scholars
Coders are expensive
Groups struggle to accurately label text at scale
Validation of both humans and machines is essential
Some models are easier to validate than others
All models are wrong
Automated models enhance/amplify, but don’t replace humans
There is no one right way to do this
“Validate, validate, validate”
“What should be avoided then, is the blind use of
any method without a validation step.”
26. CoderRank is our key innovation
Patent issued in 2016
Service Mark issued 2017
27. CoderRank for Enhanced Machine-Learning
CoderRank is to text analytics what PageRank was to search.
Just as Google said not all web pages are created equal,
Texifter argues that not all humans are created equal.When
training machines, it is best to rely most on the humans most
likely to create a valid observation.We proposed a unique way
to rank humans on trust and knowledge vectors.
55. Our ActiveLearning engine and coding tools combine…
what humans do best… with what computers do best
Humans and machines learning together
Keep humans “in-the-loop” for more accurate results and better insights
56. Boolean Operators Cannot Solve Every Problem
There are language problems well-suited to machine-learning
We are all training classifiers in daily life
Spam filtering gave way to Amazon & Netflix
Humans and machines are constantly learning together
69. Twitter Data Should Be Human Coded
Using theTwitter Display
The rush to CSV is a mistake; data is degraded
Data
Data
Live
Data
Live
Data
Data
77. Contents
Network
Time Series
Description
Author Description
Overall Metrics
Top Influencers
Top URLs
Top Domains
Top Hashtags
Top Words
Top Word Pairs
Top Replied-To
Top Mentioned
Top Tweeters
Network
sdonnan
Tweet Follow
WorldBank
Tweet Follow
CraigHammerd
Tweet Follow
bijancbayne
Tweet Follow
YouTube
Tweet Follow
TweetsAnup
Tweet Follow
realDonaldTrum
p
Tweet Follow
Nik_6996
Tweet Follow
jeremyhillman
Tweet Follow
alanBStardmp
Created with NodeXL
(http://nodexl.codeplex.com)
from the Social Media Research Foundation
(http://www.smrfoundation.org)