Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classifiers for Twitter

Stuart Shulman
Stuart ShulmanCEO at Texifter
Text Analytics:
From Colored Pens and Crumbly Papers to
Custom Machine Classifiers forTwitter
Dr. StuartW. Shulman
Founder & CEO,Texifter
@stuartwshulman
“…a wealth of information creates a poverty of attention.”
- Herbert Simon, 1971
Presentation Outline
1. Moving from pen and paper to machine-learning
2. Overview of the spectrum methods
3. Portfolio identification using the five pillars
4. HowTwitter data is relevant to evaluation of theWBG
Our Core Philosophy
Emergent properties in very well read texts such as
the archetypal “extremist agent of the law”
Agenda-Setting in the Progressive Era Print Press
Relations between Classes
Rates andTerms for Credit
Farm Profitability
Cost of Living
Soil Fertility
Education
Exploration
Speculation
Coding
Validation
Qualitative Methods: Genes,Taste, orTactic?
• Qualitative by birth or choice?
• Some look to words as an alternative to number crunching
• Others rooted in rich and meaningful interpretive traditions
• Another group is fluent in both qual & quant
• Mixed methods open up rather than limits fields of knowledge
• One central goal is valid inferences about phenomena
• Replicable and transparent methods
• Attention to error and corrective measures
• Internal and external validation of results
• Using computers for qualitative data analysis helps, but…
• Rigor still originates with the research design, not the technology
• Software makes better organization and efficiency possible
• Coders enable the researcher to step back while scaling up
Purist Pluralist Positivist
A Spectrum of Approaches toWorking with Qualitative Data
Different types of knowledge claims depending where you sit
deep immersion
closeness to data
antipathy to numbers
credible interpretation
in-depth analysis
contextual
subjective
experimental
mixed method
adaptive hybrid
flexible approach
interdisciplinary
quantitative
focus on error
measurement critical
validity and reliability
replication & objectivity
generalization
hypotheses
These choices can be philosophical, ideological, and ethical in nature
Stuart W. Shulman. 2003. "An Experiment in Digital Government at the
United States National Organic Program," Agriculture and Human Values
20(3), 253-265.
Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classifiers for Twitter
Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classifiers for Twitter
Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classifiers for Twitter
Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classifiers for Twitter
CodingWeb Sites and Focus Groups to Study Agenda-Setting
Annotation to Improve Optical Character Recognition
Over 13,000 hours of video and audio were recorded of the public spaces in a LTC facility’s dementia unit in
suburban Pittsburgh, PA. A codebook of 80+ codes was developed to categorize the behavior of the consenting
residents and staff (only in relation to patients). 22 coders spent more than 4,400 hours over a period of 22
months coding the video data.The data were coded using the Informedia DigitalVideo Library (IDVL), an
interface designed by computer scientists at Carnegie Mellon University.
Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classifiers for Twitter
An Incredibly Important Book
Grimmer & Stewart
“Text as Data” Political Analysis (2013)
Volume is a problem for scholars
Coders are expensive
Groups struggle to accurately label text at scale
Validation of both humans and machines is essential
Some models are easier to validate than others
All models are wrong
Automated models enhance/amplify, but don’t replace humans
There is no one right way to do this
“Validate, validate, validate”
“What should be avoided then, is the blind use of
any method without a validation step.”
Free, Open-Source,Web-basedText AnalyticsToolkit
Original Software Kernel:Tools for Measurement
Text Classification
A 2500 year-old problem
Plato argued it would be frustrating; it still is
Software cannot remove the problem
Computer Science and National Science Foundation
Influences in a Nutshell: Measure Everything!
Fast?
Reliable?
Accurate?
Valid?
Interrater Reliability: A Critical Measurement
Adjudication: Creating a Gold Standard
CoderRank is our key innovation
Patent issued in 2016
Service Mark issued 2017
CoderRank for Enhanced Machine-Learning
CoderRank is to text analytics what PageRank was to search.
Just as Google said not all web pages are created equal,
Texifter argues that not all humans are created equal.When
training machines, it is best to rely most on the humans most
likely to create a valid observation.We proposed a unique way
to rank humans on trust and knowledge vectors.
Pronounced “tech-sifter”- the metaphor is of a sifter
Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classifiers for Twitter
Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classifiers for Twitter
AvoidTennis Elbow
Items load to the screen and the coder hits the keystroke
Keystroke Human Coding
Human coding distributed to individuals, groups & crowds
Data
Code
s
The Five Pillars ofText Analytics
Search
Metadata Filtering
De-duplication and Clustering
Human Coding
Machine-Learning
Pillar One: Search
Pillar One: Defined Multi-term Search
Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classifiers for Twitter
Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classifiers for Twitter
PillarTwo: Metadata Filters
Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classifiers for Twitter
Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classifiers for Twitter
Slicing big piles of text into smaller, more focused sets is key
AllTextAnalytics are FilteringTechniques
Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classifiers for Twitter
Users Drill Into Interactive Displays
Use metadata to examine sub-sets of responses and create reports
PillarThree: Duplicate Detection & Clustering
Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classifiers for Twitter
Latent Dirichlet Allocation (LDA)Topic Models
Topic Composition (SampleTerms and Phrases)
• Topic 1: development; projects; support; financials
• Topic 2: education; training; standards; teaching
• Topic 3: health; coverage; ministry; information
• Topic 4: government; investment; social; policy
• Topic 5: administrative; coordination; market; procedures
• Topic 6: institutional; technical; strengthening; programs
• Topic 7: infrastructure; rehabilitation; maintenance; upgrading
• Topic 8: utility; company; privatization; restructuring; supply
Pillar Four: Human Coding (Labeling orTagging)
Human Coding Converted into Machine Classifiers
Accumulated human coding becomes
training data via machine-learning
Simplified Coding Management
Crowdsourcing accelerates the insight generation process
Synchronous & Asynchronous Collaboration
Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classifiers for Twitter
Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classifiers for Twitter
Pillar Five: Machine-Learning
Our ActiveLearning engine and coding tools combine…
what humans do best… with what computers do best
Humans and machines learning together
Keep humans “in-the-loop” for more accurate results and better insights
Boolean Operators Cannot Solve Every Problem
There are language problems well-suited to machine-learning
We are all training classifiers in daily life
Spam filtering gave way to Amazon & Netflix
Humans and machines are constantly learning together
Interested in Money Banks?
Researching a Politician?
Doing Smoking Research?
Brands?
Studying a SportsTeam?
Super Bowl HistoryVersus Political History?
Twitter Can Feel Overwhelming
Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classifiers for Twitter
Full HistoricalTwitter Access with Free Estimates
PowerTrack Operators for Precise Queries
Create andTest Rules Self Serve
Three Estimates Per Day SentVia Email
# of
Tweets
Cost
Twitter Data Should Be Human Coded
Using theTwitter Display
The rush to CSV is a mistake; data is degraded
Data
Data
Live
Data
Live
Data
Data
Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classifiers for Twitter
Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classifiers for Twitter
Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classifiers for Twitter
Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classifiers for Twitter
Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classifiers for Twitter
Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classifiers for Twitter
Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classifiers for Twitter
Contents
Network
Time Series
Description
Author Description
Overall Metrics
Top Influencers
Top URLs
Top Domains
Top Hashtags
Top Words
Top Word Pairs
Top Replied-To
Top Mentioned
Top Tweeters
Network
sdonnan
Tweet Follow
WorldBank
Tweet Follow
CraigHammerd
Tweet Follow
bijancbayne
Tweet Follow
YouTube
Tweet Follow
TweetsAnup
Tweet Follow
realDonaldTrum
p
Tweet Follow
Nik_6996
Tweet Follow
jeremyhillman
Tweet Follow
alanBStardmp
Created with NodeXL
(http://nodexl.codeplex.com)
from the Social Media Research Foundation
(http://www.smrfoundation.org)
Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classifiers for Twitter
Special Enterprise License Keys
Everyone can have one
Request via an email: info@texifter.com
For more information
discovertext.com
@discovertext
Thank-you for listening!
Dr. Stuart Shulman
@stuartwshulman
1 of 80

Recommended

Km cognitive computing overview by ken martin 19 jan2015 by
Km   cognitive computing overview by ken martin 19 jan2015Km   cognitive computing overview by ken martin 19 jan2015
Km cognitive computing overview by ken martin 19 jan2015HCL Technologies
976 views15 slides
Artificial Intelligence: Natural Language Processing by
Artificial Intelligence: Natural Language ProcessingArtificial Intelligence: Natural Language Processing
Artificial Intelligence: Natural Language ProcessingFrank Cunha
3.3K views22 slides
Knowledge Graphs and their central role in big data processing: Past, Present... by
Knowledge Graphs and their central role in big data processing: Past, Present...Knowledge Graphs and their central role in big data processing: Past, Present...
Knowledge Graphs and their central role in big data processing: Past, Present...Amit Sheth
588 views112 slides
Cognitive Computing.PDF by
Cognitive Computing.PDFCognitive Computing.PDF
Cognitive Computing.PDFCharles Quincy
6.4K views24 slides
CS 561a: Introduction to Artificial Intelligence by
CS 561a: Introduction to Artificial IntelligenceCS 561a: Introduction to Artificial Intelligence
CS 561a: Introduction to Artificial Intelligencebutest
3.5K views49 slides
Semantic Web Investigation within Big Data Context by
Semantic Web Investigation within Big Data ContextSemantic Web Investigation within Big Data Context
Semantic Web Investigation within Big Data ContextMurad Daryousse
662 views6 slides

More Related Content

What's hot

Creating a Data-Driven Government: Big Data With Purpose by
Creating a Data-Driven Government: Big Data With PurposeCreating a Data-Driven Government: Big Data With Purpose
Creating a Data-Driven Government: Big Data With PurposeTyrone Grandison
1K views80 slides
Cognitive computing by
Cognitive computing Cognitive computing
Cognitive computing Pratap Dangeti
1.2K views8 slides
Don't Handicap AI without Explicit Knowledge by
Don't Handicap AI  without Explicit KnowledgeDon't Handicap AI  without Explicit Knowledge
Don't Handicap AI without Explicit KnowledgeAmit Sheth
995 views56 slides
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop... by
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...QuantUniversity
79 views41 slides
Big data by
Big data Big data
Big data Rishi Kashyap
1.1K views21 slides
Welcome - 2011 Text Analytics Summit by
Welcome - 2011 Text Analytics SummitWelcome - 2011 Text Analytics Summit
Welcome - 2011 Text Analytics SummitSeth Grimes
1.6K views16 slides

What's hot(18)

Creating a Data-Driven Government: Big Data With Purpose by Tyrone Grandison
Creating a Data-Driven Government: Big Data With PurposeCreating a Data-Driven Government: Big Data With Purpose
Creating a Data-Driven Government: Big Data With Purpose
Tyrone Grandison1K views
Don't Handicap AI without Explicit Knowledge by Amit Sheth
Don't Handicap AI  without Explicit KnowledgeDon't Handicap AI  without Explicit Knowledge
Don't Handicap AI without Explicit Knowledge
Amit Sheth995 views
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop... by QuantUniversity
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...
QuantUniversity79 views
Welcome - 2011 Text Analytics Summit by Seth Grimes
Welcome - 2011 Text Analytics SummitWelcome - 2011 Text Analytics Summit
Welcome - 2011 Text Analytics Summit
Seth Grimes1.6K views
Acg Terr Sand2004 2130w by NKHAYDEN
Acg Terr Sand2004 2130wAcg Terr Sand2004 2130w
Acg Terr Sand2004 2130w
NKHAYDEN499 views
Online text data for machine learning, data science, and research - Who can p... by Fredrik Olsson
Online text data for machine learning, data science, and research - Who can p...Online text data for machine learning, data science, and research - Who can p...
Online text data for machine learning, data science, and research - Who can p...
Fredrik Olsson270 views
CoderRank: Creating Gold Standards by Stuart Shulman
CoderRank: Creating Gold StandardsCoderRank: Creating Gold Standards
CoderRank: Creating Gold Standards
Stuart Shulman245 views
Ethical Issues in Machine Learning Algorithms. (Part 3) by Vladimir Kanchev
Ethical Issues in Machine Learning Algorithms. (Part 3)Ethical Issues in Machine Learning Algorithms. (Part 3)
Ethical Issues in Machine Learning Algorithms. (Part 3)
Vladimir Kanchev423 views
Ethical Issues in Machine Learning Algorithms (Part 2) by Vladimir Kanchev
Ethical Issues in Machine Learning Algorithms (Part 2)Ethical Issues in Machine Learning Algorithms (Part 2)
Ethical Issues in Machine Learning Algorithms (Part 2)
Vladimir Kanchev367 views
Artificial intelligence: Simulation of Intelligence by Abhishek Upadhyay
Artificial intelligence: Simulation of IntelligenceArtificial intelligence: Simulation of Intelligence
Artificial intelligence: Simulation of Intelligence
Abhishek Upadhyay3.6K views
Big Data & Artificial Intelligence by Zavain Dar
Big Data & Artificial IntelligenceBig Data & Artificial Intelligence
Big Data & Artificial Intelligence
Zavain Dar9.2K views
Evaluating the impact of removing less important terms on sentiment analysis by Conference Papers
Evaluating the impact of removing less important terms on sentiment analysisEvaluating the impact of removing less important terms on sentiment analysis
Evaluating the impact of removing less important terms on sentiment analysis
Big data - a review (2013 4) by Sonu Gupta
Big data - a review (2013 4)Big data - a review (2013 4)
Big data - a review (2013 4)
Sonu Gupta2.7K views

Similar to Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classifiers for Twitter

Qual, Mixed, Machine and Everything in Between by
Qual, Mixed, Machine and Everything in BetweenQual, Mixed, Machine and Everything in Between
Qual, Mixed, Machine and Everything in BetweenStuart Shulman
193 views44 slides
Summit slide loop ny by
Summit slide loop nySummit slide loop ny
Summit slide loop nyStuart Shulman
344 views44 slides
ODSC East 2017: Data Science Models For Good by
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodKarry Lu
66 views79 slides
Measuring reliability and validity in human coding and machine classification by
Measuring reliability and validity in human coding and machine classificationMeasuring reliability and validity in human coding and machine classification
Measuring reliability and validity in human coding and machine classificationStuart Shulman
844 views29 slides
Minne analytics presentation 2018 12 03 final compressed by
Minne analytics presentation 2018 12 03 final   compressedMinne analytics presentation 2018 12 03 final   compressed
Minne analytics presentation 2018 12 03 final compressedBonnie Holub
87 views41 slides
Minne analytics presentation 2018 12 03 final compressed by
Minne analytics presentation 2018 12 03 final   compressedMinne analytics presentation 2018 12 03 final   compressed
Minne analytics presentation 2018 12 03 final compressedBonnie Holub
273 views42 slides

Similar to Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classifiers for Twitter(20)

Qual, Mixed, Machine and Everything in Between by Stuart Shulman
Qual, Mixed, Machine and Everything in BetweenQual, Mixed, Machine and Everything in Between
Qual, Mixed, Machine and Everything in Between
Stuart Shulman193 views
ODSC East 2017: Data Science Models For Good by Karry Lu
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For Good
Karry Lu66 views
Measuring reliability and validity in human coding and machine classification by Stuart Shulman
Measuring reliability and validity in human coding and machine classificationMeasuring reliability and validity in human coding and machine classification
Measuring reliability and validity in human coding and machine classification
Stuart Shulman844 views
Minne analytics presentation 2018 12 03 final compressed by Bonnie Holub
Minne analytics presentation 2018 12 03 final   compressedMinne analytics presentation 2018 12 03 final   compressed
Minne analytics presentation 2018 12 03 final compressed
Bonnie Holub87 views
Minne analytics presentation 2018 12 03 final compressed by Bonnie Holub
Minne analytics presentation 2018 12 03 final   compressedMinne analytics presentation 2018 12 03 final   compressed
Minne analytics presentation 2018 12 03 final compressed
Bonnie Holub273 views
Transparency in ML and AI (humble views from a concerned academic) by Paolo Missier
Transparency in ML and AI (humble views from a concerned academic)Transparency in ML and AI (humble views from a concerned academic)
Transparency in ML and AI (humble views from a concerned academic)
Paolo Missier580 views
How to create a taxonomy for management buy-in by Mary Chitty
How to create a taxonomy for management buy-inHow to create a taxonomy for management buy-in
How to create a taxonomy for management buy-in
Mary Chitty425 views
Pistoia Alliance Webinar Demystifying AI: Centre of Excellence for AI Webina... by Pistoia Alliance
Pistoia Alliance Webinar Demystifying AI: Centre of Excellence for AI  Webina...Pistoia Alliance Webinar Demystifying AI: Centre of Excellence for AI  Webina...
Pistoia Alliance Webinar Demystifying AI: Centre of Excellence for AI Webina...
Pistoia Alliance544 views
PatternLanguageOfData by kimErwin
PatternLanguageOfDataPatternLanguageOfData
PatternLanguageOfData
kimErwin239 views
Common Facets For Indexing Of Enterprise Entities On... by Jenna Welch
Common Facets For Indexing Of Enterprise Entities On...Common Facets For Indexing Of Enterprise Entities On...
Common Facets For Indexing Of Enterprise Entities On...
Jenna Welch2 views
Health information professionals and Artificial Intelligence by coxamcoxam
Health information professionals and Artificial IntelligenceHealth information professionals and Artificial Intelligence
Health information professionals and Artificial Intelligence
coxamcoxam27 views
Black Box Learning Analytics? Beyond Algorithmic Transparency by Simon Buckingham Shum
Black Box Learning Analytics? Beyond Algorithmic TransparencyBlack Box Learning Analytics? Beyond Algorithmic Transparency
Black Box Learning Analytics? Beyond Algorithmic Transparency
Future of text analysis forrester briefing by Stuart Shulman
Future of text analysis   forrester briefingFuture of text analysis   forrester briefing
Future of text analysis forrester briefing
Stuart Shulman459 views
Becoming Datacentric by Timothy Cook
Becoming DatacentricBecoming Datacentric
Becoming Datacentric
Timothy Cook120 views
Data science by john d. kelleher, brendan tierney (z lib.org) by Tayab Memon
Data science by john d. kelleher, brendan tierney (z lib.org)Data science by john d. kelleher, brendan tierney (z lib.org)
Data science by john d. kelleher, brendan tierney (z lib.org)
Tayab Memon3.3K views
Brainframes, digital technologies and connected intelligence -Derrick de Kerc... by thiteu
Brainframes, digital technologies and connected intelligence -Derrick de Kerc...Brainframes, digital technologies and connected intelligence -Derrick de Kerc...
Brainframes, digital technologies and connected intelligence -Derrick de Kerc...
thiteu2.2K views
Cognitive future part 1 by Peter Tutty
Cognitive future part 1Cognitive future part 1
Cognitive future part 1
Peter Tutty568 views

More from Stuart Shulman

Fear and loathing on the social campaign trail by
Fear and loathing on the social campaign trailFear and loathing on the social campaign trail
Fear and loathing on the social campaign trailStuart Shulman
476 views88 slides
Fear and Loathing on the Social Campaign Trail by
Fear and Loathing on the Social Campaign TrailFear and Loathing on the Social Campaign Trail
Fear and Loathing on the Social Campaign TrailStuart Shulman
128 views88 slides
Texifter Presentation at Boston New Technology’s #BNT77 Startup Showcase! by
Texifter Presentation at Boston New Technology’s #BNT77 Startup Showcase!Texifter Presentation at Boston New Technology’s #BNT77 Startup Showcase!
Texifter Presentation at Boston New Technology’s #BNT77 Startup Showcase!Stuart Shulman
217 views13 slides
Text Analytics for Social Data Using DiscoverText & Sifter by
 Text Analytics for Social Data Using DiscoverText & Sifter Text Analytics for Social Data Using DiscoverText & Sifter
Text Analytics for Social Data Using DiscoverText & SifterStuart Shulman
809 views60 slides
Text Analytics for Social Data Using DiscoverText & Sifter by
Text Analytics for Social Data Using DiscoverText & SifterText Analytics for Social Data Using DiscoverText & Sifter
Text Analytics for Social Data Using DiscoverText & SifterStuart Shulman
336 views60 slides
Twitter for Research by
Twitter for ResearchTwitter for Research
Twitter for ResearchStuart Shulman
840 views59 slides

More from Stuart Shulman(14)

Fear and loathing on the social campaign trail by Stuart Shulman
Fear and loathing on the social campaign trailFear and loathing on the social campaign trail
Fear and loathing on the social campaign trail
Stuart Shulman476 views
Fear and Loathing on the Social Campaign Trail by Stuart Shulman
Fear and Loathing on the Social Campaign TrailFear and Loathing on the Social Campaign Trail
Fear and Loathing on the Social Campaign Trail
Stuart Shulman128 views
Texifter Presentation at Boston New Technology’s #BNT77 Startup Showcase! by Stuart Shulman
Texifter Presentation at Boston New Technology’s #BNT77 Startup Showcase!Texifter Presentation at Boston New Technology’s #BNT77 Startup Showcase!
Texifter Presentation at Boston New Technology’s #BNT77 Startup Showcase!
Stuart Shulman217 views
Text Analytics for Social Data Using DiscoverText & Sifter by Stuart Shulman
 Text Analytics for Social Data Using DiscoverText & Sifter Text Analytics for Social Data Using DiscoverText & Sifter
Text Analytics for Social Data Using DiscoverText & Sifter
Stuart Shulman809 views
Text Analytics for Social Data Using DiscoverText & Sifter by Stuart Shulman
Text Analytics for Social Data Using DiscoverText & SifterText Analytics for Social Data Using DiscoverText & Sifter
Text Analytics for Social Data Using DiscoverText & Sifter
Stuart Shulman336 views
Sifting Social Data: Word Sense Disambiguation Using Machine Learning by Stuart Shulman
Sifting Social Data: Word Sense Disambiguation Using Machine LearningSifting Social Data: Word Sense Disambiguation Using Machine Learning
Sifting Social Data: Word Sense Disambiguation Using Machine Learning
Stuart Shulman1.5K views
CAQDAS 2014 Pecha Kucha - Stuart Shulman by Stuart Shulman
CAQDAS 2014 Pecha Kucha - Stuart ShulmanCAQDAS 2014 Pecha Kucha - Stuart Shulman
CAQDAS 2014 Pecha Kucha - Stuart Shulman
Stuart Shulman485 views
Technology for Citizen Voices by Stuart Shulman
Technology for Citizen VoicesTechnology for Citizen Voices
Technology for Citizen Voices
Stuart Shulman591 views
DiscoverText: Tools for Text by Stuart Shulman
DiscoverText: Tools for TextDiscoverText: Tools for Text
DiscoverText: Tools for Text
Stuart Shulman1.6K views
Citizen Voices in a Networked Age of #BigData by Stuart Shulman
Citizen Voices in a Networked Age of #BigDataCitizen Voices in a Networked Age of #BigData
Citizen Voices in a Networked Age of #BigData
Stuart Shulman916 views
DiscoverText Product Overview by Stuart Shulman
DiscoverText Product OverviewDiscoverText Product Overview
DiscoverText Product Overview
Stuart Shulman399 views
Importing bulk outlook email into DiscoverText - the .pst file upload by Stuart Shulman
Importing bulk outlook email into DiscoverText - the .pst file uploadImporting bulk outlook email into DiscoverText - the .pst file upload
Importing bulk outlook email into DiscoverText - the .pst file upload
Stuart Shulman377 views

Recently uploaded

Quality Assurance by
Quality Assurance Quality Assurance
Quality Assurance interworksoftware2
8 views6 slides
Understanding HTML terminology by
Understanding HTML terminologyUnderstanding HTML terminology
Understanding HTML terminologyartembondar5
8 views8 slides
aATP - New Correlation Confirmation Feature.pptx by
aATP - New Correlation Confirmation Feature.pptxaATP - New Correlation Confirmation Feature.pptx
aATP - New Correlation Confirmation Feature.pptxEsatEsenek1
222 views6 slides
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P... by
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...NimaTorabi2
17 views17 slides
Winter Projects GDSC IITK by
Winter Projects GDSC IITKWinter Projects GDSC IITK
Winter Projects GDSC IITKSahilSingh368445
416 views60 slides
Page Object Model by
Page Object ModelPage Object Model
Page Object Modelartembondar5
7 views5 slides

Recently uploaded(20)

Understanding HTML terminology by artembondar5
Understanding HTML terminologyUnderstanding HTML terminology
Understanding HTML terminology
artembondar58 views
aATP - New Correlation Confirmation Feature.pptx by EsatEsenek1
aATP - New Correlation Confirmation Feature.pptxaATP - New Correlation Confirmation Feature.pptx
aATP - New Correlation Confirmation Feature.pptx
EsatEsenek1222 views
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P... by NimaTorabi2
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...
NimaTorabi217 views
Top-5-production-devconMunich-2023.pptx by Tier1 app
Top-5-production-devconMunich-2023.pptxTop-5-production-devconMunich-2023.pptx
Top-5-production-devconMunich-2023.pptx
Tier1 app10 views
predicting-m3-devopsconMunich-2023-v2.pptx by Tier1 app
predicting-m3-devopsconMunich-2023-v2.pptxpredicting-m3-devopsconMunich-2023-v2.pptx
predicting-m3-devopsconMunich-2023-v2.pptx
Tier1 app14 views
Automated Testing of Microsoft Power BI Reports by RTTS
Automated Testing of Microsoft Power BI ReportsAutomated Testing of Microsoft Power BI Reports
Automated Testing of Microsoft Power BI Reports
RTTS11 views
Google Solutions Challenge 2024 Talk pdf by MohdAbdulAleem4
Google Solutions Challenge 2024 Talk pdfGoogle Solutions Challenge 2024 Talk pdf
Google Solutions Challenge 2024 Talk pdf
MohdAbdulAleem434 views
Bootstrapping vs Venture Capital.pptx by Zeljko Svedic
Bootstrapping vs Venture Capital.pptxBootstrapping vs Venture Capital.pptx
Bootstrapping vs Venture Capital.pptx
Zeljko Svedic16 views
Mobile App Development Company by Richestsoft
Mobile App Development CompanyMobile App Development Company
Mobile App Development Company
Richestsoft 5 views
University of Borås-full talk-2023-12-09.pptx by Mahdi_Fahmideh
University of Borås-full talk-2023-12-09.pptxUniversity of Borås-full talk-2023-12-09.pptx
University of Borås-full talk-2023-12-09.pptx
Mahdi_Fahmideh12 views
How to build dyanmic dashboards and ensure they always work by Wiiisdom
How to build dyanmic dashboards and ensure they always workHow to build dyanmic dashboards and ensure they always work
How to build dyanmic dashboards and ensure they always work
Wiiisdom16 views
How To Make Your Plans Suck Less — Maarten Dalmijn at the 57th Hands-on Agile... by Stefan Wolpers
How To Make Your Plans Suck Less — Maarten Dalmijn at the 57th Hands-on Agile...How To Make Your Plans Suck Less — Maarten Dalmijn at the 57th Hands-on Agile...
How To Make Your Plans Suck Less — Maarten Dalmijn at the 57th Hands-on Agile...
Stefan Wolpers44 views
Supercharging your Python Development Environment with VS Code and Dev Contai... by Dawn Wages
Supercharging your Python Development Environment with VS Code and Dev Contai...Supercharging your Python Development Environment with VS Code and Dev Contai...
Supercharging your Python Development Environment with VS Code and Dev Contai...
Dawn Wages5 views
JioEngage_Presentation.pptx by admin125455
JioEngage_Presentation.pptxJioEngage_Presentation.pptx
JioEngage_Presentation.pptx
admin1254559 views

Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classifiers for Twitter

  • 1. Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classifiers forTwitter Dr. StuartW. Shulman Founder & CEO,Texifter @stuartwshulman “…a wealth of information creates a poverty of attention.” - Herbert Simon, 1971
  • 2. Presentation Outline 1. Moving from pen and paper to machine-learning 2. Overview of the spectrum methods 3. Portfolio identification using the five pillars 4. HowTwitter data is relevant to evaluation of theWBG
  • 4. Emergent properties in very well read texts such as the archetypal “extremist agent of the law”
  • 5. Agenda-Setting in the Progressive Era Print Press
  • 6. Relations between Classes Rates andTerms for Credit Farm Profitability Cost of Living Soil Fertility Education Exploration Speculation Coding Validation
  • 7. Qualitative Methods: Genes,Taste, orTactic? • Qualitative by birth or choice? • Some look to words as an alternative to number crunching • Others rooted in rich and meaningful interpretive traditions • Another group is fluent in both qual & quant • Mixed methods open up rather than limits fields of knowledge • One central goal is valid inferences about phenomena • Replicable and transparent methods • Attention to error and corrective measures • Internal and external validation of results • Using computers for qualitative data analysis helps, but… • Rigor still originates with the research design, not the technology • Software makes better organization and efficiency possible • Coders enable the researcher to step back while scaling up
  • 8. Purist Pluralist Positivist A Spectrum of Approaches toWorking with Qualitative Data Different types of knowledge claims depending where you sit deep immersion closeness to data antipathy to numbers credible interpretation in-depth analysis contextual subjective experimental mixed method adaptive hybrid flexible approach interdisciplinary quantitative focus on error measurement critical validity and reliability replication & objectivity generalization hypotheses These choices can be philosophical, ideological, and ethical in nature
  • 9. Stuart W. Shulman. 2003. "An Experiment in Digital Government at the United States National Organic Program," Agriculture and Human Values 20(3), 253-265.
  • 14. CodingWeb Sites and Focus Groups to Study Agenda-Setting
  • 15. Annotation to Improve Optical Character Recognition
  • 16. Over 13,000 hours of video and audio were recorded of the public spaces in a LTC facility’s dementia unit in suburban Pittsburgh, PA. A codebook of 80+ codes was developed to categorize the behavior of the consenting residents and staff (only in relation to patients). 22 coders spent more than 4,400 hours over a period of 22 months coding the video data.The data were coded using the Informedia DigitalVideo Library (IDVL), an interface designed by computer scientists at Carnegie Mellon University.
  • 19. Grimmer & Stewart “Text as Data” Political Analysis (2013) Volume is a problem for scholars Coders are expensive Groups struggle to accurately label text at scale Validation of both humans and machines is essential Some models are easier to validate than others All models are wrong Automated models enhance/amplify, but don’t replace humans There is no one right way to do this “Validate, validate, validate” “What should be avoided then, is the blind use of any method without a validation step.”
  • 22. Text Classification A 2500 year-old problem Plato argued it would be frustrating; it still is Software cannot remove the problem
  • 23. Computer Science and National Science Foundation Influences in a Nutshell: Measure Everything! Fast? Reliable? Accurate? Valid?
  • 24. Interrater Reliability: A Critical Measurement
  • 25. Adjudication: Creating a Gold Standard
  • 26. CoderRank is our key innovation Patent issued in 2016 Service Mark issued 2017
  • 27. CoderRank for Enhanced Machine-Learning CoderRank is to text analytics what PageRank was to search. Just as Google said not all web pages are created equal, Texifter argues that not all humans are created equal.When training machines, it is best to rely most on the humans most likely to create a valid observation.We proposed a unique way to rank humans on trust and knowledge vectors.
  • 28. Pronounced “tech-sifter”- the metaphor is of a sifter
  • 31. AvoidTennis Elbow Items load to the screen and the coder hits the keystroke
  • 32. Keystroke Human Coding Human coding distributed to individuals, groups & crowds Data Code s
  • 33. The Five Pillars ofText Analytics Search Metadata Filtering De-duplication and Clustering Human Coding Machine-Learning
  • 35. Pillar One: Defined Multi-term Search
  • 41. Slicing big piles of text into smaller, more focused sets is key AllTextAnalytics are FilteringTechniques
  • 43. Users Drill Into Interactive Displays Use metadata to examine sub-sets of responses and create reports
  • 46. Latent Dirichlet Allocation (LDA)Topic Models
  • 47. Topic Composition (SampleTerms and Phrases) • Topic 1: development; projects; support; financials • Topic 2: education; training; standards; teaching • Topic 3: health; coverage; ministry; information • Topic 4: government; investment; social; policy • Topic 5: administrative; coordination; market; procedures • Topic 6: institutional; technical; strengthening; programs • Topic 7: infrastructure; rehabilitation; maintenance; upgrading • Topic 8: utility; company; privatization; restructuring; supply
  • 48. Pillar Four: Human Coding (Labeling orTagging)
  • 49. Human Coding Converted into Machine Classifiers Accumulated human coding becomes training data via machine-learning
  • 51. Crowdsourcing accelerates the insight generation process Synchronous & Asynchronous Collaboration
  • 55. Our ActiveLearning engine and coding tools combine… what humans do best… with what computers do best Humans and machines learning together Keep humans “in-the-loop” for more accurate results and better insights
  • 56. Boolean Operators Cannot Solve Every Problem There are language problems well-suited to machine-learning We are all training classifiers in daily life Spam filtering gave way to Amazon & Netflix Humans and machines are constantly learning together
  • 62. Super Bowl HistoryVersus Political History?
  • 63. Twitter Can Feel Overwhelming
  • 65. Full HistoricalTwitter Access with Free Estimates
  • 66. PowerTrack Operators for Precise Queries
  • 67. Create andTest Rules Self Serve
  • 68. Three Estimates Per Day SentVia Email # of Tweets Cost
  • 69. Twitter Data Should Be Human Coded Using theTwitter Display The rush to CSV is a mistake; data is degraded Data Data Live Data Live Data Data
  • 77. Contents Network Time Series Description Author Description Overall Metrics Top Influencers Top URLs Top Domains Top Hashtags Top Words Top Word Pairs Top Replied-To Top Mentioned Top Tweeters Network sdonnan Tweet Follow WorldBank Tweet Follow CraigHammerd Tweet Follow bijancbayne Tweet Follow YouTube Tweet Follow TweetsAnup Tweet Follow realDonaldTrum p Tweet Follow Nik_6996 Tweet Follow jeremyhillman Tweet Follow alanBStardmp Created with NodeXL (http://nodexl.codeplex.com) from the Social Media Research Foundation (http://www.smrfoundation.org)
  • 79. Special Enterprise License Keys Everyone can have one Request via an email: info@texifter.com
  • 80. For more information discovertext.com @discovertext Thank-you for listening! Dr. Stuart Shulman @stuartwshulman