SlideShare a Scribd company logo
1 of 32
Download to read offline
Small Data Classification for
Natural Language Processing
Michael Thorne
Head of Data Science, CaliberMind
2 | ©2016 CaliberMind
Goals
• Intro
• What Makes NLP Different
• Solutions
• Questions
3 | ©2016 CaliberMind
Michael Thorne
Head of Data Science, Caliber Mind
MS Data Science Program, GalvanizeU
B.S. Physics, Fordham University
NSA Analytic Lead
US Navy Digital Network Intelligence Analyst / Cryptolinguist
Obligatory Speaker Bio
4 | ©2016 CaliberMind
CaliberMind
• B2B marketing SaaS
• Persona modeling and personality insights
• Content matching across buyer journey for high-value,
complex purchase decisions
• Our core competency is natural language processing
5 | ©2016 CaliberMind
What’s So Special About NLP?
• Not random (Zipf’s Law)
• Huge feature space
• Subjective Criteria
6 | ©2016 CaliberMind
Small Data NLP
7 | ©2016 CaliberMind
Persona Status Quo
• Assumptive Personas
• Qualitative Criteria
• Subjective Labels
• Static Output
8 | ©2016 CaliberMind
Starting Point
Demographics
Psychographics
Firmographics
9 | ©2016 CaliberMind
Let’s Validate the Status Quo
10 | ©2016 CaliberMind
CaliberMind’s Data Challenge
• We match the right message, to the right person, at
the right time
• We operate at the upper limits of human scale
problems (100’s - 10,000’s of documents)
• We weren’t getting as accurate results as we expected
11 | ©2016 CaliberMind
Our Friend: The Central Limit Theorem
• This is the theorem that lets us assume our data is well behaved,
assuming we have enough of it
• Let’s look at a classic example, coin tosses
12 | ©2016 CaliberMind
Coin Flip Distribution
13 | ©2016 CaliberMind
1 Trial
14 | ©2016 CaliberMind
100 Trials
15 | ©2016 CaliberMind
Example: K-Means
• K-means is a workhorse algorithm when doing unsupervised learning
• What are the assumptions we make when we use k-means?
Spherical data
Same variance
Same prior probability
Turns out NLP data is none of these things
16 | ©2016 CaliberMind
Happy K-Means
17 | ©2016 CaliberMind
NLP K-Means
18 | ©2016 CaliberMind
But Wait, It Gets Better
• Our documents tend to be of vastly different sizes within the same corpus
• Unbalanced Classes
• Qualitative Criteria
• Unlabeled data
• Human-labeling is time intensive
Our Solution
20 | ©2016 CaliberMind
Dimensionality Reduction
• Dimensionality was the first thing we tackled
• Manual dictionaries to collapse similar terms
• mark = [‘growth hacker’, ‘marketer’, ‘demand gen’]
• LSA to remove low-information terms
• Automating the process using word2vec, dbpedia, and skip gram
similarities
• As we aggregate more data, we’re able to do this process more
effectively
21 | ©2016 CaliberMind
Spiky Data
22 | ©2016 CaliberMind
Metrics Over Raw Scores
• Especially important when comparing data of different sizes
• How many standard deviations off the mean works better than a
simple similarity score
• Pick the best similarity score (with NLP, it’s not cosine)
23 | ©2016 CaliberMind
Pretend We Have Labeled Data
• Rules-based scoring algorithm for a first pass
• Take a small subset of high-scoring people as exemplars
• Use a latent semantic analysis of these exemplars to make a template
• Compare remaining data rows against each exemplar cluster
• Assign highest score to that exemplar cluster, broadening the definition
• Continue until all data rows are assigned
• Any row with a similarity below a threshold we set is labeled as an
‘Unknown’, indicates additional, underlying personas
24 | ©2016 CaliberMind
Round 1 (Rules)
Name Title Similarity Score Persona
Luke J VP Marketing 1.0 Value
Randy P Founder
Lucas M Growth Ninja
Bec G Tech Guru
Fiona F Sysops 1.0 Security
Claude S Growth Hacker
Art L Data Analyst
25 | ©2016 CaliberMind
Round 2 (LSA)
Name Title Similarity Score Persona
Luke J VP Marketing 1.0 Value
Randy P Founder 0.45
Lucas M Growth Ninja 0.11
Bec G Tech Guru 0.71 Security
Fiona F Sysops 1.0 Security
Claude S Growth Hacker 0.87 Value
Art L Data Analyst 0.41
26 | ©2016 CaliberMind
Round 3 (LSA)
Name Title Similarity Score Persona
Luke J VP Marketing 1.0 Value
Randy P Founder 0.68 Security
Lucas M Growth Ninja 0.18
Bec G Tech Guru 0.86 Security
Fiona F Sysops 1.0 Security
Claude S Growth Hacker 0.89 Value
Art L Data Analyst 0.72 Security
27 | ©2016 CaliberMind
Round 3 (LSA)
Name Title Similarity Score Persona
Luke J VP Marketing 1.0 Value
Randy P Founder 0.71 Security
Lucas M Growth Ninja 0.16 Unknown
Bec G Tech Guru 0.88 Security
Fiona F Sysops 1.0 Security
Claude S Growth Hacker 0.91 Value
Art L Data Analyst 0.78 Security
28 | ©2016 CaliberMind
Exampl
e
29 | ©2016 CaliberMind
Exampl
e
30 | ©2016 CaliberMind
Exampl
e
31 | ©2016 CaliberMind
Takeaways
• Human-generated data is never really random
• Small data models are hyper-sensitive
• Validate assumptions
Questions?
Michael Thorne
mike@calibermind.com

More Related Content

Similar to Small Data Classification for NLP

Market Research Meets Big Data Analytics for Business Transformation
Market Research Meets Big Data Analytics  for Business Transformation Market Research Meets Big Data Analytics  for Business Transformation
Market Research Meets Big Data Analytics for Business Transformation
Sally Sadosky
 
Data Rehab Series: Automating Taxonomy
Data Rehab Series: Automating TaxonomyData Rehab Series: Automating Taxonomy
Data Rehab Series: Automating Taxonomy
RingLead
 
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
Databricks
 
Cloudslate Berkeley Final Presentation
Cloudslate Berkeley Final PresentationCloudslate Berkeley Final Presentation
Cloudslate Berkeley Final Presentation
Stanford University
 

Similar to Small Data Classification for NLP (20)

LeanScape - Lean Six Sigma Green Belt Book of Knowledge
LeanScape - Lean Six Sigma Green Belt Book of KnowledgeLeanScape - Lean Six Sigma Green Belt Book of Knowledge
LeanScape - Lean Six Sigma Green Belt Book of Knowledge
 
Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Ani...
Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Ani...Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Ani...
Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Ani...
 
Keep a Pulse: Turning Data into Relationship Insights and (Automated) Action ...
Keep a Pulse: Turning Data into Relationship Insights and (Automated) Action ...Keep a Pulse: Turning Data into Relationship Insights and (Automated) Action ...
Keep a Pulse: Turning Data into Relationship Insights and (Automated) Action ...
 
Keep a Pulse: Turning Data into Relationship Insights and (Automated) Action
Keep a Pulse: Turning Data into Relationship Insights and (Automated) ActionKeep a Pulse: Turning Data into Relationship Insights and (Automated) Action
Keep a Pulse: Turning Data into Relationship Insights and (Automated) Action
 
Market Research Meets Big Data Analytics for Business Transformation
Market Research Meets Big Data Analytics  for Business Transformation Market Research Meets Big Data Analytics  for Business Transformation
Market Research Meets Big Data Analytics for Business Transformation
 
Leveraging Glassdoor Analytics
Leveraging Glassdoor AnalyticsLeveraging Glassdoor Analytics
Leveraging Glassdoor Analytics
 
The Role of Analytics in Talent Acquisition
The Role of Analytics in Talent AcquisitionThe Role of Analytics in Talent Acquisition
The Role of Analytics in Talent Acquisition
 
Related searches at LinkedIn
Related searches at LinkedInRelated searches at LinkedIn
Related searches at LinkedIn
 
QGate - Duplicate data - problem solved.pptx
QGate - Duplicate data - problem solved.pptxQGate - Duplicate data - problem solved.pptx
QGate - Duplicate data - problem solved.pptx
 
DevOps Days Charlotte - The Rise of Culture
DevOps Days Charlotte - The Rise of CultureDevOps Days Charlotte - The Rise of Culture
DevOps Days Charlotte - The Rise of Culture
 
Find Revenue this Quarter. Period. A no BS webinar by MadKudu
Find Revenue this Quarter. Period. A no BS webinar by MadKuduFind Revenue this Quarter. Period. A no BS webinar by MadKudu
Find Revenue this Quarter. Period. A no BS webinar by MadKudu
 
Successful Machine Learning projects in Fintech
Successful Machine Learning projects in FintechSuccessful Machine Learning projects in Fintech
Successful Machine Learning projects in Fintech
 
Pratical Tips for Focusing Sales Efforts and Cutting Costs Using VisibleThrea...
Pratical Tips for Focusing Sales Efforts and Cutting Costs Using VisibleThrea...Pratical Tips for Focusing Sales Efforts and Cutting Costs Using VisibleThrea...
Pratical Tips for Focusing Sales Efforts and Cutting Costs Using VisibleThrea...
 
The Good, the Bad, and The Not So Bad: Tracking Threat Operators with Our Thr...
The Good, the Bad, and The Not So Bad: Tracking Threat Operators with Our Thr...The Good, the Bad, and The Not So Bad: Tracking Threat Operators with Our Thr...
The Good, the Bad, and The Not So Bad: Tracking Threat Operators with Our Thr...
 
Data Rehab Series: Automating Taxonomy
Data Rehab Series: Automating TaxonomyData Rehab Series: Automating Taxonomy
Data Rehab Series: Automating Taxonomy
 
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
 
Frank Bien Opening Keynote - Join 2016
Frank Bien Opening Keynote - Join 2016Frank Bien Opening Keynote - Join 2016
Frank Bien Opening Keynote - Join 2016
 
Frank Bien Opening Keynote - Join 2016
Frank Bien Opening Keynote - Join 2016Frank Bien Opening Keynote - Join 2016
Frank Bien Opening Keynote - Join 2016
 
Cloudslate Berkeley Final Presentation
Cloudslate Berkeley Final PresentationCloudslate Berkeley Final Presentation
Cloudslate Berkeley Final Presentation
 
Scaling Fast: Growing Engineering Orgs From Zero to IPO
Scaling Fast: Growing Engineering Orgs From Zero to IPOScaling Fast: Growing Engineering Orgs From Zero to IPO
Scaling Fast: Growing Engineering Orgs From Zero to IPO
 

Recently uploaded

4 TRIK CARA MENGGUGURKAN JANIN ATAU ABORSI KANDUNGAN
4 TRIK CARA MENGGUGURKAN JANIN ATAU ABORSI KANDUNGAN4 TRIK CARA MENGGUGURKAN JANIN ATAU ABORSI KANDUNGAN
4 TRIK CARA MENGGUGURKAN JANIN ATAU ABORSI KANDUNGAN
Cara Menggugurkan Kandungan 087776558899
 

Recently uploaded (20)

W.H.Bender Quote 61 -Influential restaurant and food service industry network...
W.H.Bender Quote 61 -Influential restaurant and food service industry network...W.H.Bender Quote 61 -Influential restaurant and food service industry network...
W.H.Bender Quote 61 -Influential restaurant and food service industry network...
 
The Science of Landing Page Messaging.pdf
The Science of Landing Page Messaging.pdfThe Science of Landing Page Messaging.pdf
The Science of Landing Page Messaging.pdf
 
Martal Group - B2B Lead Gen Agency - Onboarding Overview
Martal Group - B2B Lead Gen Agency - Onboarding OverviewMartal Group - B2B Lead Gen Agency - Onboarding Overview
Martal Group - B2B Lead Gen Agency - Onboarding Overview
 
10 Email Marketing Best Practices to Increase Engagements, CTR, And ROI
10 Email Marketing Best Practices to Increase Engagements, CTR, And ROI10 Email Marketing Best Practices to Increase Engagements, CTR, And ROI
10 Email Marketing Best Practices to Increase Engagements, CTR, And ROI
 
Unveiling the Legacy of the Rosetta stone A Key to Ancient Knowledge.pptx
Unveiling the Legacy of the Rosetta stone A Key to Ancient Knowledge.pptxUnveiling the Legacy of the Rosetta stone A Key to Ancient Knowledge.pptx
Unveiling the Legacy of the Rosetta stone A Key to Ancient Knowledge.pptx
 
Choosing the Right White Label SEO Services to Boost Your Agency's Growth.pdf
Choosing the Right White Label SEO Services to Boost Your Agency's Growth.pdfChoosing the Right White Label SEO Services to Boost Your Agency's Growth.pdf
Choosing the Right White Label SEO Services to Boost Your Agency's Growth.pdf
 
Social Media Marketing Portfolio - Maharsh Benday
Social Media Marketing Portfolio - Maharsh BendaySocial Media Marketing Portfolio - Maharsh Benday
Social Media Marketing Portfolio - Maharsh Benday
 
Busty Desi⚡Call Girls in Sector 49 Noida Escorts >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Sector 49 Noida Escorts >༒8448380779 Escort ServiceBusty Desi⚡Call Girls in Sector 49 Noida Escorts >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Sector 49 Noida Escorts >༒8448380779 Escort Service
 
VIP Call Girls Dongri WhatsApp +91-9833363713, Full Night Service
VIP Call Girls Dongri WhatsApp +91-9833363713, Full Night ServiceVIP Call Girls Dongri WhatsApp +91-9833363713, Full Night Service
VIP Call Girls Dongri WhatsApp +91-9833363713, Full Night Service
 
Unlocking the Mystery of the Voynich Manuscript
Unlocking the Mystery of the Voynich ManuscriptUnlocking the Mystery of the Voynich Manuscript
Unlocking the Mystery of the Voynich Manuscript
 
SP Search Term Data Optimization Template.pdf
SP Search Term Data Optimization Template.pdfSP Search Term Data Optimization Template.pdf
SP Search Term Data Optimization Template.pdf
 
Digital-Marketing-Into-by-Zoraiz-Ahmad.pptx
Digital-Marketing-Into-by-Zoraiz-Ahmad.pptxDigital-Marketing-Into-by-Zoraiz-Ahmad.pptx
Digital-Marketing-Into-by-Zoraiz-Ahmad.pptx
 
Elevate Your Advertising Game: Introducing Billion Broadcaster Lift Advertising
Elevate Your Advertising Game: Introducing Billion Broadcaster Lift AdvertisingElevate Your Advertising Game: Introducing Billion Broadcaster Lift Advertising
Elevate Your Advertising Game: Introducing Billion Broadcaster Lift Advertising
 
Welcome to DataMetricks Consulting (1).pptx
Welcome to DataMetricks Consulting (1).pptxWelcome to DataMetricks Consulting (1).pptx
Welcome to DataMetricks Consulting (1).pptx
 
4 TRIK CARA MENGGUGURKAN JANIN ATAU ABORSI KANDUNGAN
4 TRIK CARA MENGGUGURKAN JANIN ATAU ABORSI KANDUNGAN4 TRIK CARA MENGGUGURKAN JANIN ATAU ABORSI KANDUNGAN
4 TRIK CARA MENGGUGURKAN JANIN ATAU ABORSI KANDUNGAN
 
Micro-Choices, Max Impact Personalizing Your Journey, One Moment at a Time.pdf
Micro-Choices, Max Impact Personalizing Your Journey, One Moment at a Time.pdfMicro-Choices, Max Impact Personalizing Your Journey, One Moment at a Time.pdf
Micro-Choices, Max Impact Personalizing Your Journey, One Moment at a Time.pdf
 
personal branding kit for music business
personal branding kit for music businesspersonal branding kit for music business
personal branding kit for music business
 
Best 5 Graphics Designing Course In Chandigarh
Best 5 Graphics Designing Course In ChandigarhBest 5 Graphics Designing Course In Chandigarh
Best 5 Graphics Designing Course In Chandigarh
 
20180928 Hofstede Insights Conference Milan The Power of Culture Led Brands.pptx
20180928 Hofstede Insights Conference Milan The Power of Culture Led Brands.pptx20180928 Hofstede Insights Conference Milan The Power of Culture Led Brands.pptx
20180928 Hofstede Insights Conference Milan The Power of Culture Led Brands.pptx
 
How consumers use technology and the impacts on their lives
How consumers use technology and the impacts on their livesHow consumers use technology and the impacts on their lives
How consumers use technology and the impacts on their lives
 

Small Data Classification for NLP

  • 1. Small Data Classification for Natural Language Processing Michael Thorne Head of Data Science, CaliberMind
  • 2. 2 | ©2016 CaliberMind Goals • Intro • What Makes NLP Different • Solutions • Questions
  • 3. 3 | ©2016 CaliberMind Michael Thorne Head of Data Science, Caliber Mind MS Data Science Program, GalvanizeU B.S. Physics, Fordham University NSA Analytic Lead US Navy Digital Network Intelligence Analyst / Cryptolinguist Obligatory Speaker Bio
  • 4. 4 | ©2016 CaliberMind CaliberMind • B2B marketing SaaS • Persona modeling and personality insights • Content matching across buyer journey for high-value, complex purchase decisions • Our core competency is natural language processing
  • 5. 5 | ©2016 CaliberMind What’s So Special About NLP? • Not random (Zipf’s Law) • Huge feature space • Subjective Criteria
  • 6. 6 | ©2016 CaliberMind Small Data NLP
  • 7. 7 | ©2016 CaliberMind Persona Status Quo • Assumptive Personas • Qualitative Criteria • Subjective Labels • Static Output
  • 8. 8 | ©2016 CaliberMind Starting Point Demographics Psychographics Firmographics
  • 9. 9 | ©2016 CaliberMind Let’s Validate the Status Quo
  • 10. 10 | ©2016 CaliberMind CaliberMind’s Data Challenge • We match the right message, to the right person, at the right time • We operate at the upper limits of human scale problems (100’s - 10,000’s of documents) • We weren’t getting as accurate results as we expected
  • 11. 11 | ©2016 CaliberMind Our Friend: The Central Limit Theorem • This is the theorem that lets us assume our data is well behaved, assuming we have enough of it • Let’s look at a classic example, coin tosses
  • 12. 12 | ©2016 CaliberMind Coin Flip Distribution
  • 13. 13 | ©2016 CaliberMind 1 Trial
  • 14. 14 | ©2016 CaliberMind 100 Trials
  • 15. 15 | ©2016 CaliberMind Example: K-Means • K-means is a workhorse algorithm when doing unsupervised learning • What are the assumptions we make when we use k-means? Spherical data Same variance Same prior probability Turns out NLP data is none of these things
  • 16. 16 | ©2016 CaliberMind Happy K-Means
  • 17. 17 | ©2016 CaliberMind NLP K-Means
  • 18. 18 | ©2016 CaliberMind But Wait, It Gets Better • Our documents tend to be of vastly different sizes within the same corpus • Unbalanced Classes • Qualitative Criteria • Unlabeled data • Human-labeling is time intensive
  • 20. 20 | ©2016 CaliberMind Dimensionality Reduction • Dimensionality was the first thing we tackled • Manual dictionaries to collapse similar terms • mark = [‘growth hacker’, ‘marketer’, ‘demand gen’] • LSA to remove low-information terms • Automating the process using word2vec, dbpedia, and skip gram similarities • As we aggregate more data, we’re able to do this process more effectively
  • 21. 21 | ©2016 CaliberMind Spiky Data
  • 22. 22 | ©2016 CaliberMind Metrics Over Raw Scores • Especially important when comparing data of different sizes • How many standard deviations off the mean works better than a simple similarity score • Pick the best similarity score (with NLP, it’s not cosine)
  • 23. 23 | ©2016 CaliberMind Pretend We Have Labeled Data • Rules-based scoring algorithm for a first pass • Take a small subset of high-scoring people as exemplars • Use a latent semantic analysis of these exemplars to make a template • Compare remaining data rows against each exemplar cluster • Assign highest score to that exemplar cluster, broadening the definition • Continue until all data rows are assigned • Any row with a similarity below a threshold we set is labeled as an ‘Unknown’, indicates additional, underlying personas
  • 24. 24 | ©2016 CaliberMind Round 1 (Rules) Name Title Similarity Score Persona Luke J VP Marketing 1.0 Value Randy P Founder Lucas M Growth Ninja Bec G Tech Guru Fiona F Sysops 1.0 Security Claude S Growth Hacker Art L Data Analyst
  • 25. 25 | ©2016 CaliberMind Round 2 (LSA) Name Title Similarity Score Persona Luke J VP Marketing 1.0 Value Randy P Founder 0.45 Lucas M Growth Ninja 0.11 Bec G Tech Guru 0.71 Security Fiona F Sysops 1.0 Security Claude S Growth Hacker 0.87 Value Art L Data Analyst 0.41
  • 26. 26 | ©2016 CaliberMind Round 3 (LSA) Name Title Similarity Score Persona Luke J VP Marketing 1.0 Value Randy P Founder 0.68 Security Lucas M Growth Ninja 0.18 Bec G Tech Guru 0.86 Security Fiona F Sysops 1.0 Security Claude S Growth Hacker 0.89 Value Art L Data Analyst 0.72 Security
  • 27. 27 | ©2016 CaliberMind Round 3 (LSA) Name Title Similarity Score Persona Luke J VP Marketing 1.0 Value Randy P Founder 0.71 Security Lucas M Growth Ninja 0.16 Unknown Bec G Tech Guru 0.88 Security Fiona F Sysops 1.0 Security Claude S Growth Hacker 0.91 Value Art L Data Analyst 0.78 Security
  • 28. 28 | ©2016 CaliberMind Exampl e
  • 29. 29 | ©2016 CaliberMind Exampl e
  • 30. 30 | ©2016 CaliberMind Exampl e
  • 31. 31 | ©2016 CaliberMind Takeaways • Human-generated data is never really random • Small data models are hyper-sensitive • Validate assumptions