SlideShare a Scribd company logo
Small Data Classification for
Natural Language Processing
Michael Thorne
Head of Data Science, CaliberMind
2 | ©2016 CaliberMind
Goals
• Intro
• What Makes NLP Different
• Solutions
• Questions
3 | ©2016 CaliberMind
Michael Thorne
Head of Data Science, Caliber Mind
MS Data Science Program, GalvanizeU
B.S. Physics, Fordham University
NSA Analytic Lead
US Navy Digital Network Intelligence Analyst / Cryptolinguist
Obligatory Speaker Bio
4 | ©2016 CaliberMind
CaliberMind
• B2B marketing SaaS
• Persona modeling and personality insights
• Content matching across buyer journey for high-value,
complex purchase decisions
• Our core competency is natural language processing
5 | ©2016 CaliberMind
What’s So Special About NLP?
• Not random (Zipf’s Law)
• Huge feature space
• Subjective Criteria
6 | ©2016 CaliberMind
Small Data NLP
7 | ©2016 CaliberMind
Persona Status Quo
• Assumptive Personas
• Qualitative Criteria
• Subjective Labels
• Static Output
8 | ©2016 CaliberMind
Starting Point
Demographics
Psychographics
Firmographics
9 | ©2016 CaliberMind
Let’s Validate the Status Quo
10 | ©2016 CaliberMind
CaliberMind’s Data Challenge
• We match the right message, to the right person, at
the right time
• We operate at the upper limits of human scale
problems (100’s - 10,000’s of documents)
• We weren’t getting as accurate results as we expected
11 | ©2016 CaliberMind
Our Friend: The Central Limit Theorem
• This is the theorem that lets us assume our data is well behaved,
assuming we have enough of it
• Let’s look at a classic example, coin tosses
12 | ©2016 CaliberMind
Coin Flip Distribution
13 | ©2016 CaliberMind
1 Trial
14 | ©2016 CaliberMind
100 Trials
15 | ©2016 CaliberMind
Example: K-Means
• K-means is a workhorse algorithm when doing unsupervised learning
• What are the assumptions we make when we use k-means?
Spherical data
Same variance
Same prior probability
Turns out NLP data is none of these things
16 | ©2016 CaliberMind
Happy K-Means
17 | ©2016 CaliberMind
NLP K-Means
18 | ©2016 CaliberMind
But Wait, It Gets Better
• Our documents tend to be of vastly different sizes within the same corpus
• Unbalanced Classes
• Qualitative Criteria
• Unlabeled data
• Human-labeling is time intensive
Our Solution
20 | ©2016 CaliberMind
Dimensionality Reduction
• Dimensionality was the first thing we tackled
• Manual dictionaries to collapse similar terms
• mark = [‘growth hacker’, ‘marketer’, ‘demand gen’]
• LSA to remove low-information terms
• Automating the process using word2vec, dbpedia, and skip gram
similarities
• As we aggregate more data, we’re able to do this process more
effectively
21 | ©2016 CaliberMind
Spiky Data
22 | ©2016 CaliberMind
Metrics Over Raw Scores
• Especially important when comparing data of different sizes
• How many standard deviations off the mean works better than a
simple similarity score
• Pick the best similarity score (with NLP, it’s not cosine)
23 | ©2016 CaliberMind
Pretend We Have Labeled Data
• Rules-based scoring algorithm for a first pass
• Take a small subset of high-scoring people as exemplars
• Use a latent semantic analysis of these exemplars to make a template
• Compare remaining data rows against each exemplar cluster
• Assign highest score to that exemplar cluster, broadening the definition
• Continue until all data rows are assigned
• Any row with a similarity below a threshold we set is labeled as an
‘Unknown’, indicates additional, underlying personas
24 | ©2016 CaliberMind
Round 1 (Rules)
Name Title Similarity Score Persona
Luke J VP Marketing 1.0 Value
Randy P Founder
Lucas M Growth Ninja
Bec G Tech Guru
Fiona F Sysops 1.0 Security
Claude S Growth Hacker
Art L Data Analyst
25 | ©2016 CaliberMind
Round 2 (LSA)
Name Title Similarity Score Persona
Luke J VP Marketing 1.0 Value
Randy P Founder 0.45
Lucas M Growth Ninja 0.11
Bec G Tech Guru 0.71 Security
Fiona F Sysops 1.0 Security
Claude S Growth Hacker 0.87 Value
Art L Data Analyst 0.41
26 | ©2016 CaliberMind
Round 3 (LSA)
Name Title Similarity Score Persona
Luke J VP Marketing 1.0 Value
Randy P Founder 0.68 Security
Lucas M Growth Ninja 0.18
Bec G Tech Guru 0.86 Security
Fiona F Sysops 1.0 Security
Claude S Growth Hacker 0.89 Value
Art L Data Analyst 0.72 Security
27 | ©2016 CaliberMind
Round 3 (LSA)
Name Title Similarity Score Persona
Luke J VP Marketing 1.0 Value
Randy P Founder 0.71 Security
Lucas M Growth Ninja 0.16 Unknown
Bec G Tech Guru 0.88 Security
Fiona F Sysops 1.0 Security
Claude S Growth Hacker 0.91 Value
Art L Data Analyst 0.78 Security
28 | ©2016 CaliberMind
Exampl
e
29 | ©2016 CaliberMind
Exampl
e
30 | ©2016 CaliberMind
Exampl
e
31 | ©2016 CaliberMind
Takeaways
• Human-generated data is never really random
• Small data models are hyper-sensitive
• Validate assumptions
Questions?
Michael Thorne
mike@calibermind.com

More Related Content

Similar to Small Data Classification for NLP

LeanScape - Lean Six Sigma Green Belt Book of Knowledge
LeanScape - Lean Six Sigma Green Belt Book of KnowledgeLeanScape - Lean Six Sigma Green Belt Book of Knowledge
LeanScape - Lean Six Sigma Green Belt Book of Knowledge
Reagan Pannell
 
Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Ani...
Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Ani...Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Ani...
Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Ani...
Databricks
 
Keep a Pulse: Turning Data into Relationship Insights and (Automated) Action ...
Keep a Pulse: Turning Data into Relationship Insights and (Automated) Action ...Keep a Pulse: Turning Data into Relationship Insights and (Automated) Action ...
Keep a Pulse: Turning Data into Relationship Insights and (Automated) Action ...
TALiNT Partners
 
Keep a Pulse: Turning Data into Relationship Insights and (Automated) Action
Keep a Pulse: Turning Data into Relationship Insights and (Automated) ActionKeep a Pulse: Turning Data into Relationship Insights and (Automated) Action
Keep a Pulse: Turning Data into Relationship Insights and (Automated) Action
TALiNT Partners
 
Market Research Meets Big Data Analytics for Business Transformation
Market Research Meets Big Data Analytics  for Business Transformation Market Research Meets Big Data Analytics  for Business Transformation
Market Research Meets Big Data Analytics for Business Transformation Sally Sadosky
 
Leveraging Glassdoor Analytics
Leveraging Glassdoor AnalyticsLeveraging Glassdoor Analytics
Leveraging Glassdoor Analytics
Glassdoor
 
The Role of Analytics in Talent Acquisition
The Role of Analytics in Talent AcquisitionThe Role of Analytics in Talent Acquisition
The Role of Analytics in Talent Acquisition
Human Capital Media
 
Related searches at LinkedIn
Related searches at LinkedInRelated searches at LinkedIn
Related searches at LinkedIn
Mitul Tiwari
 
QGate - Duplicate data - problem solved.pptx
QGate - Duplicate data - problem solved.pptxQGate - Duplicate data - problem solved.pptx
QGate - Duplicate data - problem solved.pptx
AASTHAJAJOO
 
DevOps Days Charlotte - The Rise of Culture
DevOps Days Charlotte - The Rise of CultureDevOps Days Charlotte - The Rise of Culture
DevOps Days Charlotte - The Rise of Culture
Chris Nowak
 
Find Revenue this Quarter. Period. A no BS webinar by MadKudu
Find Revenue this Quarter. Period. A no BS webinar by MadKuduFind Revenue this Quarter. Period. A no BS webinar by MadKudu
Find Revenue this Quarter. Period. A no BS webinar by MadKudu
Francis Brero
 
Successful Machine Learning projects in Fintech
Successful Machine Learning projects in FintechSuccessful Machine Learning projects in Fintech
Successful Machine Learning projects in Fintech
Appsilon Data Science
 
Pratical Tips for Focusing Sales Efforts and Cutting Costs Using VisibleThrea...
Pratical Tips for Focusing Sales Efforts and Cutting Costs Using VisibleThrea...Pratical Tips for Focusing Sales Efforts and Cutting Costs Using VisibleThrea...
Pratical Tips for Focusing Sales Efforts and Cutting Costs Using VisibleThrea...
VisibleThread
 
The Good, the Bad, and The Not So Bad: Tracking Threat Operators with Our Thr...
The Good, the Bad, and The Not So Bad: Tracking Threat Operators with Our Thr...The Good, the Bad, and The Not So Bad: Tracking Threat Operators with Our Thr...
The Good, the Bad, and The Not So Bad: Tracking Threat Operators with Our Thr...
Core Security
 
Data Rehab Series: Automating Taxonomy
Data Rehab Series: Automating TaxonomyData Rehab Series: Automating Taxonomy
Data Rehab Series: Automating TaxonomyRingLead
 
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
Databricks
 
Frank Bien Opening Keynote - Join 2016
Frank Bien Opening Keynote - Join 2016Frank Bien Opening Keynote - Join 2016
Frank Bien Opening Keynote - Join 2016
Looker
 
Frank Bien Opening Keynote - Join 2016
Frank Bien Opening Keynote - Join 2016Frank Bien Opening Keynote - Join 2016
Frank Bien Opening Keynote - Join 2016
Looker
 
Cloudslate Berkeley Final Presentation
Cloudslate Berkeley Final PresentationCloudslate Berkeley Final Presentation
Cloudslate Berkeley Final PresentationStanford University
 
Scaling Fast: Growing Engineering Orgs From Zero to IPO
Scaling Fast: Growing Engineering Orgs From Zero to IPOScaling Fast: Growing Engineering Orgs From Zero to IPO
Scaling Fast: Growing Engineering Orgs From Zero to IPO
Nick Caldwell
 

Similar to Small Data Classification for NLP (20)

LeanScape - Lean Six Sigma Green Belt Book of Knowledge
LeanScape - Lean Six Sigma Green Belt Book of KnowledgeLeanScape - Lean Six Sigma Green Belt Book of Knowledge
LeanScape - Lean Six Sigma Green Belt Book of Knowledge
 
Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Ani...
Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Ani...Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Ani...
Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Ani...
 
Keep a Pulse: Turning Data into Relationship Insights and (Automated) Action ...
Keep a Pulse: Turning Data into Relationship Insights and (Automated) Action ...Keep a Pulse: Turning Data into Relationship Insights and (Automated) Action ...
Keep a Pulse: Turning Data into Relationship Insights and (Automated) Action ...
 
Keep a Pulse: Turning Data into Relationship Insights and (Automated) Action
Keep a Pulse: Turning Data into Relationship Insights and (Automated) ActionKeep a Pulse: Turning Data into Relationship Insights and (Automated) Action
Keep a Pulse: Turning Data into Relationship Insights and (Automated) Action
 
Market Research Meets Big Data Analytics for Business Transformation
Market Research Meets Big Data Analytics  for Business Transformation Market Research Meets Big Data Analytics  for Business Transformation
Market Research Meets Big Data Analytics for Business Transformation
 
Leveraging Glassdoor Analytics
Leveraging Glassdoor AnalyticsLeveraging Glassdoor Analytics
Leveraging Glassdoor Analytics
 
The Role of Analytics in Talent Acquisition
The Role of Analytics in Talent AcquisitionThe Role of Analytics in Talent Acquisition
The Role of Analytics in Talent Acquisition
 
Related searches at LinkedIn
Related searches at LinkedInRelated searches at LinkedIn
Related searches at LinkedIn
 
QGate - Duplicate data - problem solved.pptx
QGate - Duplicate data - problem solved.pptxQGate - Duplicate data - problem solved.pptx
QGate - Duplicate data - problem solved.pptx
 
DevOps Days Charlotte - The Rise of Culture
DevOps Days Charlotte - The Rise of CultureDevOps Days Charlotte - The Rise of Culture
DevOps Days Charlotte - The Rise of Culture
 
Find Revenue this Quarter. Period. A no BS webinar by MadKudu
Find Revenue this Quarter. Period. A no BS webinar by MadKuduFind Revenue this Quarter. Period. A no BS webinar by MadKudu
Find Revenue this Quarter. Period. A no BS webinar by MadKudu
 
Successful Machine Learning projects in Fintech
Successful Machine Learning projects in FintechSuccessful Machine Learning projects in Fintech
Successful Machine Learning projects in Fintech
 
Pratical Tips for Focusing Sales Efforts and Cutting Costs Using VisibleThrea...
Pratical Tips for Focusing Sales Efforts and Cutting Costs Using VisibleThrea...Pratical Tips for Focusing Sales Efforts and Cutting Costs Using VisibleThrea...
Pratical Tips for Focusing Sales Efforts and Cutting Costs Using VisibleThrea...
 
The Good, the Bad, and The Not So Bad: Tracking Threat Operators with Our Thr...
The Good, the Bad, and The Not So Bad: Tracking Threat Operators with Our Thr...The Good, the Bad, and The Not So Bad: Tracking Threat Operators with Our Thr...
The Good, the Bad, and The Not So Bad: Tracking Threat Operators with Our Thr...
 
Data Rehab Series: Automating Taxonomy
Data Rehab Series: Automating TaxonomyData Rehab Series: Automating Taxonomy
Data Rehab Series: Automating Taxonomy
 
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
 
Frank Bien Opening Keynote - Join 2016
Frank Bien Opening Keynote - Join 2016Frank Bien Opening Keynote - Join 2016
Frank Bien Opening Keynote - Join 2016
 
Frank Bien Opening Keynote - Join 2016
Frank Bien Opening Keynote - Join 2016Frank Bien Opening Keynote - Join 2016
Frank Bien Opening Keynote - Join 2016
 
Cloudslate Berkeley Final Presentation
Cloudslate Berkeley Final PresentationCloudslate Berkeley Final Presentation
Cloudslate Berkeley Final Presentation
 
Scaling Fast: Growing Engineering Orgs From Zero to IPO
Scaling Fast: Growing Engineering Orgs From Zero to IPOScaling Fast: Growing Engineering Orgs From Zero to IPO
Scaling Fast: Growing Engineering Orgs From Zero to IPO
 

Recently uploaded

How to use Short Form Video To Grow Your Brand and Business - Keenya Kelly
How to use Short Form Video To Grow Your Brand and Business - Keenya KellyHow to use Short Form Video To Grow Your Brand and Business - Keenya Kelly
How to use Short Form Video To Grow Your Brand and Business - Keenya Kelly
DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions
 
Monthly Social Media News Update May 2024
Monthly Social Media News Update May 2024Monthly Social Media News Update May 2024
Monthly Social Media News Update May 2024
Andy Lambert
 
Consumer Journey Mapping & Personalization Master Class - Sabrina Killgo
Consumer Journey Mapping & Personalization Master Class - Sabrina KillgoConsumer Journey Mapping & Personalization Master Class - Sabrina Killgo
Consumer Journey Mapping & Personalization Master Class - Sabrina Killgo
DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions
 
Unknown to Unforgettable - The Art and Science to Being Irresistible on Camer...
Unknown to Unforgettable - The Art and Science to Being Irresistible on Camer...Unknown to Unforgettable - The Art and Science to Being Irresistible on Camer...
Unknown to Unforgettable - The Art and Science to Being Irresistible on Camer...
DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions
 
Digital Strategy Master Class - Andrew Rupert
Digital Strategy Master Class - Andrew RupertDigital Strategy Master Class - Andrew Rupert
May 2024 - VBOUT Partners Meeting Group Session
May 2024 - VBOUT Partners Meeting Group SessionMay 2024 - VBOUT Partners Meeting Group Session
May 2024 - VBOUT Partners Meeting Group Session
Vbout.com
 
Digital Marketing Trends - Experts Insights on How to Gain a Competitive Edge
Digital Marketing Trends - Experts Insights on How to Gain a Competitive EdgeDigital Marketing Trends - Experts Insights on How to Gain a Competitive Edge
Digital Marketing Trends - Experts Insights on How to Gain a Competitive Edge
DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions
 
Offissa Dizayn - Otel, Kafe, Restoran Kataloqu_240603_011042.pdf
Offissa Dizayn - Otel, Kafe, Restoran Kataloqu_240603_011042.pdfOffissa Dizayn - Otel, Kafe, Restoran Kataloqu_240603_011042.pdf
Offissa Dizayn - Otel, Kafe, Restoran Kataloqu_240603_011042.pdf
offisadizayn
 
Traditional Store Audits are Outdated: A New Approach to Protecting Your Bran...
Traditional Store Audits are Outdated: A New Approach to Protecting Your Bran...Traditional Store Audits are Outdated: A New Approach to Protecting Your Bran...
Traditional Store Audits are Outdated: A New Approach to Protecting Your Bran...
Auxis Consulting & Outsourcing
 
BLOOM_May2024 (r). Balmer Lawrie Online Monthly Bulletin
BLOOM_May2024 (r). Balmer Lawrie Online Monthly BulletinBLOOM_May2024 (r). Balmer Lawrie Online Monthly Bulletin
BLOOM_May2024 (r). Balmer Lawrie Online Monthly Bulletin
BalmerLawrie
 
De-risk Your Digital Evolution - Hannah Grap
De-risk Your Digital Evolution - Hannah GrapDe-risk Your Digital Evolution - Hannah Grap
Playlist and Paint Event with Sony Music U
Playlist and Paint Event with Sony Music UPlaylist and Paint Event with Sony Music U
Playlist and Paint Event with Sony Music U
SemajahParker
 
AI-Powered Personalization: Principles, Use Cases, and Its Impact on CRO
AI-Powered Personalization: Principles, Use Cases, and Its Impact on CROAI-Powered Personalization: Principles, Use Cases, and Its Impact on CRO
AI-Powered Personalization: Principles, Use Cases, and Its Impact on CRO
VWO
 
Email Marketing Master Class - Chris Ferris
Email Marketing Master Class - Chris FerrisEmail Marketing Master Class - Chris Ferris
The What, Why & How of 3D and AR in Digital Commerce
The What, Why & How of 3D and AR in Digital CommerceThe What, Why & How of 3D and AR in Digital Commerce
The What, Why & How of 3D and AR in Digital Commerce
PushON Ltd
 
Unlocking Everyday Narratives: The Power of Storytelling in Marketing - Chad...
Unlocking Everyday Narratives: The Power of Storytelling in Marketing  - Chad...Unlocking Everyday Narratives: The Power of Storytelling in Marketing  - Chad...
Unlocking Everyday Narratives: The Power of Storytelling in Marketing - Chad...
DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions
 
5 Big Bets for 2024 - Jamie A. Lee, Stripes Co
5 Big Bets for 2024 - Jamie A. Lee, Stripes Co5 Big Bets for 2024 - Jamie A. Lee, Stripes Co
The New Era Of SEO - How AI Has Changed SEO Forever - Danny Leibrandt
The New Era Of SEO - How AI Has Changed SEO Forever - Danny LeibrandtThe New Era Of SEO - How AI Has Changed SEO Forever - Danny Leibrandt
The New Era Of SEO - How AI Has Changed SEO Forever - Danny Leibrandt
DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions
 
Your Path to Profits - The Game-Changing Power of a Marketing OS for Your Bus...
Your Path to Profits - The Game-Changing Power of a Marketing OS for Your Bus...Your Path to Profits - The Game-Changing Power of a Marketing OS for Your Bus...
Your Path to Profits - The Game-Changing Power of a Marketing OS for Your Bus...
DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions
 

Recently uploaded (20)

How to use Short Form Video To Grow Your Brand and Business - Keenya Kelly
How to use Short Form Video To Grow Your Brand and Business - Keenya KellyHow to use Short Form Video To Grow Your Brand and Business - Keenya Kelly
How to use Short Form Video To Grow Your Brand and Business - Keenya Kelly
 
Monthly Social Media News Update May 2024
Monthly Social Media News Update May 2024Monthly Social Media News Update May 2024
Monthly Social Media News Update May 2024
 
Consumer Journey Mapping & Personalization Master Class - Sabrina Killgo
Consumer Journey Mapping & Personalization Master Class - Sabrina KillgoConsumer Journey Mapping & Personalization Master Class - Sabrina Killgo
Consumer Journey Mapping & Personalization Master Class - Sabrina Killgo
 
Unknown to Unforgettable - The Art and Science to Being Irresistible on Camer...
Unknown to Unforgettable - The Art and Science to Being Irresistible on Camer...Unknown to Unforgettable - The Art and Science to Being Irresistible on Camer...
Unknown to Unforgettable - The Art and Science to Being Irresistible on Camer...
 
Digital Strategy Master Class - Andrew Rupert
Digital Strategy Master Class - Andrew RupertDigital Strategy Master Class - Andrew Rupert
Digital Strategy Master Class - Andrew Rupert
 
May 2024 - VBOUT Partners Meeting Group Session
May 2024 - VBOUT Partners Meeting Group SessionMay 2024 - VBOUT Partners Meeting Group Session
May 2024 - VBOUT Partners Meeting Group Session
 
Digital Marketing Trends - Experts Insights on How to Gain a Competitive Edge
Digital Marketing Trends - Experts Insights on How to Gain a Competitive EdgeDigital Marketing Trends - Experts Insights on How to Gain a Competitive Edge
Digital Marketing Trends - Experts Insights on How to Gain a Competitive Edge
 
Offissa Dizayn - Otel, Kafe, Restoran Kataloqu_240603_011042.pdf
Offissa Dizayn - Otel, Kafe, Restoran Kataloqu_240603_011042.pdfOffissa Dizayn - Otel, Kafe, Restoran Kataloqu_240603_011042.pdf
Offissa Dizayn - Otel, Kafe, Restoran Kataloqu_240603_011042.pdf
 
Traditional Store Audits are Outdated: A New Approach to Protecting Your Bran...
Traditional Store Audits are Outdated: A New Approach to Protecting Your Bran...Traditional Store Audits are Outdated: A New Approach to Protecting Your Bran...
Traditional Store Audits are Outdated: A New Approach to Protecting Your Bran...
 
Metaverse Marketing in the Generation of the Internet - Eugene Capon
Metaverse Marketing in the Generation of the Internet - Eugene CaponMetaverse Marketing in the Generation of the Internet - Eugene Capon
Metaverse Marketing in the Generation of the Internet - Eugene Capon
 
BLOOM_May2024 (r). Balmer Lawrie Online Monthly Bulletin
BLOOM_May2024 (r). Balmer Lawrie Online Monthly BulletinBLOOM_May2024 (r). Balmer Lawrie Online Monthly Bulletin
BLOOM_May2024 (r). Balmer Lawrie Online Monthly Bulletin
 
De-risk Your Digital Evolution - Hannah Grap
De-risk Your Digital Evolution - Hannah GrapDe-risk Your Digital Evolution - Hannah Grap
De-risk Your Digital Evolution - Hannah Grap
 
Playlist and Paint Event with Sony Music U
Playlist and Paint Event with Sony Music UPlaylist and Paint Event with Sony Music U
Playlist and Paint Event with Sony Music U
 
AI-Powered Personalization: Principles, Use Cases, and Its Impact on CRO
AI-Powered Personalization: Principles, Use Cases, and Its Impact on CROAI-Powered Personalization: Principles, Use Cases, and Its Impact on CRO
AI-Powered Personalization: Principles, Use Cases, and Its Impact on CRO
 
Email Marketing Master Class - Chris Ferris
Email Marketing Master Class - Chris FerrisEmail Marketing Master Class - Chris Ferris
Email Marketing Master Class - Chris Ferris
 
The What, Why & How of 3D and AR in Digital Commerce
The What, Why & How of 3D and AR in Digital CommerceThe What, Why & How of 3D and AR in Digital Commerce
The What, Why & How of 3D and AR in Digital Commerce
 
Unlocking Everyday Narratives: The Power of Storytelling in Marketing - Chad...
Unlocking Everyday Narratives: The Power of Storytelling in Marketing  - Chad...Unlocking Everyday Narratives: The Power of Storytelling in Marketing  - Chad...
Unlocking Everyday Narratives: The Power of Storytelling in Marketing - Chad...
 
5 Big Bets for 2024 - Jamie A. Lee, Stripes Co
5 Big Bets for 2024 - Jamie A. Lee, Stripes Co5 Big Bets for 2024 - Jamie A. Lee, Stripes Co
5 Big Bets for 2024 - Jamie A. Lee, Stripes Co
 
The New Era Of SEO - How AI Has Changed SEO Forever - Danny Leibrandt
The New Era Of SEO - How AI Has Changed SEO Forever - Danny LeibrandtThe New Era Of SEO - How AI Has Changed SEO Forever - Danny Leibrandt
The New Era Of SEO - How AI Has Changed SEO Forever - Danny Leibrandt
 
Your Path to Profits - The Game-Changing Power of a Marketing OS for Your Bus...
Your Path to Profits - The Game-Changing Power of a Marketing OS for Your Bus...Your Path to Profits - The Game-Changing Power of a Marketing OS for Your Bus...
Your Path to Profits - The Game-Changing Power of a Marketing OS for Your Bus...
 

Small Data Classification for NLP

  • 1. Small Data Classification for Natural Language Processing Michael Thorne Head of Data Science, CaliberMind
  • 2. 2 | ©2016 CaliberMind Goals • Intro • What Makes NLP Different • Solutions • Questions
  • 3. 3 | ©2016 CaliberMind Michael Thorne Head of Data Science, Caliber Mind MS Data Science Program, GalvanizeU B.S. Physics, Fordham University NSA Analytic Lead US Navy Digital Network Intelligence Analyst / Cryptolinguist Obligatory Speaker Bio
  • 4. 4 | ©2016 CaliberMind CaliberMind • B2B marketing SaaS • Persona modeling and personality insights • Content matching across buyer journey for high-value, complex purchase decisions • Our core competency is natural language processing
  • 5. 5 | ©2016 CaliberMind What’s So Special About NLP? • Not random (Zipf’s Law) • Huge feature space • Subjective Criteria
  • 6. 6 | ©2016 CaliberMind Small Data NLP
  • 7. 7 | ©2016 CaliberMind Persona Status Quo • Assumptive Personas • Qualitative Criteria • Subjective Labels • Static Output
  • 8. 8 | ©2016 CaliberMind Starting Point Demographics Psychographics Firmographics
  • 9. 9 | ©2016 CaliberMind Let’s Validate the Status Quo
  • 10. 10 | ©2016 CaliberMind CaliberMind’s Data Challenge • We match the right message, to the right person, at the right time • We operate at the upper limits of human scale problems (100’s - 10,000’s of documents) • We weren’t getting as accurate results as we expected
  • 11. 11 | ©2016 CaliberMind Our Friend: The Central Limit Theorem • This is the theorem that lets us assume our data is well behaved, assuming we have enough of it • Let’s look at a classic example, coin tosses
  • 12. 12 | ©2016 CaliberMind Coin Flip Distribution
  • 13. 13 | ©2016 CaliberMind 1 Trial
  • 14. 14 | ©2016 CaliberMind 100 Trials
  • 15. 15 | ©2016 CaliberMind Example: K-Means • K-means is a workhorse algorithm when doing unsupervised learning • What are the assumptions we make when we use k-means? Spherical data Same variance Same prior probability Turns out NLP data is none of these things
  • 16. 16 | ©2016 CaliberMind Happy K-Means
  • 17. 17 | ©2016 CaliberMind NLP K-Means
  • 18. 18 | ©2016 CaliberMind But Wait, It Gets Better • Our documents tend to be of vastly different sizes within the same corpus • Unbalanced Classes • Qualitative Criteria • Unlabeled data • Human-labeling is time intensive
  • 20. 20 | ©2016 CaliberMind Dimensionality Reduction • Dimensionality was the first thing we tackled • Manual dictionaries to collapse similar terms • mark = [‘growth hacker’, ‘marketer’, ‘demand gen’] • LSA to remove low-information terms • Automating the process using word2vec, dbpedia, and skip gram similarities • As we aggregate more data, we’re able to do this process more effectively
  • 21. 21 | ©2016 CaliberMind Spiky Data
  • 22. 22 | ©2016 CaliberMind Metrics Over Raw Scores • Especially important when comparing data of different sizes • How many standard deviations off the mean works better than a simple similarity score • Pick the best similarity score (with NLP, it’s not cosine)
  • 23. 23 | ©2016 CaliberMind Pretend We Have Labeled Data • Rules-based scoring algorithm for a first pass • Take a small subset of high-scoring people as exemplars • Use a latent semantic analysis of these exemplars to make a template • Compare remaining data rows against each exemplar cluster • Assign highest score to that exemplar cluster, broadening the definition • Continue until all data rows are assigned • Any row with a similarity below a threshold we set is labeled as an ‘Unknown’, indicates additional, underlying personas
  • 24. 24 | ©2016 CaliberMind Round 1 (Rules) Name Title Similarity Score Persona Luke J VP Marketing 1.0 Value Randy P Founder Lucas M Growth Ninja Bec G Tech Guru Fiona F Sysops 1.0 Security Claude S Growth Hacker Art L Data Analyst
  • 25. 25 | ©2016 CaliberMind Round 2 (LSA) Name Title Similarity Score Persona Luke J VP Marketing 1.0 Value Randy P Founder 0.45 Lucas M Growth Ninja 0.11 Bec G Tech Guru 0.71 Security Fiona F Sysops 1.0 Security Claude S Growth Hacker 0.87 Value Art L Data Analyst 0.41
  • 26. 26 | ©2016 CaliberMind Round 3 (LSA) Name Title Similarity Score Persona Luke J VP Marketing 1.0 Value Randy P Founder 0.68 Security Lucas M Growth Ninja 0.18 Bec G Tech Guru 0.86 Security Fiona F Sysops 1.0 Security Claude S Growth Hacker 0.89 Value Art L Data Analyst 0.72 Security
  • 27. 27 | ©2016 CaliberMind Round 3 (LSA) Name Title Similarity Score Persona Luke J VP Marketing 1.0 Value Randy P Founder 0.71 Security Lucas M Growth Ninja 0.16 Unknown Bec G Tech Guru 0.88 Security Fiona F Sysops 1.0 Security Claude S Growth Hacker 0.91 Value Art L Data Analyst 0.78 Security
  • 28. 28 | ©2016 CaliberMind Exampl e
  • 29. 29 | ©2016 CaliberMind Exampl e
  • 30. 30 | ©2016 CaliberMind Exampl e
  • 31. 31 | ©2016 CaliberMind Takeaways • Human-generated data is never really random • Small data models are hyper-sensitive • Validate assumptions