Small Data Classification for NLP

•

0 likes•258 views

We have a lot of tools to help us with big data. What happens when our data is tiny? This talk will discuss a novel way of classifying NLP data from a very small training set, explaining some of the ways we deal with strongly unbalanced classes when working in the 100’s of rows of data.

Marketing

Small Data Classification for
Natural Language Processing
Michael Thorne
Head of Data Science, CaliberMind

2 | ©2016 CaliberMind
Goals
• Intro
• What Makes NLP Different
• Solutions
• Questions

3 | ©2016 CaliberMind
Michael Thorne
Head of Data Science, Caliber Mind
MS Data Science Program, GalvanizeU
B.S. Physics, Fordham University
NSA Analytic Lead
US Navy Digital Network Intelligence Analyst / Cryptolinguist
Obligatory Speaker Bio

4 | ©2016 CaliberMind
CaliberMind
• B2B marketing SaaS
• Persona modeling and personality insights
• Content matching across buyer journey for high-value,
complex purchase decisions
• Our core competency is natural language processing

5 | ©2016 CaliberMind
What’s So Special About NLP?
• Not random (Zipf’s Law)
• Huge feature space
• Subjective Criteria

7 | ©2016 CaliberMind
Persona Status Quo
• Assumptive Personas
• Qualitative Criteria
• Subjective Labels
• Static Output

8 | ©2016 CaliberMind
Starting Point
Demographics
Psychographics
Firmographics

9 | ©2016 CaliberMind
Let’s Validate the Status Quo

10 | ©2016 CaliberMind
CaliberMind’s Data Challenge
• We match the right message, to the right person, at
the right time
• We operate at the upper limits of human scale
problems (100’s - 10,000’s of documents)
• We weren’t getting as accurate results as we expected

11 | ©2016 CaliberMind
Our Friend: The Central Limit Theorem
• This is the theorem that lets us assume our data is well behaved,
assuming we have enough of it
• Let’s look at a classic example, coin tosses

12 | ©2016 CaliberMind
Coin Flip Distribution

15 | ©2016 CaliberMind
Example: K-Means
• K-means is a workhorse algorithm when doing unsupervised learning
• What are the assumptions we make when we use k-means?
Spherical data
Same variance
Same prior probability
Turns out NLP data is none of these things

18 | ©2016 CaliberMind
But Wait, It Gets Better
• Our documents tend to be of vastly different sizes within the same corpus
• Unbalanced Classes
• Qualitative Criteria
• Unlabeled data
• Human-labeling is time intensive

20 | ©2016 CaliberMind
Dimensionality Reduction
• Dimensionality was the first thing we tackled
• Manual dictionaries to collapse similar terms
• mark = [‘growth hacker’, ‘marketer’, ‘demand gen’]
• LSA to remove low-information terms
• Automating the process using word2vec, dbpedia, and skip gram
similarities
• As we aggregate more data, we’re able to do this process more
effectively

22 | ©2016 CaliberMind
Metrics Over Raw Scores
• Especially important when comparing data of different sizes
• How many standard deviations off the mean works better than a
simple similarity score
• Pick the best similarity score (with NLP, it’s not cosine)

23 | ©2016 CaliberMind
Pretend We Have Labeled Data
• Rules-based scoring algorithm for a first pass
• Take a small subset of high-scoring people as exemplars
• Use a latent semantic analysis of these exemplars to make a template
• Compare remaining data rows against each exemplar cluster
• Assign highest score to that exemplar cluster, broadening the definition
• Continue until all data rows are assigned
• Any row with a similarity below a threshold we set is labeled as an
‘Unknown’, indicates additional, underlying personas

24 | ©2016 CaliberMind
Round 1 (Rules)
Name Title Similarity Score Persona
Luke J VP Marketing 1.0 Value
Randy P Founder
Lucas M Growth Ninja
Bec G Tech Guru
Fiona F Sysops 1.0 Security
Claude S Growth Hacker
Art L Data Analyst

25 | ©2016 CaliberMind
Round 2 (LSA)
Name Title Similarity Score Persona
Luke J VP Marketing 1.0 Value
Randy P Founder 0.45
Lucas M Growth Ninja 0.11
Bec G Tech Guru 0.71 Security
Fiona F Sysops 1.0 Security
Claude S Growth Hacker 0.87 Value
Art L Data Analyst 0.41

26 | ©2016 CaliberMind
Round 3 (LSA)
Name Title Similarity Score Persona
Luke J VP Marketing 1.0 Value
Randy P Founder 0.68 Security
Lucas M Growth Ninja 0.18
Bec G Tech Guru 0.86 Security
Fiona F Sysops 1.0 Security
Claude S Growth Hacker 0.89 Value
Art L Data Analyst 0.72 Security

27 | ©2016 CaliberMind
Round 3 (LSA)
Name Title Similarity Score Persona
Luke J VP Marketing 1.0 Value
Randy P Founder 0.71 Security
Lucas M Growth Ninja 0.16 Unknown
Bec G Tech Guru 0.88 Security
Fiona F Sysops 1.0 Security
Claude S Growth Hacker 0.91 Value
Art L Data Analyst 0.78 Security

31 | ©2016 CaliberMind
Takeaways
• Human-generated data is never really random
• Small data models are hyper-sensitive
• Validate assumptions

Questions?
Michael Thorne
mike@calibermind.com

Similar to Small Data Classification for NLP

LeanScape - Lean Six Sigma Green Belt Book of Knowledge

Reagan Pannell

In this session, PayPal will present the techniques used to retain merchants using some of the Machine Learning models using SparkML platform. Retaining merchants directly equates to Dollar value. So, it was very critical for us to identify the right model that trains on our data and predicts merchant behavior giving us insights that help us prevent merchant churn. We will also deep dive on how we captured the right signals filtering the noise that could skew the predictions and some of the challenges we faced in scaling this solution. Lastly, we will see how SparkML orchestrated various events in the pipeline we built thereby enabling us to perform feature engineering, train it, validate and cross-validate it at scale across the different data samples we had.

Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Ani...

Databricks

Keep a Pulse: Turning Data into Relationship Insights and (Automated) Action ...

TALiNT Partners

Sinéad Daly, Regional Manager for UK & Ireland, Bullhorn - Are you working harder OR smarter? The technology and subsequent data at our fingertips creates boundless opportunities and valuable insights if harnessed properly - However, with endless requests from clients, candidates, and internal employees, it’s difficult to slow down to evaluate, set strategy and execute. In this session -We will share tips on leveraging the data from your ATS and CRM to drive (and potentially automate) activity to increase both efficiency and results.

Keep a Pulse: Turning Data into Relationship Insights and (Automated) Action

TALiNT Partners

Market Research Meets Big Data Analytics for Business Transformation

Sally Sadosky

Leveraging Glassdoor Analytics

Glassdoor

With the increasing access to big data, organizations are finding new ways to utilize this information within their talent acquisition strategy. During this Spotlight Webinar, we’ll focus on HR analytics and how organizations are leveraging this data to strengthen their recruiting strategies when identifying talent. During this spotlight webinar, learners will: Identify how analytics play a role in forecasting the time required to identify and hire candidates Determine how to leverage analytics to strengthen recruiting strategy Learn how vendor partnerships can provide HR analytics that support workforce planning.

The Role of Analytics in Talent Acquisition

Human Capital Media

Search plays an important role in online social networks as it provides an essential mechanism for discovering members and content on the network. Related search recommendation is one of several mechanisms used for improving members’ search experience in finding relevant results to their queries. This paper describes the design, implementation, and deployment of Metaphor, the related search recommendation system on LinkedIn, a professional social networking site with over 175 million members worldwide. Metaphor builds on a number of signals and filters that capture several dimensions of relatedness across member search activity. The system, which has been in live operation for over a year, has gone through multiple iterations and evaluation cycles. This paper makes three contributions. First, we provide a discussion of a large-scale related search recommendation system. Second, we describe a mechanism for effectively combining several signals in building a unified dataset for related search recommendations. Third, we introduce a query length model for capturing bias in recommendation click behavior. We also discuss some of the practical concerns in deploying related search recommendations.

Similar to Small Data Classification for NLP (20)

LeanScape - Lean Six Sigma Green Belt Book of Knowledge

Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Ani...

Keep a Pulse: Turning Data into Relationship Insights and (Automated) Action ...

Keep a Pulse: Turning Data into Relationship Insights and (Automated) Action

Market Research Meets Big Data Analytics for Business Transformation

Leveraging Glassdoor Analytics

The Role of Analytics in Talent Acquisition

Recently uploaded

W.H.Bender Quote 61 -Influential restaurant and food service industry network...

William (Bill) H. Bender, FCSI

Making small changes to your website's text for A/B tests often disappoints. Moreover, catering to diverse audiences presents its own set of hurdles. With AI capabilities like GPTs, you can tackle this challenge head-on. However, GPTs come with a blank canvas - it's up to you to define their style. In this enlightening session, Brian Massey, Managing Partner at Conversion Sciences, will unveil the secrets to crafting precise, testable, and compelling copy that speaks directly to your site's diverse prospects. Learn how to put these strategies to the test on your homepage, landing pages, or eCommerce product pages for maximum impact.

The Science of Landing Page Messaging.pdf

VWO

Martal Group - B2B Lead Gen Agency - Onboarding Overview

Martal Group

Email marketing is a low-cost and effective marketing strategy. It is an excellent way to generate leads, convert visitors to customers, and build engagement. However, you could miss many opportunities to turn profits without exploring the best email marketing practices. Whether you’re a beginner, intermediate, or expert email marketer, everyone wants the same – a high open and click-through rate. We want to see subscribers engaging with our email campaigns, open, and take action on the CTA. https://cybernaira.com/email-marketing-best-practices/

10 Email Marketing Best Practices to Increase Engagements, CTR, And ROI

Shamsudeen Adeshokan

In the realm of archaeology and linguistics, few artifacts hold the mystique and historical significance of the Rosetta stone. This unassuming slab of black basalt, discovered in 1799 by French soldiers during Napoleon’s campaign in Egypt, has since become synonymous with deciphering ancient mysteries. The Rosetta Stone, named after the town near where it was found, stands as a remarkable testament to human ingenuity and the relentless pursuit of understanding.

Unveiling the Legacy of the Rosetta stone A Key to Ancient Knowledge.pptx

elizabethella096

Choosing the Right White Label SEO Services to Boost Your Agency's Growth.pdf

Autus Digital

Social Media Marketing Portfolio - Maharsh Benday

MaharshBenday

(Vivek)Call Us, 8448380779,Call girls in Delhi NCr – We Offer best in class call girls. escort Service At Affordable Price At low Rate with Space Night 8000 We Are One Of The Oldest Escort and Call girls Agencies in Delhi. You Will Find That Our Female Escorts Are Full Of Fun, Sexy And They Would Love Enjoy Your Company. We Have A Fantastic Selection Of Escort Ladies Available For In-Calls As Well As Out-Calls. Our Escorts Are Not Only Beautiful But All Have Great Personalities Making Them The Perfect Companion For Any Occasion. In-Call:- You Can Come At Our Place in Delhi Our place Which Is Very Clean Hygienic 100% safe Accommodation. Out-Call:- You have To Come Pick The Girl From My Place We Are Also Provide Door Step Services (Delhi Ncr, Noida, Gurgaon, Faridabad, Ghaziabad Note:- Pic Collectors Time Passers Bargainers Stay Away As We Respect The Value For Your Money Time And Expect The Same From You Hygienic:- Full Ac room And Clean Rooms Available In Hotel 24 * 7 Hourly In Delhi NCR More Details, With WhatsApp Number, +91-8448380779

Busty Desi⚡Call Girls in Sector 49 Noida Escorts >༒8448380779 Escort Service

Delhi Call girls

VIP Call Girls Dongri WhatsApp +91-9833363713, Full Night Service

meghakumariji156

Tucked away in the rare book collection of Yale University’s Beinecke Rare Book & Manuscript Library lies one of the most enigmatic and captivating artifacts known to historians and cryptographers—the Voynich Manuscript. This ancient book, dating back to the 15th century, has baffled scholars, linguists, and codebreakers for centuries with its indecipherable script and mysterious illustrations. The Voynich Manuscript is not merely a relic of the past; it is a riddle waiting to be solved, a portal into a world of unknown knowledge and secrets.

Unlocking the Mystery of the Voynich Manuscript

elizabethella096

SP Search Term Data Optimization Template.pdf

PauleneNicoleLapira

Unlock the power of digital marketing and take your business to the next level! This comprehensive presentation covers the fundamentals of digital marketing, including search engine optimization, pay-per-click advertising, social media marketing, email marketing, and content marketing. Learn how to create a digital marketing strategy that drives website traffic, generates leads, and increases brand awareness. Discover the benefits of digital marketing, including cost-effectiveness, measurability, flexibility, and increased reach. Get insights into the latest digital marketing trends and best practices, and learn how to measure and optimize your digital marketing efforts. Whether you're a seasoned marketer or just starting out, this presentation is your ultimate guide to succeeding in the digital landscape. https://www.instagram.com/zoraizahmadd/

Digital-Marketing-Into-by-Zoraiz-Ahmad.pptx

ZACGaming

Elevate Your Advertising Game: Introducing Billion Broadcaster Lift Advertising

VikasYadav194549

This presentation examines digital marketing services in Noida, providing practical strategies to grow your business online. This includes SEO optimization to improve search engine visibility, social media marketing to engage audiences on platforms like Facebook, Instagram, LinkedIn, and PPC advertising to drive traffic and conversions Content marketing crafts high-quality content to attract organic traffic and establishes thought leadership. Manages analytics and optimization metrics and prepares strategies for continuous improvement. These strategies help you increase your digital presence and attract more customers in the competitive Noida market.

Welcome to DataMetricks Consulting (1).pptx

datametricks

4 TRIK CARA MENGGUGURKAN JANIN ATAU ABORSI KANDUNGAN

Cara Menggugurkan Kandungan 087776558899

Micro-Choices, Max Impact Personalizing Your Journey, One Moment at a Time.pdf

Piyush Kumar

personal branding kit for music business

brjohnson6

Best 5 Graphics Designing Course In Chandigarh

hamitthakurdma01

20180928 Hofstede Insights Conference Milan The Power of Culture Led Brands.pptx

MartinKaraffa3

How consumers use technology and the impacts on their lives

Mathuraa

Recently uploaded (20)

W.H.Bender Quote 61 -Influential restaurant and food service industry network...

The Science of Landing Page Messaging.pdf

Martal Group - B2B Lead Gen Agency - Onboarding Overview

10 Email Marketing Best Practices to Increase Engagements, CTR, And ROI

Unveiling the Legacy of the Rosetta stone A Key to Ancient Knowledge.pptx

Choosing the Right White Label SEO Services to Boost Your Agency's Growth.pdf

Social Media Marketing Portfolio - Maharsh Benday

Busty Desi⚡Call Girls in Sector 49 Noida Escorts >༒8448380779 Escort Service

VIP Call Girls Dongri WhatsApp +91-9833363713, Full Night Service

Unlocking the Mystery of the Voynich Manuscript

SP Search Term Data Optimization Template.pdf

Digital-Marketing-Into-by-Zoraiz-Ahmad.pptx

Elevate Your Advertising Game: Introducing Billion Broadcaster Lift Advertising

Welcome to DataMetricks Consulting (1).pptx

4 TRIK CARA MENGGUGURKAN JANIN ATAU ABORSI KANDUNGAN

Micro-Choices, Max Impact Personalizing Your Journey, One Moment at a Time.pdf

personal branding kit for music business

Best 5 Graphics Designing Course In Chandigarh

20180928 Hofstede Insights Conference Milan The Power of Culture Led Brands.pptx

How consumers use technology and the impacts on their lives

Small Data Classification for NLP

1. Small Data Classification for Natural Language Processing Michael Thorne Head of Data Science, CaliberMind

3. 3 | ©2016 CaliberMind Michael Thorne Head of Data Science, Caliber Mind MS Data Science Program, GalvanizeU B.S. Physics, Fordham University NSA Analytic Lead US Navy Digital Network Intelligence Analyst / Cryptolinguist Obligatory Speaker Bio

4. 4 | ©2016 CaliberMind CaliberMind • B2B marketing SaaS • Persona modeling and personality insights • Content matching across buyer journey for high-value, complex purchase decisions • Our core competency is natural language processing

10. 10 | ©2016 CaliberMind CaliberMind’s Data Challenge • We match the right message, to the right person, at the right time • We operate at the upper limits of human scale problems (100’s - 10,000’s of documents) • We weren’t getting as accurate results as we expected

11. 11 | ©2016 CaliberMind Our Friend: The Central Limit Theorem • This is the theorem that lets us assume our data is well behaved, assuming we have enough of it • Let’s look at a classic example, coin tosses

15. 15 | ©2016 CaliberMind Example: K-Means • K-means is a workhorse algorithm when doing unsupervised learning • What are the assumptions we make when we use k-means? Spherical data Same variance Same prior probability Turns out NLP data is none of these things

18. 18 | ©2016 CaliberMind But Wait, It Gets Better • Our documents tend to be of vastly different sizes within the same corpus • Unbalanced Classes • Qualitative Criteria • Unlabeled data • Human-labeling is time intensive

19. Our Solution

20. 20 | ©2016 CaliberMind Dimensionality Reduction • Dimensionality was the first thing we tackled • Manual dictionaries to collapse similar terms • mark = [‘growth hacker’, ‘marketer’, ‘demand gen’] • LSA to remove low-information terms • Automating the process using word2vec, dbpedia, and skip gram similarities • As we aggregate more data, we’re able to do this process more effectively

22. 22 | ©2016 CaliberMind Metrics Over Raw Scores • Especially important when comparing data of different sizes • How many standard deviations off the mean works better than a simple similarity score • Pick the best similarity score (with NLP, it’s not cosine)

23. 23 | ©2016 CaliberMind Pretend We Have Labeled Data • Rules-based scoring algorithm for a first pass • Take a small subset of high-scoring people as exemplars • Use a latent semantic analysis of these exemplars to make a template • Compare remaining data rows against each exemplar cluster • Assign highest score to that exemplar cluster, broadening the definition • Continue until all data rows are assigned • Any row with a similarity below a threshold we set is labeled as an ‘Unknown’, indicates additional, underlying personas

24. 24 | ©2016 CaliberMind Round 1 (Rules) Name Title Similarity Score Persona Luke J VP Marketing 1.0 Value Randy P Founder Lucas M Growth Ninja Bec G Tech Guru Fiona F Sysops 1.0 Security Claude S Growth Hacker Art L Data Analyst

25. 25 | ©2016 CaliberMind Round 2 (LSA) Name Title Similarity Score Persona Luke J VP Marketing 1.0 Value Randy P Founder 0.45 Lucas M Growth Ninja 0.11 Bec G Tech Guru 0.71 Security Fiona F Sysops 1.0 Security Claude S Growth Hacker 0.87 Value Art L Data Analyst 0.41

26. 26 | ©2016 CaliberMind Round 3 (LSA) Name Title Similarity Score Persona Luke J VP Marketing 1.0 Value Randy P Founder 0.68 Security Lucas M Growth Ninja 0.18 Bec G Tech Guru 0.86 Security Fiona F Sysops 1.0 Security Claude S Growth Hacker 0.89 Value Art L Data Analyst 0.72 Security

27. 27 | ©2016 CaliberMind Round 3 (LSA) Name Title Similarity Score Persona Luke J VP Marketing 1.0 Value Randy P Founder 0.71 Security Lucas M Growth Ninja 0.16 Unknown Bec G Tech Guru 0.88 Security Fiona F Sysops 1.0 Security Claude S Growth Hacker 0.91 Value Art L Data Analyst 0.78 Security

32. Questions? Michael Thorne mike@calibermind.com

Small Data Classification for NLP

Recommended

Recommended

More Related Content

Similar to Small Data Classification for NLP

Similar to Small Data Classification for NLP (20)

Recently uploaded

Recently uploaded (20)

Small Data Classification for NLP