Small Data Classification for NLP

•

0 likes•259 views

We have a lot of tools to help us with big data. What happens when our data is tiny? This talk will discuss a novel way of classifying NLP data from a very small training set, explaining some of the ways we deal with strongly unbalanced classes when working in the 100’s of rows of data.

Marketing

Small Data Classification for
Natural Language Processing
Michael Thorne
Head of Data Science, CaliberMind

2 | ©2016 CaliberMind
Goals
• Intro
• What Makes NLP Different
• Solutions
• Questions

3 | ©2016 CaliberMind
Michael Thorne
Head of Data Science, Caliber Mind
MS Data Science Program, GalvanizeU
B.S. Physics, Fordham University
NSA Analytic Lead
US Navy Digital Network Intelligence Analyst / Cryptolinguist
Obligatory Speaker Bio

4 | ©2016 CaliberMind
CaliberMind
• B2B marketing SaaS
• Persona modeling and personality insights
• Content matching across buyer journey for high-value,
complex purchase decisions
• Our core competency is natural language processing

5 | ©2016 CaliberMind
What’s So Special About NLP?
• Not random (Zipf’s Law)
• Huge feature space
• Subjective Criteria

7 | ©2016 CaliberMind
Persona Status Quo
• Assumptive Personas
• Qualitative Criteria
• Subjective Labels
• Static Output

8 | ©2016 CaliberMind
Starting Point
Demographics
Psychographics
Firmographics

9 | ©2016 CaliberMind
Let’s Validate the Status Quo

10 | ©2016 CaliberMind
CaliberMind’s Data Challenge
• We match the right message, to the right person, at
the right time
• We operate at the upper limits of human scale
problems (100’s - 10,000’s of documents)
• We weren’t getting as accurate results as we expected

11 | ©2016 CaliberMind
Our Friend: The Central Limit Theorem
• This is the theorem that lets us assume our data is well behaved,
assuming we have enough of it
• Let’s look at a classic example, coin tosses

12 | ©2016 CaliberMind
Coin Flip Distribution

15 | ©2016 CaliberMind
Example: K-Means
• K-means is a workhorse algorithm when doing unsupervised learning
• What are the assumptions we make when we use k-means?
Spherical data
Same variance
Same prior probability
Turns out NLP data is none of these things

18 | ©2016 CaliberMind
But Wait, It Gets Better
• Our documents tend to be of vastly different sizes within the same corpus
• Unbalanced Classes
• Qualitative Criteria
• Unlabeled data
• Human-labeling is time intensive

20 | ©2016 CaliberMind
Dimensionality Reduction
• Dimensionality was the first thing we tackled
• Manual dictionaries to collapse similar terms
• mark = [‘growth hacker’, ‘marketer’, ‘demand gen’]
• LSA to remove low-information terms
• Automating the process using word2vec, dbpedia, and skip gram
similarities
• As we aggregate more data, we’re able to do this process more
effectively

22 | ©2016 CaliberMind
Metrics Over Raw Scores
• Especially important when comparing data of different sizes
• How many standard deviations off the mean works better than a
simple similarity score
• Pick the best similarity score (with NLP, it’s not cosine)

23 | ©2016 CaliberMind
Pretend We Have Labeled Data
• Rules-based scoring algorithm for a first pass
• Take a small subset of high-scoring people as exemplars
• Use a latent semantic analysis of these exemplars to make a template
• Compare remaining data rows against each exemplar cluster
• Assign highest score to that exemplar cluster, broadening the definition
• Continue until all data rows are assigned
• Any row with a similarity below a threshold we set is labeled as an
‘Unknown’, indicates additional, underlying personas

24 | ©2016 CaliberMind
Round 1 (Rules)
Name Title Similarity Score Persona
Luke J VP Marketing 1.0 Value
Randy P Founder
Lucas M Growth Ninja
Bec G Tech Guru
Fiona F Sysops 1.0 Security
Claude S Growth Hacker
Art L Data Analyst

25 | ©2016 CaliberMind
Round 2 (LSA)
Name Title Similarity Score Persona
Luke J VP Marketing 1.0 Value
Randy P Founder 0.45
Lucas M Growth Ninja 0.11
Bec G Tech Guru 0.71 Security
Fiona F Sysops 1.0 Security
Claude S Growth Hacker 0.87 Value
Art L Data Analyst 0.41

26 | ©2016 CaliberMind
Round 3 (LSA)
Name Title Similarity Score Persona
Luke J VP Marketing 1.0 Value
Randy P Founder 0.68 Security
Lucas M Growth Ninja 0.18
Bec G Tech Guru 0.86 Security
Fiona F Sysops 1.0 Security
Claude S Growth Hacker 0.89 Value
Art L Data Analyst 0.72 Security

27 | ©2016 CaliberMind
Round 3 (LSA)
Name Title Similarity Score Persona
Luke J VP Marketing 1.0 Value
Randy P Founder 0.71 Security
Lucas M Growth Ninja 0.16 Unknown
Bec G Tech Guru 0.88 Security
Fiona F Sysops 1.0 Security
Claude S Growth Hacker 0.91 Value
Art L Data Analyst 0.78 Security

31 | ©2016 CaliberMind
Takeaways
• Human-generated data is never really random
• Small data models are hyper-sensitive
• Validate assumptions

Questions?
Michael Thorne
mike@calibermind.com

(Kevin Lacobie, Sr. Marketing Manager, Marketing Technology & Analytics, NRG Energy) Effective data governance serves an important function within the modern enterprise. It enables business users to make decisions based on high-quality data, and well-managed information assets. In this session, you will learn the importance of creating a proper framework for marketing, and the tools and processes needed to create the best framework possible.

Large Scale Graph Processing & Machine Learning Algorithms for Payment Fraud ...

DataWorks Summit

PayPal is at the forefront of applying large scale graph processing and machine learning algorithms to keep fraudsters at bay. In this talk, I’ll present how advanced graph processing and machine learning algorithms such as Deep Learning and Gradient Boosting are applied at PayPal for fraud prevention. I’ll elaborate on specific challenges in applying large scale graph processing & machine technique to payment fraud prevention. I’ll explain how we employ sophisticated machine learning tools – open source and in-house developed. I will also present results from experiments conducted on a very large graph data set containing millions of edges and vertices.

A6 big data_in_the_cloud

Dr. Wilfred Lin (Ph.D.)

Driving Digital Transformation with Machine Learning in Oracle Analytics

Perficient, Inc.

The adoption of machine learning (ML) is increasing at near-breakneck speed. As organizations seek innovative ideas on how to improve the business, Oracle Analytics Cloud with ML capabilities is leading the charge. With built-in drag-and-drop functions into visualizations and autonomous prediction execution, Oracle Analytics puts the power of machine learning in your hands. We covered how Oracle Analytics can connect various data sources, allow you to apply ML without being statistically savvy, and easily build your story in presentation format. Discussion included: -In-depth look at Oracle Analytics Cloud -How to connect different data sources like SaaS applications, data lakes, external data sources and more -Custom-trained ML models demonstration -Real-world business use case from end to end

Sitecore: Understanding your visitors and user personas

nonlinear creations

Data Refinement: The missing link between data collection and decisionsVivastream

Defining true north metrics to quantify engagement at LinkedIn

Bonnie Barrilleaux

In this session, PayPal will present the techniques used to retain merchants using some of the Machine Learning models using SparkML platform. Retaining merchants directly equates to Dollar value. So, it was very critical for us to identify the right model that trains on our data and predicts merchant behavior giving us insights that help us prevent merchant churn. We will also deep dive on how we captured the right signals filtering the noise that could skew the predictions and some of the challenges we faced in scaling this solution. Lastly, we will see how SparkML orchestrated various events in the pipeline we built thereby enabling us to perform feature engineering, train it, validate and cross-validate it at scale across the different data samples we had.

Keep a Pulse: Turning Data into Relationship Insights and (Automated) Action ...

TALiNT Partners

Keep a Pulse: Turning Data into Relationship Insights and (Automated) Action

TALiNT Partners

Sinéad Daly, Regional Manager for UK & Ireland, Bullhorn - Are you working harder OR smarter? The technology and subsequent data at our fingertips creates boundless opportunities and valuable insights if harnessed properly - However, with endless requests from clients, candidates, and internal employees, it’s difficult to slow down to evaluate, set strategy and execute. In this session -We will share tips on leveraging the data from your ATS and CRM to drive (and potentially automate) activity to increase both efficiency and results.

Market Research Meets Big Data Analytics for Business Transformation Sally Sadosky

Leveraging Glassdoor Analytics

Glassdoor

The Role of Analytics in Talent Acquisition

Human Capital Media

With the increasing access to big data, organizations are finding new ways to utilize this information within their talent acquisition strategy. During this Spotlight Webinar, we’ll focus on HR analytics and how organizations are leveraging this data to strengthen their recruiting strategies when identifying talent. During this spotlight webinar, learners will: Identify how analytics play a role in forecasting the time required to identify and hire candidates Determine how to leverage analytics to strengthen recruiting strategy Learn how vendor partnerships can provide HR analytics that support workforce planning.

Related searches at LinkedIn

Mitul Tiwari

Search plays an important role in online social networks as it provides an essential mechanism for discovering members and content on the network. Related search recommendation is one of several mechanisms used for improving members’ search experience in finding relevant results to their queries. This paper describes the design, implementation, and deployment of Metaphor, the related search recommendation system on LinkedIn, a professional social networking site with over 175 million members worldwide. Metaphor builds on a number of signals and filters that capture several dimensions of relatedness across member search activity. The system, which has been in live operation for over a year, has gone through multiple iterations and evaluation cycles. This paper makes three contributions. First, we provide a discussion of a large-scale related search recommendation system. Second, we describe a mechanism for effectively combining several signals in building a unified dataset for related search recommendations. Third, we introduce a query length model for capturing bias in recommendation click behavior. We also discuss some of the practical concerns in deploying related search recommendations.

QGate - Duplicate data - problem solved.pptx

AASTHAJAJOO

DevOps Days Charlotte - The Rise of Culture

Chris Nowak

Find Revenue this Quarter. Period. A no BS webinar by MadKudu

Francis Brero

Successful Machine Learning projects in Fintech

Appsilon Data Science

Filip Stachura, CEO at Appsilon Data Science spoke on Fintech Open Mic Night - Credit Scoring. Traditional credit scoring used financial history and basic info about the prospective borrower: job, address and so forth. The abundance of data fueled by the emergence of data-collecting companies opened completely new sources of information for risk managers. Optimal credit scoring model of today could use very large sets of data from a variety of sources and look for relationships among previously unnoticed areas and factors. This can lead to better allocation of lending capital and greater financial inclusion, as a consequence. Appsilon stands at the edge of discovery exploring new appliances of data science methodologies & tools in finance - hope you will enjoy our video. https://appsilon.com/appsilon-ceo-filip-stachura-talked-about-data-science-in-finance/

Pratical Tips for Focusing Sales Efforts and Cutting Costs Using VisibleThrea...

VisibleThread

The Good, the Bad, and The Not So Bad: Tracking Threat Operators with Our Thr...

Core Security

How do you separate the good from the bad actors? Put on your white hat as we go on an adventure to separate the wheat from the chaff; or the good from the bad and not so bad. Join us for a chance to learn from our Threat Research team and how they track and expose threat operators and build that intelligence into Network Insight. Not only that, but discover the different ways our Threat Research team is able to apply their findings. This will be an insightful session for anyone interested in learning more about a day in the life of a threat researcher.

Data Rehab Series: Automating TaxonomyRingLead

No REST till Production – Building and Deploying 9 Models to Production in 3 ...

Databricks

The state of the art in productionizing machine Learning models today primarily addresses building RESTful APIs. In the Digital Ecosystem, RESTful APIs are a necessary, but not sufficient, part of the complete solution for productionizing ML models. And according to recent research by the McKinsey Global Institute, applying AI in marketing and sales has the most potential value. In the digital ecosystem, productionizing ML models at an accelerated pace becomes easy with: Feature Store with commonly used features that is available for all data scientists Feature Stores that distill visitor behavior is ready to use feature vectors in a semi supervised manner Data pipeline that can support the challenging demands of the digital ecosystem to feed the Feature Store on an ongoing basis Pipeline templates that support the challenging demands of the digital ecosystem that feed feature store, predict and distribute predictions on an ongoing basis. With these, a major electronics manufacturer was able to build and productionize a new model in 3 weeks. The use case for the model is retargeting advertising; it analyzes the behavior of website visitors and builds customized audiences of the visitors that are most likely to purchase 9 different products. Using the model, this manufacturer was able to maintain the same level of purchases with half of the retargeting media spend -increasing the efficiency of their marketing spend by 100%.

Frank Bien Opening Keynote - Join 2016

Looker

Frank Bien Opening Keynote - Join 2016

Looker

Cloudslate Berkeley Final PresentationStanford University

Scaling Fast: Growing Engineering Orgs From Zero to IPO

Nick Caldwell

How to use Short Form Video To Grow Your Brand and Business - Keenya Kelly

DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions

Monthly Social Media News Update May 2024

Andy Lambert

TL;DR. These are the three themes that stood out to us over the course of last month. 1️⃣ Social media is becoming increasingly significant for brand discovery. Marketers are now understanding the impact of social and budgets are shifting accordingly. 2️⃣ Instagram’s new algorithm and latest guidance will help us maintain organic growth. Instagram continues to evolve, but Reels remains the most crucial tool for growth. 3️⃣ Collaboration will help us unlock growth. Who we work with will define how fast we grow. Meta continues to evolve their Creator Marketplace and now TikTok are beginning to push ‘collabs’ more too.

Similar to Small Data Classification for NLP

LeanScape - Lean Six Sigma Green Belt Book of Knowledge

Reagan Pannell

Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Ani...

Databricks

Keep a Pulse: Turning Data into Relationship Insights and (Automated) Action ...

TALiNT Partners

Keep a Pulse: Turning Data into Relationship Insights and (Automated) Action

TALiNT Partners

Market Research Meets Big Data Analytics for Business Transformation Sally Sadosky

Leveraging Glassdoor Analytics

Glassdoor

The Role of Analytics in Talent Acquisition

Human Capital Media

Similar to Small Data Classification for NLP (20)

LeanScape - Lean Six Sigma Green Belt Book of Knowledge

Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Ani...

Keep a Pulse: Turning Data into Relationship Insights and (Automated) Action ...

Keep a Pulse: Turning Data into Relationship Insights and (Automated) Action

Market Research Meets Big Data Analytics for Business Transformation

Leveraging Glassdoor Analytics

The Role of Analytics in Talent Acquisition

Recently uploaded

How to use Short Form Video To Grow Your Brand and Business - Keenya Kelly

DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions

Monthly Social Media News Update May 2024

Andy Lambert

Consumer Journey Mapping & Personalization Master Class - Sabrina Killgo

DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions

The Secret to Engaging Modern Consumers: Journey Mapping and Personalization In today's digital landscape, understanding the customer's journey and delivering personalized experiences are paramount. This masterclass delves into the art of consumer journey mapping, a powerful technique that visualizes the entire customer experience across touchpoints. Attendees will learn how to create detailed journey maps, identify pain points, and uncover opportunities for optimization. The presentation also explores personalization strategies that leverage data and technology to tailor content, products, and experiences to individual customers. From real-time personalization to predictive analytics, attendees will gain insights into cutting-edge approaches that drive engagement and loyalty. Key Takeaways: Current consumer landscape; Steps to mapping an effective consumer journey; Understanding the value of personalization; Integrating mapping and personalization for success; Brands that are getting It right!; Best Practices; Future Trends

Unknown to Unforgettable - The Art and Science to Being Irresistible on Camer...

DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions

Videos are more engaging, more memorable, and more popular than any other type of content out there. That’s why it’s estimated that 82% of consumer traffic will come from videos by 2025. And with videos evolving from landscape to portrait and experts promoting shorter clips, one thing remains constant – our brains LOVE videos. So is there science behind what makes people absolutely irresistible on camera? The answer: definitely yes. In this jam-packed session with Stephanie Garcia, you’ll get your hands on a steal-worthy guide that uncovers the art and science to being irresistible on camera. From body language to words that convert, she’ll show you how to captivate on command so that viewers are excited and ready to take action.

Digital Strategy Master Class - Andrew Rupert

DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions

First Things First: Building and Effective Marketing Strategy Too many companies (and marketers) jump straight into activation planning without formalizing a marketing strategy. It may seem tedious, but analyzing the mindset of your targeted audiences and identifying the messaging points most likely to resonate with them is time well spent. That process is also a great opportunity for marketers to collaborate with sales leaders and account managers on a galvanized go-to-market approach. I’ll walk you through the methods and tools we use with our clients to ensure campaign success. Key Takeaways: -Recognize the critical role of strategy in marketing -Learn our approach for building an actionable, effective marketing strategy -Receive templates and guides for developing a marketing strategy

May 2024 - VBOUT Partners Meeting Group Session

Vbout.com

Digital Marketing Trends - Experts Insights on How to Gain a Competitive Edge

DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions

Offissa Dizayn - Otel, Kafe, Restoran Kataloqu_240603_011042.pdf

offisadizayn

Traditional Store Audits are Outdated: A New Approach to Protecting Your Bran...

Auxis Consulting & Outsourcing

Metaverse Marketing in the Generation of the Internet - Eugene Capon

DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions

BLOOM_May2024 (r). Balmer Lawrie Online Monthly Bulletin

BalmerLawrie

De-risk Your Digital Evolution - Hannah Grap

DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions

It's another new era of digital and marketers are faced with making big bets on their digital strategy. If you are looking at modernizing your tech stack to support your digital evolution, there are a few can't miss (often overlooked) areas that should be part of every conversation. We'll cover setting your vision, avoiding siloes, adding a democratized approach to data strategy, localization, creating critical governance requirements and more. Attendees will walk away with actions they can take into initiatives they are running today and consider for the future.

Playlist and Paint Event with Sony Music U

SemajahParker

AI-Powered Personalization: Principles, Use Cases, and Its Impact on CRO

VWO

In today’s era of AI, personalization is more than just a trend—it’s a fundamental strategy that unlocks numerous opportunities. When done effectively, personalization builds trust, loyalty, and satisfaction among your users—key factors for business success. However, relying solely on AI capabilities isn’t enough. You need to anchor your approach in solid principles, understand your users’ context, and master the art of persuasion. Join us as Sarjak Patel and Naitry Saggu from 3rd Eye Consulting unveil a transformative framework. This approach seamlessly integrates your unique context, consumer insights, and conversion goals, paving the way for unparalleled success in personalization.

Email Marketing Master Class - Chris Ferris

DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions

The Forgotten Secret Weapon of Digital Marketing: Email Digital marketing is a rapidly changing, ever evolving industry--Influencers, Threads, X, AI, etc. But one of the most effective digital marketing tools is also one of the oldest: Email. Find out from two Houston-based digital experts how to maximize your results from email. Key Takeaways: Email has the best ROI of any digital tactic It can be used at any stage of the customer journey It is increasingly important as the cookie-less future gets closer and closer

The What, Why & How of 3D and AR in Digital Commerce

PushON Ltd

Vladimir Mulhem has over 20 years of experience in commercialising cutting edge creative technology across construction, marketing and retail. Previously the founder and Tech and Innovation Director of Creative Content Works working with the likes of Next, John Lewis and JD Sport, he now helps retailers, brands and agencies solve challenges of applying the emerging technologies 3D, AR, VR and Gen AI to real-world problems. In this webinar, Vladimir will be covering the following topics: Applications of 3D and AR in Digital Commerce, Benefits of 3D and AR, Tools to create, manage and publish 3D and AR in Digital Commerce.

Unlocking Everyday Narratives: The Power of Storytelling in Marketing - Chad...

DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions

Everyone knows the power of stories, but when asked to come up with them, we struggle. Either we second guess ourselves as to the story's relevance, or we just come up blank and can't think of any. Unlocking Everyday Narratives: The Power of Storytelling in Marketing will teach you how to recognize stories in the moment and to recall forgotten moments that your audience needs to hear. Key Takeaways: Understand Why Personal Stories Connect Better How To Remember Forgotten Stories How To Use Customer Experiences As Stories For Your Brand

5 Big Bets for 2024 - Jamie A. Lee, Stripes Co

DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions

The New Era Of SEO - How AI Has Changed SEO Forever - Danny Leibrandt

DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions

In this presentation, Danny Leibrandt explains the impact of AI on SEO and what Google has been doing about it. Learn how to take your SEO game to the next level and win over Google with his new strategy anyone can use. Get actionable steps to rank your name, your business, and your clients on Google - the right way. Key Takeaways: 1. Real content is king 2. Find ways to show EEAT 3. Repurpose across all platforms

Your Path to Profits - The Game-Changing Power of a Marketing OS for Your Bus...

DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions

Most small businesses struggle to see marketing results. In this session, we will eliminate any confusion about what to do next, solving your marketing problems so your business can thrive. You’ll learn how to create a foundational marketing OS (operating system) based on neuroscience and backed by real-world results. You’ll be taught how to develop deep customer connections, and how to have your CRM dynamically segment and sell at any stage in the customer’s journey. By the end of the session, you’ll remove confusion and chaos and replace it with clarity and confidence for long-term marketing success. Key Takeaways: • Uncover the power of a foundational marketing system that dynamically communicates with prospects and customers on autopilot. • Harness neuroscience and Tribal Alignment to transform your communication strategies, turning potential clients into fans and those fans into loyal customers. • Discover the art of automated segmentation, pinpointing your most lucrative customers and identifying the optimal moments for successful conversions. • Streamline your business with a content production plan that eliminates guesswork, wasted time, and money.

Recently uploaded (20)

How to use Short Form Video To Grow Your Brand and Business - Keenya Kelly

Monthly Social Media News Update May 2024

Consumer Journey Mapping & Personalization Master Class - Sabrina Killgo

Unknown to Unforgettable - The Art and Science to Being Irresistible on Camer...

Digital Strategy Master Class - Andrew Rupert

May 2024 - VBOUT Partners Meeting Group Session

Digital Marketing Trends - Experts Insights on How to Gain a Competitive Edge

Offissa Dizayn - Otel, Kafe, Restoran Kataloqu_240603_011042.pdf

Traditional Store Audits are Outdated: A New Approach to Protecting Your Bran...

Metaverse Marketing in the Generation of the Internet - Eugene Capon

BLOOM_May2024 (r). Balmer Lawrie Online Monthly Bulletin

De-risk Your Digital Evolution - Hannah Grap

Playlist and Paint Event with Sony Music U

AI-Powered Personalization: Principles, Use Cases, and Its Impact on CRO

Email Marketing Master Class - Chris Ferris

The What, Why & How of 3D and AR in Digital Commerce

Unlocking Everyday Narratives: The Power of Storytelling in Marketing - Chad...

5 Big Bets for 2024 - Jamie A. Lee, Stripes Co

The New Era Of SEO - How AI Has Changed SEO Forever - Danny Leibrandt

Your Path to Profits - The Game-Changing Power of a Marketing OS for Your Bus...

Small Data Classification for NLP

1. Small Data Classification for Natural Language Processing Michael Thorne Head of Data Science, CaliberMind

3. 3 | ©2016 CaliberMind Michael Thorne Head of Data Science, Caliber Mind MS Data Science Program, GalvanizeU B.S. Physics, Fordham University NSA Analytic Lead US Navy Digital Network Intelligence Analyst / Cryptolinguist Obligatory Speaker Bio

4. 4 | ©2016 CaliberMind CaliberMind • B2B marketing SaaS • Persona modeling and personality insights • Content matching across buyer journey for high-value, complex purchase decisions • Our core competency is natural language processing

10. 10 | ©2016 CaliberMind CaliberMind’s Data Challenge • We match the right message, to the right person, at the right time • We operate at the upper limits of human scale problems (100’s - 10,000’s of documents) • We weren’t getting as accurate results as we expected

11. 11 | ©2016 CaliberMind Our Friend: The Central Limit Theorem • This is the theorem that lets us assume our data is well behaved, assuming we have enough of it • Let’s look at a classic example, coin tosses

15. 15 | ©2016 CaliberMind Example: K-Means • K-means is a workhorse algorithm when doing unsupervised learning • What are the assumptions we make when we use k-means? Spherical data Same variance Same prior probability Turns out NLP data is none of these things

18. 18 | ©2016 CaliberMind But Wait, It Gets Better • Our documents tend to be of vastly different sizes within the same corpus • Unbalanced Classes • Qualitative Criteria • Unlabeled data • Human-labeling is time intensive

19. Our Solution

20. 20 | ©2016 CaliberMind Dimensionality Reduction • Dimensionality was the first thing we tackled • Manual dictionaries to collapse similar terms • mark = [‘growth hacker’, ‘marketer’, ‘demand gen’] • LSA to remove low-information terms • Automating the process using word2vec, dbpedia, and skip gram similarities • As we aggregate more data, we’re able to do this process more effectively

22. 22 | ©2016 CaliberMind Metrics Over Raw Scores • Especially important when comparing data of different sizes • How many standard deviations off the mean works better than a simple similarity score • Pick the best similarity score (with NLP, it’s not cosine)

23. 23 | ©2016 CaliberMind Pretend We Have Labeled Data • Rules-based scoring algorithm for a first pass • Take a small subset of high-scoring people as exemplars • Use a latent semantic analysis of these exemplars to make a template • Compare remaining data rows against each exemplar cluster • Assign highest score to that exemplar cluster, broadening the definition • Continue until all data rows are assigned • Any row with a similarity below a threshold we set is labeled as an ‘Unknown’, indicates additional, underlying personas

24. 24 | ©2016 CaliberMind Round 1 (Rules) Name Title Similarity Score Persona Luke J VP Marketing 1.0 Value Randy P Founder Lucas M Growth Ninja Bec G Tech Guru Fiona F Sysops 1.0 Security Claude S Growth Hacker Art L Data Analyst

25. 25 | ©2016 CaliberMind Round 2 (LSA) Name Title Similarity Score Persona Luke J VP Marketing 1.0 Value Randy P Founder 0.45 Lucas M Growth Ninja 0.11 Bec G Tech Guru 0.71 Security Fiona F Sysops 1.0 Security Claude S Growth Hacker 0.87 Value Art L Data Analyst 0.41

26. 26 | ©2016 CaliberMind Round 3 (LSA) Name Title Similarity Score Persona Luke J VP Marketing 1.0 Value Randy P Founder 0.68 Security Lucas M Growth Ninja 0.18 Bec G Tech Guru 0.86 Security Fiona F Sysops 1.0 Security Claude S Growth Hacker 0.89 Value Art L Data Analyst 0.72 Security

27. 27 | ©2016 CaliberMind Round 3 (LSA) Name Title Similarity Score Persona Luke J VP Marketing 1.0 Value Randy P Founder 0.71 Security Lucas M Growth Ninja 0.16 Unknown Bec G Tech Guru 0.88 Security Fiona F Sysops 1.0 Security Claude S Growth Hacker 0.91 Value Art L Data Analyst 0.78 Security

32. Questions? Michael Thorne mike@calibermind.com

Small Data Classification for NLP

Recommended

Recommended

More Related Content

Similar to Small Data Classification for NLP

Similar to Small Data Classification for NLP (20)

Recently uploaded

Recently uploaded (20)

Small Data Classification for NLP