SlideShare a Scribd company logo
1 of 35
Download to read offline
Domain Specific
NLP Pipelines
Rajesh Muppalla
Senior Director of Engineering
@codingnirvana
2
About Me
● Senior Director of Engineering, Avalara
● Previously
○ Co-Founder @ Indix (acquired by Avalara in Feb 2019)
○ Tech Lead, Go-CD @ Thoughtworks - Open Source CI/CD Tool
● Scale By The Bay
○ 2016 - Data Pipelines Panelist
○ 2017 - Continuous Delivery For Machine Learning
● Based in Chennai, India
3
Talk Focus
● Working on NLP problems since early 2012
● Two domains now
○ Indix - E-Commerce
■ Evolution across 7 years
○ Avalara - Tax Compliance
■ Learnings from e-commerce
■ Newer techniques from last couple of years
● Share what we learnt
4
Indix - “Google Map” of Products
5
NLP Stack - E-Commerce Domain @ Indix
Classification NER
Document
Similarity
Auto
Complete
Query
ExpansionAlgorithms
Use Cases
Product
Classification
Attribute
Extraction
SearchMatching
Featurization/ NLU Embeddings
(Character/Word)
Language Models
(n-gram based)
Knowledge Graph
Pre-Processing Tokenizers Lemmatizers POS Tagging
Language
Detection
Extraction
Data
Parsers (HTML)
Training Data
(Labeled)
Raw Data
(UnLabeled)
Technology
6
Embeddings
Embeddings
Word Embeddings
Word2Vec FastText Glove
CBOW Skip-Gram
Dense Representation of word vectors learned from large unlabelled corpus
● Why?
○ Machines do not understand text, need a numerical representation
● Part of feature engineering
● Learned Embeddings
○ Capture notion of similarity
○ Popularized by Word2Vec by Mikolov in 2013
○ Glove and FastText are other implementations
8
Embeddings - Useful Properties
● Word Embeddings capture certain relations between words
Source - https://www.tensorflow.org/tutorials/text/word_embeddings
9
Embeddings - Training (Word2Vec - Skip Gram Model)
Source - http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
10
Embeddings - Hidden layer is Embedding Matrix
Source - http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
11
Word Embeddings for Document Similarity
Are these two products the same?
Cosine Similarity
Yes No
< Threshold> Threshold
0 1 -2 1 0 3 4 0 1 0 1 -2 1 0 3 4 0 1 0 1 -2 1 0 3 4 0 1 0 1 -2 1 0 3 4 0 1
0 1 2 4 ….
Burt’s Bee’s Lip Balm
0 1 -2 1 0 3 4 0 1 0 1 -2 1 0 3 4 0 1 0 1 -2 1 0 3 4 0 1 0 1 -2 1 0 3 4 0 1
0 1 2 4 ….
Average
LipBalm By Burts Bees
LipBalm by Burts Bees Burt’s Bees Lip Balm
12
Product Classification - E-Commerce Domain
lip care (0.91)
Product Classification - Using FastText
0
1
-
2
1
0
Burt’s
Lip
Balm
Bees
0
1
-
3
4
6
0
1
-
4
6
7
0
1
-
5
9
3
0
1
2
3
5
7
...
1
9
3
2
9
4
5
8
Averaging
shoes (0.03)
mobiles (0.03)
Hidden Linear
Layer
Output Layer
Softmax
apparel (0.03)
14
Avalara - Part of “every” transaction
ERP
Ecommerce
Retail
Transactions
Tax
Tax Rates
Tax Boundaries
Taxability Rules
Returns
Exemption Statutes
Certificate Templates
Customers
Customers
Certificates
15
Taxes for specific type of Clothing
Tax Rules in New York
Source - https://www.avalara.com/us/en/learn/whitepapers/the-trials-and-tribulations-of-sales-tax-in-the-united-states.html
16
Product Classification - Tax Domain
Skin Care
(SK0001)
Tax Code US State Rate (%)
SK0001 Alabama 3
SK0001 New York 8
SK0001 Texas 0
SK0001 Oklahoma 5
Beauty / Makeup
(33041000)
HS Code Country
of Origin
Customs
Rate (%)
33041000 Canada 5
33041000 India 2
33041000 Singapore 7
33041000 Australia 5
Product
(Transaction)
Tax Code
HS Code
Tax Rule
Tax Rule
BB Lip Balm - 4pc
17
Product Classification - Tax Domain
● Should be easy?
○ “Similar” to Product Classification in E-Commerce Domain
● However, there are challenges
○ Classification Taxonomy is different - 2000 vs 5000 leaf nodes
○ Lack of labelled data for training (small data or low data problem)
○ Vocab is different
■ E.g. Abbreviations used in product transaction data
○ Data is noisy
● Decent Baseline
○ Re-use the Embeddings layer from e-Commerce and Re-train the Tax classifier model
● Can we do better?
18
Yes, we can
● Transfer Learning
● Weak Supervision
● Data Augmentation & Synthesis
19
Transfer Learning
Source - https://towardsdatascience.com/a-comprehensive-hands-on-guide-to-transfer-learning-with-real-world-applications-in-deep-learning-212bf3b2f27a
20
Transfer Learning
● Computer Vision
■ 2012 - ImageNet Competition
■ AlexNet - beat competition by 41% using Transfer Learning
● 2018 - Watershed moment in NLP
○ Transfer Learning using Pre-Trained Language Models
○ ELMO, ULMFIT, BERT, GPT, GPT-2
“It only seems to be a question of time until pretrained word embeddings will be dethroned
and replaced by pretrained language models in the toolbox of every NLP practitioner.”
Sebastian Ruder (Researcher, Deep Mind) - https://thegradient.pub/nlp-imagenet/
Transfer Learning - History
Transfer Learning - Using ULMFIT
Step 1 - Pre-Training on Source Dataset
AWD-LSTM
Softmax Layer
Hidden Layer
(3 Stacked BiLSTM)
Embedding Layer
2. Objective Function
Predict the next word
Pre-Trained
Language Model
1. Dataset (Huge)
ECommerce Product Data
23
Pre-Trained Language Models
Two-layer BiLSTM AWD-LSTM Transformer
ELMO ULMFIT GPT BERT GPT2
(Architecture)
Pre-Trained Language Models
Transfer Learning - Using ULMFIT
Step 2 - Fine-Tuning on Domain Dataset
Hidden Layer
Weight Matrix
Embedding
Matrix
2. Objective Function
Predict the next word
1. Dataset (Large)
Tax Domain Product Data
(Unlabelled)
Pre-Trained Language Model
Transfer Learning - Using ULMFIT
Step 2 - Fine-Tuning Tricks - Freezing + Gradual Unfreezing
Freeze
First few epochs of training
freeze the Bi-LSTM weights to
prevent catastrophic forgetting
of what it learned from the
source dataset
Transfer Learning - Using ULMFIT
Step 2 - Fine-Tuning Tricks - Slanted Learning Rates
A short steep increase in size of LRs to quickly
converge to a suitable region in the parameter
space
A long decay period to precisely fine-tune the
weights
Transfer Learning - Using ULMFIT
Step 3 - Domain Task Classifier
RELU
Softmax
2. Objective Function
Cross Entropy
1. Dataset (Small)
Tax Domain Labelled Data
Fine-Tuned Language Model
28
Tax Domain - Tax Rate Changes
Key Value
Effective Date May 1, 2019
Jurisdiction GeorgeTown County
New Rate 7%
Tax Type General
Semantic Role Labeling
(Slot Filling)
CoReference
Resolution
ELMO w/o Fine
Tuning
29
Transfer Learning - Why does it work?
● Many NLP tasks share common knowledge about language
○ Linguistic Representations, Structural Similarities
● Tasks can inform each other
○ Syntax, Semantics
● Labeled data is rare but unlabeled data is abundant in every domain
30
Pre-Trained Language Models vs Word Embeddings
● Entire Network vs Single Layer
● Pre-Trained Language Models are able to capture context
○ Example
■ Apple iPhone 64 GB Black
■ Apple Cider Vinegar
31
Weak Supervision
● AKA Data Programming
○ Libraries - Snorkel, Snuba
● Steps
○ Create “domain heuristics” or “labeling functions” based on the small unlabelled dataset
■ Snorkel needs humans to do this, Snuba can do that automatically for you
○ Learn a generative model that denoise these heuristics and can emit probabilistic labels
○ Run this model on the entire unlabelled dataset to get probabilistic labels
○ You now have a “good enough” large training data set
32
Data Augmentation & Synthesis
● Back Translation
○ Source language -> Pivot Language -> Source Language
● Synonyms Replacement
● Make sure
○ Preserve the semantic structure and the meaning
NLP Stack - Tax Domain @ Avalara
Classification SRL
CoReference
ResolutionAlgorithms
Use Cases
Product
Classification
Rates/Rules
Extraction
<Future Use
Case>
Embeddings
(Character/Word)
Pre-Trained Language Models
(using Transfer Learning)
Pre-Processing Tokenizers Lemmatizers POS Tagging
Language
Detection
Extraction
Data
Parsers (HTML)
Training Data
(Labeled)
Raw Data
(UnLabeled)
Parsers (PDF) OCR
<TBD>
33
Technology
Featurization/ NLU
34
Conclusion
● NLP Pipelines need to be domain specific
○ Libraries, Infrastructure and Techniques can be re-used across domains
○ Having good quality and labelled domain specific data is utmost important
● Domain Specific Data
○ Large labelled data
■ Most techniques will work out of the box
○ Use unlabelled data from your domain to your advantage
○ Small/Low labelled data
■ Transfer Learning using Pre-Trained language models gives you a strong baseline
■ Techniques like Weak Supervision and Data Augmentation will help too
35
Thank You

More Related Content

Similar to Domain specific nlp pipelines

AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or realityAwantik Das
 
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...InfluxData
 
Natural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application TrendsNatural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application TrendsShreyas Suresh Rao
 
Viadeos Segmentation platform with Spark on Mesos
Viadeos Segmentation platform with Spark on MesosViadeos Segmentation platform with Spark on Mesos
Viadeos Segmentation platform with Spark on MesosCepoi Eugen
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
 
Not my problem - Delegating responsibility to infrastructure
Not my problem - Delegating responsibility to infrastructureNot my problem - Delegating responsibility to infrastructure
Not my problem - Delegating responsibility to infrastructureYshay Yaacobi
 
Google cloud Study Jam 2023.pptx
Google cloud Study Jam 2023.pptxGoogle cloud Study Jam 2023.pptx
Google cloud Study Jam 2023.pptxGDSCNiT
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabszekeLabs Technologies
 
Proud to be polyglot
Proud to be polyglotProud to be polyglot
Proud to be polyglotTugdual Grall
 
What drives Innovation? Innovations And Technological Solutions for the Distr...
What drives Innovation? Innovations And Technological Solutions for the Distr...What drives Innovation? Innovations And Technological Solutions for the Distr...
What drives Innovation? Innovations And Technological Solutions for the Distr...Stefano Fago
 
ICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPTICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPTDr. Haxel Consult
 
A step towards machine learning at accionlabs
A step towards machine learning at accionlabsA step towards machine learning at accionlabs
A step towards machine learning at accionlabsChetan Khatri
 
Introduction to GluonNLP
Introduction to GluonNLPIntroduction to GluonNLP
Introduction to GluonNLPApache MXNet
 
Building multi billion ( dollars, users, documents ) search engines on open ...
Building multi billion ( dollars, users, documents ) search engines  on open ...Building multi billion ( dollars, users, documents ) search engines  on open ...
Building multi billion ( dollars, users, documents ) search engines on open ...Andrei Lopatenko
 
The information supernova
The information supernovaThe information supernova
The information supernovaAlaa Al-Agamawi
 
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangean...
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangean...TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangean...
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangean...TAUS - The Language Data Network
 
Samsung voice intelligence.v5.5
Samsung voice intelligence.v5.5Samsung voice intelligence.v5.5
Samsung voice intelligence.v5.5vinutharani1995
 
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Codemotion
 

Similar to Domain specific nlp pipelines (20)

AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
 
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
 
Natural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application TrendsNatural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application Trends
 
Viadeos Segmentation platform with Spark on Mesos
Viadeos Segmentation platform with Spark on MesosViadeos Segmentation platform with Spark on Mesos
Viadeos Segmentation platform with Spark on Mesos
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
Not my problem - Delegating responsibility to infrastructure
Not my problem - Delegating responsibility to infrastructureNot my problem - Delegating responsibility to infrastructure
Not my problem - Delegating responsibility to infrastructure
 
Google cloud Study Jam 2023.pptx
Google cloud Study Jam 2023.pptxGoogle cloud Study Jam 2023.pptx
Google cloud Study Jam 2023.pptx
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
Proud to be polyglot
Proud to be polyglotProud to be polyglot
Proud to be polyglot
 
What drives Innovation? Innovations And Technological Solutions for the Distr...
What drives Innovation? Innovations And Technological Solutions for the Distr...What drives Innovation? Innovations And Technological Solutions for the Distr...
What drives Innovation? Innovations And Technological Solutions for the Distr...
 
ICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPTICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPT
 
A step towards machine learning at accionlabs
A step towards machine learning at accionlabsA step towards machine learning at accionlabs
A step towards machine learning at accionlabs
 
Introduction to GluonNLP
Introduction to GluonNLPIntroduction to GluonNLP
Introduction to GluonNLP
 
Building multi billion ( dollars, users, documents ) search engines on open ...
Building multi billion ( dollars, users, documents ) search engines  on open ...Building multi billion ( dollars, users, documents ) search engines  on open ...
Building multi billion ( dollars, users, documents ) search engines on open ...
 
The information supernova
The information supernovaThe information supernova
The information supernova
 
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangean...
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangean...TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangean...
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangean...
 
Samsung voice intelligence.v5.5
Samsung voice intelligence.v5.5Samsung voice intelligence.v5.5
Samsung voice intelligence.v5.5
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
 
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
 

Recently uploaded

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Domain specific nlp pipelines

  • 1. Domain Specific NLP Pipelines Rajesh Muppalla Senior Director of Engineering @codingnirvana
  • 2. 2 About Me ● Senior Director of Engineering, Avalara ● Previously ○ Co-Founder @ Indix (acquired by Avalara in Feb 2019) ○ Tech Lead, Go-CD @ Thoughtworks - Open Source CI/CD Tool ● Scale By The Bay ○ 2016 - Data Pipelines Panelist ○ 2017 - Continuous Delivery For Machine Learning ● Based in Chennai, India
  • 3. 3 Talk Focus ● Working on NLP problems since early 2012 ● Two domains now ○ Indix - E-Commerce ■ Evolution across 7 years ○ Avalara - Tax Compliance ■ Learnings from e-commerce ■ Newer techniques from last couple of years ● Share what we learnt
  • 4. 4 Indix - “Google Map” of Products
  • 5. 5 NLP Stack - E-Commerce Domain @ Indix Classification NER Document Similarity Auto Complete Query ExpansionAlgorithms Use Cases Product Classification Attribute Extraction SearchMatching Featurization/ NLU Embeddings (Character/Word) Language Models (n-gram based) Knowledge Graph Pre-Processing Tokenizers Lemmatizers POS Tagging Language Detection Extraction Data Parsers (HTML) Training Data (Labeled) Raw Data (UnLabeled) Technology
  • 7. Embeddings Word Embeddings Word2Vec FastText Glove CBOW Skip-Gram Dense Representation of word vectors learned from large unlabelled corpus ● Why? ○ Machines do not understand text, need a numerical representation ● Part of feature engineering ● Learned Embeddings ○ Capture notion of similarity ○ Popularized by Word2Vec by Mikolov in 2013 ○ Glove and FastText are other implementations
  • 8. 8 Embeddings - Useful Properties ● Word Embeddings capture certain relations between words Source - https://www.tensorflow.org/tutorials/text/word_embeddings
  • 9. 9 Embeddings - Training (Word2Vec - Skip Gram Model) Source - http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
  • 10. 10 Embeddings - Hidden layer is Embedding Matrix Source - http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
  • 11. 11 Word Embeddings for Document Similarity Are these two products the same? Cosine Similarity Yes No < Threshold> Threshold 0 1 -2 1 0 3 4 0 1 0 1 -2 1 0 3 4 0 1 0 1 -2 1 0 3 4 0 1 0 1 -2 1 0 3 4 0 1 0 1 2 4 …. Burt’s Bee’s Lip Balm 0 1 -2 1 0 3 4 0 1 0 1 -2 1 0 3 4 0 1 0 1 -2 1 0 3 4 0 1 0 1 -2 1 0 3 4 0 1 0 1 2 4 …. Average LipBalm By Burts Bees LipBalm by Burts Bees Burt’s Bees Lip Balm
  • 12. 12 Product Classification - E-Commerce Domain
  • 13. lip care (0.91) Product Classification - Using FastText 0 1 - 2 1 0 Burt’s Lip Balm Bees 0 1 - 3 4 6 0 1 - 4 6 7 0 1 - 5 9 3 0 1 2 3 5 7 ... 1 9 3 2 9 4 5 8 Averaging shoes (0.03) mobiles (0.03) Hidden Linear Layer Output Layer Softmax apparel (0.03)
  • 14. 14 Avalara - Part of “every” transaction ERP Ecommerce Retail Transactions Tax Tax Rates Tax Boundaries Taxability Rules Returns Exemption Statutes Certificate Templates Customers Customers Certificates
  • 15. 15 Taxes for specific type of Clothing Tax Rules in New York Source - https://www.avalara.com/us/en/learn/whitepapers/the-trials-and-tribulations-of-sales-tax-in-the-united-states.html
  • 16. 16 Product Classification - Tax Domain Skin Care (SK0001) Tax Code US State Rate (%) SK0001 Alabama 3 SK0001 New York 8 SK0001 Texas 0 SK0001 Oklahoma 5 Beauty / Makeup (33041000) HS Code Country of Origin Customs Rate (%) 33041000 Canada 5 33041000 India 2 33041000 Singapore 7 33041000 Australia 5 Product (Transaction) Tax Code HS Code Tax Rule Tax Rule BB Lip Balm - 4pc
  • 17. 17 Product Classification - Tax Domain ● Should be easy? ○ “Similar” to Product Classification in E-Commerce Domain ● However, there are challenges ○ Classification Taxonomy is different - 2000 vs 5000 leaf nodes ○ Lack of labelled data for training (small data or low data problem) ○ Vocab is different ■ E.g. Abbreviations used in product transaction data ○ Data is noisy ● Decent Baseline ○ Re-use the Embeddings layer from e-Commerce and Re-train the Tax classifier model ● Can we do better?
  • 18. 18 Yes, we can ● Transfer Learning ● Weak Supervision ● Data Augmentation & Synthesis
  • 21. ● Computer Vision ■ 2012 - ImageNet Competition ■ AlexNet - beat competition by 41% using Transfer Learning ● 2018 - Watershed moment in NLP ○ Transfer Learning using Pre-Trained Language Models ○ ELMO, ULMFIT, BERT, GPT, GPT-2 “It only seems to be a question of time until pretrained word embeddings will be dethroned and replaced by pretrained language models in the toolbox of every NLP practitioner.” Sebastian Ruder (Researcher, Deep Mind) - https://thegradient.pub/nlp-imagenet/ Transfer Learning - History
  • 22. Transfer Learning - Using ULMFIT Step 1 - Pre-Training on Source Dataset AWD-LSTM Softmax Layer Hidden Layer (3 Stacked BiLSTM) Embedding Layer 2. Objective Function Predict the next word Pre-Trained Language Model 1. Dataset (Huge) ECommerce Product Data
  • 23. 23 Pre-Trained Language Models Two-layer BiLSTM AWD-LSTM Transformer ELMO ULMFIT GPT BERT GPT2 (Architecture) Pre-Trained Language Models
  • 24. Transfer Learning - Using ULMFIT Step 2 - Fine-Tuning on Domain Dataset Hidden Layer Weight Matrix Embedding Matrix 2. Objective Function Predict the next word 1. Dataset (Large) Tax Domain Product Data (Unlabelled) Pre-Trained Language Model
  • 25. Transfer Learning - Using ULMFIT Step 2 - Fine-Tuning Tricks - Freezing + Gradual Unfreezing Freeze First few epochs of training freeze the Bi-LSTM weights to prevent catastrophic forgetting of what it learned from the source dataset
  • 26. Transfer Learning - Using ULMFIT Step 2 - Fine-Tuning Tricks - Slanted Learning Rates A short steep increase in size of LRs to quickly converge to a suitable region in the parameter space A long decay period to precisely fine-tune the weights
  • 27. Transfer Learning - Using ULMFIT Step 3 - Domain Task Classifier RELU Softmax 2. Objective Function Cross Entropy 1. Dataset (Small) Tax Domain Labelled Data Fine-Tuned Language Model
  • 28. 28 Tax Domain - Tax Rate Changes Key Value Effective Date May 1, 2019 Jurisdiction GeorgeTown County New Rate 7% Tax Type General Semantic Role Labeling (Slot Filling) CoReference Resolution ELMO w/o Fine Tuning
  • 29. 29 Transfer Learning - Why does it work? ● Many NLP tasks share common knowledge about language ○ Linguistic Representations, Structural Similarities ● Tasks can inform each other ○ Syntax, Semantics ● Labeled data is rare but unlabeled data is abundant in every domain
  • 30. 30 Pre-Trained Language Models vs Word Embeddings ● Entire Network vs Single Layer ● Pre-Trained Language Models are able to capture context ○ Example ■ Apple iPhone 64 GB Black ■ Apple Cider Vinegar
  • 31. 31 Weak Supervision ● AKA Data Programming ○ Libraries - Snorkel, Snuba ● Steps ○ Create “domain heuristics” or “labeling functions” based on the small unlabelled dataset ■ Snorkel needs humans to do this, Snuba can do that automatically for you ○ Learn a generative model that denoise these heuristics and can emit probabilistic labels ○ Run this model on the entire unlabelled dataset to get probabilistic labels ○ You now have a “good enough” large training data set
  • 32. 32 Data Augmentation & Synthesis ● Back Translation ○ Source language -> Pivot Language -> Source Language ● Synonyms Replacement ● Make sure ○ Preserve the semantic structure and the meaning
  • 33. NLP Stack - Tax Domain @ Avalara Classification SRL CoReference ResolutionAlgorithms Use Cases Product Classification Rates/Rules Extraction <Future Use Case> Embeddings (Character/Word) Pre-Trained Language Models (using Transfer Learning) Pre-Processing Tokenizers Lemmatizers POS Tagging Language Detection Extraction Data Parsers (HTML) Training Data (Labeled) Raw Data (UnLabeled) Parsers (PDF) OCR <TBD> 33 Technology Featurization/ NLU
  • 34. 34 Conclusion ● NLP Pipelines need to be domain specific ○ Libraries, Infrastructure and Techniques can be re-used across domains ○ Having good quality and labelled domain specific data is utmost important ● Domain Specific Data ○ Large labelled data ■ Most techniques will work out of the box ○ Use unlabelled data from your domain to your advantage ○ Small/Low labelled data ■ Transfer Learning using Pre-Trained language models gives you a strong baseline ■ Techniques like Weak Supervision and Data Augmentation will help too