SlideShare a Scribd company logo
1 of 33
Download to read offline
Domain Specific
NLP Pipelines
Rajesh Muppalla
Senior Director of Engineering
@codingnirvana
2
About Me
● Senior Director of Engineering, Avalara
● Previously
○ Co-Founder @ Indix (was acquired by Avalara in Feb 2019)
○ Tech Lead, Go-CD @ Thoughtworks - Open Source CI/CD Tool
● Focus Areas
○ Continuous Delivery
○ Microservices
○ Data Platforms
○ Machine Learning
3
● Founded in 2004, IPO in 2018
● Over 25,000 active customers
● More than 700 pre-built integrations
● 9.5 billion transactions processed in 2018
Avalara is a tax compliance automation company catering to businesses of all sizes.
Our mission is to help businesses manage complicated tax compliance obligations
imposed by their state, local or other tax authorities throughout the world.
4
The Avalara Solution
Completely Automated, End-to-End Compliance
Seamless integration with ERP, POS, E-commerce and third-party apps
ERP
Ecommerce
Retail
Transactions
Tax
Tax Rates
Tax Boundaries
Taxability Rules
Returns
Exemption Statutes
Certificate Templates
Customers
Customers
Certificates
5
Indix - “Google Map” of Products
6
Data Pipeline @ Indix
Crawling Pipeline
Data PipelineML
AggregateMatchStandardizeExtract AttributesClassifyDedupe
Parse
Crawl
Data
CrawlSeed
Brand & Retailer
Websites
Feeds Pipeline
Transform Clean Connect
Feed
Data
Brand & Retailer
Feeds
Product
Catalog
Customizable
Feeds
Search &
Analytics
Index
Indexing PipelineReal Time
Index Analyze Derive Join
API
(Bulk &
Synchronous)
Product Data
Transformation
Service
7
Natural Language Processing
● Making sense of natural language text
● Examples
○ Autocomplete on Search
○ Spellcheck in Docs
○ SIRI/Alexa/Google Assistant
○ Spam Filters
○ Machine Translation
8
NLP Examples - Product Classification
9
Product Classification - Tax Domain
Skin Care
(SK0001)
Tax Code US State Rate (%)
SK0001 Alabama 3
SK0001 California 2
SK0001 Texas 0
SK0001 Oklahoma 5
HS Code
(33041000)
HS Code Country
of Origin
Customs
Rate (%)
33041000 Canada 5
33041000 India 2
33041000 Singapore 7
33041000 Australia 5
10
NLP Example - Attribute Extraction
11
Tax Domain - Tax Rate Changes
Key Value
Effective Date May 1, 2019
Jurisdiction GeorgeTown County
New Rate 7%
Tax Type General
12
Word Embeddings
13
Simplified NLP Pipeline
Raw Text Tokenization
Vector
Representation
Choose Algorithm
Train Model
(using training data)
Test Model
(using test data)
Preprocessing
14
Vector Representation
● Why?
○ Machines do not understand text, need a numerical representation
○ Easy for Images - use RGB Model
● Part of feature engineering
● Can be learnt too
○ Get rid of hand crafted features
15
Vector Representation - One Hot Encoding
● Map each word to a unique ID
● Vocab size can be 10k to 250k
● Drawbacks
○ Memory/Size of the representation is inefficient
○ No notion of similarity
■ all words have a similar distance from each other
● Useful in semantic tasks
● Question - Can we preserve word similarity?
16
Vector Representation - Word Embeddings
● Main Innovation
○ Obtain dense representation on a large unlabelled corpus
● Popularized by Word2Vec by Mikolov in 2013
○ Glove and FastText are other implementations
● Two variants
17
Vector Representation - Word Embeddings
● Word Embeddings capture certain relations between words
18
Word2Vec - Training - Skip Gram Model
19
Word2Vec - Lookup Table
20
Word Embeddings - Use cases & Drawbacks
● Use Cases
○ Embedding layer in various NLP Tasks
■ Classification
■ Attribute Extraction
○ Product Similarity
● Drawbacks
○ Non context aware
■ Same representation for bank in the following two sentences
● I went to the bank to deposit money
● There is a boat next to the river bank
21
Language Models
22
Language Models
● Compute the probability distribution of next word given a sequence of previous
words
● Trained on a large unlabelled corpus
● Evolution
○ N-Gram Language Models
○ Neural Language Models
○ Pretrained Language Models
■ Transfer Learning
23
n-Gram Language Models
● Probabilistic Model
● Goal - Assign a probability to a given sentence
○ Machine Translation
■ P(high winds tonite) > P(large winds tonite)
○ Spell Check
■ The office is about fifteen minuets from my house
● P(about fifteen minutes from) > P(about fifteen minuets from)
○ Autocomplete, OCR, Summarization, Question Answering Etc.
● 3-Gram or 4-Gram mostly used in practice
● Compute Joint Probability using Chain Rule. Get probability using counts
24
Neural Language Models
● Drawbacks of statistical n-gram models
○ Large Context Sizes - Inefficient
● Enter LSTMs for Neural Language Modeling
○ Hidden layer much smaller than vocab size
25
Language Models - Use Cases & Drawbacks
● Use Cases
○ Search Autocomplete
■ For spellcheck - need Error Model too
○ Word Segmentation
■ Appleiphone64gbblack -> Apple IPhone 64GB Black
● Drawbacks
○ Traditional Language Models
■ Do not work well with OOV (Out of Vocabulary) words
● Question?
■ Can we get the best of both worlds? - combine both Word Embeddings and Language Models
26
Step 1: Semi-Supervised Training on Large Data Corpus
Step 2: Supervised Training on Domain Specific Data
Dataset
(Large)
(e.g. Wikipedia, Common
Crawl)
Dataset
(Small)
Domain Specific
(e.g. E-Comm, Tax Compliance,
Healthcare)
Language Model
Domain Specific
Model
BERT ELMOGPT-2
XLNET
Pre-Trained Language Model
BERT GPT-2 ELMO
XLNET
USES
(e.g. Classifier, NER)
Pre-Trained Language Models - Transfer Learning
27
Knowledge Graphs
28
Knowledge Graphs
29
More Examples...
30
What is a Knowledge Graph?
● Structured and Formal Representation of Knowledge as a Graph
● Entities = Nodes
■ Barack Obama, United States
■ Product Domain - Apple, IPhone
○ Entities can also have properties via Ontology
■ Height, Age, Price, YearOfFounding
● Relationships = Edges
■ countryOfBirth, countryOfResidence, presidentOf, isProductLineOf
■ Edges can be bidirectional - eg. - human relationships
■ Edges can have weight - a measure of confidence
31
Knowledge Graph for Product Domain
32
Knowledge Graph - Use Cases
● Search
○ Intent Recognition
○ Query Segmentation
○ Query Expansion/Rewriting
● Demo
33
Thank You

More Related Content

Similar to Devday @ Sahaj - Domain Specific NLP Pipelines

What drives Innovation? Innovations And Technological Solutions for the Distr...
What drives Innovation? Innovations And Technological Solutions for the Distr...What drives Innovation? Innovations And Technological Solutions for the Distr...
What drives Innovation? Innovations And Technological Solutions for the Distr...
Stefano Fago
 

Similar to Devday @ Sahaj - Domain Specific NLP Pipelines (20)

Software Engineering Primer
Software Engineering PrimerSoftware Engineering Primer
Software Engineering Primer
 
Choosing the Right Database - Facebook DevC Malang Hackdays 2017
Choosing the Right Database - Facebook DevC Malang Hackdays 2017Choosing the Right Database - Facebook DevC Malang Hackdays 2017
Choosing the Right Database - Facebook DevC Malang Hackdays 2017
 
A few questions about large scale machine learning
A few questions about large scale machine learningA few questions about large scale machine learning
A few questions about large scale machine learning
 
L15.pptx
L15.pptxL15.pptx
L15.pptx
 
General introduction to AI ML DL DS
General introduction to AI ML DL DSGeneral introduction to AI ML DL DS
General introduction to AI ML DL DS
 
Legacy code - Taming The Beast
Legacy code  - Taming The BeastLegacy code  - Taming The Beast
Legacy code - Taming The Beast
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
 
BSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 SessionsBSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 Sessions
 
What drives Innovation? Innovations And Technological Solutions for the Distr...
What drives Innovation? Innovations And Technological Solutions for the Distr...What drives Innovation? Innovations And Technological Solutions for the Distr...
What drives Innovation? Innovations And Technological Solutions for the Distr...
 
[DSC Europe 23] Djordje Grozdic - Transforming Business Process Automation wi...
[DSC Europe 23] Djordje Grozdic - Transforming Business Process Automation wi...[DSC Europe 23] Djordje Grozdic - Transforming Business Process Automation wi...
[DSC Europe 23] Djordje Grozdic - Transforming Business Process Automation wi...
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
Domain Semantics
Domain SemanticsDomain Semantics
Domain Semantics
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Weak Supervision.pdf
Weak Supervision.pdfWeak Supervision.pdf
Weak Supervision.pdf
 
Big data analytics using R
Big data analytics using RBig data analytics using R
Big data analytics using R
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
 
Why you don't need maths to get benefits of ml
Why you don't need maths to get benefits of mlWhy you don't need maths to get benefits of ml
Why you don't need maths to get benefits of ml
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
Machine Learning Basics to get you into Leading Tech companies.pptx
Machine Learning Basics to get you into Leading Tech companies.pptxMachine Learning Basics to get you into Leading Tech companies.pptx
Machine Learning Basics to get you into Leading Tech companies.pptx
 

Recently uploaded

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Recently uploaded (20)

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 

Devday @ Sahaj - Domain Specific NLP Pipelines

  • 1. Domain Specific NLP Pipelines Rajesh Muppalla Senior Director of Engineering @codingnirvana
  • 2. 2 About Me ● Senior Director of Engineering, Avalara ● Previously ○ Co-Founder @ Indix (was acquired by Avalara in Feb 2019) ○ Tech Lead, Go-CD @ Thoughtworks - Open Source CI/CD Tool ● Focus Areas ○ Continuous Delivery ○ Microservices ○ Data Platforms ○ Machine Learning
  • 3. 3 ● Founded in 2004, IPO in 2018 ● Over 25,000 active customers ● More than 700 pre-built integrations ● 9.5 billion transactions processed in 2018 Avalara is a tax compliance automation company catering to businesses of all sizes. Our mission is to help businesses manage complicated tax compliance obligations imposed by their state, local or other tax authorities throughout the world.
  • 4. 4 The Avalara Solution Completely Automated, End-to-End Compliance Seamless integration with ERP, POS, E-commerce and third-party apps ERP Ecommerce Retail Transactions Tax Tax Rates Tax Boundaries Taxability Rules Returns Exemption Statutes Certificate Templates Customers Customers Certificates
  • 5. 5 Indix - “Google Map” of Products
  • 6. 6 Data Pipeline @ Indix Crawling Pipeline Data PipelineML AggregateMatchStandardizeExtract AttributesClassifyDedupe Parse Crawl Data CrawlSeed Brand & Retailer Websites Feeds Pipeline Transform Clean Connect Feed Data Brand & Retailer Feeds Product Catalog Customizable Feeds Search & Analytics Index Indexing PipelineReal Time Index Analyze Derive Join API (Bulk & Synchronous) Product Data Transformation Service
  • 7. 7 Natural Language Processing ● Making sense of natural language text ● Examples ○ Autocomplete on Search ○ Spellcheck in Docs ○ SIRI/Alexa/Google Assistant ○ Spam Filters ○ Machine Translation
  • 8. 8 NLP Examples - Product Classification
  • 9. 9 Product Classification - Tax Domain Skin Care (SK0001) Tax Code US State Rate (%) SK0001 Alabama 3 SK0001 California 2 SK0001 Texas 0 SK0001 Oklahoma 5 HS Code (33041000) HS Code Country of Origin Customs Rate (%) 33041000 Canada 5 33041000 India 2 33041000 Singapore 7 33041000 Australia 5
  • 10. 10 NLP Example - Attribute Extraction
  • 11. 11 Tax Domain - Tax Rate Changes Key Value Effective Date May 1, 2019 Jurisdiction GeorgeTown County New Rate 7% Tax Type General
  • 13. 13 Simplified NLP Pipeline Raw Text Tokenization Vector Representation Choose Algorithm Train Model (using training data) Test Model (using test data) Preprocessing
  • 14. 14 Vector Representation ● Why? ○ Machines do not understand text, need a numerical representation ○ Easy for Images - use RGB Model ● Part of feature engineering ● Can be learnt too ○ Get rid of hand crafted features
  • 15. 15 Vector Representation - One Hot Encoding ● Map each word to a unique ID ● Vocab size can be 10k to 250k ● Drawbacks ○ Memory/Size of the representation is inefficient ○ No notion of similarity ■ all words have a similar distance from each other ● Useful in semantic tasks ● Question - Can we preserve word similarity?
  • 16. 16 Vector Representation - Word Embeddings ● Main Innovation ○ Obtain dense representation on a large unlabelled corpus ● Popularized by Word2Vec by Mikolov in 2013 ○ Glove and FastText are other implementations ● Two variants
  • 17. 17 Vector Representation - Word Embeddings ● Word Embeddings capture certain relations between words
  • 18. 18 Word2Vec - Training - Skip Gram Model
  • 20. 20 Word Embeddings - Use cases & Drawbacks ● Use Cases ○ Embedding layer in various NLP Tasks ■ Classification ■ Attribute Extraction ○ Product Similarity ● Drawbacks ○ Non context aware ■ Same representation for bank in the following two sentences ● I went to the bank to deposit money ● There is a boat next to the river bank
  • 22. 22 Language Models ● Compute the probability distribution of next word given a sequence of previous words ● Trained on a large unlabelled corpus ● Evolution ○ N-Gram Language Models ○ Neural Language Models ○ Pretrained Language Models ■ Transfer Learning
  • 23. 23 n-Gram Language Models ● Probabilistic Model ● Goal - Assign a probability to a given sentence ○ Machine Translation ■ P(high winds tonite) > P(large winds tonite) ○ Spell Check ■ The office is about fifteen minuets from my house ● P(about fifteen minutes from) > P(about fifteen minuets from) ○ Autocomplete, OCR, Summarization, Question Answering Etc. ● 3-Gram or 4-Gram mostly used in practice ● Compute Joint Probability using Chain Rule. Get probability using counts
  • 24. 24 Neural Language Models ● Drawbacks of statistical n-gram models ○ Large Context Sizes - Inefficient ● Enter LSTMs for Neural Language Modeling ○ Hidden layer much smaller than vocab size
  • 25. 25 Language Models - Use Cases & Drawbacks ● Use Cases ○ Search Autocomplete ■ For spellcheck - need Error Model too ○ Word Segmentation ■ Appleiphone64gbblack -> Apple IPhone 64GB Black ● Drawbacks ○ Traditional Language Models ■ Do not work well with OOV (Out of Vocabulary) words ● Question? ■ Can we get the best of both worlds? - combine both Word Embeddings and Language Models
  • 26. 26 Step 1: Semi-Supervised Training on Large Data Corpus Step 2: Supervised Training on Domain Specific Data Dataset (Large) (e.g. Wikipedia, Common Crawl) Dataset (Small) Domain Specific (e.g. E-Comm, Tax Compliance, Healthcare) Language Model Domain Specific Model BERT ELMOGPT-2 XLNET Pre-Trained Language Model BERT GPT-2 ELMO XLNET USES (e.g. Classifier, NER) Pre-Trained Language Models - Transfer Learning
  • 30. 30 What is a Knowledge Graph? ● Structured and Formal Representation of Knowledge as a Graph ● Entities = Nodes ■ Barack Obama, United States ■ Product Domain - Apple, IPhone ○ Entities can also have properties via Ontology ■ Height, Age, Price, YearOfFounding ● Relationships = Edges ■ countryOfBirth, countryOfResidence, presidentOf, isProductLineOf ■ Edges can be bidirectional - eg. - human relationships ■ Edges can have weight - a measure of confidence
  • 31. 31 Knowledge Graph for Product Domain
  • 32. 32 Knowledge Graph - Use Cases ● Search ○ Intent Recognition ○ Query Segmentation ○ Query Expansion/Rewriting ● Demo