Devday @ Sahaj - Domain Specific NLP Pipelines

Domain Specific
NLP Pipelines
Rajesh Muppalla
Senior Director of Engineering
@codingnirvana

2
About Me
● Senior Director of Engineering, Avalara
● Previously
○ Co-Founder @ Indix (was acquired by Avalara in Feb 2019)
○ Tech Lead, Go-CD @ Thoughtworks - Open Source CI/CD Tool
● Focus Areas
○ Continuous Delivery
○ Microservices
○ Data Platforms
○ Machine Learning

3
● Founded in 2004, IPO in 2018
● Over 25,000 active customers
● More than 700 pre-built integrations
● 9.5 billion transactions processed in 2018
Avalara is a tax compliance automation company catering to businesses of all sizes.
Our mission is to help businesses manage complicated tax compliance obligations
imposed by their state, local or other tax authorities throughout the world.

4
The Avalara Solution
Completely Automated, End-to-End Compliance
Seamless integration with ERP, POS, E-commerce and third-party apps
ERP
Ecommerce
Retail
Transactions
Tax
Tax Rates
Tax Boundaries
Taxability Rules
Returns
Exemption Statutes
Certificate Templates
Customers
Customers
Certificates

5
Indix - “Google Map” of Products

6
Data Pipeline @ Indix
Crawling Pipeline
Data PipelineML
AggregateMatchStandardizeExtract AttributesClassifyDedupe
Parse
Crawl
Data
CrawlSeed
Brand & Retailer
Websites
Feeds Pipeline
Transform Clean Connect
Feed
Data
Brand & Retailer
Feeds
Product
Catalog
Customizable
Feeds
Search &
Analytics
Index
Indexing PipelineReal Time
Index Analyze Derive Join
API
(Bulk &
Synchronous)
Product Data
Transformation
Service

7
Natural Language Processing
● Making sense of natural language text
● Examples
○ Autocomplete on Search
○ Spellcheck in Docs
○ SIRI/Alexa/Google Assistant
○ Spam Filters
○ Machine Translation

8
NLP Examples - Product Classification

9
Product Classification - Tax Domain
Skin Care
(SK0001)
Tax Code US State Rate (%)
SK0001 Alabama 3
SK0001 California 2
SK0001 Texas 0
SK0001 Oklahoma 5
HS Code
(33041000)
HS Code Country
of Origin
Customs
Rate (%)
33041000 Canada 5
33041000 India 2
33041000 Singapore 7
33041000 Australia 5

10
NLP Example - Attribute Extraction

11
Tax Domain - Tax Rate Changes
Key Value
Effective Date May 1, 2019
Jurisdiction GeorgeTown County
New Rate 7%
Tax Type General

13
Simplified NLP Pipeline
Raw Text Tokenization
Vector
Representation
Choose Algorithm
Train Model
(using training data)
Test Model
(using test data)
Preprocessing

14
Vector Representation
● Why?
○ Machines do not understand text, need a numerical representation
○ Easy for Images - use RGB Model
● Part of feature engineering
● Can be learnt too
○ Get rid of hand crafted features

15
Vector Representation - One Hot Encoding
● Map each word to a unique ID
● Vocab size can be 10k to 250k
● Drawbacks
○ Memory/Size of the representation is inefficient
○ No notion of similarity
■ all words have a similar distance from each other
● Useful in semantic tasks
● Question - Can we preserve word similarity?

16
Vector Representation - Word Embeddings
● Main Innovation
○ Obtain dense representation on a large unlabelled corpus
● Popularized by Word2Vec by Mikolov in 2013
○ Glove and FastText are other implementations
● Two variants

17
Vector Representation - Word Embeddings
● Word Embeddings capture certain relations between words

18
Word2Vec - Training - Skip Gram Model

20
Word Embeddings - Use cases & Drawbacks
● Use Cases
○ Embedding layer in various NLP Tasks
■ Classification
■ Attribute Extraction
○ Product Similarity
● Drawbacks
○ Non context aware
■ Same representation for bank in the following two sentences
● I went to the bank to deposit money
● There is a boat next to the river bank

22
Language Models
● Compute the probability distribution of next word given a sequence of previous
words
● Trained on a large unlabelled corpus
● Evolution
○ N-Gram Language Models
○ Neural Language Models
○ Pretrained Language Models
■ Transfer Learning

23
n-Gram Language Models
● Probabilistic Model
● Goal - Assign a probability to a given sentence
○ Machine Translation
■ P(high winds tonite) > P(large winds tonite)
○ Spell Check
■ The office is about fifteen minuets from my house
● P(about fifteen minutes from) > P(about fifteen minuets from)
○ Autocomplete, OCR, Summarization, Question Answering Etc.
● 3-Gram or 4-Gram mostly used in practice
● Compute Joint Probability using Chain Rule. Get probability using counts

24
Neural Language Models
● Drawbacks of statistical n-gram models
○ Large Context Sizes - Inefficient
● Enter LSTMs for Neural Language Modeling
○ Hidden layer much smaller than vocab size

25
Language Models - Use Cases & Drawbacks
● Use Cases
○ Search Autocomplete
■ For spellcheck - need Error Model too
○ Word Segmentation
■ Appleiphone64gbblack -> Apple IPhone 64GB Black
● Drawbacks
○ Traditional Language Models
■ Do not work well with OOV (Out of Vocabulary) words
● Question?
■ Can we get the best of both worlds? - combine both Word Embeddings and Language Models

26
Step 1: Semi-Supervised Training on Large Data Corpus
Step 2: Supervised Training on Domain Specific Data
Dataset
(Large)
(e.g. Wikipedia, Common
Crawl)
Dataset
(Small)
Domain Specific
(e.g. E-Comm, Tax Compliance,
Healthcare)
Language Model
Domain Specific
Model
BERT ELMOGPT-2
XLNET
Pre-Trained Language Model
BERT GPT-2 ELMO
XLNET
USES
(e.g. Classifier, NER)
Pre-Trained Language Models - Transfer Learning

30
What is a Knowledge Graph?
● Structured and Formal Representation of Knowledge as a Graph
● Entities = Nodes
■ Barack Obama, United States
■ Product Domain - Apple, IPhone
○ Entities can also have properties via Ontology
■ Height, Age, Price, YearOfFounding
● Relationships = Edges
■ countryOfBirth, countryOfResidence, presidentOf, isProductLineOf
■ Edges can be bidirectional - eg. - human relationships
■ Edges can have weight - a measure of confidence

31
Knowledge Graph for Product Domain

32
Knowledge Graph - Use Cases
● Search
○ Intent Recognition
○ Query Segmentation
○ Query Expansion/Rewriting
● Demo

Devday @ Sahaj - Domain Specific NLP Pipelines

Recommended

Recommended

More Related Content

Similar to Devday @ Sahaj - Domain Specific NLP Pipelines

Similar to Devday @ Sahaj - Domain Specific NLP Pipelines (20)

Recently uploaded

Recently uploaded (20)

Devday @ Sahaj - Domain Specific NLP Pipelines