2. 2
About Me
● Senior Director of Engineering, Avalara
● Previously
○ Co-Founder @ Indix (acquired by Avalara in Feb 2019)
○ Tech Lead, Go-CD @ Thoughtworks - Open Source CI/CD Tool
● Scale By The Bay
○ 2016 - Data Pipelines Panelist
○ 2017 - Continuous Delivery For Machine Learning
● Based in Chennai, India
3. 3
Talk Focus
● Working on NLP problems since early 2012
● Two domains now
○ Indix - E-Commerce
■ Evolution across 7 years
○ Avalara - Tax Compliance
■ Learnings from e-commerce
■ Newer techniques from last couple of years
● Share what we learnt
7. Embeddings
Word Embeddings
Word2Vec FastText Glove
CBOW Skip-Gram
Dense Representation of word vectors learned from large unlabelled corpus
● Why?
○ Machines do not understand text, need a numerical representation
● Part of feature engineering
● Learned Embeddings
○ Capture notion of similarity
○ Popularized by Word2Vec by Mikolov in 2013
○ Glove and FastText are other implementations
8. 8
Embeddings - Useful Properties
● Word Embeddings capture certain relations between words
Source - https://www.tensorflow.org/tutorials/text/word_embeddings
15. 15
Taxes for specific type of Clothing
Tax Rules in New York
Source - https://www.avalara.com/us/en/learn/whitepapers/the-trials-and-tribulations-of-sales-tax-in-the-united-states.html
16. 16
Product Classification - Tax Domain
Skin Care
(SK0001)
Tax Code US State Rate (%)
SK0001 Alabama 3
SK0001 New York 8
SK0001 Texas 0
SK0001 Oklahoma 5
Beauty / Makeup
(33041000)
HS Code Country
of Origin
Customs
Rate (%)
33041000 Canada 5
33041000 India 2
33041000 Singapore 7
33041000 Australia 5
Product
(Transaction)
Tax Code
HS Code
Tax Rule
Tax Rule
BB Lip Balm - 4pc
17. 17
Product Classification - Tax Domain
● Should be easy?
○ “Similar” to Product Classification in E-Commerce Domain
● However, there are challenges
○ Classification Taxonomy is different - 2000 vs 5000 leaf nodes
○ Lack of labelled data for training (small data or low data problem)
○ Vocab is different
■ E.g. Abbreviations used in product transaction data
○ Data is noisy
● Decent Baseline
○ Re-use the Embeddings layer from e-Commerce and Re-train the Tax classifier model
● Can we do better?
18. 18
Yes, we can
● Transfer Learning
● Weak Supervision
● Data Augmentation & Synthesis
21. ● Computer Vision
■ 2012 - ImageNet Competition
■ AlexNet - beat competition by 41% using Transfer Learning
● 2018 - Watershed moment in NLP
○ Transfer Learning using Pre-Trained Language Models
○ ELMO, ULMFIT, BERT, GPT, GPT-2
“It only seems to be a question of time until pretrained word embeddings will be dethroned
and replaced by pretrained language models in the toolbox of every NLP practitioner.”
Sebastian Ruder (Researcher, Deep Mind) - https://thegradient.pub/nlp-imagenet/
Transfer Learning - History
22. Transfer Learning - Using ULMFIT
Step 1 - Pre-Training on Source Dataset
AWD-LSTM
Softmax Layer
Hidden Layer
(3 Stacked BiLSTM)
Embedding Layer
2. Objective Function
Predict the next word
Pre-Trained
Language Model
1. Dataset (Huge)
ECommerce Product Data
24. Transfer Learning - Using ULMFIT
Step 2 - Fine-Tuning on Domain Dataset
Hidden Layer
Weight Matrix
Embedding
Matrix
2. Objective Function
Predict the next word
1. Dataset (Large)
Tax Domain Product Data
(Unlabelled)
Pre-Trained Language Model
25. Transfer Learning - Using ULMFIT
Step 2 - Fine-Tuning Tricks - Freezing + Gradual Unfreezing
Freeze
First few epochs of training
freeze the Bi-LSTM weights to
prevent catastrophic forgetting
of what it learned from the
source dataset
26. Transfer Learning - Using ULMFIT
Step 2 - Fine-Tuning Tricks - Slanted Learning Rates
A short steep increase in size of LRs to quickly
converge to a suitable region in the parameter
space
A long decay period to precisely fine-tune the
weights
27. Transfer Learning - Using ULMFIT
Step 3 - Domain Task Classifier
RELU
Softmax
2. Objective Function
Cross Entropy
1. Dataset (Small)
Tax Domain Labelled Data
Fine-Tuned Language Model
28. 28
Tax Domain - Tax Rate Changes
Key Value
Effective Date May 1, 2019
Jurisdiction GeorgeTown County
New Rate 7%
Tax Type General
Semantic Role Labeling
(Slot Filling)
CoReference
Resolution
ELMO w/o Fine
Tuning
29. 29
Transfer Learning - Why does it work?
● Many NLP tasks share common knowledge about language
○ Linguistic Representations, Structural Similarities
● Tasks can inform each other
○ Syntax, Semantics
● Labeled data is rare but unlabeled data is abundant in every domain
30. 30
Pre-Trained Language Models vs Word Embeddings
● Entire Network vs Single Layer
● Pre-Trained Language Models are able to capture context
○ Example
■ Apple iPhone 64 GB Black
■ Apple Cider Vinegar
31. 31
Weak Supervision
● AKA Data Programming
○ Libraries - Snorkel, Snuba
● Steps
○ Create “domain heuristics” or “labeling functions” based on the small unlabelled dataset
■ Snorkel needs humans to do this, Snuba can do that automatically for you
○ Learn a generative model that denoise these heuristics and can emit probabilistic labels
○ Run this model on the entire unlabelled dataset to get probabilistic labels
○ You now have a “good enough” large training data set
32. 32
Data Augmentation & Synthesis
● Back Translation
○ Source language -> Pivot Language -> Source Language
● Synonyms Replacement
● Make sure
○ Preserve the semantic structure and the meaning
33. NLP Stack - Tax Domain @ Avalara
Classification SRL
CoReference
ResolutionAlgorithms
Use Cases
Product
Classification
Rates/Rules
Extraction
<Future Use
Case>
Embeddings
(Character/Word)
Pre-Trained Language Models
(using Transfer Learning)
Pre-Processing Tokenizers Lemmatizers POS Tagging
Language
Detection
Extraction
Data
Parsers (HTML)
Training Data
(Labeled)
Raw Data
(UnLabeled)
Parsers (PDF) OCR
<TBD>
33
Technology
Featurization/ NLU
34. 34
Conclusion
● NLP Pipelines need to be domain specific
○ Libraries, Infrastructure and Techniques can be re-used across domains
○ Having good quality and labelled domain specific data is utmost important
● Domain Specific Data
○ Large labelled data
■ Most techniques will work out of the box
○ Use unlabelled data from your domain to your advantage
○ Small/Low labelled data
■ Transfer Learning using Pre-Trained language models gives you a strong baseline
■ Techniques like Weak Supervision and Data Augmentation will help too