2. 2
About Me
● Senior Director of Engineering, Avalara
● Previously
○ Co-Founder @ Indix (was acquired by Avalara in Feb 2019)
○ Tech Lead, Go-CD @ Thoughtworks - Open Source CI/CD Tool
● Focus Areas
○ Continuous Delivery
○ Microservices
○ Data Platforms
○ Machine Learning
3. 3
● Founded in 2004, IPO in 2018
● Over 25,000 active customers
● More than 700 pre-built integrations
● 9.5 billion transactions processed in 2018
Avalara is a tax compliance automation company catering to businesses of all sizes.
Our mission is to help businesses manage complicated tax compliance obligations
imposed by their state, local or other tax authorities throughout the world.
6. 6
Data Pipeline @ Indix
Crawling Pipeline
Data PipelineML
AggregateMatchStandardizeExtract AttributesClassifyDedupe
Parse
Crawl
Data
CrawlSeed
Brand & Retailer
Websites
Feeds Pipeline
Transform Clean Connect
Feed
Data
Brand & Retailer
Feeds
Product
Catalog
Customizable
Feeds
Search &
Analytics
Index
Indexing PipelineReal Time
Index Analyze Derive Join
API
(Bulk &
Synchronous)
Product Data
Transformation
Service
7. 7
Natural Language Processing
● Making sense of natural language text
● Examples
○ Autocomplete on Search
○ Spellcheck in Docs
○ SIRI/Alexa/Google Assistant
○ Spam Filters
○ Machine Translation
13. 13
Simplified NLP Pipeline
Raw Text Tokenization
Vector
Representation
Choose Algorithm
Train Model
(using training data)
Test Model
(using test data)
Preprocessing
14. 14
Vector Representation
● Why?
○ Machines do not understand text, need a numerical representation
○ Easy for Images - use RGB Model
● Part of feature engineering
● Can be learnt too
○ Get rid of hand crafted features
15. 15
Vector Representation - One Hot Encoding
● Map each word to a unique ID
● Vocab size can be 10k to 250k
● Drawbacks
○ Memory/Size of the representation is inefficient
○ No notion of similarity
■ all words have a similar distance from each other
● Useful in semantic tasks
● Question - Can we preserve word similarity?
16. 16
Vector Representation - Word Embeddings
● Main Innovation
○ Obtain dense representation on a large unlabelled corpus
● Popularized by Word2Vec by Mikolov in 2013
○ Glove and FastText are other implementations
● Two variants
20. 20
Word Embeddings - Use cases & Drawbacks
● Use Cases
○ Embedding layer in various NLP Tasks
■ Classification
■ Attribute Extraction
○ Product Similarity
● Drawbacks
○ Non context aware
■ Same representation for bank in the following two sentences
● I went to the bank to deposit money
● There is a boat next to the river bank
22. 22
Language Models
● Compute the probability distribution of next word given a sequence of previous
words
● Trained on a large unlabelled corpus
● Evolution
○ N-Gram Language Models
○ Neural Language Models
○ Pretrained Language Models
■ Transfer Learning
23. 23
n-Gram Language Models
● Probabilistic Model
● Goal - Assign a probability to a given sentence
○ Machine Translation
■ P(high winds tonite) > P(large winds tonite)
○ Spell Check
■ The office is about fifteen minuets from my house
● P(about fifteen minutes from) > P(about fifteen minuets from)
○ Autocomplete, OCR, Summarization, Question Answering Etc.
● 3-Gram or 4-Gram mostly used in practice
● Compute Joint Probability using Chain Rule. Get probability using counts
24. 24
Neural Language Models
● Drawbacks of statistical n-gram models
○ Large Context Sizes - Inefficient
● Enter LSTMs for Neural Language Modeling
○ Hidden layer much smaller than vocab size
25. 25
Language Models - Use Cases & Drawbacks
● Use Cases
○ Search Autocomplete
■ For spellcheck - need Error Model too
○ Word Segmentation
■ Appleiphone64gbblack -> Apple IPhone 64GB Black
● Drawbacks
○ Traditional Language Models
■ Do not work well with OOV (Out of Vocabulary) words
● Question?
■ Can we get the best of both worlds? - combine both Word Embeddings and Language Models
26. 26
Step 1: Semi-Supervised Training on Large Data Corpus
Step 2: Supervised Training on Domain Specific Data
Dataset
(Large)
(e.g. Wikipedia, Common
Crawl)
Dataset
(Small)
Domain Specific
(e.g. E-Comm, Tax Compliance,
Healthcare)
Language Model
Domain Specific
Model
BERT ELMOGPT-2
XLNET
Pre-Trained Language Model
BERT GPT-2 ELMO
XLNET
USES
(e.g. Classifier, NER)
Pre-Trained Language Models - Transfer Learning
30. 30
What is a Knowledge Graph?
● Structured and Formal Representation of Knowledge as a Graph
● Entities = Nodes
■ Barack Obama, United States
■ Product Domain - Apple, IPhone
○ Entities can also have properties via Ontology
■ Height, Age, Price, YearOfFounding
● Relationships = Edges
■ countryOfBirth, countryOfResidence, presidentOf, isProductLineOf
■ Edges can be bidirectional - eg. - human relationships
■ Edges can have weight - a measure of confidence