NLP and ML for Non-Experts: An Introduction

© 2019 Chegg, Inc. / All Rights Reserved
Natural Language
Processing and Machine
Learning for non - experts.
May 7
Sanghamitra Deb
@sangha_deb,
sdeb@chegg.com

Zoha Zargham Sakshi Bhargava Priya Venkat
Collaborators

Natural Language Processing
• Giving structure to
unstructured data.
• Learn properties of the data
that makes decision making
simple.
• Provide concise information to
drive intelligence of different
systems.

Natural Language Processing
All industries generate content.
Healthcare
Finance
Retail
Legal
Education
Marketing
Real Estate Social Media
Academics

Natural Language Processing :
Why?
• Unstructured data cannot be
consumed directly.
• Automate simple and complex
functionalities.
• Users can query text data and
generate BU reports.
• Understand customers better
and take necessary actions for
better experience.

Natural Language Processing :
How?
Build a Machine Learning
Pipeline to infer properties of
the text that will solve a
particular problem.

7
What is
Chegg?
The Chegg logo is a registered trademark of Chegg, Inc. All other trademarks are owned by
their respective owners.
• Chegg is a student first learning
platform.
• Multiple services: question answering,
online tutoring, flashcards, writing,
math solver, internships, etc.
• Content drives product.

8
Chegg Study

9
Chegg Study

NLP Goal: Create a Knowledge Base
What is a knowledgebase?
All Content
Algebra Physics Statistics Mechanical Eng Accounting ….
Several Tens of Subjects

NLP Goal: Create a Knowledge Base
What is a knowledgebase?
Statistics
Probability Testing Regression
Discrete
PDs
Continuous
PD’s
Sampling
Estimation Hypothesis Testing Regression
Binomial
Normal
Probability

Building a Machine Learning Pipeline for NLP
• Classification using tfidf
• Weak Supervision
• Transfer learning techniques
• Thresholding
• Active Learning

Machine Learning needs huge number of examples
This is expensive
1. Collecting Data 2. Gathering labelled data 3. Feature Engineering 4. Fit a model
Deep learning replaces
feature engineering !!
However, DL requires huge
amounts of data.

Weak Supervision

What is Weak Supervision?
Using noisy sources of truth to generate training data.
• Rules/heuristics
• Constraints
• Invariances
• Existing knowledge
• Cheaper sources of Labels
• Pre-trained model
• Get labels from the
customer

17
https://hazyresearch.github.io/snorkel/
A System for Fast Training Data Creation
• Weak supervision augments manual
generation of labelled data.
• Fast adoption and user friendly interface.
Snorkel – packaging Weak supervision tools

18
Weak Supervision Pipeline using Snorkel
Labeling Functions:
o 𝜆1: Does the word
’wife’ occur between
names?
o 𝜆2: Is this a name in
the dictionary of
couples?
‘spouse’ occur in the
sentence?
• Write multiple functions that can label data
• The functions encode one of the weak
supervision techniques
External
Knowledge
Sources
Patterns and
Dictionaries
Domain
heuristics
SMEs (Subject
Matter Experts)
providing
valuable inputs

19
1. David and his wife Mel boarded the
flight to Adelaide.
2. Former US President Barack Obama
and the former first lady Michelle
Obama waved to the crowds in Ohio.
Document Parsing
Sentence
Phrases/n-grams
Labeling Functions:
’wife’ occur between
names?
o 𝜆2: Is this a name in
the dictionary of
couples?
‘spouse’ occur in the
sentence?
External
Knowledge
Sources
Patterns and
Dictionaries
Domain
heuristics
SMEs (Subject
Matter Experts)
providing
valuable inputs
Input Data

• Labeling Functions have different latent accuracies
• We want to learn these accuracies without using
labeled data.
• Essentially, compare agreements and disagreements.

• Build supervised models using probabilistic training labels
• Increase coverage
• Improved number of features
Ratner et. al., 2016

Weak Supervision Steps
v
Prove that a triangle is equilateral
Question word preceding a Keyword
Keyword is contained in known Algebra: triangles
database
Concept: Algebra: triangles✓
Broad Stroke Filtering Rules
Language Pattern Recognition: Using part of
speech tags to extract keywords.
‘symmetric matrices’, ‘real number’
‘JJ NNS’
JJ = Noun, singular
NNS = Noun, plural

How do the rules work
0
20
40
60
80
100
120
140
160
180
200
1 0 -1
Training set
1: Class 1,document belongs to Algebra: triangles
0: unlabelled data
-1: Class 2, document does not belong to Algebra: triangles.
Rule 1: All documents with words/phrases contained in
Algebra: triangles database preceded by question
words or phrases (ex: prove, calculate, etc) fall under the
category of Algebra: triangles
Rule 2: If words or phrases have appeared only once in
your corpus it is not Algebra: triangles.

How do the rules work?
0
20
40
60
80
100
120
140
160
180
200
1 0 -1
Training set
Rule 1: All documents with words/phrases contained in
Algebra: triangles database preceeded by question
words or phrases (ex: prove, calculate, etc) fall under the
category of Algebra: triangles
Rule 2: If words or phrases have appeared only once in
your corpus it is not Algebra: triangles.
Rule 3: All documents with words/phrases not in Algebra:
database do not fall under the category of Algebra:
triangles
Rule 4: All documents containing> 4 keywords/phrases
from Algebra: triangles database belong to Algebra:
triangles.
1: Class 1,document belongs to Algebra: triangles
0: unlabelled data
-1: Class 2, document does not belong to Algebra: triangles.

Transfer Learning

• Embeddings/vectors
• LSTM: Long Short term memory. Ideal for processing sequence of data such as
text
• CNN: Convolution Neural Networks are regularized versions of multilayer
perceptrons.
• Transformer: Sequence-to-Sequence (or Seq2Seq) is a neural net that
transforms a given sequence of elements, such as the sequence of words in a
sentence, into another sequence.
• Softmax: Softmax is often used in neural networks, to map the non-normalized
output of a network to a probability distribution over predicted output classes.
Some Deep Learning Terms

What is transfer learning?
Picture from Orielly
Transfer learning is the ability to learn a new task
from fewer examples with the knowledge from a
correlated task that has already been learned.
Humans are very good at transfer learning.
• Know to code in C++. Learn to code in python
• Get a PhD in physics. Learn to do machine
learning
• Know to play classic piano. Learn to play jazz
piano.
Language itself has patterns and coherence. Language Models(LM) learns from them to create embeddings/vectors.

Word2vec
The cat sat on the mat
Proposed in 2013 as an
approximation to language
modeling
vec(king) + vec(queen) - vec(man) = vec(woman)

Word2vec
The cat sat on the mat
Proposed in 2013 as an
approximation to language
modeling

Sentence Embeddings
Language Model: Given a sentence, predicting the next word
Embed
LSTM
Softmax
Predict

Character Embeddings:
31
The broadway play premiered yesterday
The broadway play premiered yesterday
Softmax
Concatenation of character Embeddings
] Convolution Layer with Multiple Filters
] Max over time pooling layer
Cross Entropy between next word and prediction
Given a word, predicting the next word

• Task 1: Given a question predict the answer.
• Task 2: Given the front of a flash card predict the back of a flash card.
• Task 3: Predict the concept/topic associated with the content.
• Task 4: Predict the the course of a piece of content.
• Task 5: Predict the subject of a piece of content.
Domain Specific Optimizations

33
Open Source: Puppets on Sesame Street
ELMo: Embeddings
from Language
Models : Bi-directional
LSTM
BERT: Bidirectional
Transformer.
GPT-2: Left to Right
transformer models.
Contextualized word embeddings.
The tree was burned to a crisp. The morning air is crisp.
Code publicly available: Allennlp:
• pytorch
• Tensorflow
• keras
Best paper at NAACL in June 2018
Trained model publicly
available. Released code
helps you predict the next
word given a sentence.
Code publicly available: google
• Tensorflow
• Pytorch
• keras
Released towards the end of 2018. Several releases, Latest ~ March
Others: ULMfit, Fasttext, etc

Open Source vs In house Embeddings
Open Source In house
• Low barrier to starting a project.
• Does not require deep learning
knowledge.
• Works well for generic tasks such as
sentiment analysis, news category
detection, etc.
Download à Concatenate
with existing
features
à Classify Build Embeddings:
Deep Learning
Language Model,
Domain specific
optimization
à
Concatenate
with existing
features
à Classify
• Barrier to starting a project is higher.
• Requires deep learning expertise to build the
embeddings.
• The embeddings would be optimized for the
domains such as education, healthcare etc.

Thresholding
Probability
0 10.5
0.70.3
• You want your model to be correct.
• It is possible improve the
percentage of correct results at the
cost of coverage.

Active Learning
1. Train a classifier and
predict on unseen data.
2. Evaluate points close to
the decision boundary
3. Collect SME annotations
on these points and add
them to the training set
Advantages:
• Requires less training Data.
• Strategizing can lead to higher coverage of
the model space
Dis-advantages:
• Uncertainty sampling is noise
seeking: this can lead to fitting the
noise in the data.
• Outliers: Getting label on outliers
does not improve the models
(there are techniques to avoid this
issue).

Machine Learning Pipline
Produce training data with
Weak Supervision
Feature Generation with
Transfer Learning
Supervised
Learning:
Active Learning:
gather more training
data
Populate database with inferred tags
from model
Prob>threshold
Yes:
prediction
accepted
No
Goal: Concept classification

Collaboration
Product/Businesss
Content team
Iterate
• define Filters
• define rules
• Explain model
performance
What is the
use case?
Product Integration
Collaboration

• Routing appropriate topics to relevant experts for answering
questions
• Recommend appropriate topics to a student for practicing
before an exam
• Determine topics with highest demand to students and
improve or create content for these topics
• Connect different products based on topic similarity
Applications

In conclusion …
There is infinite amount of content
Language itself has logic, coherence and meaning.
Human curated examples for training models are expensive.
Be smart about collecting examples and working with small amounts of example.
Tackling new use cases less expensive.
Facilitate automation

Questions
Sanghamitra Deb
@sangha_deb,
sdeb@chegg.com

Bayesian Nonparametric Crowdsourcing , Moreno et al. 2014
Weakly supervised classification of rare aortic valve malformations using
unlabeled cardiac MRI sequences, Fries et al. 2018
Weak Supervision: The New Programming Paradigm for Machine Learning,
Ratner et al. 2017
Character-Aware Neural Language Models, Kim et al 2015
Deep contextualized word representations, Peters et al 2018
BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding, Devlin et al 2018
Distributed Representations of Words and Phrases and their Compositionality,
Mikolov et al 2013
References

NLP and ML for Non-Experts: An Introduction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to NLP and ML for Non-Experts: An Introduction

Similar to NLP and ML for Non-Experts: An Introduction (20)

More from Sanghamitra Deb

More from Sanghamitra Deb (14)

Recently uploaded

Recently uploaded (20)

NLP and ML for Non-Experts: An Introduction