Using weak supervision and transfer learning techniques to build knowledge graph to improve student experiences at Chegg by Sanghamitra Deb, Staff Data Scientist @Chegg

Confidential Material / © 2019 Chegg, Inc. / All Rights Reserved
Using weak supervision and
transfer learning techniques to
build knowledge graph
July 25
Sanghamitra Deb
@sangha_deb,
sdeb@chegg.com

Zoha Zargham Sakshi Bhargava Priya Venkat

Natural Language Processing
• Giving structure to
unstructured data
• Learn properties of the data
that makes decision making
simple
• Provide consise information to
drive intelligence of different
systems.

Natural Language Processing
All industries generate content.
Healthcare
Finance
Retail
Legal
Education
Marketing
Real Estate Social Media
Academics

Natural Language Processing :
How?
Build a Machine Learning
Pipeline to infer properties of
the text that will solve a
particular problem

• Student first company to improve
education
• Multiple services: question
answering, online tutoring,
flashcards, writing, math solver,
internships.
• Content drives product.

7
Chegg Study
*The Chegg logo is a registered trademark of Chegg, Inc. All other trademarks are owned by their respective owners

8
Chegg Study
*The Chegg logo is a registered trademark of Chegg, Inc. All other trademarks are owned by their respective owners

NLP Goal: Create a Knowledge Base
What is a knowledgebase?
All Content
Algebra Physics Statistics Mechanical Eng Accounting ….
Several Tens of Subjects

NLP Goal: Create a Knowledge Base
What is a knowledgebase?
Statistics
Probability Testing Regression
Discrete
PDs
Continuous
PD’s
Sampling
Estimation Hypothesis Testing Regression
Binomial
Normal
Probability

Building a Machine Learning Pipeline for NLP
• Classification using tfidf
• Weak Supervision
• Transfer learning techniques
• Thresholding
• Active Learning

Machine Learning needs huge number of examples
This is expensive
1. Collecting Data 2. Gathering labelled data 3. Feature Engineering 4. Fit a model

Machine Learning needs huge number of examples
This is expensive
1. Collecting Data 2. Gathering labelled data 3. Feature Engineering 4. Fit a model
Deep learning replaces
feature engineering !!
However, DL requires huge
amounts of data.

Weak Supervision

What is Weak Supervision?
Using noisy sources of truth to generate training data.
• Rules/heuristics
• Constraints
• Invariances
• Existing knowledge
• Cheaper sources of Labels
• Pre-trained model
• Get labels from the
customer

16
https://hazyresearch.github.io/snorkel/
A System for Fast Training Data Creation
• Weak supervision replaces manual
generation of labelled data
• Fast adoption and user friendly interface
Snorkel – packaging Weak supervision tools

17
Weak Supervision Pipeline using Snorkel
Labeling Functions:
o 𝜆1: Does the word
’wife’ occur between
names?
o 𝜆2: Is this a name in
the dictionary of
couples?
‘spouse’ occur in the
sentence?
• Write multiple functions that can label data
• The functions encode one of the weak
supervision techniques
External
Knowledge
Sources
Patterns and
Dictionaries
Domain
heuristics
SMEs (Subject
Matter Experts)
providing
valuable inputs

18
1. David and his wife Mel boarded the
flight to Adelaide.
2. Former US President Barack Obama
and the former first lady Michelle
Obama waved to the crowds in Ohio.
Document Parsing
Sentence
Phrases/n-grams
Labeling Functions:
’wife’ occur between
names?
o 𝜆2: Is this a name in
the dictionary of
couples?
‘spouse’ occur in the
sentence?
External
Knowledge
Sources
Patterns and
Dictionaries
Domain
heuristics
SMEs (Subject
Matter Experts)
providing
valuable inputs
Input Data

• Labeling Functions have different latent accuracies
• We want to learn these accuracies without using
labeled data
• Essentially, compare agreements and disagreements

• Build supervised models using probabilistic training labels
• Increase coverage
• Improved number of features
Ratner et. al., 2016

Weak Supervision Steps
v
Prove that a triangle is equilateral
Question word preceding a Keyword
Keyword is contained in known Algebra: triangles
database
Concept: Algebra: triangles✓
Broad Stroke Filtering Rules
Language Pattern Recognition: Using part of
speech tags to extract keywords.
‘symmetric matrices’, ‘real number’
‘JJ NNS’
JJ = Noun, singular
NNS = Noun, plural

How do the rules work
0
20
40
60
80
100
120
140
160
180
200
1 0 -1
Training set
1: Class 1,document belongs to Algebra: triangles
0: unlabelled data
-1: Class 2, document does not belong to Algebra: triangles.
Rule 1: All documents with words/phrases contained in
Algebra: triangles database preceeded by question
words or phrases (ex: prove, calculate, etc) fall under the
category of Algebra: triangles
Rule 2: If words or phrases have appeared only once in
your corpus it is not Algebra: triangles.

How do the rules work?
0
20
40
60
80
100
120
140
160
180
200
1 0 -1
Training set
Rule 1: All documents with words/phrases contained in
Algebra: triangles database preceeded by question
words or phrases (ex: prove, calculate, etc) fall under the
category of Algebra: triangles
Rule 2: If words or phrases have appeared only once in
your corpus it is not Algebra: triangles.
Rule 3: All documents with words/phrases not in Algebra:
database do not fall under the category of Algebra:
triangles
Rule 4: All documents containing> 4 keywords/phrases
from Algebra: triangles database belong to Algebra:
triangles.
1: Class 1,document belongs to Algebra: triangles
0: unlabelled data
-1: Class 2, document does not belong to Algebra: triangles.

Building the Machine Learning
Pipeline

Thresholding
Probability
0 10.5
0.70.3
• You want your model to be correct.
• It is possible improve the
percentage of correct results at the
cost of coverage.

Active Learning
1. Train a classifier and
predict on unseen data.
2. Evaluate points close to
the decision boundary
3. Collect SME annotations
on these points and add
them to the training set
Advantages:
• Requires less training Data.
• Strategizing can lead to higher coverage of
the model space
Dis-advantages:
• Uncertaintly sampling is noise
seeking: this can lead to fitting the
noise in the data.
• Outliers: Getting label on outliers
does not improve the models
(there are techniques to avoid this
issue).

Machine Learning Pipline
Produce training data with
Weak Supervison
Feature Generation with
Transfer Learning
Supervised
Learning:
Active Learning:
gather more training
data
Populate database with inferred tags
from model
Prob>threshold
Yes:
prediction
accepted
No
Goal: Concept classification

Collaboration
Product/Businesss
Content team
Iterate
• Redefine Filters
• Redefine rules
• Explain model
performance
What is the
application?
Product Integration
Collaboration

• Routing appropriate topics to relevant experts for answering questions
• Recommend appropriate topics to a student for practicing before an
exam
• Determine topics with highest demand to students and improve or create
content for these topics
• Connect different products based on topic similarity
Applications

In conclusion …
There is infinite amount of content
Language itself has logic, coherence and meaning.
Human curated examples for training models are expensive
Be smart about collecting examples and working with small amounts of example
Tackling new use cases less expensive.
Facilitate automation

Questions
Sanghamitra Deb
@sangha_deb,
sdeb@chegg.com

Appendix

Using generative models to combine rules
Independent
Rules x_i
n accuracy dependencies

Using generative models to combine rules
Including dependency
Here T is the set of
dependency types of interest
Learning the parameters of the generative model jointly requires
Gibbs sampling to estimate gradients. As the number of possible
dependencies increases at least quadratically in the number of
labeling functions

Results: Precision
Subject
(example:
Statistics)
Level 2:
(probability,te
sting,regressi
on)
>90%
>90%
Level 3:
(discrete PDF,
estimation,sam
pling,etc
~80%
Level 4:
(discrete PDF,
estimation,sam
pling,etc
~70%

Evaluation Metrics

Other open source embedding tools
• Shallow Networks: Word2vec, Glove
• FASTtext : facebook, character based.
• ULMfit : fast.ai
• GPT-2 latest models from openai
• Shallow Networks: Word2vec, Glove
• FASTtext : facebook, character based.
• ULMfit : fast.ai
• GPT-2 latest models from openai

Sentence Embeddings
Autoencoders Language Models
Skip Thought Vectors
Embed
LSTM
Softmax
Predict

Using weak supervision and transfer learning techniques to build knowledge graph to improve student experiences at Chegg by Sanghamitra Deb, Staff Data Scientist @Chegg

Recommended

Recommended

More Related Content

Similar to Using weak supervision and transfer learning techniques to build knowledge graph to improve student experiences at Chegg by Sanghamitra Deb, Staff Data Scientist @Chegg

Similar to Using weak supervision and transfer learning techniques to build knowledge graph to improve student experiences at Chegg by Sanghamitra Deb, Staff Data Scientist @Chegg (20)

More from Paris Women in Machine Learning and Data Science

More from Paris Women in Machine Learning and Data Science (20)

Recently uploaded

Recently uploaded (20)

Using weak supervision and transfer learning techniques to build knowledge graph to improve student experiences at Chegg by Sanghamitra Deb, Staff Data Scientist @Chegg

Editor's Notes