Democratizing NLP content modeling with transfer learning using GPUs

Confidential Material – Chegg Inc. © 2005 - 2016. All Rights Reserved.Chegg Inc. © 2018 by Chegg Inc. All Rights Reserved. 1
Democratizing NLP content modeling with
transfer learning using GPUs
Sanghamitra Deb
Staff Data Scientist
Email: sdeb@chegg.com
Twitter: @sangha_deb

Example Slide
Chegg Inc. © 2018. All Rights Reserved.3
Chegg

Example Slide
Chegg Tutors?

Example Slide
Flashtools
Metaphase:
The chromosomes line up at equator,
centriole fibers attactch to
centromeres (where the chromatids
are joined to each other)
Front
Back
Flashcards

Example Slide
Chegg Study

Example Slide
Chegg Inc. © 2018. All Rights Reserved.
• Democratizing NLP : what does it mean?
• Transfer Learning
• Word2vec
• Sentence and character embeddings
• Word embeddings in context
• Applications
• Knowledgebase creation
• Unique concepts …
10
Overview

Example Slide
Democratizing NLP with transfer learning
• Giving structure to unstructured data
• Data analysts should be able to query language data and get insights.
• Machine Learning practitioners with no knowledge of language should be
able to use features from the NLP data.
Why Transfer Learning?
• Converts text into vectors thereby giving structure.
• Transfer learning can be used to solve problems such as text summarization,
entity recognition, tagging, keyword extraction and other downstream
classification task. This structured data can be queried for further insights.
• The result of transfer learning is typically feature-ization of text data at
document, word or sentence level.

Example Slide
Traditional NLP Pipeline
1. Collecting Data 2. Gathering labelled data 3. Feature Engineering 4. Fit a model

Example Slide
Traditional Machine Learning Pipeline
1. Collecting Data 2. Gathering labelled data 3. Feature Engineering 4. Fit a model
Deep learning replaces
feature engineering !!
However, DL requires
huge amounts of data.

Example Slide
What is transfer learning?
Traditional Machine Learning
Task 1,
domain 1
Model 1
Results
Task 2,
domain 2
Model 2
Results
All models are
task/domain specific

Example Slide
What is transfer learning?
Traditional Machine Learning
Source
Task/dom
ain
Model 1
Knowledge
Task 2,
domain 2
Model 2All models are
task/domain specific
Sentence Vectors , word vectors

Example Slide
Classic Example: Computer Vision
Pretrained ImageNet models have
been used to achieve state-of-the-art
results in tasks such as object
detection
, semantic segmentation
, human pose estimation
, and video recognition
. At the same time, they have enabled
the application of CV to domains
where the number of training
examples is small and annotation is
expensive.

Example Slide
Transfer Learning in NLP: Word2vec
Proposed in 2013 by mikolov as an
approximation to language modeling
The cat sat on the mat
Glove, FastText

Example Slide
Proposed in 2013 as an
Large corpus of text ~ say 10000 words
• 10000 dim one hot vector
• Interface it with 300 node hidden layer
(weights connecting this layer :
wordvectors)
• Activations: linear summations of
weighted inputs
• Nodes are fed into softmax
• During training the weights are changed
such that words surrounding cat have a
higher probability.
Glove, FastText
Download google pre-trained vectors. Or

Example Slide
Proposed in 2013 as an
Glove, FastText

Example Slide
Character Embedding
The broadway play premiered yesterday
https://arxiv.org/abs/1508.06615
Softmax
Concatenation of character Embeddings
] Convolution Layer with Multiple Filters
] Max over time pooling layer
Cross Entropy between next word and prediction

Example Slide
Why are character embeddings important?
• Input Layer is simplified, instead of 10000 sized one hot vector, we
have an input of size<100.
• At Chegg we have a lot of free form raw text that are input by
students.
• Spelling mistakes, student language, etc.
• The vocabulary can have a lot of variation
• Student linguistics evolve fast with time
• We have a wide range of subjects with different symbols (for
example: greek letters are common in physics math and other
stem subjects)

Example Slide
Sentence Embeddings
Autoencoders
Language Models
Skip Thought Vectors

Example Slide
Sentence Embeddings: AutoEncoders/Language Models
….............
150
20
20
Softmax
….............
Next word
After learning this 128
dimensional state
represents the
sentence vector
Auto-encoders
Language-Models

Example Slide
Word Embeddings in Context
https://www.gocomics.com/frazz/
In a context free embedding ”crisp” in sentence “The morning air is getting crisp” and “getting burned to a crisp” would
have the same vector: f(crisp)
In a context aware model the embedding would be specific to the would be augmented by the context in which it appears.
f(crisp, context)

Example Slide
Word Embeddings in Context : Elmo
Best paper at NAACL 2018
Code publicly available:
• Allennlp – pytorch
• Tensorflow
• Keras
• Chainer
https://allennlp.org/elmo

Example Slide
How does it work?
https://arxiv.org/abs/1508.06615
Softmax
Concatenation of character Embeddings
] Convolution Layer with Multiple Filters
] Max over time pooling layer
Cross Entropy between next word and prediction
Input to Elmo

Example Slide
Elmo: How does it work?
Bi-directional Language Model
ELMo is a task specific combination of the intermediate layer
representations in the biLM.

Example Slide
What is the output?

Example Slide
Transfer Learning Pipeline: Then

Example Slide
Transfer Learning Pipeline: Now
Finetuning Aggregating
Use Neural Architecture to
fine tune the vectors from
each layer, this is common
practice in computer vision
Use any technique to
aggregate all the layers

Example Slide
Transfer Learning at a glance
Training
Classifier: The embeddings
contain direct signals from
labels. The labels should be
related to the task at hand for
maximum data efficiency.
Language model: The
embeddings are learned from
any text, but no signals from
labels.
Machine Translation: The
surprisingly copious
translation data. The idea is
that if the embeddings can
translate to a foreign
language, it can translate to a
classifier task as well.
Neural Architectures
RNN: Model with infinite context,
but can’t parallelize the
computation.
CNN: Model with local context
that is highly parallelizable but can
increase the context by stacking
them deeply and add either
dilated or separable convolutions.
Transformer: uses self-attention
and positional encoding. Good for
small documents.
Mean/Max Pool: A simple averaging /
maxing of all the context vectors will
give you reasonable results.
Last Vector: If the model aggregates
information to the last vector, you can
simply pop off the last vector as the
document embeddings.
Attention: Attention dynamically
allocates the importances of the
context embeddings before averaging
to a document embedding.
Aggregation

Example Slide
Transfer Learning at a glance
Training
Classifier: The embeddings
contain direct signals from
labels. The labels should be
related to the task at hand for
maximum data efficiency.
Language model: The
any text, but no signals from
labels.
Machine Translation: The
surprisingly copious
translation data. The idea is
that if the embeddings can
translate to a foreign
language, it can translate to a
classifier task as well.
Mean/Max Pool: A simple averaging /
maxing of all the context vectors will
give you reasonable results.
Last Vector: If the model aggregates
information to the last vector, you can
simply pop off the last vector as the
document embeddings.
Attention: Attention dynamically
allocates the importances of the
context embeddings before averaging
to a document embedding.
Aggregation

Example Slide
Performance
https://indico.io/blog/more-effective-transfer-learning-for-nlp/

Example Slide
Democratization: Phase 1
• Use the advantages of deep
learning even in regimes of
small data
• Feature Engineering is
automated, increasing
performance of data scientists
• This features can be used to
productionize an ML model by
a team

WIP: DRAFT SLIDES
Confidential Material – Chegg Inc. © 2005 – 2016. All Rights Reserved.
Applications
36c

Example Slide
• Creating a knowledgebase
• Finding unique concepts.
• Word Sense Disambiguation
• Equation extraction
• Text summarization
• Creating flash cards
• Providing condensed information to tutors/experts to efficiently answer
student needs.
37
Applications

Example Slide
Creating a knowledgebase
All Content
Algebra Physics Statistics Mechanical Eng Accounting ….
Several Tens of Subjects

Example Slide
What is a knowledgebase?
Statistics
Probability Testing Regression
Probability
Discrete
PDs
Continuous
PD’s
Sampling
Estimation Hypothesis Testing Regression
Binomial
NormalBayes Theorem

Example Slide
Classification Task
• Get sentence embeddings for each
of the sentences.
• Concatenate elmo embeddings to
the sentence embeddings.
• Do a tfidf weighting for the
sentences.
• Use a simple classifier such as
logistic regression or SVM (to
ensure scalability in the production
pipeline) for the classification
Model Building Fine Tuning results
• Look at precision, recall and
f1-score
• For specific product need use
high precision results to avoid
false positives
Example
completely factor the following
expressions. 1. t 2 + 4tv + 4v 2 2. z 2 +
15z -54 3. 4x 2 - 8x -12 + 6x 4. 144 - 9p
2 5. 5c 2 - 24cd - 5d 2 6. w 2 - 17w + 42
7. 256z 2 - 4 - 192z 2 + 3 8. 2a 2 c 3 -
14bc 3 + 32c 3 d 2 9. 35g 2 + 6g - 9 10.
3j 3 - 51j 2 + 210j use factoring and
the zero- product property to solve
the following problems. 1. z(z - 1)(z +
3) = 0 2. x 2 - x - 10 = 2 3. 4a 2 - 11a +
6 = 0 4. 9r 2 - 30r + 21 = -4.

Example Slide
• Classification using tfidf
• Transfer learning techniques
• Weak Supervision
• Thresholding
• Active Learning
42
Machine Learning methods used
On average we could achieve an average of > 80% accuracy for all nodes in the
knowledge tree

Example Slide
FlashCards
Identifying unique concepts Word Sense Disambiguation
• Clustering feature vectors (context words,
sentences) to identify unique topics. Caveat: It
is typically possible to extract a subset of
uniques concepts using this method.
• Finding similarities between flash cards using
the feature vectors
Example: Cardiac muscle ~1000 flash cards exist
Dominating topics:
1) Striated muscle of the heart
2) Propels blood
3) …
4) …
5) Mix bag of topics
Pleated sheet : cloth, curtain
Pleated sheet : regular secondary structure in proteins
--- a topic in bio chemistry
Circular: Mathmatical concept
Circular: a store advertisement

Example Slide
Equation Extraction using character embeddings
Math Data: Equation Extraction
…given by P(x) = -x 2 + 150x + 50, where x is…
…given by F(x)=0.02x2 +1.56x+9.8 where x…

Example Slide
• Tagging content with nodes of the knowledge tree provides
queriable structure to unstructured text
• It is possible perform analytics on this the volume, student
engagement, dominance of different concepts at different times
of a year from this data.
• Finally the tagged contents form the basis of recommendation
systems in different parts of chegg.
45
Democratizing NLP: Phase 2

Example Slide
Conclusion
In real life small data problems are more common
compared to big data problems.
With transfer learning we are able to use the advantages of
deep learning, i.e replace feature engineering in domain of
small data
Transfer Learning produces structured dense features from
unstructured text, these features can be combines with
structured features (eg: views, clicks, conversion) to produce
more robust models

WIP: DRAFT SLIDES
Questions
47c
Email: sdeb@chegg.com
Twitter: @sangha_deb

WIP: DRAFT SLIDES
Appendix
48c

Example Slide
References
Character Embeddings:
https://machinelearningmastery.com/develop-character-based-neural-language-model-keras/
https://medium.com/@surmenok/character-level-convolutional-networks-for-text-classification-d582c0c36ace
https://medium.com/@zhuixiyou/character-level-cnn-with-keras-50391c3adf33
Contextual Word Embeddings:
https://towardsdatascience.com/elmo-helps-to-further-improve-your-word-embeddings-c6ed2c9df95f
Attention model
http://jalammar.github.io/illustrated-transformer/
Sentence Embedding
https://blog.myyellowroad.com/unsupervised-sentence-representation-with-deep-learning-104b90079a93
Ulmfit
http://nlp.fast.ai/category/classification.html

Example Slide
Sentence Embeddings: Skip-grams
Sentence_Before
….............
20 dim
Sentence_Current
….............
20 dim
Senetnce_After
….............
20 dim
Given:
Predict
Predict

Example Slide
Sentence Embeddings: Skip-grams
Sentence_prev Sentence_Current Senetnce_next
…............. …............. ….............
20 dim 20 dim 20 dim
Word Embedding matrix for
previous sentence.
150 150
current sentence.
Emb_prev Emb_curr Emb_next
next sentence.

Example Slide
Sentence Embeddings: Skip-grams, Architecture
Sentence_prev Sentence_Current : 20
dim
Senetnce_next
…............. …............. ….............
20 dim 20 dim
Emb_curr
Encoder
Emb_prev
128
Emb_next
Softmax Softmax
Decoder_Final_
Output: 20 *
vocab_sizeDecoder_prev
After learning this 128
dimensional state
represents the
sentence vector
Decoder_next
128*20 128*20

Example Slide
Sentence Embeddings: Skip-grams, Learning
Sentence_prev
….............
20 dim
Senetnce_next
….............
20 dim
20*2000
Decoder_Final_Output
_prev: 20 * vocab_size
Decoder_Final_Output
_next: 20 * vocab_size
20*2000
Loss :sparse categorical
crossentropy
Loss :sparse categorical
crossentropy

Democratizing NLP content modeling with transfer learning using GPUs

Recommended

Recommended

More Related Content

Similar to Democratizing NLP content modeling with transfer learning using GPUs

Similar to Democratizing NLP content modeling with transfer learning using GPUs (20)

More from Sanghamitra Deb

More from Sanghamitra Deb (15)

Recently uploaded

Recently uploaded (20)

Democratizing NLP content modeling with transfer learning using GPUs