Introduced a state-of-the-art text classifier by addressing the capability of language semantics and polysemy in Natural Language Processing tasks. Used contextual representations of a word to achieve a ~5% increase in metrics outperforming existing models.
2. Overview
• Natural language refers to the way we, humans, communicate with each
other.
• Numerous applications of Natural language Processing in real life.
Automatic summarization, translation,
named entity recognition, relationship extraction,
sentiment analysis, speech recognition,
and topic segmentation.
• Deep learning can make sense of data using multiple layers of abstraction.
4. Neural Language Modeling: The ML Way
• Two main techniques to understand natural language :
▪ Syntactic Analysis (Syntax): Analyzing natural language conforming to
the rules of a formal grammar.
▪ Semantic Analysis: Understanding the meaning and interpretation of
words, signs, and sentence structure.
5. Pre-Processing Data
• It is necessary to highlight required attributes from dataset.
• Steps for cleaning the data:
▪ Tokenization
▪ Remove Punctuation
▪ Remove Stop words
▪ Stemming
▪ Lemmatizing
▪ Regex
6. Modeling Challenges
• We were wrestling here with the following challenges:
▪ Using as much relevant evidence as possible.
▪ Pooling evidence between words.
▪ Model Polysemy, the coexistence of many possible meanings for a word
or phrase.
7. Representing Words
We are wrestling here with the following challenges –
▪ Using as much relevant evidence as possible
▪ Pooling evidence between words
▪ Model Polysemy, the coexistence of many possible meanings for a word/phrase
• Words Embeddings: Represented data with a one-hot or two-hot vector, TF-
IDF scaling, Co-Occurrence matric e.g.,
– dog = (0,0,0,0,1,0,0,0,0,....)
– cat = (0,0,0,0,0,0,0,1,0,....)
– eat = (0,1,0,0,0,0,0,0,0,....)
• That’s a large vector!
• Remedies
– limit to, say, 20,000 most frequent words, rest are OTHER
– Place words in sqrt(n) classes, dimensionality reduction, and more
8. Representing Words
We are wrestling here with the following challenges –
▪ Using as much relevant evidence as possible
▪ Pooling evidence between words
▪ Model Polysemy, the coexistence of many possible meanings for a word/phrase
Beauty of Word Embeddings:
Capture some sort of relationship between words, be it meaning,
morphology, context, or some other kind of relationship.
9. Representing Words
We are wrestling here with the following challenges –
▪ Using as much relevant evidence as possible
▪ Pooling evidence between words
▪ Model Polysemy, the coexistence of many possible meanings for a
word/phrase
ELMo.
DEEP CONTEXTUALIZED
WORD REPRESENTATION
10. What is ELMo?
Deep contextualized word representations
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner,
Christopher Clark, Kenton Lee, Luke Zettlemoyer.
Best Paper at NAACL 2018
11. ELMo (Embeddings from Language MOdels)
• Deep Contextual Word Representations that models,
▪ Complex characters of word use
▪ How these uses vary across linguistic contexts (polysemy)
I must make a deposit at the bank.
Let’s have lunch beside a river bank.
• The word vectors are learned functions of the internal states of a deep bi-
directional language model (biLM).
14. 2-layer bidirectional LSTM backbone
• The red box represents
the forward recurrent
unit.
• The blue represents the
backward recurrent
unit.
15. Add Residual Connection
• A residual connection is
added between the LSTM
layers.
• The input to the first
layer is added to its
output before being
passed on as the input to
the second layer.
16. Transformation
Transformations applied for each token before being
provided to input of LSTM layer.
• Convert each token to an appropriate
representation using character
embeddings.
• Max pooling is a sample-based
discretization process.
• Highway networks use learned gating
mechanisms to regulate information
flow, inspired by Long Short-Term
Memory (LSTM) recurrent neural
networks.
19. NLP Task Specific Model
• Built models using ELMo on the two tasks below:
• Sentiment Analysis
• Email Spam Classification
• Used TensorFlow v1.8 and Keras 2.0 API.
• CUDA, cuDnn to provide GPU-acceleration over Nvidia GeForce GTX 1070.
• Custom implementation of confusion matrix for every epoch.
• Calculated precision, recall and F1-score apart from accuracy to streamline
model for imbalanced data as well.
20. 0
0.2
0.4
0.6
0.8
1
1.2
Sentiment Analysis Email Spam Classification
F1-Score Accuracy
Result and Comparison
Task Previous
SOTA
ELMo
Result
Sentiment Analysis
(F1-Score)
0.53 0.547
Email Classification
(Accuracy)
0.954 0.99
23. Final Thoughts
• The experimental results really speak to the power of the ELMo concept.
• ELMo representations were integrated to existing NLP tasks: Sentiment Analysis
and Email Spam Classification.
• In both cases, the ELMo models achieved state-of-the-art performance!
• ELMo follows an interesting vein of deep learning research related to transfer
learning.
• ELMo is such an important paper because it has taken the first steps in
demonstrating that language model transfer learning may be the
ImageNet equivalent for natural language processing.