Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
|
Presented By
Date
July 6, 2017
Sujit Pal, Elsevier Labs
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe
...
| 2
INSPIRATION
| 3
AGENDA
• NLP Pipelines before Deep Learning
• Deconstructing the “Encode, Embed, Attend, Predict” pipeline.
• Example ...
| 4
NLP PIPELINES BEFORE DEEP LEARNING
• Document Collection centric
• Based on Information Retrieval
• Document collectio...
| 5
NLP PIPELINES BEFORE DEEP LEARNING
• Idea borrowed from Machine Learning (ML)
• Represent categorical variables (words...
| 6
WORD EMBEDDINGS
• Word2Vec – predict word from
context (CBOW) or context from word
(skip-gram) shown here.
• Trained o...
| 7
STEP #1: EMBED
• Replace 1-hot vectors with 3rd party embeddings.
• Embeddings encode distributional semantics
• Sente...
| 8
STEP #2: ENCODE
• Bag of words – concatenate word vectors together.
• Encode step computes a representation of the sen...
| 9
STEP #3: ATTEND
• Reduction operation – could be Sum or Global Average/Max Pooling instead
• Attention takes as input ...
| 10
ATTENTION: MATRIX
• Proposed by Raffel, et al
• Intuition: select most important
element from each timestep
• Learnab...
| 11
ATTENTION: MATRIX + VECTOR (LEARNED)
• Proposed by Lin, et al
• Intuition: select most important
element from each ti...
| 12
ATTENTION: MATRIX + VECTOR (PROVIDED)
• Proposed by Cho, et al
• Intuition: select most important
element from each t...
| 13
ATTENTION: MATRIX + MATRIX
• Proposed by Parikh, et al
• Intuition: build alignment (similarity) matrix
by multiplyin...
| 14
STEP #4: PREDICT
• Convert reduced vector to a label.
• Generally uses shallow fully connected networks such as the o...
| 15
DOCUMENT CLASSIFICATION EXAMPLE – ITERATION #1
• Embed, Predict
• Bag of Words idea
• Sentence = bag of words
• Docum...
| 16
DOCUMENT CLASSIFICATION EXAMPLE – ITERATION #2
• Embed, Encode, Predict
• Hierarchical Encoding
• Sentence Encoder: c...
| 17
DOCUMENT CLASSIFICATION EXAMPLE – ITERATION #3 (a, b, c)
• Embed, Encode, Attend,
Predict
• Encode step returns matri...
| 18
DOCUMENT CLASSIFICATION EXAMPLE – RESULTS
| 19
DOCUMENT SIMILARITY EXAMPLE
• Hierarchical Model (Word to Sentence and
sentence to document)
• Tried w/o Attention, A...
| 20
SENTENCE SIMILARITY EXAMPLE
• Hierarchical Model (Word to Sentence and
sentence to document).
• Used Matrix Matrix At...
| 21
SUMMARY
• 4-step recipe is a principled approach to NLP with Deep Learning
• Embed step leverages availability of man...
| 22
REFERENCES
• Honnibal, M. (2016, November 10). Embed, encode, attend, predict: The new
deep learning formula for stat...
| 23
THANK YOU
• Code: https://github.com/sujitpal/eeap-examples
• Slides: https://www.slideshare.net/sujitpal/presentatio...
Upcoming SlideShare
Loading in …5
×

Sujit Pal - Applying the four-step "Embed, Encode, Attend, Predict" framework to predict document similarity

491 views

Published on

PyData Seattle 2017

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Sujit Pal - Applying the four-step "Embed, Encode, Attend, Predict" framework to predict document similarity

  1. 1. | Presented By Date July 6, 2017 Sujit Pal, Elsevier Labs Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text classification and similarity
  2. 2. | 2 INSPIRATION
  3. 3. | 3 AGENDA • NLP Pipelines before Deep Learning • Deconstructing the “Encode, Embed, Attend, Predict” pipeline. • Example #1: Document Classification • Example #2: Document Similarity • Example #3: Sentence Similarity
  4. 4. | 4 NLP PIPELINES BEFORE DEEP LEARNING • Document Collection centric • Based on Information Retrieval • Document collection to matrix • Densify using feature reduction • Feed into SVM for classification, etc.
  5. 5. | 5 NLP PIPELINES BEFORE DEEP LEARNING • Idea borrowed from Machine Learning (ML) • Represent categorical variables (words) as 1-hot vectors • Represent sentences as matrix of 1-hot word vectors • No distributional semantics.
  6. 6. | 6 WORD EMBEDDINGS • Word2Vec – predict word from context (CBOW) or context from word (skip-gram) shown here. • Trained on large corpora and publicly available. • Other embeddings – GloVe, FastText.
  7. 7. | 7 STEP #1: EMBED • Replace 1-hot vectors with 3rd party embeddings. • Embeddings encode distributional semantics • Sentence represented as sequence of dense word vectors • Converts from word ID to word vector
  8. 8. | 8 STEP #2: ENCODE • Bag of words – concatenate word vectors together. • Encode step computes a representation of the sentence as a matrix. • Each row of sentence matrix encodes the meaning of each word in the context of the sentence. • Use either LSTM (Long Short Term Memory) or GRU (Gated Recurrent Unit) • Bidirectional processes words left to right and right to left and concatenates.
  9. 9. | 9 STEP #3: ATTEND • Reduction operation – could be Sum or Global Average/Max Pooling instead • Attention takes as input auxiliary context vector. • Attention tells what to keep during reduction to minimize information loss. • Different kinds – matrix, matrix + context, matrix + vector, matrix + matrix.
  10. 10. | 10 ATTENTION: MATRIX • Proposed by Raffel, et al • Intuition: select most important element from each timestep • Learnable weights W and b depending on target. • Code on Github
  11. 11. | 11 ATTENTION: MATRIX + VECTOR (LEARNED) • Proposed by Lin, et al • Intuition: select most important element from each timestep and weight with another learned vector u. • Code on Github
  12. 12. | 12 ATTENTION: MATRIX + VECTOR (PROVIDED) • Proposed by Cho, et al • Intuition: select most important element from each timestep and weight it with a learned multiple of a provided context vector • Code on Github
  13. 13. | 13 ATTENTION: MATRIX + MATRIX • Proposed by Parikh, et al • Intuition: build alignment (similarity) matrix by multiplying learned vectors from each matrix, compute context vectors from the alignment matrix, and mix with original signal. • Code on Github
  14. 14. | 14 STEP #4: PREDICT • Convert reduced vector to a label. • Generally uses shallow fully connected networks such as the one shown. • Can also be modified to have a regression head (return the probabilities from the softmax activation.
  15. 15. | 15 DOCUMENT CLASSIFICATION EXAMPLE – ITERATION #1 • Embed, Predict • Bag of Words idea • Sentence = bag of words • Document = bag of sentences • Code on Github
  16. 16. | 16 DOCUMENT CLASSIFICATION EXAMPLE – ITERATION #2 • Embed, Encode, Predict • Hierarchical Encoding • Sentence Encoder: converts sequence of word vectors to sentence vector. • Document Encoder: converts sequence of sentence vectors to document vector. • Sentence encoder Network embedded inside Document network. • Code on Github
  17. 17. | 17 DOCUMENT CLASSIFICATION EXAMPLE – ITERATION #3 (a, b, c) • Embed, Encode, Attend, Predict • Encode step returns matrix, vector for each time step. • Attend reduces matrix to vector. • 3 types of attention (all except Matrix Matrix) applied to different versions of model. • Code on Github – (a), (b), (c)
  18. 18. | 18 DOCUMENT CLASSIFICATION EXAMPLE – RESULTS
  19. 19. | 19 DOCUMENT SIMILARITY EXAMPLE • Hierarchical Model (Word to Sentence and sentence to document) • Tried w/o Attention, Attention for sentence encoding, and attention for both sentence encoding and document compare • Code in Github – (a), (b), (c)
  20. 20. | 20 SENTENCE SIMILARITY EXAMPLE • Hierarchical Model (Word to Sentence and sentence to document). • Used Matrix Matrix Attention for comparison • Code in Github – without attention, with attention
  21. 21. | 21 SUMMARY • 4-step recipe is a principled approach to NLP with Deep Learning • Embed step leverages availability of many pre-trained embeddings. • Encode step generally uses Bidirectional LSTM to create position sensitive features, possible to use CNN here also. • Attention of 3 main types – matrix to vector, with or without implicit context, matrix and vector to vector, and matrix and matrix to vector. Computes summary with respect to input or context if provided. • Predict step converts vector to probability distribution via softmax, usually with a Fully Connected (Dense) network. • Interesting pipelines can be composed using complete or partial subsequences of the 4 step recipe.
  22. 22. | 22 REFERENCES • Honnibal, M. (2016, November 10). Embed, encode, attend, predict: The new deep learning formula for state-of-the-art NLP models. • Liao, R. (2016, December 26). Text Classification, Part 3 – Hierarchical attention network. • Leonardblier, P. (2016, January 20). Attention Mechanism • Raffel, C., & Ellis, D. P. (2015). Feed-forward networks with attention can solve some long-term memory problems. arXiv preprint arXiv:1512.08756. • Yang, Z., et al. (2016). Hierarchical attention networks for document classification. In Proceedings of NAACL-HLT (pp. 1480-1489). • Cho, K., et al. (2015). Describing multimedia content using attention-based encoder-decoder networks. IEEE Transactions on Multimedia, 17(11), 1875-1886. • Parikh, A. P., et al. (2016). A decomposable attention model for natural language inference. arXiv preprint arXiv:1606.01933.
  23. 23. | 23 THANK YOU • Code: https://github.com/sujitpal/eeap-examples • Slides: https://www.slideshare.net/sujitpal/presentation-slides-77511261 • Email: sujit.pal@elsevier.com

×