Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Attention mechanisms with tensorflow

28,150 views

Published on

A review on attention mechanism and its variants implemented in Tensorflow.

Published in: Technology

Attention mechanisms with tensorflow

  1. 1. Attention Mechanisms with Tensorflow Keon Kim DeepCoding 2016. 08. 26
  2. 2. Today, We Will Study... 1. Attention Mechanism and Its Implementation - Attention Mechanism (Short) Review - Attention Mechanism Code Review
  3. 3. Today, We Will Study... 1. Attention Mechanism and Its Implementation - Attention Mechanism (Short) Review - Attention Mechanism Code Review 2. Attention Mechanism Variant (Pointer Networks) and Its Implementation - Pointer Networks (Short) Review - Pointer Networks Code Review
  4. 4. Attention Mechanism Review
  5. 5. Global Attention Attention Mechanism Review
  6. 6. Encoder compresses input series into one vector Decoder uses this vector to generate output
  7. 7. Encoder compresses input series into one vector Decoder uses this vector to generate output Attention Mechanism predicts the output yt with a weighted average context vector ct, not just the last state.
  8. 8. Attention Mechanism predicts the output yt with a weighted average context vector ct, not just the last state.
  9. 9. Attention Mechanism predicts the output yt with a weighted average context vector ct, not just the last state.
  10. 10. Softmax Context Weight of h et2et1 Attention Mechanism
  11. 11. Character-based Neural Machine Translation [Ling+2015] http://arxiv.org/pdf/1511.04586v1.pdf
  12. 12. Character-based Neural Machine Translation [Ling+2015] http://arxiv.org/pdf/1511.04586v1.pdf String to Vector String Generation
  13. 13. Attention Mechanism Code Review
  14. 14. Attention Mechanism Code Review translate.py example in tensorflow https://github.com/tensorflow/tensorflow/tree/master/ tensorflow/models/rnn/translate
  15. 15. Implementation of Grammar as a Foreign Language Given a input string, outputs a syntax tree
  16. 16. Implementation of Grammar as a Foreign Language Given a input string, outputs a syntax tree [Go.] is put in reversed ordertranslate.py uses the same model, but for en-fr translation task
  17. 17. Basic Flow of the Implementation
  18. 18. Preprocess Create Model Train Test
  19. 19. Preprocess Create Model Train Test I go. Je vais. Original English Original French Use of Bucket Bucket is a method to efficiently handle sentences of different lengths English sentence with length L1, French sentence with length L2 + 1 (prefixed with GO symbol) English sentence -> encoder_inputs French sentence -> decoder_inputs we should in principle create a seq2seq model for every pair (L1, L2+1) of lengths of an English and French sentence. This would result in an enormous graph consisting of many very similar subgraphs. On the other hand, we could just pad every sentence with a special PAD symbol. Then we'd need only one seq2seq model, for the padded lengths. But on shorter sentence our model would be inefficient, encoding and decoding many PAD symbols that are useless.
  20. 20. Preprocess Create Model Train Test I go. Je vais. Original English Original French As a compromise between constructing a graph for every pair of lengths and padding to a single length, we use a number of buckets and pad each sentence to the length of the bucket above it. In translate.py we use the following default buckets. buckets = [(5, 10), (10, 15), (20, 25), (40, 50)] This means that if the input is an English sentence with 3 tokens, and the corresponding output is a French sentence with 6 tokens, then they will be put in the first bucket and padded to length 5 for encoder inputs, and length 10 for decoder inputs. Use of Bucket
  21. 21. Preprocess Create Model Train Test I go. Je vais. Original English Original French Encoder and Decoder Inputs in Bucket (5, 10)
  22. 22. Preprocess Create Model Train Test I go. Je vais. ["I", "go", "."] ["Je", "vais", "."] Tokenization Tokenization Original English Original French Encoder and Decoder Inputs in Bucket (5, 10)
  23. 23. Preprocess Create Model Train Test I go. Je vais. ["I", "go", "."] ["Je", "vais", "."] Tokenization ["PAD""PAD"".", "go", "I"] ["GO", "Je", "vais", ".","EOS","PAD","PAD","PAD","PAD","PAD"] Encoder Input Tokenization Decoder Input Original English Original French Encoder and Decoder Inputs in Bucket (5, 10)
  24. 24. Train Test [ ["PAD""PAD"".", "go", "I"], … ] [ ["GO", "Je", "vais", ".","EOS","PAD","PAD","PAD","PAD","PAD"], … ] Encoder Inputs Decoder Inputs Creating Seq2Seq Attention Model Create Model Preprocessing Create Model Preprocess model embedding_rnn_seq2seq(encoder_inputs, decoder_inputs, …, feed_prev=False)
  25. 25. Train Test [ ["PAD""PAD"".", "go", "I"], … ] [ ["GO", "Je", "vais", ".","EOS","PAD","PAD","PAD","PAD","PAD"], … ] Encoder Inputs Decoder Inputs Creating Seq2Seq Attention Model Create Model Preprocessing Create Model Preprocess model embedding_rnn_seq2seq(encoder_inputs, decoder_inputs, …, feed_prev=False) embedding_rnn_seq2seq() is made of encoder + embedding_attention_decoder
  26. 26. Train Test [ ["PAD""PAD"".", "go", "I"], … ] [ ["GO", "Je", "vais", ".","EOS","PAD","PAD","PAD","PAD","PAD"], … ] Encoder Inputs Decoder Inputs Creating Seq2Seq Attention Model Create Model Preprocessing Create Model Preprocess model embedding_rnn_seq2seq(encoder_inputs, decoder_inputs, …, feed_prev=False) embedding_rnn_seq2seq() is made of encoder + embedding_attention_decoder() embedding_attention_decoder() is made of embedding + attention_decoder()
  27. 27. Train Test [ ["PAD""PAD"".", "go", "I"], … ] [ ["GO", "Je", "vais", ".","EOS","PAD","PAD","PAD","PAD","PAD"], … ] Encoder Inputs Decoder Inputs Creating Seq2Seq Attention Model Create Model Preprocessing Create Model Preprocess model embedding_rnn_seq2seq(encoder_inputs, decoder_inputs, …, feed_prev=False) “feed_prev = False” means that the decoder will use decoder_inputs tensors as provided
  28. 28. Train Test [ ["PAD""PAD"".", "go", "I"], … ] [ ["GO", "Je", "vais", ".","EOS","PAD","PAD","PAD","PAD","PAD"], … ] Encoder Inputs Decoder Inputs Creating Seq2Seq Attention Model Create Model Preprocessing Create Model Preprocess embedding_rnn_seq2seq(encoder_inputs, decoder_inputs, …, feed_prev=False) outputs, states = model_with_buckets(encoder_inputs, decoder_inputs, model, … ) model (Just a wrapper function that helps with using buckets)
  29. 29. Create Model Training Test Train Preprocess session.run() with: - GradientDescentOptimizer - Gradient Clipping by Global Norm
  30. 30. Create Model Training Test Train Preprocess session.run() with: - GradientDescentOptimizer - Gradient Clipping by Global Norm
  31. 31. Create Model Train Testing Preprocess Test decode() with: - feed_prev = True “feed_prev = True” means that the decoder would only use the first element of decoder_inputs and previous output of the encoder from the second run.
  32. 32. python translate.py --decode --data_dir [your_data_directory] --train_dir [checkpoints_directory] Reading model parameters from /tmp/translate.ckpt-340000 > Who is the president of the United States? Qui est le président des États-Unis ? Result
  33. 33. Let’s go to TensorFlow import tensorflow as tf
  34. 34. Attention Mechanism and Its Variants - Global attention - Local attention - Pointer networks - Attention for image (image caption generation) …
  35. 35. Attention Mechanism and Its Variants - Global attention - Local attention - Pointer networks ⇠ this one for today - Attention for image (image caption generation) …
  36. 36. Pointer Networks Review Convex Hull
  37. 37. Pointer Networks ‘Point’ Input Elements! In Ptr-Net, we do not blend the encoder state to propagate extra information to the decoder like standard attention mechanism. But instead ... et2et1 Attention mechanismStandard Attention mechanism
  38. 38. Pointer Networks ‘Point’ Input Elements! We use as pointers to the input elements Ptr-Net approach specifically targets problems whose outputs are discrete and correspond to positions in the input Pointer Network
  39. 39. Another model for each number of points?
  40. 40. distribution of u
  41. 41. Distribution of the Attention is the Answer! distribution of u
  42. 42. Attention Mechanism vs Pointer Networks Softmax normalizes the vector eij to be an output distribution over the dictionary of inputs Attention mechanism Ptr-Net
  43. 43. Pointer Networks Code Review
  44. 44. Pointer Networks Code Review Sorting Implementation https://github.com/ikostrikov/TensorFlow-Pointer-Networks
  45. 45. Characteristics of the Implementation This Implementation: - The model code is a slightly modified version of attention_decoder from seq2seq tensorflow model - Simple implementation but with poor comments
  46. 46. Characteristics of the Implementation This Implementation: - The model code is a slightly modified version of attention_decoder from seq2seq tensorflow model - Simple implementation but with poor comments Focus on: - The general structure of the code (so you can refer while creating your own) - How original Implementation of attention_decoder is modified
  47. 47. Task: "Order Matters" 4.4 - Learn “How to sort N ordered numbers between 0 and 1” - ex) 0.2 0.5 0.6 0.23 0.1 0.9 => 0.1 0.2 0.23 0.5 0.6 0.9
  48. 48. Structure of the Implementation
  49. 49. How the Implementation is Structured main.py (Execution) PointerNetwork() pointer.py (Decoder Model) pointer_decoder() attention() dataset.py (Dataset) next_batch()create_feed_dict() step() __init__() session
  50. 50. How the Implementation is Structured main.py PointerNetwork() pointer.py pointer_decoder() attention() dataset.py next_batch()create_feed_dict() step() __init__() Encoder and decoder are instantiated in __init__() session
  51. 51. How the Implementation is Structured main.py PointerNetwork() pointer.py pointer_decoder() attention() dataset.py next_batch()create_feed_dict() step() __init__() 1. Dataset is instantiated and called in every step() (batch) 2. Create_feed_dict() is then used to feed the dataset into session. session
  52. 52. How the Implementation is Structured main.py PointerNetwork() pointer.py pointer_decoder() attention() dataset.py next_batch()create_feed_dict() step() __init__() Finally, step() uses the model created in __init__() to run the session session
  53. 53. Brief Explanations of The Model Part
  54. 54. main.py PointerNetwork() pointer.py pointer_decoder() attention() dataset.py next_batch()create_feed_dict() step() __init__() session pointer_decoder()
  55. 55. pointer_decoder() A Simple Modification to the attention_decoder tensorflow model
  56. 56. main.py PointerNetwork() pointer.py PointerDecoder() attention() dataset.py next_batch()create_feed_dict() step() __init__() session pointer_decoder(): attention()
  57. 57. pointer_decoder(): attention() query == states
  58. 58. Standard Attention vs Pointer Attention in Code attention fucntion from attention_decoder() attention fucntion from pointer_decoder()
  59. 59. main.py PointerNetwork() pointer.py pointer_decoder() attention() dataset.py next_batch()create_feed_dict() step() __init__() Encoder and decoder are instantiated in __init__() session
  60. 60. __init__() Whole Encoder and Decoder Model is Made in Here
  61. 61. Let’s go to TensorFlow import tensorflow as tf
  62. 62. 82% Step: 0 Train: 1.07584562302 Test: 1.07516384125 Correct order / All order: 0.000000 Step: 100 Train: 8.91889034099 Test: 8.91508702453 Correct order / All order: 0.000000 …. Step: 9800 Train: 0.437000320964 Test: 0.459392405155 Correct order / All order: 0.841875 Step: 9900 Train: 0.424404183739 Test: 0.636979421763 Correct order / All order: 0.825000 Result
  63. 63. Original Theano Implementation (Part) Ptr-Net
  64. 64. Thank you :D
  65. 65. References - Many Slides from: http://www.slideshare.net/yutakikuchi927/deep-learning-nlp-attention - Character Based Neural Machine Translation: http://arxiv.org/abs/1511.04586 - Grammar as a Foreign Language: https://arxiv.org/abs/1412.7449 - Tensorflow Official Tutorial on Seq2Seq Models: https://www.tensorflow.org/versions/r0.10/tutorials/seq2seq - Pointer Networks: https://arxiv.org/abs/1506.03134 - Order Matters: http://arxiv.org/abs/1511.06391 - Pointer Netoworks Implementation: https://github.com/ikostrikov/TensorFlow-Pointer-Networks

×