SlideShare a Scribd company logo
Sequence Modelling
with Deep Learning
ODSC London 2019 Tutorial
Natasha Latysheva
Overview
I. Introduction to sequence modelling
II. Quick neural network review
• Feed-forward networks
III. Recurrent neural networks
• From feed-forward networks to recurrence
• RNNs with gating mechanisms
IV. Practical: Building a language model for Game of Thrones
V. Components of state-of-the-art RNN models
• Encoder-decoder models
• Bidirectionality
• Attention
VI. Transformers and self-attention
Speaker Intro
• Welocalize
• We provide language services
• Fairly large, by revenue 8th largest globally,
4th largest US. 1500+ employees.
• Lots of localisation (translation)
• International marketing, site optimisation
• NLP engineering team
• 14 people remote across US, Ireland, UK,
Germany, China
• Various NLP things: machine translation,
text-to-speech, NER, sentiment, topics,
classification, etc.
I. Introduction to Sequence Modelling
Other sequence problems
Less conventional sequence data
• Activity on a website:
• [click_button, move_cursor, wait,
wait, click_subscribe, close_tab]
• Customer history:
• [inactive -> mildly_active ->
payment_made -> complaint_filed
-> inactive -> account_closed]
• Code (constrained language) is
sequential data – can learn the
structure
II. Quick Neural Network Review
Feed-forward networks
Simplifying the notation
• Single neurons
• Weight matrices, bias vectors
• Fully-connected layer
III. Recurrent Neural Networks
Why do we need fancy methods to
model sequences?
• Say we are training a translation
model, English->French
• ”The cat is black” to “Le chat is
noir”
• Could in theory use a feed-
forward network to translate
word-by-word
Why do we need fancy methods?
• A feed-forward network treats
time steps as completely
independent
• Even in this simple 1-to-1
correspondence example, things
are broken
• How you translate “black” depends
on noun gender (“noir” vs. “noire”)
• How you translate “The” also
depends on gender (“Le” vs. “La”)
• More generally, getting the
translation right requires context
Why do we need fancy methods?
• We need a way for the network
to remember information from
previous time steps
Recurrent neural networks
• Extremely popular way of modelling
sequential data
• Process data one time step at a
time, while updating a running
internal hidden state
Standard FF network to RNN
Standard FF network to RNN
Standard FF network to RNN
Standard FF network to RNN
Standard FF network to RNN
• At each time step, RNN
passes on its activations
from previous time step
• In theory all the way back
to the first time step
Standard FF network to RNN
*Activation function probably tanh or ReLU
Standard FF network to RNN
• So you can say this is a
form of memory
• Cell hidden state
transferred
• Basis for RNNs
remembering context
Memory problems
• Basic RNNs not great at
long-term dependencies
but plenty of ways to
improve this
• Information gating
mechanisms
• Condensing input using
encoders
Gating mechanisms
• Gates regulate the flow of
information
• Very helpful - basic RNN cells not really
used anymore. Responsible for recent
RNN popularity.
• Add explicit mechanisms to remember
information and forget information
• Why use gates?
• Helps you learn long-term
dependencies
• Not all time points are equally relevant
– not everything has to be remembered
• Speeds up training/convergence
Gated recurrent
units (GRUs)
• GRUs were developed later
than LSTMs but are simpler
• Motivation is to get the main
benefits of LSTMs but with less
computation
• Reset gate: Mechanism to
decide when to remember vs.
forget/reset previous
information (hidden state)
• Update gate: Mechanism to
decide when to update
hidden state
GRU mechanics
• Reset gate controls how
much past info we use
• Rt = 0 means we are resetting
our RNN, not using any
previous information
• Rt = 1 means we use all of
previous information (back to
our normal vanilla RNN)
GRU mechanics
• Update gate controls whether
we bother updating our
hidden state using new
information
• Zt = 1 means you’re not
updating, you’re just using
previous hidden state
• Zt = 0 means you’re updating as
much as possible
LSTM mechanics
• LSTMs add a memory unit to
further control the flow of
information through the cell
• Also whereas GRUs have 2
gates, an LSTM cell has 3
gates:
• An input gate – should I ignore
or consider the input?
• A forget gate – should I keep
or throw away the information
in memory?
• An output gate – how should I
use input, hidden state and
memory to output my next
hidden state?
GRUs vs. LSTMs
• GRUs are simpler + train
faster
• LSTMs more popular – can
give slightly better
performance, but GRU
performance often on par
• LSTMs would in theory
outperform GRUs in tasks
requiring very long-range
modelling
IV. Game of Thrones Language Model
Notebook
• ~30 mins
• Jupyter
notebook on
building an RNN-
based language
model
• Python 3 + Keras
for neural
networks
tinyurl.com/wbay5o3
IV. Components of SOTA RNN models
Encoder-Decoder architectures
• Being forced to
immediately
output a French
word for every
English word
Encoder-Decoder architectures
Encoder-Decoder architectures
• Tends to work a lot
better than using a
single sequence-to-
sequence RNNs to
produce an output
for each input step
• You often need to
see the whole
sequence before
knowing what to
output
Bidirectionality in RNN encoder-decoders
• For the encoder,
bidirectional RNNs
(BRNNs) often used
• BRNNs read the
input sequences
forwards and
backwards
Bidirectional
RNNs
• Process input
sequences in both
directions
The problem with RNN encoder-decoders
• Serious information
bottleneck
• Condense input
sequence down to a
small vector?!
• Memorise long
sequence + regurgitate
• Not how humans work
• Long computation
paths
Attention concept
• Has been very influential in
deep learning
• Originally developed for
MT (Bahdanau, 2014)
• As you’re producing your
output sequence, maybe
not every part of your input
is as equally relevant
• Image captioning example
Lu et al. 2017. Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning.
Attention intuition
• Attention allows the
network to refer back
to the input
sequence, instead of
forcing it to encode
all information into
one fixed-length
vector
• Encoder: Used BRNN
to compute rich set of
features about source
words and their
surrounding words
• Decoder is asked to
choose which hidden
states to use and
ignore
• Weighted sum of
hidden states used to
predict the next word
Attention intuition
• Decoder RNN uses
attention parameters
to decide how much
to pay attention to
different parts of the
input
• Allows the model to
amplify the signal
from relevant parts of
the input sequence
• This improves
modelling
Attention intuition
Main benefits
• Encoder passes a lot
more data to
the decoder
• Not just last hidden
state
• Passes all hidden states
at every time step
• Computation path
problem: relevant
information is now
closer by
Summary so far
• Sequence modelling
• Recurrent neural
networks
• Some key components
of SOTA RNN-based
models:
• Gating mechanisms
(GRUs and LSTMs)
• Encoder-decoders
• Bidirectional encoding
• Attention
V. Transformers and self-attention
Transformers are taking over NLP
• Translation, language
models, question
answering, summarisation,
etc.
• Some of the best word
embeddings are based on
Transformers
• BERT, ELmO, OpenAI GPT-2
models
A single Transformer encoder block
• No recurrence, no convolutions
• “Attention is all you need” paper
• The core concept is the self-
attention mechanism
• Much more parallelisable than
RNN-based models, which
means faster training
Self-attention is a
sequence-to-sequence
operation
• At the highest level – self-
attention takes t input
vectors and outputs t
output vectors
• Take input embedding for
“the” and update it by
incorporating in
information from its
context
How is the vector for “the” updated?
• Each output vector
is a weighted sum
of the input vectors
• But all of these
weights are
different
These are not learned weights in the
traditional neural network sense
• The weights are
calculated by taking
dot products
• Can use different
functions over input
Example calculation of a single weight
Example calculation of a single weight
Calculating a weight matrix row
Attention weight matrix
• The dot product can be
anything (negative infinity to
positive infinity)
• We normalise by length
• We softmax this so that the
weights are positive values
summing to 1
• Attention weight matrix
summarises relationship
between words
• Because dot products capture
similarity between vectors
Multi-headed attention
• Attention weight matrix
captures relationship
between words
• But there’s many
different ways words can
be related
• And which ones you want
to capture depends on
your task
• Different attention heads
learn different relations
between word pairs
Img source
Difference to RNNs
• Whereas RNNs updates context
token-by-token by updating
internal hidden state, self-
attention captures context by
updating all word representations
simultaneously
• Lower computational complexity,
scales better with more data
• More parallelisable = faster
training
Connecting all
these concepts
• “Useful” input representations are
learned
• “Useful” weights for transforming
input vectors are learned
• These quantities should produce
“useful” dot products
• That lead to “useful” updated input
vectors
• That lead to “useful” input to the
feed-forward network layer
• … etc. … that eventually lead to
lower overall loss on the training set
Summary
I. Introduction to sequence modelling
II. Quick neural network review
• How a single neuron functions
• Feed-forward networks
III. Recurrent neural networks
• From feed-forward networks to recurrence
• RNNs with gating mechanisms
IV. Practical: Building a language model for Game of Thrones
V. Components of state-of-the-art RNN models
• Encoder-decoder models
• Bidirectionality
• Attention
VI. Transformers and self-attention
Further Reading
• More accessible: Andrew Ng
Sequence Course on Coursera
• https://www.coursera.org/learn/nlp-
sequence-models
• More technical: Deep Learning book
by Goodfellow et al.
• https://www.deeplearningbook.org/cont
ents/rnn.html
• Also: Alex Smola Berkeley Lectures
• https://www.youtube.com/user/smolix/vi
deos
Just for fun
• Talk to transformer
• https://talktotransformer.com/
• Using OpenAI’s “too
dangerous to release” GPT-
2 language model
Thanks, questions?
Extra slides
Sequences in natural language
• Sequence modelling very popular in
NLP because language is sequential by
nature
• Text
• Sequences of words
• Sequences of characters
• We process text sequentially, though in
principle could see all words at once
• Speech
• Sequence of amplitudes over time
• Frequency spectrogram over time
• Extracted frequency features over time
Img source
Sequences in biology
• Genomics, DNA and
RNA sequences
• Proteomics, protein
sequences,
structural biology
• Trying to represent
sequences in some
way, or predict some
function or
association of the
sequence
Img source
Sequences in finance
• Lots of time series data
• Numerical sequences (stocks,
indices)
• Lots of forecasting work –
predicting the future (trading
strategies)
• Deep learning for these
sequences perhaps not as
popular as you might think
• Quite well-developed methods
based on classical statistics,
interpretability important
Img source
Img source
Single neuron computation
• What computation is
happening inside 1
neuron?
• If you understand how 1
neuron computes output
given input, it’s a small
step to understand how an
entire network computes
output given input
Single neuron computation
• What computation is
happening inside 1
neuron?
• If you understand how 1
neuron computes output
given input, it’s a small
step to understand how an
entire network computes
output given input
Perceptrons
• Modelling a binary outcome using
binary input features
• Should I have a cup of tea?
• 0 = no
• 1 = yes
• Three features with 1 weight each:
• Do they have Earl Grey?
• earl_grey, 𝑤" = 3
• Have I just had a cup of tea?
• already_had, 𝑤# =-1
• Can I get it to go?
• to_go, 𝑤$ =2
Perceptrons
• Modelling a binary outcome using
binary input features
• Should I have a cup of tea?
• 0 = no
• 1 = yes
• Three features with 1 weight each:
• Do they have Earl Grey?
• earl_grey, 𝑤" = 3
• Have I just had a cup of tea?
• already_had, 𝑤# =-1
• Can I get it to go?
• to_go, 𝑤$ =2
Perceptrons
• Here weights are
cherry-picked, but
perceptrons learn these
weights automatically
from training data by
shifting parameters to
minimise error
Perceptrons
• Formalising the perceptron
calculation
• Instead of a threshold, more
common to see a bias term
• Instead of writing out the
sums using sigma notation,
more common to see dot
products.
• Vectorisation for efficiency
• Here, I manually chose these
values – but given a dataset of
past inputs/outputs, you could
learn the optimal parameter
values
Perceptrons
• Formalising the
perceptron calculation
• Instead of a threshold,
more common to see a
bias term
• Instead of writing out
the sums using sigma
notation, more common
to see dot products.
• Vectorisation for
efficiency
Sigmoid neurons
• Want to handle continuous
values
• Where input can be
something other than just 0 or
1
• Where output can be
something other than just 0 or
1
• We put the weighted sum of
inputs through an activation
function
• Sigmoid or logistic function
Sigmoid neurons
• The sigmoid function is
basically a smoothed out
perceptron!
• Output no longer a
sudden jump
• It’s the smoothness of the
function that we care
about
Img source
Activation functions
• Which activation function
to use?
• Heuristics based on
experiments, not proof-
based
Img source
More layers!
• Increase
number of
layers to
increase
capacity for
abstraction,
hierarchical
processing of
input
Training on big window sizes
• How much of window size? On very long sequence, unrolled
RNN becomes a very deep network
• Same problems with vanishing/exploding gradients as normal
networks
• And takes a longer time to train
• The normal tricks can help – good initialization of parameters, non-
saturating activation functions, gradient clipping, batch norm
• Training over a limited number of steps – truncated
backpropagation through time
LSTM mechanics
• Input, forget, output gates are
little neural networks within the
cell
• Memory being updated via
forget gate and candidate
memory
• Hidden state being updated by
output gate, which weighs up all
information
Query, Key, and Value transformations
• Notice that we are using
each input vector on 3
separate occasions
• E.g. vector x2
1. To take dot products
with each other input
vector when calculating
y2
2. In dot products with
other output vectors (y1,
y3, y4) are calculated
3. And in the weighted
sum to produce output
vector y2
Query, Key, and Value transformations
• To model these 3
different functions for
each input vector, and
give the model extra
expressivity and
flexibility, we are going
to modify the input
vectors
• Apply simple linear
transformations
Input transformation
matrices
• These weight matrices
are learnable
parameters
• Gives something else
to learn by gradient
descent

More Related Content

What's hot

Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
Hye-min Ahn
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
Nuwan Sriyantha Bandara
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanism
Khang Pham
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
Ashray Bhandare
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10)
Larry Guo
 
LSTM Basics
LSTM BasicsLSTM Basics
LSTM Basics
Akshay Sehgal
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
Yan Xu
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
Illia Polosukhin
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
Andrii Gakhov
 
Cnn
CnnCnn
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Simplilearn
 
LSTM Tutorial
LSTM TutorialLSTM Tutorial
LSTM Tutorial
Ralph Schlosser
 
RNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential DataRNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential Data
Yao-Chieh Hu
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysis
odsc
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
Junaid Bhat
 
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
Simplilearn
 
Tensorflow presentation
Tensorflow presentationTensorflow presentation
Tensorflow presentation
Ahmed rebai
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
Jeong-Gwan Lee
 
딥러닝 자연어처리 - RNN에서 BERT까지
딥러닝 자연어처리 - RNN에서 BERT까지딥러닝 자연어처리 - RNN에서 BERT까지
딥러닝 자연어처리 - RNN에서 BERT까지
deepseaswjh
 

What's hot (20)

Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanism
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10)
 
LSTM Basics
LSTM BasicsLSTM Basics
LSTM Basics
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
 
Cnn
CnnCnn
Cnn
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
 
LSTM Tutorial
LSTM TutorialLSTM Tutorial
LSTM Tutorial
 
RNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential DataRNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential Data
 
Recurrent neural network
Recurrent neural networkRecurrent neural network
Recurrent neural network
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysis
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
 
Tensorflow presentation
Tensorflow presentationTensorflow presentation
Tensorflow presentation
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
 
딥러닝 자연어처리 - RNN에서 BERT까지
딥러닝 자연어처리 - RNN에서 BERT까지딥러닝 자연어처리 - RNN에서 BERT까지
딥러닝 자연어처리 - RNN에서 BERT까지
 

Similar to Sequence Modelling with Deep Learning

Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From Scratch
Natasha Latysheva
 
Sequence Model pytorch at colab with gpu.pdf
Sequence Model pytorch at colab with gpu.pdfSequence Model pytorch at colab with gpu.pdf
Sequence Model pytorch at colab with gpu.pdf
FEG
 
Complete solution for Recurrent neural network.pptx
Complete solution for Recurrent neural network.pptxComplete solution for Recurrent neural network.pptx
Complete solution for Recurrent neural network.pptx
ArunKumar674066
 
Natural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A SurveyNatural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A Survey
Rimzim Thube
 
Deep learning
Deep learningDeep learning
Deep learning
Ratnakar Pandey
 
240219_RNN, LSTM code.pptxdddddddddddddddd
240219_RNN, LSTM code.pptxdddddddddddddddd240219_RNN, LSTM code.pptxdddddddddddddddd
240219_RNN, LSTM code.pptxdddddddddddddddd
ssuser2624f71
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
David Martínez Rego
 
Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.
Vishal Mishra
 
DSRLab seminar Introduction to deep learning
DSRLab seminar   Introduction to deep learningDSRLab seminar   Introduction to deep learning
DSRLab seminar Introduction to deep learning
Poo Kuan Hoong
 
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Impetus Technologies
 
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A PrimerMDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
Poo Kuan Hoong
 
Introduction of Machine learning and Deep Learning
Introduction of Machine learning and Deep LearningIntroduction of Machine learning and Deep Learning
Introduction of Machine learning and Deep Learning
Madhu Sanjeevi (Mady)
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_experts
Sanghamitra Deb
 
240115_Attention Is All You Need (2017 NIPS).pptx
240115_Attention Is All You Need (2017 NIPS).pptx240115_Attention Is All You Need (2017 NIPS).pptx
240115_Attention Is All You Need (2017 NIPS).pptx
thanhdowork
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learning
Poo Kuan Hoong
 
Unit one ppt of deeep learning which includes Ann cnn
Unit one ppt of  deeep learning which includes Ann cnnUnit one ppt of  deeep learning which includes Ann cnn
Unit one ppt of deeep learning which includes Ann cnn
kartikaursang53
 
Distributed deep learning_framework_spark_4_may_2015_ver_0.7
Distributed deep learning_framework_spark_4_may_2015_ver_0.7Distributed deep learning_framework_spark_4_may_2015_ver_0.7
Distributed deep learning_framework_spark_4_may_2015_ver_0.7
Vijay Srinivas Agneeswaran, Ph.D
 
Deep Learning for Machine Translation
Deep Learning for Machine TranslationDeep Learning for Machine Translation
Deep Learning for Machine Translation
Matīss ‎‎‎‎‎‎‎  
 
From neural networks to deep learning
From neural networks to deep learningFrom neural networks to deep learning
From neural networks to deep learning
Viet-Trung TRAN
 
A Survey of Convolutional Neural Networks
A Survey of Convolutional Neural NetworksA Survey of Convolutional Neural Networks
A Survey of Convolutional Neural Networks
Rimzim Thube
 

Similar to Sequence Modelling with Deep Learning (20)

Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From Scratch
 
Sequence Model pytorch at colab with gpu.pdf
Sequence Model pytorch at colab with gpu.pdfSequence Model pytorch at colab with gpu.pdf
Sequence Model pytorch at colab with gpu.pdf
 
Complete solution for Recurrent neural network.pptx
Complete solution for Recurrent neural network.pptxComplete solution for Recurrent neural network.pptx
Complete solution for Recurrent neural network.pptx
 
Natural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A SurveyNatural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A Survey
 
Deep learning
Deep learningDeep learning
Deep learning
 
240219_RNN, LSTM code.pptxdddddddddddddddd
240219_RNN, LSTM code.pptxdddddddddddddddd240219_RNN, LSTM code.pptxdddddddddddddddd
240219_RNN, LSTM code.pptxdddddddddddddddd
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.
 
DSRLab seminar Introduction to deep learning
DSRLab seminar   Introduction to deep learningDSRLab seminar   Introduction to deep learning
DSRLab seminar Introduction to deep learning
 
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
 
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A PrimerMDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
 
Introduction of Machine learning and Deep Learning
Introduction of Machine learning and Deep LearningIntroduction of Machine learning and Deep Learning
Introduction of Machine learning and Deep Learning
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_experts
 
240115_Attention Is All You Need (2017 NIPS).pptx
240115_Attention Is All You Need (2017 NIPS).pptx240115_Attention Is All You Need (2017 NIPS).pptx
240115_Attention Is All You Need (2017 NIPS).pptx
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learning
 
Unit one ppt of deeep learning which includes Ann cnn
Unit one ppt of  deeep learning which includes Ann cnnUnit one ppt of  deeep learning which includes Ann cnn
Unit one ppt of deeep learning which includes Ann cnn
 
Distributed deep learning_framework_spark_4_may_2015_ver_0.7
Distributed deep learning_framework_spark_4_may_2015_ver_0.7Distributed deep learning_framework_spark_4_may_2015_ver_0.7
Distributed deep learning_framework_spark_4_may_2015_ver_0.7
 
Deep Learning for Machine Translation
Deep Learning for Machine TranslationDeep Learning for Machine Translation
Deep Learning for Machine Translation
 
From neural networks to deep learning
From neural networks to deep learningFrom neural networks to deep learning
From neural networks to deep learning
 
A Survey of Convolutional Neural Networks
A Survey of Convolutional Neural NetworksA Survey of Convolutional Neural Networks
A Survey of Convolutional Neural Networks
 

Recently uploaded

一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 

Recently uploaded (20)

一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 

Sequence Modelling with Deep Learning

  • 1. Sequence Modelling with Deep Learning ODSC London 2019 Tutorial Natasha Latysheva
  • 2. Overview I. Introduction to sequence modelling II. Quick neural network review • Feed-forward networks III. Recurrent neural networks • From feed-forward networks to recurrence • RNNs with gating mechanisms IV. Practical: Building a language model for Game of Thrones V. Components of state-of-the-art RNN models • Encoder-decoder models • Bidirectionality • Attention VI. Transformers and self-attention
  • 3. Speaker Intro • Welocalize • We provide language services • Fairly large, by revenue 8th largest globally, 4th largest US. 1500+ employees. • Lots of localisation (translation) • International marketing, site optimisation • NLP engineering team • 14 people remote across US, Ireland, UK, Germany, China • Various NLP things: machine translation, text-to-speech, NER, sentiment, topics, classification, etc.
  • 4. I. Introduction to Sequence Modelling
  • 5.
  • 7. Less conventional sequence data • Activity on a website: • [click_button, move_cursor, wait, wait, click_subscribe, close_tab] • Customer history: • [inactive -> mildly_active -> payment_made -> complaint_filed -> inactive -> account_closed] • Code (constrained language) is sequential data – can learn the structure
  • 8. II. Quick Neural Network Review
  • 10. Simplifying the notation • Single neurons • Weight matrices, bias vectors • Fully-connected layer
  • 12. Why do we need fancy methods to model sequences? • Say we are training a translation model, English->French • ”The cat is black” to “Le chat is noir” • Could in theory use a feed- forward network to translate word-by-word
  • 13. Why do we need fancy methods? • A feed-forward network treats time steps as completely independent • Even in this simple 1-to-1 correspondence example, things are broken • How you translate “black” depends on noun gender (“noir” vs. “noire”) • How you translate “The” also depends on gender (“Le” vs. “La”) • More generally, getting the translation right requires context
  • 14. Why do we need fancy methods? • We need a way for the network to remember information from previous time steps
  • 15. Recurrent neural networks • Extremely popular way of modelling sequential data • Process data one time step at a time, while updating a running internal hidden state
  • 20. Standard FF network to RNN • At each time step, RNN passes on its activations from previous time step • In theory all the way back to the first time step
  • 21. Standard FF network to RNN *Activation function probably tanh or ReLU
  • 22. Standard FF network to RNN • So you can say this is a form of memory • Cell hidden state transferred • Basis for RNNs remembering context
  • 23. Memory problems • Basic RNNs not great at long-term dependencies but plenty of ways to improve this • Information gating mechanisms • Condensing input using encoders
  • 24. Gating mechanisms • Gates regulate the flow of information • Very helpful - basic RNN cells not really used anymore. Responsible for recent RNN popularity. • Add explicit mechanisms to remember information and forget information • Why use gates? • Helps you learn long-term dependencies • Not all time points are equally relevant – not everything has to be remembered • Speeds up training/convergence
  • 25. Gated recurrent units (GRUs) • GRUs were developed later than LSTMs but are simpler • Motivation is to get the main benefits of LSTMs but with less computation • Reset gate: Mechanism to decide when to remember vs. forget/reset previous information (hidden state) • Update gate: Mechanism to decide when to update hidden state
  • 26. GRU mechanics • Reset gate controls how much past info we use • Rt = 0 means we are resetting our RNN, not using any previous information • Rt = 1 means we use all of previous information (back to our normal vanilla RNN)
  • 27. GRU mechanics • Update gate controls whether we bother updating our hidden state using new information • Zt = 1 means you’re not updating, you’re just using previous hidden state • Zt = 0 means you’re updating as much as possible
  • 28. LSTM mechanics • LSTMs add a memory unit to further control the flow of information through the cell • Also whereas GRUs have 2 gates, an LSTM cell has 3 gates: • An input gate – should I ignore or consider the input? • A forget gate – should I keep or throw away the information in memory? • An output gate – how should I use input, hidden state and memory to output my next hidden state?
  • 29. GRUs vs. LSTMs • GRUs are simpler + train faster • LSTMs more popular – can give slightly better performance, but GRU performance often on par • LSTMs would in theory outperform GRUs in tasks requiring very long-range modelling
  • 30. IV. Game of Thrones Language Model
  • 31. Notebook • ~30 mins • Jupyter notebook on building an RNN- based language model • Python 3 + Keras for neural networks tinyurl.com/wbay5o3
  • 32. IV. Components of SOTA RNN models
  • 33. Encoder-Decoder architectures • Being forced to immediately output a French word for every English word
  • 35.
  • 36. Encoder-Decoder architectures • Tends to work a lot better than using a single sequence-to- sequence RNNs to produce an output for each input step • You often need to see the whole sequence before knowing what to output
  • 37.
  • 38. Bidirectionality in RNN encoder-decoders • For the encoder, bidirectional RNNs (BRNNs) often used • BRNNs read the input sequences forwards and backwards
  • 39.
  • 41. The problem with RNN encoder-decoders • Serious information bottleneck • Condense input sequence down to a small vector?! • Memorise long sequence + regurgitate • Not how humans work • Long computation paths
  • 42. Attention concept • Has been very influential in deep learning • Originally developed for MT (Bahdanau, 2014) • As you’re producing your output sequence, maybe not every part of your input is as equally relevant • Image captioning example Lu et al. 2017. Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning.
  • 43. Attention intuition • Attention allows the network to refer back to the input sequence, instead of forcing it to encode all information into one fixed-length vector
  • 44. • Encoder: Used BRNN to compute rich set of features about source words and their surrounding words • Decoder is asked to choose which hidden states to use and ignore • Weighted sum of hidden states used to predict the next word Attention intuition
  • 45. • Decoder RNN uses attention parameters to decide how much to pay attention to different parts of the input • Allows the model to amplify the signal from relevant parts of the input sequence • This improves modelling Attention intuition
  • 46. Main benefits • Encoder passes a lot more data to the decoder • Not just last hidden state • Passes all hidden states at every time step • Computation path problem: relevant information is now closer by
  • 47. Summary so far • Sequence modelling • Recurrent neural networks • Some key components of SOTA RNN-based models: • Gating mechanisms (GRUs and LSTMs) • Encoder-decoders • Bidirectional encoding • Attention
  • 48. V. Transformers and self-attention
  • 49. Transformers are taking over NLP • Translation, language models, question answering, summarisation, etc. • Some of the best word embeddings are based on Transformers • BERT, ELmO, OpenAI GPT-2 models
  • 50. A single Transformer encoder block • No recurrence, no convolutions • “Attention is all you need” paper • The core concept is the self- attention mechanism • Much more parallelisable than RNN-based models, which means faster training
  • 51. Self-attention is a sequence-to-sequence operation • At the highest level – self- attention takes t input vectors and outputs t output vectors • Take input embedding for “the” and update it by incorporating in information from its context
  • 52. How is the vector for “the” updated?
  • 53. • Each output vector is a weighted sum of the input vectors • But all of these weights are different
  • 54. These are not learned weights in the traditional neural network sense • The weights are calculated by taking dot products • Can use different functions over input
  • 55. Example calculation of a single weight
  • 56. Example calculation of a single weight
  • 57. Calculating a weight matrix row
  • 58. Attention weight matrix • The dot product can be anything (negative infinity to positive infinity) • We normalise by length • We softmax this so that the weights are positive values summing to 1 • Attention weight matrix summarises relationship between words • Because dot products capture similarity between vectors
  • 59. Multi-headed attention • Attention weight matrix captures relationship between words • But there’s many different ways words can be related • And which ones you want to capture depends on your task • Different attention heads learn different relations between word pairs Img source
  • 60. Difference to RNNs • Whereas RNNs updates context token-by-token by updating internal hidden state, self- attention captures context by updating all word representations simultaneously • Lower computational complexity, scales better with more data • More parallelisable = faster training
  • 61. Connecting all these concepts • “Useful” input representations are learned • “Useful” weights for transforming input vectors are learned • These quantities should produce “useful” dot products • That lead to “useful” updated input vectors • That lead to “useful” input to the feed-forward network layer • … etc. … that eventually lead to lower overall loss on the training set
  • 62. Summary I. Introduction to sequence modelling II. Quick neural network review • How a single neuron functions • Feed-forward networks III. Recurrent neural networks • From feed-forward networks to recurrence • RNNs with gating mechanisms IV. Practical: Building a language model for Game of Thrones V. Components of state-of-the-art RNN models • Encoder-decoder models • Bidirectionality • Attention VI. Transformers and self-attention
  • 63. Further Reading • More accessible: Andrew Ng Sequence Course on Coursera • https://www.coursera.org/learn/nlp- sequence-models • More technical: Deep Learning book by Goodfellow et al. • https://www.deeplearningbook.org/cont ents/rnn.html • Also: Alex Smola Berkeley Lectures • https://www.youtube.com/user/smolix/vi deos
  • 64. Just for fun • Talk to transformer • https://talktotransformer.com/ • Using OpenAI’s “too dangerous to release” GPT- 2 language model
  • 67. Sequences in natural language • Sequence modelling very popular in NLP because language is sequential by nature • Text • Sequences of words • Sequences of characters • We process text sequentially, though in principle could see all words at once • Speech • Sequence of amplitudes over time • Frequency spectrogram over time • Extracted frequency features over time Img source
  • 68. Sequences in biology • Genomics, DNA and RNA sequences • Proteomics, protein sequences, structural biology • Trying to represent sequences in some way, or predict some function or association of the sequence Img source
  • 69. Sequences in finance • Lots of time series data • Numerical sequences (stocks, indices) • Lots of forecasting work – predicting the future (trading strategies) • Deep learning for these sequences perhaps not as popular as you might think • Quite well-developed methods based on classical statistics, interpretability important Img source Img source
  • 70. Single neuron computation • What computation is happening inside 1 neuron? • If you understand how 1 neuron computes output given input, it’s a small step to understand how an entire network computes output given input
  • 71. Single neuron computation • What computation is happening inside 1 neuron? • If you understand how 1 neuron computes output given input, it’s a small step to understand how an entire network computes output given input
  • 72. Perceptrons • Modelling a binary outcome using binary input features • Should I have a cup of tea? • 0 = no • 1 = yes • Three features with 1 weight each: • Do they have Earl Grey? • earl_grey, 𝑤" = 3 • Have I just had a cup of tea? • already_had, 𝑤# =-1 • Can I get it to go? • to_go, 𝑤$ =2
  • 73. Perceptrons • Modelling a binary outcome using binary input features • Should I have a cup of tea? • 0 = no • 1 = yes • Three features with 1 weight each: • Do they have Earl Grey? • earl_grey, 𝑤" = 3 • Have I just had a cup of tea? • already_had, 𝑤# =-1 • Can I get it to go? • to_go, 𝑤$ =2
  • 74. Perceptrons • Here weights are cherry-picked, but perceptrons learn these weights automatically from training data by shifting parameters to minimise error
  • 75. Perceptrons • Formalising the perceptron calculation • Instead of a threshold, more common to see a bias term • Instead of writing out the sums using sigma notation, more common to see dot products. • Vectorisation for efficiency • Here, I manually chose these values – but given a dataset of past inputs/outputs, you could learn the optimal parameter values
  • 76. Perceptrons • Formalising the perceptron calculation • Instead of a threshold, more common to see a bias term • Instead of writing out the sums using sigma notation, more common to see dot products. • Vectorisation for efficiency
  • 77. Sigmoid neurons • Want to handle continuous values • Where input can be something other than just 0 or 1 • Where output can be something other than just 0 or 1 • We put the weighted sum of inputs through an activation function • Sigmoid or logistic function
  • 78. Sigmoid neurons • The sigmoid function is basically a smoothed out perceptron! • Output no longer a sudden jump • It’s the smoothness of the function that we care about Img source
  • 79. Activation functions • Which activation function to use? • Heuristics based on experiments, not proof- based Img source
  • 80. More layers! • Increase number of layers to increase capacity for abstraction, hierarchical processing of input
  • 81. Training on big window sizes • How much of window size? On very long sequence, unrolled RNN becomes a very deep network • Same problems with vanishing/exploding gradients as normal networks • And takes a longer time to train • The normal tricks can help – good initialization of parameters, non- saturating activation functions, gradient clipping, batch norm • Training over a limited number of steps – truncated backpropagation through time
  • 82. LSTM mechanics • Input, forget, output gates are little neural networks within the cell • Memory being updated via forget gate and candidate memory • Hidden state being updated by output gate, which weighs up all information
  • 83.
  • 84. Query, Key, and Value transformations • Notice that we are using each input vector on 3 separate occasions • E.g. vector x2 1. To take dot products with each other input vector when calculating y2 2. In dot products with other output vectors (y1, y3, y4) are calculated 3. And in the weighted sum to produce output vector y2
  • 85. Query, Key, and Value transformations • To model these 3 different functions for each input vector, and give the model extra expressivity and flexibility, we are going to modify the input vectors • Apply simple linear transformations
  • 86. Input transformation matrices • These weight matrices are learnable parameters • Gives something else to learn by gradient descent