Sequence Modelling with Deep Learning

Sequence Modelling
with Deep Learning
ODSC London 2019 Tutorial
Natasha Latysheva

Overview
I. Introduction to sequence modelling
II. Quick neural network review
• Feed-forward networks
III. Recurrent neural networks
• From feed-forward networks to recurrence
• RNNs with gating mechanisms
IV. Practical: Building a language model for Game of Thrones
V. Components of state-of-the-art RNN models
• Encoder-decoder models
• Bidirectionality
• Attention
VI. Transformers and self-attention

Speaker Intro
• Welocalize
• We provide language services
• Fairly large, by revenue 8th largest globally,
4th largest US. 1500+ employees.
• Lots of localisation (translation)
• International marketing, site optimisation
• NLP engineering team
• 14 people remote across US, Ireland, UK,
Germany, China
• Various NLP things: machine translation,
text-to-speech, NER, sentiment, topics,
classification, etc.

I. Introduction to Sequence Modelling

Less conventional sequence data
• Activity on a website:
• [click_button, move_cursor, wait,
wait, click_subscribe, close_tab]
• Customer history:
• [inactive -> mildly_active ->
payment_made -> complaint_filed
-> inactive -> account_closed]
• Code (constrained language) is
sequential data – can learn the
structure

II. Quick Neural Network Review

Simplifying the notation
• Single neurons
• Weight matrices, bias vectors
• Fully-connected layer

III. Recurrent Neural Networks

Why do we need fancy methods to
model sequences?
• Say we are training a translation
model, English->French
• ”The cat is black” to “Le chat is
noir”
• Could in theory use a feed-
forward network to translate
word-by-word

Why do we need fancy methods?
• A feed-forward network treats
time steps as completely
independent
• Even in this simple 1-to-1
correspondence example, things
are broken
• How you translate “black” depends
on noun gender (“noir” vs. “noire”)
• How you translate “The” also
depends on gender (“Le” vs. “La”)
• More generally, getting the
translation right requires context

Why do we need fancy methods?
• We need a way for the network
to remember information from
previous time steps

Recurrent neural networks
• Extremely popular way of modelling
sequential data
• Process data one time step at a
time, while updating a running
internal hidden state

Standard FF network to RNN
• At each time step, RNN
passes on its activations
from previous time step
• In theory all the way back
to the first time step

*Activation function probably tanh or ReLU

• So you can say this is a
form of memory
• Cell hidden state
transferred
• Basis for RNNs
remembering context

Memory problems
• Basic RNNs not great at
long-term dependencies
but plenty of ways to
improve this
• Information gating
mechanisms
• Condensing input using
encoders

Gating mechanisms
• Gates regulate the flow of
information
• Very helpful - basic RNN cells not really
used anymore. Responsible for recent
RNN popularity.
• Add explicit mechanisms to remember
information and forget information
• Why use gates?
• Helps you learn long-term
dependencies
• Not all time points are equally relevant
– not everything has to be remembered
• Speeds up training/convergence

Gated recurrent
units (GRUs)
• GRUs were developed later
than LSTMs but are simpler
• Motivation is to get the main
benefits of LSTMs but with less
computation
• Reset gate: Mechanism to
decide when to remember vs.
forget/reset previous
information (hidden state)
• Update gate: Mechanism to
decide when to update
hidden state

GRU mechanics
• Reset gate controls how
much past info we use
• Rt = 0 means we are resetting
our RNN, not using any
previous information
• Rt = 1 means we use all of
previous information (back to
our normal vanilla RNN)

GRU mechanics
• Update gate controls whether
we bother updating our
hidden state using new
information
• Zt = 1 means you’re not
updating, you’re just using
previous hidden state
• Zt = 0 means you’re updating as
much as possible

LSTM mechanics
• LSTMs add a memory unit to
further control the flow of
information through the cell
• Also whereas GRUs have 2
gates, an LSTM cell has 3
gates:
• An input gate – should I ignore
or consider the input?
• A forget gate – should I keep
or throw away the information
in memory?
• An output gate – how should I
use input, hidden state and
memory to output my next
hidden state?

GRUs vs. LSTMs
• GRUs are simpler + train
faster
• LSTMs more popular – can
give slightly better
performance, but GRU
performance often on par
• LSTMs would in theory
outperform GRUs in tasks
requiring very long-range
modelling

IV. Game of Thrones Language Model

Notebook
• ~30 mins
• Jupyter
notebook on
building an RNN-
based language
model
• Python 3 + Keras
for neural
networks
tinyurl.com/wbay5o3

IV. Components of SOTA RNN models

Encoder-Decoder architectures
• Being forced to
immediately
output a French
word for every
English word

Encoder-Decoder architectures
• Tends to work a lot
better than using a
single sequence-to-
sequence RNNs to
produce an output
for each input step
• You often need to
see the whole
sequence before
knowing what to
output

Bidirectionality in RNN encoder-decoders
• For the encoder,
bidirectional RNNs
(BRNNs) often used
• BRNNs read the
input sequences
forwards and
backwards

Bidirectional
RNNs
• Process input
sequences in both
directions

The problem with RNN encoder-decoders
• Serious information
bottleneck
• Condense input
sequence down to a
small vector?!
• Memorise long
sequence + regurgitate
• Not how humans work
• Long computation
paths

Attention concept
• Has been very influential in
deep learning
• Originally developed for
MT (Bahdanau, 2014)
• As you’re producing your
output sequence, maybe
not every part of your input
is as equally relevant
• Image captioning example
Lu et al. 2017. Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning.

Attention intuition
• Attention allows the
network to refer back
to the input
sequence, instead of
forcing it to encode
all information into
one fixed-length
vector

• Encoder: Used BRNN
to compute rich set of
features about source
words and their
surrounding words
• Decoder is asked to
choose which hidden
states to use and
ignore
• Weighted sum of
hidden states used to
predict the next word
Attention intuition

• Decoder RNN uses
attention parameters
to decide how much
to pay attention to
different parts of the
input
• Allows the model to
amplify the signal
from relevant parts of
the input sequence
• This improves
modelling
Attention intuition

Main benefits
• Encoder passes a lot
more data to
the decoder
• Not just last hidden
state
• Passes all hidden states
at every time step
• Computation path
problem: relevant
information is now
closer by

Summary so far
• Sequence modelling
• Recurrent neural
networks
• Some key components
of SOTA RNN-based
models:
• Gating mechanisms
(GRUs and LSTMs)
• Encoder-decoders
• Bidirectional encoding
• Attention

V. Transformers and self-attention

Transformers are taking over NLP
• Translation, language
models, question
answering, summarisation,
etc.
• Some of the best word
embeddings are based on
Transformers
• BERT, ELmO, OpenAI GPT-2
models

A single Transformer encoder block
• No recurrence, no convolutions
• “Attention is all you need” paper
• The core concept is the self-
attention mechanism
• Much more parallelisable than
RNN-based models, which
means faster training

Self-attention is a
sequence-to-sequence
operation
• At the highest level – self-
attention takes t input
vectors and outputs t
output vectors
• Take input embedding for
“the” and update it by
incorporating in
information from its
context

How is the vector for “the” updated?

• Each output vector
is a weighted sum
of the input vectors
• But all of these
weights are
different

These are not learned weights in the
traditional neural network sense
• The weights are
calculated by taking
dot products
• Can use different
functions over input

Example calculation of a single weight

Calculating a weight matrix row

Attention weight matrix
• The dot product can be
anything (negative infinity to
positive infinity)
• We normalise by length
• We softmax this so that the
weights are positive values
summing to 1
• Attention weight matrix
summarises relationship
between words
• Because dot products capture
similarity between vectors

Multi-headed attention
• Attention weight matrix
captures relationship
between words
• But there’s many
different ways words can
be related
• And which ones you want
to capture depends on
your task
• Different attention heads
learn different relations
between word pairs
Img source

Difference to RNNs
• Whereas RNNs updates context
token-by-token by updating
internal hidden state, self-
attention captures context by
updating all word representations
simultaneously
• Lower computational complexity,
scales better with more data
• More parallelisable = faster
training

Connecting all
these concepts
• “Useful” input representations are
learned
• “Useful” weights for transforming
input vectors are learned
• These quantities should produce
“useful” dot products
• That lead to “useful” updated input
vectors
• That lead to “useful” input to the
feed-forward network layer
• … etc. … that eventually lead to
lower overall loss on the training set

Summary
I. Introduction to sequence modelling
II. Quick neural network review
• How a single neuron functions
• Feed-forward networks
III. Recurrent neural networks
• From feed-forward networks to recurrence
• RNNs with gating mechanisms
IV. Practical: Building a language model for Game of Thrones
V. Components of state-of-the-art RNN models
• Encoder-decoder models
• Bidirectionality
• Attention
VI. Transformers and self-attention

Further Reading
• More accessible: Andrew Ng
Sequence Course on Coursera
• https://www.coursera.org/learn/nlp-
sequence-models
• More technical: Deep Learning book
by Goodfellow et al.
• https://www.deeplearningbook.org/cont
ents/rnn.html
• Also: Alex Smola Berkeley Lectures
• https://www.youtube.com/user/smolix/vi
deos

Just for fun
• Talk to transformer
• https://talktotransformer.com/
• Using OpenAI’s “too
dangerous to release” GPT-
2 language model

Sequences in natural language
• Sequence modelling very popular in
NLP because language is sequential by
nature
• Text
• Sequences of words
• Sequences of characters
• We process text sequentially, though in
principle could see all words at once
• Speech
• Sequence of amplitudes over time
• Frequency spectrogram over time
• Extracted frequency features over time
Img source

Sequences in biology
• Genomics, DNA and
RNA sequences
• Proteomics, protein
sequences,
structural biology
• Trying to represent
sequences in some
way, or predict some
function or
association of the
sequence
Img source

Sequences in finance
• Lots of time series data
• Numerical sequences (stocks,
indices)
• Lots of forecasting work –
predicting the future (trading
strategies)
• Deep learning for these
sequences perhaps not as
popular as you might think
• Quite well-developed methods
based on classical statistics,
interpretability important
Img source
Img source

Single neuron computation
• What computation is
happening inside 1
neuron?
• If you understand how 1
neuron computes output
given input, it’s a small
step to understand how an
entire network computes
output given input

Perceptrons
• Modelling a binary outcome using
binary input features
• Should I have a cup of tea?
• 0 = no
• 1 = yes
• Three features with 1 weight each:
• Do they have Earl Grey?
• earl_grey, 𝑤" = 3
• Have I just had a cup of tea?
• already_had, 𝑤# =-1
• Can I get it to go?
• to_go, 𝑤$ =2

Perceptrons
• Here weights are
cherry-picked, but
perceptrons learn these
weights automatically
from training data by
shifting parameters to
minimise error

Perceptrons
• Formalising the perceptron
calculation
• Instead of a threshold, more
common to see a bias term
• Instead of writing out the
sums using sigma notation,
more common to see dot
products.
• Vectorisation for efficiency
• Here, I manually chose these
values – but given a dataset of
past inputs/outputs, you could
learn the optimal parameter
values

Perceptrons
• Formalising the
perceptron calculation
• Instead of a threshold,
more common to see a
bias term
• Instead of writing out
the sums using sigma
notation, more common
to see dot products.
• Vectorisation for
efficiency

Sigmoid neurons
• Want to handle continuous
values
• Where input can be
something other than just 0 or
1
• Where output can be
something other than just 0 or
1
• We put the weighted sum of
inputs through an activation
function
• Sigmoid or logistic function

Sigmoid neurons
• The sigmoid function is
basically a smoothed out
perceptron!
• Output no longer a
sudden jump
• It’s the smoothness of the
function that we care
about
Img source

Activation functions
• Which activation function
to use?
• Heuristics based on
experiments, not proof-
based
Img source

More layers!
• Increase
number of
layers to
increase
capacity for
abstraction,
hierarchical
processing of
input

Training on big window sizes
• How much of window size? On very long sequence, unrolled
RNN becomes a very deep network
• Same problems with vanishing/exploding gradients as normal
networks
• And takes a longer time to train
• The normal tricks can help – good initialization of parameters, non-
saturating activation functions, gradient clipping, batch norm
• Training over a limited number of steps – truncated
backpropagation through time

LSTM mechanics
• Input, forget, output gates are
little neural networks within the
cell
• Memory being updated via
forget gate and candidate
memory
• Hidden state being updated by
output gate, which weighs up all
information

Query, Key, and Value transformations
• Notice that we are using
each input vector on 3
separate occasions
• E.g. vector x2
1. To take dot products
with each other input
vector when calculating
y2
2. In dot products with
other output vectors (y1,
y3, y4) are calculated
3. And in the weighted
sum to produce output
vector y2

Query, Key, and Value transformations
• To model these 3
different functions for
each input vector, and
give the model extra
expressivity and
flexibility, we are going
to modify the input
vectors
• Apply simple linear
transformations

Input transformation
matrices
• These weight matrices
are learnable
parameters
• Gives something else
to learn by gradient
descent

Sequence Modelling with Deep Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Sequence Modelling with Deep Learning

Similar to Sequence Modelling with Deep Learning (20)

Recently uploaded

Recently uploaded (20)

Sequence Modelling with Deep Learning