Practical Deep Learning for NLP
Maarten Versteegh
NLP Research Engineer
Overview
● Deep Learning Recap
● Text classification:
– Convnet with word embeddings
● Sentiment analysis:
– ResNet
● Tips and tricks
What is this deep learning thing again?
Input
Hidden
Output
Activation
Error
Rectified Linear Units
Backpropagation involves repeated multiplication with derivative of activation function
→ Problem if result is always smaller than 1!
Text Classification
Traditional approach: BOW + TFIDF
“The car might also need a front end alignment”
"alignment" (0.323)
"also" (0.137)
"car" (0.110)
"end" (0.182)
"front" (0.167)
"might" (0.178)
"need" (0.157)
"the" (0.053)
"also need" (0.343)
"car might" (0.358)
"end alignment" (0.358)
"front end" (0.296)
"might also" (0.358)
"need front" (0.358)
"the car" (0.161)
F1-Score*
BOW+TFIDF+SVM Some number
20 newsgroups performance
(*) Scores removed
Deep Learning 1: Replace Classifier
Hidden x 256
x 512
x 1000
BOW
Features
Hidden
Output
from keras.layers import Input, Dense
from keras.models import Model
input_layer = Input(shape=(1000,))
fc_1 = Dense(512, activation='relu')(input_layer)
fc_2 = Dense(256, activation='relu')(fc_1)
output_layer = Dense(10, activation='softmax')(fc_2)
model = Model(input=input_layer, output=output_layer)
model.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
model.fit(bow, newsgroups.target)
predictions = model.predict(features).argmax(axis=1)
F1-Score*
BOW+TFIDF+SVM Some number
BOW+TFIDF+SVD+ 2-layer NN Some slightly higher number
20 newsgroups performance
(*) Scores removed
What about the deep learning promise?
Convolutional Networks
Source: Andrej Karpathy
Pooling layer
Source: Andrej Karpathy
Convolutional networks
Source: Y. Kim (2014) Convolutional Networks for Sentence Classification
Word embedding
from keras.layers import Embedding
# embedding_matrix: ndarray(vocab_size, embedding_dim)
input_layer = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
layer = Embedding(
embedding_matrix.shape[0],
embedding_matrix.shape[1],
weights=[embedding_matrix],
input_length=max_sequence_length,
trainable=False
)(input_layer)
from keras.layer import Convolution1D, MaxPooling1D,
BatchNormalization, Activation
layer = Embedding(...)(input_layer)
layer = Convolution1D(
128, # number of filters
5, # filter size
activation='relu',
)(layer)
layer = MaxPooling1D(5)(layer)
Performance
F1-Score*
BOW+TFIDF+SVM Some number
CBOW+TFIDF+SVD+NN Some slightly higher number
ConvNet (3 layers) Quite a bit higher now
ConvNet (6 layers) Look mom, even higher!
(*) Scores removed
Sentiment Analysis
Data Set
Facebook posts from media organizations:
– CNN, MSNBC, NYTimes, The Guardian, Buzzfeed,
Breitbart, Politico, The Wall Street Journal, Washington
Post, Baltimore Sun
Measure sentiment as “reactions”
Title Org Like Love Wow Haha Sad Angry
Poll: Clinton up big on Trump in Virginia CNN 4176 601 17 211 11 83
It's a fact: Trump has tiny hands. Will
this be the one that sinks him?
Guardian 595 17 17 225 2 8
Donald Trump Explains His Obama-
Founded-ISIS Claim as ‘Sarcasm’
NYTimes 2059 32 284 1214 80 2167
Can hipsters stomach the unpalatable
truth about avocado toast?
Guardian 3655 0 396 44 773 69
Tim Kaine skewers Donald Trump's
military policy
MSNBC 1094 111 6 12 2 26
Top 5 Most Antisemitic Things Hillary
Clinton Has Done
Breitbart 1067 7 134 35 22 372
17 Hilarious Tweets About Donald
Trump Explaining Movies
Buzzfeed 11390 375 16 4121 4 5
Go deeper: ResNet
Convolutional Layers with shortcuts
He et al. Deep Residual Learning
for Image Recognition
Go deeper: ResNet
input_layer = ...
layer = Convolution1D(128, 5, activation='linear')
(input_layer)
layer = BatchNormalization()(layer)
layer = Activation('relu')(layer)
layer = Convolution1D(128, 5, activation='linear')(layer)
layer = BatchNormalization()(layer)
layer = Activation('relu')(layer)
block_output = merge([layer, input_layer], mode='sum')
block_output = Activation('relu')(block_output)
It's a fact: Trump has tiny hands.
(EMBEDDING_DIM=300)
ResNet Block
…
ResNet Block
The Guardian
(1-of-K)
Conv (128) x 10
%
Title + Message
News Org
MaxPooling
Dense
Dense
Cherry-picked predicted response
distribution*
Sentence Org Love Haha Wow Sad Angry
Trump wins the election Guardian 3% 9% 7% 32% 49%
Trump wins the election Breitbart 58% 30% 8% 1% 3%
*Your mileage may vary. By a lot. I
mean it.
Tips and Tricks
Initialization
● Break symmetry:
– Never ever initialize all your weights to
the same value
● Let initialization depend on activation
function:
– ReLU/PReLU → He Normal
– sigmoid/tanh → Glorot Normal
Choose an adaptive optimizer
Source: Alec Radford
Choose an adaptive optimizer
Choose the right model size
● Start small and keep adding layers
– Check if test error keeps going down
● Cross-validate over the number of units
● You want to be able to overfit
Y. Bengio (2012) Practical
recommendations for gradient-based
training of deep architectures
Don't be scared of overfitting
● If your model can't overfit, it also can't learn enough
● So, check that your model can overfit:
– If not, make it bigger
– If so, get more date and/or regularize
Source: wikipedia
Regularization
● Norm penalties on hidden layer weights, never
on first and last
● Dropout
● Early stopping
Size of data set
● Just get more data already
● Augment data:
– Textual replacements
– Word vector perturbation
– Noise Contrastive Estimation
● Semi-supervised learning:
– Adapt word embeddings to your domain
Monitor your model
Training loss:
– Does the model converge?
– Is the learning rate too low or too high?
Training loss and learning rate
Source: Andrej Karpathy
Monitor your model
Training and validation accuracy
– Is there a large gap?
– Does the training accuracy increase
while the validation accuracy
decreases?
Training and validation accuracy
Source: Andrej Karpathy
Monitor your model
● Ratio of weights to updates
● Distribution of activations and gradients
(per layer)
Hyperparameter optimization
After network architecture, continue with:
– Regularization strength
– Initial learning rate
– Optimization strategy (and LR decay
schedule)
Friends don't let friends do a full grid search!
Hyperparameter optimization
Friends don't let friends do a full grid search!
– Use a smart strategy like Bayesian
optimization or Particle Swarm Optimization
(Spearmint, SMAC, Hyperopt, Optunity)
– Even random search often beats grid search
Keep up to date: arxiv-sanity.com
We are hiring!
DevOps & Front-end
NLP engineers
Full-stack Python engineers
www.textkernel.com/jobs
Questions?
Source: http://visualqa.org/

Practical Deep Learning for NLP

  • 1.
    Practical Deep Learningfor NLP Maarten Versteegh NLP Research Engineer
  • 2.
    Overview ● Deep LearningRecap ● Text classification: – Convnet with word embeddings ● Sentiment analysis: – ResNet ● Tips and tricks
  • 3.
    What is thisdeep learning thing again?
  • 4.
  • 5.
    Rectified Linear Units Backpropagationinvolves repeated multiplication with derivative of activation function → Problem if result is always smaller than 1!
  • 6.
  • 7.
    Traditional approach: BOW+ TFIDF “The car might also need a front end alignment” "alignment" (0.323) "also" (0.137) "car" (0.110) "end" (0.182) "front" (0.167) "might" (0.178) "need" (0.157) "the" (0.053) "also need" (0.343) "car might" (0.358) "end alignment" (0.358) "front end" (0.296) "might also" (0.358) "need front" (0.358) "the car" (0.161)
  • 8.
    F1-Score* BOW+TFIDF+SVM Some number 20newsgroups performance (*) Scores removed
  • 9.
    Deep Learning 1:Replace Classifier Hidden x 256 x 512 x 1000 BOW Features Hidden Output
  • 10.
    from keras.layers importInput, Dense from keras.models import Model input_layer = Input(shape=(1000,)) fc_1 = Dense(512, activation='relu')(input_layer) fc_2 = Dense(256, activation='relu')(fc_1) output_layer = Dense(10, activation='softmax')(fc_2) model = Model(input=input_layer, output=output_layer) model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy']) model.fit(bow, newsgroups.target) predictions = model.predict(features).argmax(axis=1)
  • 11.
    F1-Score* BOW+TFIDF+SVM Some number BOW+TFIDF+SVD+2-layer NN Some slightly higher number 20 newsgroups performance (*) Scores removed
  • 12.
    What about thedeep learning promise?
  • 13.
  • 14.
  • 15.
    Convolutional networks Source: Y.Kim (2014) Convolutional Networks for Sentence Classification
  • 16.
  • 17.
    from keras.layers importEmbedding # embedding_matrix: ndarray(vocab_size, embedding_dim) input_layer = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32') layer = Embedding( embedding_matrix.shape[0], embedding_matrix.shape[1], weights=[embedding_matrix], input_length=max_sequence_length, trainable=False )(input_layer)
  • 18.
    from keras.layer importConvolution1D, MaxPooling1D, BatchNormalization, Activation layer = Embedding(...)(input_layer) layer = Convolution1D( 128, # number of filters 5, # filter size activation='relu', )(layer) layer = MaxPooling1D(5)(layer)
  • 19.
    Performance F1-Score* BOW+TFIDF+SVM Some number CBOW+TFIDF+SVD+NNSome slightly higher number ConvNet (3 layers) Quite a bit higher now ConvNet (6 layers) Look mom, even higher! (*) Scores removed
  • 20.
  • 21.
    Data Set Facebook postsfrom media organizations: – CNN, MSNBC, NYTimes, The Guardian, Buzzfeed, Breitbart, Politico, The Wall Street Journal, Washington Post, Baltimore Sun Measure sentiment as “reactions”
  • 22.
    Title Org LikeLove Wow Haha Sad Angry Poll: Clinton up big on Trump in Virginia CNN 4176 601 17 211 11 83 It's a fact: Trump has tiny hands. Will this be the one that sinks him? Guardian 595 17 17 225 2 8 Donald Trump Explains His Obama- Founded-ISIS Claim as ‘Sarcasm’ NYTimes 2059 32 284 1214 80 2167 Can hipsters stomach the unpalatable truth about avocado toast? Guardian 3655 0 396 44 773 69 Tim Kaine skewers Donald Trump's military policy MSNBC 1094 111 6 12 2 26 Top 5 Most Antisemitic Things Hillary Clinton Has Done Breitbart 1067 7 134 35 22 372 17 Hilarious Tweets About Donald Trump Explaining Movies Buzzfeed 11390 375 16 4121 4 5
  • 23.
    Go deeper: ResNet ConvolutionalLayers with shortcuts He et al. Deep Residual Learning for Image Recognition
  • 24.
    Go deeper: ResNet input_layer= ... layer = Convolution1D(128, 5, activation='linear') (input_layer) layer = BatchNormalization()(layer) layer = Activation('relu')(layer) layer = Convolution1D(128, 5, activation='linear')(layer) layer = BatchNormalization()(layer) layer = Activation('relu')(layer) block_output = merge([layer, input_layer], mode='sum') block_output = Activation('relu')(block_output)
  • 25.
    It's a fact:Trump has tiny hands. (EMBEDDING_DIM=300) ResNet Block … ResNet Block The Guardian (1-of-K) Conv (128) x 10 % Title + Message News Org MaxPooling Dense Dense
  • 26.
    Cherry-picked predicted response distribution* SentenceOrg Love Haha Wow Sad Angry Trump wins the election Guardian 3% 9% 7% 32% 49% Trump wins the election Breitbart 58% 30% 8% 1% 3% *Your mileage may vary. By a lot. I mean it.
  • 27.
  • 28.
    Initialization ● Break symmetry: –Never ever initialize all your weights to the same value ● Let initialization depend on activation function: – ReLU/PReLU → He Normal – sigmoid/tanh → Glorot Normal
  • 29.
    Choose an adaptiveoptimizer Source: Alec Radford Choose an adaptive optimizer
  • 30.
    Choose the rightmodel size ● Start small and keep adding layers – Check if test error keeps going down ● Cross-validate over the number of units ● You want to be able to overfit Y. Bengio (2012) Practical recommendations for gradient-based training of deep architectures
  • 31.
    Don't be scaredof overfitting ● If your model can't overfit, it also can't learn enough ● So, check that your model can overfit: – If not, make it bigger – If so, get more date and/or regularize Source: wikipedia
  • 32.
    Regularization ● Norm penaltieson hidden layer weights, never on first and last ● Dropout ● Early stopping
  • 33.
    Size of dataset ● Just get more data already ● Augment data: – Textual replacements – Word vector perturbation – Noise Contrastive Estimation ● Semi-supervised learning: – Adapt word embeddings to your domain
  • 34.
    Monitor your model Trainingloss: – Does the model converge? – Is the learning rate too low or too high?
  • 35.
    Training loss andlearning rate Source: Andrej Karpathy
  • 36.
    Monitor your model Trainingand validation accuracy – Is there a large gap? – Does the training accuracy increase while the validation accuracy decreases?
  • 37.
    Training and validationaccuracy Source: Andrej Karpathy
  • 38.
    Monitor your model ●Ratio of weights to updates ● Distribution of activations and gradients (per layer)
  • 39.
    Hyperparameter optimization After networkarchitecture, continue with: – Regularization strength – Initial learning rate – Optimization strategy (and LR decay schedule)
  • 40.
    Friends don't letfriends do a full grid search!
  • 41.
    Hyperparameter optimization Friends don'tlet friends do a full grid search! – Use a smart strategy like Bayesian optimization or Particle Swarm Optimization (Spearmint, SMAC, Hyperopt, Optunity) – Even random search often beats grid search
  • 42.
    Keep up todate: arxiv-sanity.com
  • 43.
    We are hiring! DevOps& Front-end NLP engineers Full-stack Python engineers www.textkernel.com/jobs
  • 44.