Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Practical Deep Learning for NLP

1,095 views

Published on

Presentation by Maarten Versteegh, NLP Research Engineer at Textkernel, at the PyData Meetup (https://www.meetup.com/PyData-NL/events/232899698/).

Published in: Technology
  • Be the first to comment

Practical Deep Learning for NLP

  1. 1. Practical Deep Learning for NLP Maarten Versteegh NLP Research Engineer
  2. 2. Overview ● Deep Learning Recap ● Text classification: – Convnet with word embeddings ● Sentiment analysis: – ResNet ● Tips and tricks
  3. 3. What is this deep learning thing again?
  4. 4. Input Hidden Output Activation Error
  5. 5. Rectified Linear Units Backpropagation involves repeated multiplication with derivative of activation function → Problem if result is always smaller than 1!
  6. 6. Text Classification
  7. 7. Traditional approach: BOW + TFIDF “The car might also need a front end alignment” "alignment" (0.323) "also" (0.137) "car" (0.110) "end" (0.182) "front" (0.167) "might" (0.178) "need" (0.157) "the" (0.053) "also need" (0.343) "car might" (0.358) "end alignment" (0.358) "front end" (0.296) "might also" (0.358) "need front" (0.358) "the car" (0.161)
  8. 8. F1-Score* BOW+TFIDF+SVM Some number 20 newsgroups performance (*) Scores removed
  9. 9. Deep Learning 1: Replace Classifier Hidden x 256 x 512 x 1000 BOW Features Hidden Output
  10. 10. from keras.layers import Input, Dense from keras.models import Model input_layer = Input(shape=(1000,)) fc_1 = Dense(512, activation='relu')(input_layer) fc_2 = Dense(256, activation='relu')(fc_1) output_layer = Dense(10, activation='softmax')(fc_2) model = Model(input=input_layer, output=output_layer) model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy']) model.fit(bow, newsgroups.target) predictions = model.predict(features).argmax(axis=1)
  11. 11. F1-Score* BOW+TFIDF+SVM Some number BOW+TFIDF+SVD+ 2-layer NN Some slightly higher number 20 newsgroups performance (*) Scores removed
  12. 12. What about the deep learning promise?
  13. 13. Convolutional Networks Source: Andrej Karpathy
  14. 14. Pooling layer Source: Andrej Karpathy
  15. 15. Convolutional networks Source: Y. Kim (2014) Convolutional Networks for Sentence Classification
  16. 16. Word embedding
  17. 17. from keras.layers import Embedding # embedding_matrix: ndarray(vocab_size, embedding_dim) input_layer = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32') layer = Embedding( embedding_matrix.shape[0], embedding_matrix.shape[1], weights=[embedding_matrix], input_length=max_sequence_length, trainable=False )(input_layer)
  18. 18. from keras.layer import Convolution1D, MaxPooling1D, BatchNormalization, Activation layer = Embedding(...)(input_layer) layer = Convolution1D( 128, # number of filters 5, # filter size activation='relu', )(layer) layer = MaxPooling1D(5)(layer)
  19. 19. Performance F1-Score* BOW+TFIDF+SVM Some number CBOW+TFIDF+SVD+NN Some slightly higher number ConvNet (3 layers) Quite a bit higher now ConvNet (6 layers) Look mom, even higher! (*) Scores removed
  20. 20. Sentiment Analysis
  21. 21. Data Set Facebook posts from media organizations: – CNN, MSNBC, NYTimes, The Guardian, Buzzfeed, Breitbart, Politico, The Wall Street Journal, Washington Post, Baltimore Sun Measure sentiment as “reactions”
  22. 22. Title Org Like Love Wow Haha Sad Angry Poll: Clinton up big on Trump in Virginia CNN 4176 601 17 211 11 83 It's a fact: Trump has tiny hands. Will this be the one that sinks him? Guardian 595 17 17 225 2 8 Donald Trump Explains His Obama- Founded-ISIS Claim as ‘Sarcasm’ NYTimes 2059 32 284 1214 80 2167 Can hipsters stomach the unpalatable truth about avocado toast? Guardian 3655 0 396 44 773 69 Tim Kaine skewers Donald Trump's military policy MSNBC 1094 111 6 12 2 26 Top 5 Most Antisemitic Things Hillary Clinton Has Done Breitbart 1067 7 134 35 22 372 17 Hilarious Tweets About Donald Trump Explaining Movies Buzzfeed 11390 375 16 4121 4 5
  23. 23. Go deeper: ResNet Convolutional Layers with shortcuts He et al. Deep Residual Learning for Image Recognition
  24. 24. Go deeper: ResNet input_layer = ... layer = Convolution1D(128, 5, activation='linear') (input_layer) layer = BatchNormalization()(layer) layer = Activation('relu')(layer) layer = Convolution1D(128, 5, activation='linear')(layer) layer = BatchNormalization()(layer) layer = Activation('relu')(layer) block_output = merge([layer, input_layer], mode='sum') block_output = Activation('relu')(block_output)
  25. 25. It's a fact: Trump has tiny hands. (EMBEDDING_DIM=300) ResNet Block … ResNet Block The Guardian (1-of-K) Conv (128) x 10 % Title + Message News Org MaxPooling Dense Dense
  26. 26. Cherry-picked predicted response distribution* Sentence Org Love Haha Wow Sad Angry Trump wins the election Guardian 3% 9% 7% 32% 49% Trump wins the election Breitbart 58% 30% 8% 1% 3% *Your mileage may vary. By a lot. I mean it.
  27. 27. Tips and Tricks
  28. 28. Initialization ● Break symmetry: – Never ever initialize all your weights to the same value ● Let initialization depend on activation function: – ReLU/PReLU → He Normal – sigmoid/tanh → Glorot Normal
  29. 29. Choose an adaptive optimizer Source: Alec Radford Choose an adaptive optimizer
  30. 30. Choose the right model size ● Start small and keep adding layers – Check if test error keeps going down ● Cross-validate over the number of units ● You want to be able to overfit Y. Bengio (2012) Practical recommendations for gradient-based training of deep architectures
  31. 31. Don't be scared of overfitting ● If your model can't overfit, it also can't learn enough ● So, check that your model can overfit: – If not, make it bigger – If so, get more date and/or regularize Source: wikipedia
  32. 32. Regularization ● Norm penalties on hidden layer weights, never on first and last ● Dropout ● Early stopping
  33. 33. Size of data set ● Just get more data already ● Augment data: – Textual replacements – Word vector perturbation – Noise Contrastive Estimation ● Semi-supervised learning: – Adapt word embeddings to your domain
  34. 34. Monitor your model Training loss: – Does the model converge? – Is the learning rate too low or too high?
  35. 35. Training loss and learning rate Source: Andrej Karpathy
  36. 36. Monitor your model Training and validation accuracy – Is there a large gap? – Does the training accuracy increase while the validation accuracy decreases?
  37. 37. Training and validation accuracy Source: Andrej Karpathy
  38. 38. Monitor your model ● Ratio of weights to updates ● Distribution of activations and gradients (per layer)
  39. 39. Hyperparameter optimization After network architecture, continue with: – Regularization strength – Initial learning rate – Optimization strategy (and LR decay schedule)
  40. 40. Friends don't let friends do a full grid search!
  41. 41. Hyperparameter optimization Friends don't let friends do a full grid search! – Use a smart strategy like Bayesian optimization or Particle Swarm Optimization (Spearmint, SMAC, Hyperopt, Optunity) – Even random search often beats grid search
  42. 42. Keep up to date: arxiv-sanity.com
  43. 43. We are hiring! DevOps & Front-end NLP engineers Full-stack Python engineers www.textkernel.com/jobs
  44. 44. Questions? Source: http://visualqa.org/

×