Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- General Sequence Learning using Rec... by indico data 13799 views
- Parsing Natural Scenes and Natural ... by jie cao 2328 views
- Dependency Parsing by Emory University 3792 views
- 「トピックモデルによる統計的潜在意味解析」読書会 2章前半 by Koji Ono 11754 views
- 「トピックモデルによる統計的潜在意味解析」読書会「第1章 統計的潜在意... by ksmzn 13828 views
- Growing Customer Lifetime Value wit... by Cisco 43270 views

5,271 views

Published on

In this presentation, we will first introduce RNNs as a concept. Then we will sketch how to implement them and cover the tricks necessary to make them work well. With the basics covered, we will investigate using RNNs as general text classification and regression models, examining where they succeed and where they fail compared to more traditional text analysis models. A straightforward open-source Python and Theano library for training RNNs with a scikit-learn style interface will be introduced and we’ll see how to use it through a tutorial on a real world text dataset

Published in:
Technology

No Downloads

Total views

5,271

On SlideShare

0

From Embeds

0

Number of Embeds

110

Shares

0

Downloads

187

Comments

0

Likes

5

No embeds

No notes for slide

- 1. RECURRENT NEURAL NETWORKS FOR TEXT ANALYSIS Alec Radford O P E N D A T A S C I E N C E C O N F E R E N C E_ BOSTON 2015 @opendatasci
- 2. Recurrent Neural Networks for text analysis From idea to practice ALEC RADFORD
- 3. Follow Along Slides at: http://goo.gl/WLsUWv
- 4. How ML -0.15, 0.2, 0, 1.5 A, B, C, D The cat sat on the mat. Numerical, great! Categorical, great! Uhhh…….
- 5. How text is dealt with (ML perspective) Text Features (bow, TFIDF, LSA, etc...) Linear Model (SVM, softmax)
- 6. Structure is important! The cat sat on the mat. sat the on mat cat the ● Certain tasks, structure is essential: ○ Humor ○ Sarcasm ● Certain tasks, ngrams can get you a long way: ○ Sentiment Analysis ○ Topic detection ● Specific words can be strong indicators ○ useless, fantastic (sentiment) ○ hoop, green tea, NASDAQ (topic)
- 7. Structure is hard Ngrams is typical way of preserving some structure. sat the on mat cat the cat cat sat sat on on thethe mat Beyond bi or tri-grams occurrences become very rare and dimensionality becomes huge (1, 10 million + features)
- 8. Structure is hard
- 9. How text is dealt with (ML perspective) Text Features (bow, TFIDF, LSA, etc...) Linear Model (SVM, softmax)
- 10. How text should be dealt with? Text RNN Linear Model (SVM, softmax)
- 11. How an RNN works the cat sat on the mat
- 12. How an RNN works the cat sat on the mat input to hidden
- 13. How an RNN works the cat sat on the mat input to hidden hidden to hidden
- 14. How an RNN works the cat sat on the mat input to hidden hidden to hidden
- 15. How an RNN works the cat sat on the mat projections (activities x weights) activities (vectors of values) input to hidden hidden to hidden
- 16. How an RNN works the cat sat on the mat projections (activities x weights) activities (vectors of values) Learned representation of sequence. input to hidden hidden to hidden
- 17. How an RNN works the cat sat on the mat projections (activities x weights) activities (vectors of values) cat hidden to output input to hidden hidden to hidden
- 18. From text to RNN input the cat sat on the mat “The cat sat on the mat.” Tokenize . Assign index 0 1 2 3 0 4 5 String input Embedding lookup 2.5 0.3 -1.2 0.2 -3.3 0.7 -4.1 1.6 2.8 1.1 5.7 -0.2 2.5 0.3 -1.2 1.4 0.6 -3.9 -3.8 1.5 0.1 2.5 0.3 -1.2 0.2 -3.3 0.7 -4.1 1.6 2.8 1.1 5.7 -0.2 1.4 0.6 -3.9 -3.8 1.5 0.1 Learned matrix
- 19. You can stack them too the cat sat on the mat cat hidden to output input to hidden hidden to hidden
- 20. But aren’t RNNs unstable? Simple RNNs trained with SGD are unstable/difficult to learn. But modern RNNs with various tricks blow up much less often! ● Gating Units ● Gradient Clipping ● Steeper gates ● Better initialization ● Better optimizers ● Bigger datasets
- 21. Simple Recurrent Unit ht-1 xt + ht xt+1 + ht+1 + Element wise addition Activation function Routes information can propagate along Involved in modifying information flow and values
- 22. ⊙ ⊙⊙ Gated Recurrent Unit - GRU xt r htht-1 ht z + ~ 1-z z + Element wise addition ⊙ Element wise multiplication Routes information can propagate along Involved in modifying information flow and values
- 23. Gated Recurrent Unit - GRU ⊙ ⊙⊙ xt r htht-1 z + ~ 1-z z ⊙ ⊙⊙ xt+1 r ht+1ht z + ~ 1-z z ht+1
- 24. Gating is important For sentiment analysis of longer sequences of text (paragraph or so) a simple RNN has difficulty learning at all while a gated RNN does so easily.
- 25. Which One? There are two types of gated RNNs: ● Gated Recurrent Units (GRU) by K. Cho, recently introduced and used for machine translation and speech recognition tasks. ● Long short term memory (LSTM) by S. Hochreiter and J. Schmidhuber has been around since 1997 and has been used far more. Various modifications to it exist.
- 26. Which One? GRU is simpler, faster, and optimizes quicker (at least on sentiment). Because it only has two gates (compared to four) approximately 1.5- 1.75x faster for theano implementation. If you have a huge dataset and don’t mind waiting LSTM may be better in the long run due to its greater complexity - especially if you add peephole connections.
- 27. Exploding Gradients? Exploding gradients are a major problem for traditional RNNs trained with SGD. One of the sources of the reputation of RNNs being hard to train. In 2012, R Pascanu and T. Mikolov proposed clipping the norm of the gradient to alleviate this. Modern optimizers don’t seem to have this problem - at least for classification text analysis.
- 28. Better Gating Functions Interesting paper at NIPS workshop (Q. Lyu, J. Zhu) - make the gates “steeper” so they change more rapidly from “off” to “on” so model learns to use them quicker.
- 29. Better Initialization Andrew Saxe last year showed that initializing weight matrices with random orthogonal matrices works better than random gaussian (or uniform) matrices. In addition, Richard Socher (and more recently Quoc Le) have used identity initialization schemes which work great as well.
- 30. Understanding Optimizers 2D moons dataset courtesy of scikit-learn
- 31. Comparing Optimizers Adam (D. Kingma) combines the early optimization speed of Adagrad (J. Duchi) with the better later convergence of various other methods like Adadelta (M. Zeiler) and RMSprop (T. Tieleman). Warning: Generalization performance of Adam seems slightly worse for smaller datasets.
- 32. It adds up Up to 10x more efficient training once you add all the tricks together compared to a naive implementation - much more stable - rarely diverges. Around 7.5x faster, the various tricks add a bit of computation time.
- 33. Too much? - Overfitting RNNs can overfit very well as we will see. As they continue to fit to training dataset, their performance on test data will plateau or even worsen. Keep track of it using a validation set, save model at each iteration over training data and pick the earliest, best, validation performance.
- 34. The Showdown Model #1 Model #2 + 512 dim embedding 512 dim hidden state output Using bigrams and grid search on min_df for vectorizer and regularization coefficient for model. Using whatever I tried that worked :) Adam, GRU, steeper sigmoid gates, ortho/identity
- 35. Sentiment & Helpfulness
- 36. Effect of Dataset Size ● RNNs have poor generalization properties on small datasets. ○ 1K labeled examples 25-50% worse than linear model… ● RNNs have better generalization properties on large datasets. ○ 1M labeled examples 0-30% better than linear model. ● Crossovers between 10K and 1M examples ○ Depends on dataset.
- 37. The Thing we don’t talk about For 1 million paragraph sized text examples to converge: ● Linear model takes 30 minutes on a single CPU core. ● RNN takes 90 minutes on a Titan X. ● RNN takes five days on a single CPU core. RNN is about 250x slower on CPU than linear model… This is why we use GPUs
- 38. Visualizing representations of words learned via sentiment TSNE - L.J.P. van derIndividual words colored by average sentiment
- 39. Negative Positive Model learns to separate negative and positive words, not too surprising
- 40. Quantities of TimeQualifiers Product nouns Punctuation Much cooler, model also begins to learn components of language from only binary sentiment labels
- 41. The library - Passage ● Tiny RNN library built on top of Theano ● https://github.com/IndicoDataSolutions/Passage ● Still alpha - we’re working on it! ● Supports simple, LSTM, and GRU recurrent layers ● Supports multiple recurrent layers ● Supports deep input to and deep output from hidden layers ○ no deep transitions currently ● Supports embedding and onehot input representations ● Can be used for both regression and classification problems ○ Regression needs preprocessing for stability - working on it ● Much more in the pipeline
- 42. An example Sentiment analysis of movie reviews - 25K labeled examples
- 43. RNN imports
- 44. RNN imports preprocessing
- 45. RNN imports preprocessing load training data
- 46. RNN imports preprocessing tokenize data load training data
- 47. RNN imports preprocessing configure model tokenize data load training data
- 48. RNN imports preprocessing make and train model tokenize data load training data configure model
- 49. RNN imports preprocessing load test data make and train model tokenize data load training data configure model
- 50. RNN imports preprocessing predict on test data load test data make and train model tokenize data load training data configure model
- 51. The results Top 10! - barely :)
- 52. Summary ● RNNs look to be a competitive tool in certain situations for text analysis. ● Especially if you have a large 1M+ example dataset o A GPU or great patience is essential ● Otherwise it can be difficult to justify over linear models o Speed o Complexity o Poor generalization with small datasets
- 53. Contact alec@indico.io
- 54. We’re hiring! ● Data Engineer ● Infrastructure Engineer ● Interested? o contact@indico.io (or talk-to/email me after pres.)
- 55. Questions?

No public clipboards found for this slide

Be the first to comment