Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016


Published on

Comparing TensorFlow NLP Options: word2Vec, gloVe, RNN/LSTM, SyntaxNet, and Penn Treebank: Through code samples and demos, we’ll compare the architectures and algorithms of the various TensorFlow NLP options. We’ll explore both feed-forward and recurrent neural networks such as word2vec, gloVe, RNN/LSTM, SyntaxNet, and Penn Treebank using the latest TensorFlow libraries.

Published in: Technology

Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

  1. 1. MLconf ATL! Sept 23rd, 2016 Chris Fregly Research Scientist @ PipelineIO
  2. 2. Who am I? Chris Fregly, Research Scientist @ PipelineIO, San Francisco Previously, Engineer @ Netflix, Databricks, and IBM Spark Contributor @ Apache Spark, Committer @ Netflix OSS Founder @ Advanced Spark and TensorFlow Meetup Author @ Advanced Spark (
  3. 3. Advanced Spark and Tensorflow Meetup
  4. 4. ATL Spark Meetup (9/22)
  5. 5. ATL Hadoop Meetup (9/21)
  6. 6. Confession #1 I Failed Linguistics in College! Chose Pass/Fail Option (90 (mid-term) + 70 (final)) / 2 = 80 = C+ How did a C+ turn into an F? ZER0 (0) CLASS PARTICIPATION?!
  7. 7. Confession #2 I Hated Statistics in College 2 Degrees: Mechanical + Manufacturing Engg Approximations were Bad! I Wasn’t a Fluffy Physics Major Though, I Kinda Wish I Was!
  8. 8. Wait… Please Don’t Leave! I’m Older and Wiser Now Approximate is the New Exact Computational Linguistics and NLP are My Jam!
  9. 9. Agenda Tensorflow + Neural Nets NLP Fundamentals NLP Models
  10. 10. What is Tensorflow? General Purpose Numerical Computation Engine Happens to be good for neural nets! Tooling Tensorboard (port 6006 == `goog`) à DAG-based like Spark! Computation graph is logical plan Stored in Protobuf’s TF converts logical -> physical plan Lots of Libraries TFLearn (Tensorflow’s Scikit-learn Impl) Tensorflow Serving (Prediction Layer) à ^^ Distributed and GPU-Optimized
  11. 11. What are Neural Networks? Like All ML, Goal is to Minimize Loss (Error) Error relative to known outcome of labeled data Mostly Supervised Learning Classification Labeled training data Training Steps Step 1: Randomly Guess Input Weights Step 2: Calculate Error Against Labeled Data Step 3: Determine Gradient Value, +/- Direction Step 4: Back-propagateGradient to Update Each Input Weight Step 5: Repeat Step 1 with New Weights until Convergence Activation Function
  12. 12. Activation Functions Goal: Learn and Train a Model on Input Data Non-Linear Functions Find Non-Linear Fit of Input Data Common Activation Functions Sigmoid Function (sigmoid) {0, 1} Hyperbolic Tangent (tanh) {-1, 1}
  13. 13. Back Propagation Gradients Calculated by Comparing to Known Label Use Gradients to Adjust Input Weights Chain Rule
  14. 14. Loss/Error Optimizers Gradient Descent Batch (entire dataset) Per-record (don’t do this!) Mini-batch (empirically 16 -> 512) Stochastic (approximation) Momentum (optimization) AdaGrad SGD with adaptive learning rates per feature Set initial learning rate More-likely to incorrectly converge on local minima advanced-spark-and-tensorflow-meetup-08042016
  15. 15. The Math Linear Algebra Matrix Multiplication Very Parallelizable Calculus Derivatives Chain Rule
  16. 16. Convolutional Neural Networks Feed-forward Do not form a cycle Apply Many Layers (aka. Filters) to Input Each Layer/Filter Picks up on Features Features not necessarily human-grokkable Examples of Human-grokkable Filters 3 color filters: RGB Moving AVG for time series Brute Force Try Diff numLayers & layerSizes
  17. 17. CNN Use Case: Stitch Fix Stitch Fix Also Uses NLP to Analyze Return/Reject Comments StitchFix Strata Conf SF 2016: Using Deep Learning to Create New Clothing Styles!
  18. 18. Recurrent Neural Networks Forms a Cycle (vs. Feed-forward) Maintains State over Time Keep track of context Learns sequential patterns Decay over time Use Cases Speech Text/NLP Prediction
  19. 19. RNN Sequences Input: Image Output: Classification Input: Image Output: Text (Captions) Input: Text Output: Class (Sentiment) Input: Text (English) Output: Text (Spanish) Input Layer Hidden Layer Output Layer
  20. 20. Character-based RNNs Tokens are Characters vs. Words/Phrases Microsoft trains ever 3 characters Less Combination of Possible Neighbors Only 26 alpha character tokens vs. millions of word tokens Preserves state between 1st and 2nd ‘l’ improves prediction
  21. 21. Long Short Term Memory (LSTM) More Complex State Update Function than Vanilla RNN
  22. 22. LSTM State Update Cell State Forget Gate Layer (Sigmoid) Input Gate Layer (Sigmoid) Candidate Gate Layer (tanh) Output Layer
  23. 23. Transfer Learning
  24. 24. Agenda Tensorflow + Neural Nets NLP Fundamentals NLP Models
  25. 25. Use Cases Document Summary TextRank: TF/IDF + PageRank Article Classification and Similarity LDA: calculate top `k` topic distribution Machine Translation word2vec: compare word embedding vectors Must Convert Text to Numbers!
  26. 26. Core Concepts Corpus Collection of text ie. Documents, articles, genetic codes Embeddings Tokens represented/embedded in vector space Learned, hidden features (~PCA, SVD) Similar tokens cluster together, analogies cluster apart k-skip-gram Skip k neighbors when defining tokens n-gram Treat n consecutive tokens as a single token Composable: 1-skip, bi-gram (every other word)
  27. 27. Parsers and POS Taggers Describe grammatical sentence structure Requires context of entire sentence Helps reason about sentence 80% obvious, simple token neighbors Major bottleneck in NLP pipeline!
  28. 28. Pre-trained Parsers and Taggers Penn Treebank Parser and Part-of-Speech Tagger Human-annotated (!) Trained on 4.5 million words Parsey McParseface Trained by SyntaxNet
  29. 29. Feature Engineering Lower-case Preserve proper nouns using carat (`^`) “MLconf” => “^m^lconf” “Varsity” => “^varsity” Encode Common N-grams (Phrases) Create a single token using underscore (`_`) “Senior Developer” => “senior_developer” Stemming and Lemmatization Try to avoid: let the neural network figure this out Can preserve part of speech (POS) using “_noun”, “_verb” “banking” => “banking_verb”
  30. 30. Agenda Tensorflow + Neural Nets NLP Fundamentals NLP Models
  31. 31. Count-based Models Goal: Convert Text to Vector of Neighbor Co-occurrences Bag of Words (BOW) Simple hashmap with word counts Loses neighbor context Term Frequency / Inverse Document Frequency (TF/IDF) Normalizes based on token frequency GloVe Matrix factorization on co-occurrence matrix Highly parallelizable, reduce dimensions, capture global co-occurrence stats Log smoothing of probability ratios Stores word vector diffs for fast analogy lookups
  32. 32. Neural-based Predictive Models Goal: Predict Text using Learned Embedding Vectors word2vec Shallow neural network Local: nearby words predict each other Fixed word embedding vector size (ie. 300) Optimizer: Mini-batch Stochastic Gradient Descent (SGD) SyntaxNet Deep(er) neural network Global(er) Not a Recurrent Neural Net (RNN)! Can combine with BOW-based models (ie. word2vec CBOW)
  33. 33. word2vec CBOW word2vec Predict target word from source context A single source context is an observation Loses useful distribution information Good for small datasets Skip-gram word2vec (Inverse of CBOW) Predict source context words from target word Each (source context, target word) tuple is observation Better for large datasets
  34. 34. word2vec Libraries gensim Python only Most popular Spark ML Python + Java/Scala Supports only synonyms
  35. 35. *2vec lda2vec LDA (global) + word2vec (local) From Chris Moody @ Stitch Fix like2vec Embedding-based Recommender
  36. 36. word2vec vs. GloVe Both are Fundamentally Similar Capture local co-occurrence statistics (neighbors) Capture distance between embedding vector (analogies) GloVe Count-based Also captures global co-occurrence statistics Requires upfront pass through entire dataset
  37. 37. SyntaxNet POS Tagging Determine coarse-grained grammatical role of each word Multiple contexts, multiple roles Neural Net Inputs: stack, buffer Results: POS probability distro Already Tagged
  38. 38. SyntaxNet Dependency Parser Determine fine-grained roles using grammatical relationships “Transition-based”, Incremental Dependency Parser Globally Normalized using Beam Search with Early Update Parsey McParseface: Pre-trained Parser/Tagger avail in 40 langs Fine-grained Coarse-grained
  39. 39. SyntaxNet Use Case: Nutrition Nutrition and Health Startup in SF (Stealth) Using Google’s SyntaxNet Rate Recipes and Menus by Nutritional Value Correct Incorrect
  40. 40. Model Validation Unsupervised Learning Requires Validation Google has Published Analogy Tests for Model Validation Thanks, Google!
  41. 41. Thank You, Atlanta! Chris Fregly, Research Scientist @ PipelineIO All Source Code, Demos, and Docker Images @ Join the Global Meetup for all Slides and Videos @