Atlanta MLconf Machine Learning Conference 09-23-2016

MLconf ATL!
Sept 23rd, 2016
Chris Fregly
Research Scientist @ PipelineIO

Who am I?
Chris Fregly, Research Scientist @ PipelineIO, San Francisco
Previously, Engineer @ Netflix, Databricks, and IBM Spark
Contributor @ Apache Spark, Committer @ Netflix OSS
Founder @ Advanced Spark and TensorFlow Meetup
Author @ Advanced Spark (advancedspark.com)

Advanced Spark and Tensorflow Meetup

ATL Spark Meetup (9/22)
http://www.slideshare.net/cfregly/atlanta-spark-user-meetup-09-22-2016

ATL Hadoop Meetup (9/21)
http://www.slideshare.net/cfregly/atlanta-hadoop-users-meetup-09-21-2016

Confession #1
I Failed Linguistics in College!
Chose Pass/Fail Option
(90 (mid-term) + 70 (final)) / 2 = 80 = C+
How did a C+ turn into an F?
ZER0 (0) CLASS PARTICIPATION?!

Confession #2
I Hated Statistics in College
2 Degrees: Mechanical + Manufacturing Engg
Approximations were Bad!
I Wasn’t a Fluffy Physics Major
Though, I Kinda Wish I Was!

Wait… Please Don’t Leave!
I’m Older and Wiser Now
Approximate is the New Exact
Computational Linguistics and NLP are My Jam!

Agenda
Tensorflow + Neural Nets
NLP Fundamentals
NLP Models

What is Tensorflow?
General Purpose Numerical Computation Engine
Happens to be good for neural nets!
Tooling
Tensorboard (port 6006 == `goog`) à
DAG-based like Spark!
Computation graph is logical plan
Stored in Protobuf’s
TF converts logical -> physical plan
Lots of Libraries
TFLearn (Tensorflow’s Scikit-learn Impl)
Tensorflow Serving (Prediction Layer) à ^^
Distributed and GPU-Optimized

What are Neural Networks?
Like All ML, Goal is to Minimize Loss (Error)
Error relative to known outcome of labeled data
Mostly Supervised Learning Classification
Labeled training data
Training Steps
Step 1: Randomly Guess Input Weights
Step 2: Calculate Error Against Labeled Data
Step 3: Determine Gradient Value, +/- Direction
Step 4: Back-propagateGradient to Update Each Input Weight
Step 5: Repeat Step 1 with New Weights until Convergence
Activation
Function

Activation Functions
Goal: Learn and Train a Model on Input Data
Non-Linear Functions
Find Non-Linear Fit of Input Data
Common Activation Functions
Sigmoid Function (sigmoid)
{0, 1}
Hyperbolic Tangent (tanh)
{-1, 1}

Back Propagation
http://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html
Gradients Calculated by Comparing to Known Label
Use Gradients to Adjust Input Weights
Chain Rule

Loss/Error Optimizers
Gradient Descent
Batch (entire dataset)
Per-record (don’t do this!)
Mini-batch (empirically 16 -> 512)
Stochastic (approximation)
Momentum (optimization)
AdaGrad
SGD with adaptive learning rates per feature
Set initial learning rate
More-likely to incorrectly converge on local minima
http://www.slideshare.net/cfregly/gradient-descent-back-propagation-and-auto-differentiation-
advanced-spark-and-tensorflow-meetup-08042016

The Math
Linear Algebra
Matrix Multiplication
Very Parallelizable
Calculus
Derivatives
Chain Rule

Convolutional Neural Networks
Feed-forward
Do not form a cycle
Apply Many Layers (aka. Filters) to Input
Each Layer/Filter Picks up on Features
Features not necessarily human-grokkable
Examples of Human-grokkable Filters
3 color filters: RGB
Moving AVG for time series
Brute Force
Try Diff numLayers & layerSizes

CNN Use Case: Stitch Fix
Stitch Fix Also Uses NLP to Analyze Return/Reject Comments
StitchFix Strata Conf SF 2016:
Using Deep Learning to Create New Clothing Styles!

Recurrent Neural Networks
Forms a Cycle (vs. Feed-forward)
Maintains State over Time
Keep track of context
Learns sequential patterns
Decay over time
Use Cases
Speech
Text/NLP Prediction

RNN Sequences
Input: Image
Output: Classification
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Input: Image
Output: Text (Captions)
Input: Text
Output: Class (Sentiment)
Input: Text (English)
Output: Text (Spanish)
Input
Layer
Hidden
Layer
Output
Layer

Character-based RNNs
Tokens are Characters vs. Words/Phrases
Microsoft trains ever 3 characters
Less Combination of Possible Neighbors
Only 26 alpha character tokens vs. millions of word tokens
Preserves state
between
1st and 2nd ‘l’
improves prediction

Long Short Term Memory (LSTM)
More Complex
State Update
Function
than
Vanilla RNN

LSTM State Update
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Cell State
Forget Gate Layer
(Sigmoid)
Input Gate Layer
(Sigmoid)
Candidate Gate Layer
(tanh)
Output
Layer

Use Cases
Document Summary
TextRank: TF/IDF + PageRank
Article Classification and Similarity
LDA: calculate top `k` topic distribution
Machine Translation
word2vec: compare word embedding vectors
Must Convert Text to Numbers!

Core Concepts
Corpus
Collection of text
ie. Documents, articles, genetic codes
Embeddings
Tokens represented/embedded in vector space
Learned, hidden features (~PCA, SVD)
Similar tokens cluster together, analogies cluster apart
k-skip-gram
Skip k neighbors when defining tokens
n-gram
Treat n consecutive tokens as a single token
Composable:
1-skip, bi-gram
(every other word)

Parsers and POS Taggers
Describe grammatical sentence structure
Requires context of entire sentence
Helps reason about sentence
80% obvious, simple token neighbors
Major bottleneck in NLP pipeline!

Pre-trained Parsers and Taggers
Penn Treebank
Parser and Part-of-Speech Tagger
Human-annotated (!)
Trained on 4.5 million words
Parsey McParseface
Trained by SyntaxNet

Feature Engineering
Lower-case
Preserve proper nouns using carat (`^`)
“MLconf” => “^m^lconf”
“Varsity” => “^varsity”
Encode Common N-grams (Phrases)
Create a single token using underscore (`_`)
“Senior Developer” => “senior_developer”
Stemming and Lemmatization
Try to avoid: let the neural network figure this out
Can preserve part of speech (POS) using “_noun”, “_verb”
“banking” => “banking_verb”

Count-based Models
Goal: Convert Text to Vector of Neighbor Co-occurrences
Bag of Words (BOW)
Simple hashmap with word counts
Loses neighbor context
Term Frequency / Inverse Document Frequency (TF/IDF)
Normalizes based on token frequency
GloVe
Matrix factorization on co-occurrence matrix
Highly parallelizable, reduce dimensions, capture global co-occurrence stats
Log smoothing of probability ratios
Stores word vector diffs for fast analogy lookups

Neural-based Predictive Models
Goal: Predict Text using Learned Embedding Vectors
word2vec
Shallow neural network
Local: nearby words predict each other
Fixed word embedding vector size (ie. 300)
Optimizer: Mini-batch Stochastic Gradient Descent (SGD)
SyntaxNet
Deep(er) neural network
Global(er)
Not a Recurrent Neural Net (RNN)!
Can combine with BOW-based models (ie. word2vec CBOW)

word2vec
CBOW word2vec
Predict target word from source context
A single source context is an observation
Loses useful distribution information
Good for small datasets
Skip-gram word2vec (Inverse of CBOW)
Predict source context words from target word
Each (source context, target word) tuple is observation
Better for large datasets

word2vec Libraries
gensim
Python only
Most popular
Spark ML
Python + Java/Scala
Supports only synonyms

*2vec
lda2vec
LDA (global) + word2vec (local)
From Chris Moody @ Stitch Fix
like2vec
Embedding-based Recommender

word2vec vs. GloVe
Both are Fundamentally Similar
Capture local co-occurrence statistics (neighbors)
Capture distance between embedding vector
(analogies)
GloVe
Count-based
Also captures global co-occurrence statistics
Requires upfront pass through entire dataset

SyntaxNet POS Tagging
Determine coarse-grained grammatical role of each word
Multiple contexts, multiple roles
Neural Net
Inputs: stack, buffer
Results: POS probability distro
Already
Tagged

SyntaxNet Dependency Parser
Determine fine-grained roles using grammatical relationships
“Transition-based”, Incremental Dependency Parser
Globally Normalized using Beam Search with Early Update
Parsey McParseface: Pre-trained Parser/Tagger avail in 40 langs
Fine-grained
Coarse-grained

SyntaxNet Use Case: Nutrition
Nutrition and Health Startup in SF (Stealth)
Using Google’s SyntaxNet
Rate Recipes and Menus by Nutritional Value
Correct
Incorrect

Model Validation
Unsupervised Learning Requires Validation
Google has Published Analogy Tests for Model Validation
Thanks, Google!

Thank You, Atlanta!
Chris Fregly, Research Scientist @ PipelineIO
All Source Code, Demos, and Docker Images
@ pipeline.io
Join the Global Meetup for all Slides and Videos
@ advancedspark.com

Atlanta MLconf Machine Learning Conference 09-23-2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Atlanta MLconf Machine Learning Conference 09-23-2016

Similar to Atlanta MLconf Machine Learning Conference 09-23-2016 (20)

More from Chris Fregly

More from Chris Fregly (20)

Recently uploaded

Recently uploaded (20)

Atlanta MLconf Machine Learning Conference 09-23-2016