SlideShare a Scribd company logo
MLconf ATL!
Sept 23rd, 2016
Chris Fregly
Research Scientist @ PipelineIO
Who am I?
Chris Fregly, Research Scientist @ PipelineIO, San Francisco
Previously, Engineer @ Netflix, Databricks, and IBM Spark
Contributor @ Apache Spark, Committer @ Netflix OSS
Founder @ Advanced Spark and TensorFlow Meetup
Author @ Advanced Spark (advancedspark.com)
Advanced Spark and Tensorflow Meetup
ATL Spark Meetup (9/22)
http://www.slideshare.net/cfregly/atlanta-spark-user-meetup-09-22-2016
ATL Hadoop Meetup (9/21)
http://www.slideshare.net/cfregly/atlanta-hadoop-users-meetup-09-21-2016
Confession #1
I Failed Linguistics in College!
Chose Pass/Fail Option
(90 (mid-term) + 70 (final)) / 2 = 80 = C+
How did a C+ turn into an F?
ZER0 (0) CLASS PARTICIPATION?!
Confession #2
I Hated Statistics in College
2 Degrees: Mechanical + Manufacturing Engg
Approximations were Bad!
I Wasn’t a Fluffy Physics Major
Though, I Kinda Wish I Was!
Wait… Please Don’t Leave!
I’m Older and Wiser Now
Approximate is the New Exact
Computational Linguistics and NLP are My Jam!
Agenda
Tensorflow + Neural Nets
NLP Fundamentals
NLP Models
What is Tensorflow?
General Purpose Numerical Computation Engine
Happens to be good for neural nets!
Tooling
Tensorboard (port 6006 == `goog`) à
DAG-based like Spark!
Computation graph is logical plan
Stored in Protobuf’s
TF converts logical -> physical plan
Lots of Libraries
TFLearn (Tensorflow’s Scikit-learn Impl)
Tensorflow Serving (Prediction Layer) à ^^
Distributed and GPU-Optimized
What are Neural Networks?
Like All ML, Goal is to Minimize Loss (Error)
Error relative to known outcome of labeled data
Mostly Supervised Learning Classification
Labeled training data
Training Steps
Step 1: Randomly Guess Input Weights
Step 2: Calculate Error Against Labeled Data
Step 3: Determine Gradient Value, +/- Direction
Step 4: Back-propagateGradient to Update Each Input Weight
Step 5: Repeat Step 1 with New Weights until Convergence
Activation
Function
Activation Functions
Goal: Learn and Train a Model on Input Data
Non-Linear Functions
Find Non-Linear Fit of Input Data
Common Activation Functions
Sigmoid Function (sigmoid)
{0, 1}
Hyperbolic Tangent (tanh)
{-1, 1}
Back Propagation
http://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html
Gradients Calculated by Comparing to Known Label
Use Gradients to Adjust Input Weights
Chain Rule
Loss/Error Optimizers
Gradient Descent
Batch (entire dataset)
Per-record (don’t do this!)
Mini-batch (empirically 16 -> 512)
Stochastic (approximation)
Momentum (optimization)
AdaGrad
SGD with adaptive learning rates per feature
Set initial learning rate
More-likely to incorrectly converge on local minima
http://www.slideshare.net/cfregly/gradient-descent-back-propagation-and-auto-differentiation-
advanced-spark-and-tensorflow-meetup-08042016
The Math
Linear Algebra
Matrix Multiplication
Very Parallelizable
Calculus
Derivatives
Chain Rule
Convolutional Neural Networks
Feed-forward
Do not form a cycle
Apply Many Layers (aka. Filters) to Input
Each Layer/Filter Picks up on Features
Features not necessarily human-grokkable
Examples of Human-grokkable Filters
3 color filters: RGB
Moving AVG for time series
Brute Force
Try Diff numLayers & layerSizes
CNN Use Case: Stitch Fix
Stitch Fix Also Uses NLP to Analyze Return/Reject Comments
StitchFix Strata Conf SF 2016:
Using Deep Learning to Create New Clothing Styles!
Recurrent Neural Networks
Forms a Cycle (vs. Feed-forward)
Maintains State over Time
Keep track of context
Learns sequential patterns
Decay over time
Use Cases
Speech
Text/NLP Prediction
RNN Sequences
Input: Image
Output: Classification
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Input: Image
Output: Text (Captions)
Input: Text
Output: Class (Sentiment)
Input: Text (English)
Output: Text (Spanish)
Input
Layer
Hidden
Layer
Output
Layer
Character-based RNNs
Tokens are Characters vs. Words/Phrases
Microsoft trains ever 3 characters
Less Combination of Possible Neighbors
Only 26 alpha character tokens vs. millions of word tokens
Preserves state
between
1st and 2nd ‘l’
improves prediction
Long Short Term Memory (LSTM)
More Complex
State Update
Function
than
Vanilla RNN
LSTM State Update
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Cell State
Forget Gate Layer
(Sigmoid)
Input Gate Layer
(Sigmoid)
Candidate Gate Layer
(tanh)
Output
Layer
Transfer Learning
Agenda
Tensorflow + Neural Nets
NLP Fundamentals
NLP Models
Use Cases
Document Summary
TextRank: TF/IDF + PageRank
Article Classification and Similarity
LDA: calculate top `k` topic distribution
Machine Translation
word2vec: compare word embedding vectors
Must Convert Text to Numbers!
Core Concepts
Corpus
Collection of text
ie. Documents, articles, genetic codes
Embeddings
Tokens represented/embedded in vector space
Learned, hidden features (~PCA, SVD)
Similar tokens cluster together, analogies cluster apart
k-skip-gram
Skip k neighbors when defining tokens
n-gram
Treat n consecutive tokens as a single token
Composable:
1-skip, bi-gram
(every other word)
Parsers and POS Taggers
Describe grammatical sentence structure
Requires context of entire sentence
Helps reason about sentence
80% obvious, simple token neighbors
Major bottleneck in NLP pipeline!
Pre-trained Parsers and Taggers
Penn Treebank
Parser and Part-of-Speech Tagger
Human-annotated (!)
Trained on 4.5 million words
Parsey McParseface
Trained by SyntaxNet
Feature Engineering
Lower-case
Preserve proper nouns using carat (`^`)
“MLconf” => “^m^lconf”
“Varsity” => “^varsity”
Encode Common N-grams (Phrases)
Create a single token using underscore (`_`)
“Senior Developer” => “senior_developer”
Stemming and Lemmatization
Try to avoid: let the neural network figure this out
Can preserve part of speech (POS) using “_noun”, “_verb”
“banking” => “banking_verb”
Agenda
Tensorflow + Neural Nets
NLP Fundamentals
NLP Models
Count-based Models
Goal: Convert Text to Vector of Neighbor Co-occurrences
Bag of Words (BOW)
Simple hashmap with word counts
Loses neighbor context
Term Frequency / Inverse Document Frequency (TF/IDF)
Normalizes based on token frequency
GloVe
Matrix factorization on co-occurrence matrix
Highly parallelizable, reduce dimensions, capture global co-occurrence stats
Log smoothing of probability ratios
Stores word vector diffs for fast analogy lookups
Neural-based Predictive Models
Goal: Predict Text using Learned Embedding Vectors
word2vec
Shallow neural network
Local: nearby words predict each other
Fixed word embedding vector size (ie. 300)
Optimizer: Mini-batch Stochastic Gradient Descent (SGD)
SyntaxNet
Deep(er) neural network
Global(er)
Not a Recurrent Neural Net (RNN)!
Can combine with BOW-based models (ie. word2vec CBOW)
word2vec
CBOW word2vec
Predict target word from source context
A single source context is an observation
Loses useful distribution information
Good for small datasets
Skip-gram word2vec (Inverse of CBOW)
Predict source context words from target word
Each (source context, target word) tuple is observation
Better for large datasets
word2vec Libraries
gensim
Python only
Most popular
Spark ML
Python + Java/Scala
Supports only synonyms
*2vec
lda2vec
LDA (global) + word2vec (local)
From Chris Moody @ Stitch Fix
like2vec
Embedding-based Recommender
word2vec vs. GloVe
Both are Fundamentally Similar
Capture local co-occurrence statistics (neighbors)
Capture distance between embedding vector
(analogies)
GloVe
Count-based
Also captures global co-occurrence statistics
Requires upfront pass through entire dataset
SyntaxNet POS Tagging
Determine coarse-grained grammatical role of each word
Multiple contexts, multiple roles
Neural Net
Inputs: stack, buffer
Results: POS probability distro
Already
Tagged
SyntaxNet Dependency Parser
Determine fine-grained roles using grammatical relationships
“Transition-based”, Incremental Dependency Parser
Globally Normalized using Beam Search with Early Update
Parsey McParseface: Pre-trained Parser/Tagger avail in 40 langs
Fine-grained
Coarse-grained
SyntaxNet Use Case: Nutrition
Nutrition and Health Startup in SF (Stealth)
Using Google’s SyntaxNet
Rate Recipes and Menus by Nutritional Value
Correct
Incorrect
Model Validation
Unsupervised Learning Requires Validation
Google has Published Analogy Tests for Model Validation
Thanks, Google!
Thank You, Atlanta!
Chris Fregly, Research Scientist @ PipelineIO
All Source Code, Demos, and Docker Images
@ pipeline.io
Join the Global Meetup for all Slides and Videos
@ advancedspark.com

More Related Content

What's hot

Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
MLconf
 
Separating Hype from Reality in Deep Learning with Sameer Farooqui
 Separating Hype from Reality in Deep Learning with Sameer Farooqui Separating Hype from Reality in Deep Learning with Sameer Farooqui
Separating Hype from Reality in Deep Learning with Sameer Farooqui
Databricks
 

What's hot (20)

Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
 
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
 
Sanjeev Satheesj, Research Scientist, Baidu at The AI Conference 2017
Sanjeev Satheesj, Research Scientist, Baidu at The AI Conference 2017Sanjeev Satheesj, Research Scientist, Baidu at The AI Conference 2017
Sanjeev Satheesj, Research Scientist, Baidu at The AI Conference 2017
 
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
 
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
 
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
 
Best Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflowBest Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflow
 
The road ahead for scientific computing with Python
The road ahead for scientific computing with PythonThe road ahead for scientific computing with Python
The road ahead for scientific computing with Python
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D... Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 
Introduction to Keras
Introduction to KerasIntroduction to Keras
Introduction to Keras
 
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
 
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
Separating Hype from Reality in Deep Learning with Sameer Farooqui
 Separating Hype from Reality in Deep Learning with Sameer Farooqui Separating Hype from Reality in Deep Learning with Sameer Farooqui
Separating Hype from Reality in Deep Learning with Sameer Farooqui
 
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
 
Tensorflow Ecosystem
Tensorflow EcosystemTensorflow Ecosystem
Tensorflow Ecosystem
 
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPDictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
 

Viewers also liked

Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
MLconf
 
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
MLconf
 
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
MLconf
 
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
MLconf
 
Caroline Sinders, Online Harassment Researcher, Wikimedia at The AI Conferenc...
Caroline Sinders, Online Harassment Researcher, Wikimedia at The AI Conferenc...Caroline Sinders, Online Harassment Researcher, Wikimedia at The AI Conferenc...
Caroline Sinders, Online Harassment Researcher, Wikimedia at The AI Conferenc...
MLconf
 
Jeff Bradshaw, Founder, Adaptris
Jeff Bradshaw, Founder, AdaptrisJeff Bradshaw, Founder, Adaptris
Jeff Bradshaw, Founder, Adaptris
MLconf
 
Scott Clark, CEO, SigOpt, at MLconf Seattle 2017
Scott Clark, CEO, SigOpt, at MLconf Seattle 2017Scott Clark, CEO, SigOpt, at MLconf Seattle 2017
Scott Clark, CEO, SigOpt, at MLconf Seattle 2017
MLconf
 

Viewers also liked (20)

Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
 
Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineeri...
Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineeri...Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineeri...
Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineeri...
 
Ross Goodwin, Technologist, Sunspring, MLconf NYC 2017
Ross Goodwin, Technologist, Sunspring, MLconf NYC 2017Ross Goodwin, Technologist, Sunspring, MLconf NYC 2017
Ross Goodwin, Technologist, Sunspring, MLconf NYC 2017
 
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
 
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
 
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
 
Anjuli Kannan, Software Engineer, Google at MLconf SF 2016
Anjuli Kannan, Software Engineer, Google at MLconf SF 2016Anjuli Kannan, Software Engineer, Google at MLconf SF 2016
Anjuli Kannan, Software Engineer, Google at MLconf SF 2016
 
Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016
Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016
Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016
 
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
 
Daniel Shank, Data Scientist, Talla at MLconf SF 2016
Daniel Shank, Data Scientist, Talla at MLconf SF 2016Daniel Shank, Data Scientist, Talla at MLconf SF 2016
Daniel Shank, Data Scientist, Talla at MLconf SF 2016
 
Mayur Thakur, Managing Director, Goldman Sachs, at MLconf NYC 2017
Mayur Thakur, Managing Director, Goldman Sachs, at MLconf NYC 2017Mayur Thakur, Managing Director, Goldman Sachs, at MLconf NYC 2017
Mayur Thakur, Managing Director, Goldman Sachs, at MLconf NYC 2017
 
Scott Clark, CEO, SigOpt, at The AI Conference 2017
Scott Clark, CEO, SigOpt, at The AI Conference 2017Scott Clark, CEO, SigOpt, at The AI Conference 2017
Scott Clark, CEO, SigOpt, at The AI Conference 2017
 
Caroline Sinders, Online Harassment Researcher, Wikimedia at The AI Conferenc...
Caroline Sinders, Online Harassment Researcher, Wikimedia at The AI Conferenc...Caroline Sinders, Online Harassment Researcher, Wikimedia at The AI Conferenc...
Caroline Sinders, Online Harassment Researcher, Wikimedia at The AI Conferenc...
 
Layla El Asri, Research Scientist, Maluuba
Layla El Asri, Research Scientist, Maluuba Layla El Asri, Research Scientist, Maluuba
Layla El Asri, Research Scientist, Maluuba
 
Jeff Bradshaw, Founder, Adaptris
Jeff Bradshaw, Founder, AdaptrisJeff Bradshaw, Founder, Adaptris
Jeff Bradshaw, Founder, Adaptris
 
Alexandra Johnson, Software Engineer, SigOpt, at MLconf NYC 2017
Alexandra Johnson, Software Engineer, SigOpt, at MLconf NYC 2017Alexandra Johnson, Software Engineer, SigOpt, at MLconf NYC 2017
Alexandra Johnson, Software Engineer, SigOpt, at MLconf NYC 2017
 
Scott Clark, CEO, SigOpt, at MLconf Seattle 2017
Scott Clark, CEO, SigOpt, at MLconf Seattle 2017Scott Clark, CEO, SigOpt, at MLconf Seattle 2017
Scott Clark, CEO, SigOpt, at MLconf Seattle 2017
 
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
 
Brian Lucena, Senior Data Scientist, Metis at MLconf SF 2016
Brian Lucena, Senior Data Scientist, Metis at MLconf SF 2016Brian Lucena, Senior Data Scientist, Metis at MLconf SF 2016
Brian Lucena, Senior Data Scientist, Metis at MLconf SF 2016
 
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
 

Similar to Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

Similar to Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016 (20)

Preparing for Scala 3
Preparing for Scala 3Preparing for Scala 3
Preparing for Scala 3
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - Panorays
 
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache Airflow
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
 
Avro
AvroAvro
Avro
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...
 
Sjug #26 ml is in java but is dl too - ver1.04 - tomasz sikora 2018-03-23
Sjug #26   ml is in java but is dl too - ver1.04 - tomasz sikora 2018-03-23Sjug #26   ml is in java but is dl too - ver1.04 - tomasz sikora 2018-03-23
Sjug #26 ml is in java but is dl too - ver1.04 - tomasz sikora 2018-03-23
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Programming the Semantic Web
Programming the Semantic WebProgramming the Semantic Web
Programming the Semantic Web
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
 
Overview of the SPARQL-Generate language and latest developments
Overview of the SPARQL-Generate language and latest developmentsOverview of the SPARQL-Generate language and latest developments
Overview of the SPARQL-Generate language and latest developments
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Measuring Your Code
Measuring Your CodeMeasuring Your Code
Measuring Your Code
 
Measuring Your Code 2.0
Measuring Your Code 2.0Measuring Your Code 2.0
Measuring Your Code 2.0
 
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger KingContext-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
 

More from MLconf

Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
MLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
MLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
MLconf
 

More from MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 

Recently uploaded

Recently uploaded (20)

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 

Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

  • 1. MLconf ATL! Sept 23rd, 2016 Chris Fregly Research Scientist @ PipelineIO
  • 2. Who am I? Chris Fregly, Research Scientist @ PipelineIO, San Francisco Previously, Engineer @ Netflix, Databricks, and IBM Spark Contributor @ Apache Spark, Committer @ Netflix OSS Founder @ Advanced Spark and TensorFlow Meetup Author @ Advanced Spark (advancedspark.com)
  • 3. Advanced Spark and Tensorflow Meetup
  • 4. ATL Spark Meetup (9/22) http://www.slideshare.net/cfregly/atlanta-spark-user-meetup-09-22-2016
  • 5. ATL Hadoop Meetup (9/21) http://www.slideshare.net/cfregly/atlanta-hadoop-users-meetup-09-21-2016
  • 6.
  • 7. Confession #1 I Failed Linguistics in College! Chose Pass/Fail Option (90 (mid-term) + 70 (final)) / 2 = 80 = C+ How did a C+ turn into an F? ZER0 (0) CLASS PARTICIPATION?!
  • 8. Confession #2 I Hated Statistics in College 2 Degrees: Mechanical + Manufacturing Engg Approximations were Bad! I Wasn’t a Fluffy Physics Major Though, I Kinda Wish I Was!
  • 9. Wait… Please Don’t Leave! I’m Older and Wiser Now Approximate is the New Exact Computational Linguistics and NLP are My Jam!
  • 10. Agenda Tensorflow + Neural Nets NLP Fundamentals NLP Models
  • 11. What is Tensorflow? General Purpose Numerical Computation Engine Happens to be good for neural nets! Tooling Tensorboard (port 6006 == `goog`) à DAG-based like Spark! Computation graph is logical plan Stored in Protobuf’s TF converts logical -> physical plan Lots of Libraries TFLearn (Tensorflow’s Scikit-learn Impl) Tensorflow Serving (Prediction Layer) à ^^ Distributed and GPU-Optimized
  • 12. What are Neural Networks? Like All ML, Goal is to Minimize Loss (Error) Error relative to known outcome of labeled data Mostly Supervised Learning Classification Labeled training data Training Steps Step 1: Randomly Guess Input Weights Step 2: Calculate Error Against Labeled Data Step 3: Determine Gradient Value, +/- Direction Step 4: Back-propagateGradient to Update Each Input Weight Step 5: Repeat Step 1 with New Weights until Convergence Activation Function
  • 13. Activation Functions Goal: Learn and Train a Model on Input Data Non-Linear Functions Find Non-Linear Fit of Input Data Common Activation Functions Sigmoid Function (sigmoid) {0, 1} Hyperbolic Tangent (tanh) {-1, 1}
  • 15. Loss/Error Optimizers Gradient Descent Batch (entire dataset) Per-record (don’t do this!) Mini-batch (empirically 16 -> 512) Stochastic (approximation) Momentum (optimization) AdaGrad SGD with adaptive learning rates per feature Set initial learning rate More-likely to incorrectly converge on local minima http://www.slideshare.net/cfregly/gradient-descent-back-propagation-and-auto-differentiation- advanced-spark-and-tensorflow-meetup-08042016
  • 16. The Math Linear Algebra Matrix Multiplication Very Parallelizable Calculus Derivatives Chain Rule
  • 17. Convolutional Neural Networks Feed-forward Do not form a cycle Apply Many Layers (aka. Filters) to Input Each Layer/Filter Picks up on Features Features not necessarily human-grokkable Examples of Human-grokkable Filters 3 color filters: RGB Moving AVG for time series Brute Force Try Diff numLayers & layerSizes
  • 18. CNN Use Case: Stitch Fix Stitch Fix Also Uses NLP to Analyze Return/Reject Comments StitchFix Strata Conf SF 2016: Using Deep Learning to Create New Clothing Styles!
  • 19. Recurrent Neural Networks Forms a Cycle (vs. Feed-forward) Maintains State over Time Keep track of context Learns sequential patterns Decay over time Use Cases Speech Text/NLP Prediction
  • 20. RNN Sequences Input: Image Output: Classification http://karpathy.github.io/2015/05/21/rnn-effectiveness/ Input: Image Output: Text (Captions) Input: Text Output: Class (Sentiment) Input: Text (English) Output: Text (Spanish) Input Layer Hidden Layer Output Layer
  • 21. Character-based RNNs Tokens are Characters vs. Words/Phrases Microsoft trains ever 3 characters Less Combination of Possible Neighbors Only 26 alpha character tokens vs. millions of word tokens Preserves state between 1st and 2nd ‘l’ improves prediction
  • 22. Long Short Term Memory (LSTM) More Complex State Update Function than Vanilla RNN
  • 23. LSTM State Update http://colah.github.io/posts/2015-08-Understanding-LSTMs/ Cell State Forget Gate Layer (Sigmoid) Input Gate Layer (Sigmoid) Candidate Gate Layer (tanh) Output Layer
  • 25. Agenda Tensorflow + Neural Nets NLP Fundamentals NLP Models
  • 26. Use Cases Document Summary TextRank: TF/IDF + PageRank Article Classification and Similarity LDA: calculate top `k` topic distribution Machine Translation word2vec: compare word embedding vectors Must Convert Text to Numbers!
  • 27. Core Concepts Corpus Collection of text ie. Documents, articles, genetic codes Embeddings Tokens represented/embedded in vector space Learned, hidden features (~PCA, SVD) Similar tokens cluster together, analogies cluster apart k-skip-gram Skip k neighbors when defining tokens n-gram Treat n consecutive tokens as a single token Composable: 1-skip, bi-gram (every other word)
  • 28. Parsers and POS Taggers Describe grammatical sentence structure Requires context of entire sentence Helps reason about sentence 80% obvious, simple token neighbors Major bottleneck in NLP pipeline!
  • 29. Pre-trained Parsers and Taggers Penn Treebank Parser and Part-of-Speech Tagger Human-annotated (!) Trained on 4.5 million words Parsey McParseface Trained by SyntaxNet
  • 30. Feature Engineering Lower-case Preserve proper nouns using carat (`^`) “MLconf” => “^m^lconf” “Varsity” => “^varsity” Encode Common N-grams (Phrases) Create a single token using underscore (`_`) “Senior Developer” => “senior_developer” Stemming and Lemmatization Try to avoid: let the neural network figure this out Can preserve part of speech (POS) using “_noun”, “_verb” “banking” => “banking_verb”
  • 31. Agenda Tensorflow + Neural Nets NLP Fundamentals NLP Models
  • 32. Count-based Models Goal: Convert Text to Vector of Neighbor Co-occurrences Bag of Words (BOW) Simple hashmap with word counts Loses neighbor context Term Frequency / Inverse Document Frequency (TF/IDF) Normalizes based on token frequency GloVe Matrix factorization on co-occurrence matrix Highly parallelizable, reduce dimensions, capture global co-occurrence stats Log smoothing of probability ratios Stores word vector diffs for fast analogy lookups
  • 33. Neural-based Predictive Models Goal: Predict Text using Learned Embedding Vectors word2vec Shallow neural network Local: nearby words predict each other Fixed word embedding vector size (ie. 300) Optimizer: Mini-batch Stochastic Gradient Descent (SGD) SyntaxNet Deep(er) neural network Global(er) Not a Recurrent Neural Net (RNN)! Can combine with BOW-based models (ie. word2vec CBOW)
  • 34. word2vec CBOW word2vec Predict target word from source context A single source context is an observation Loses useful distribution information Good for small datasets Skip-gram word2vec (Inverse of CBOW) Predict source context words from target word Each (source context, target word) tuple is observation Better for large datasets
  • 35. word2vec Libraries gensim Python only Most popular Spark ML Python + Java/Scala Supports only synonyms
  • 36. *2vec lda2vec LDA (global) + word2vec (local) From Chris Moody @ Stitch Fix like2vec Embedding-based Recommender
  • 37. word2vec vs. GloVe Both are Fundamentally Similar Capture local co-occurrence statistics (neighbors) Capture distance between embedding vector (analogies) GloVe Count-based Also captures global co-occurrence statistics Requires upfront pass through entire dataset
  • 38. SyntaxNet POS Tagging Determine coarse-grained grammatical role of each word Multiple contexts, multiple roles Neural Net Inputs: stack, buffer Results: POS probability distro Already Tagged
  • 39. SyntaxNet Dependency Parser Determine fine-grained roles using grammatical relationships “Transition-based”, Incremental Dependency Parser Globally Normalized using Beam Search with Early Update Parsey McParseface: Pre-trained Parser/Tagger avail in 40 langs Fine-grained Coarse-grained
  • 40. SyntaxNet Use Case: Nutrition Nutrition and Health Startup in SF (Stealth) Using Google’s SyntaxNet Rate Recipes and Menus by Nutritional Value Correct Incorrect
  • 41. Model Validation Unsupervised Learning Requires Validation Google has Published Analogy Tests for Model Validation Thanks, Google!
  • 42. Thank You, Atlanta! Chris Fregly, Research Scientist @ PipelineIO All Source Code, Demos, and Docker Images @ pipeline.io Join the Global Meetup for all Slides and Videos @ advancedspark.com