Recipe2Vec:
Or how does my robot know
what recipes are related?
Meghan Heintz, Senior Data Scientist at BuzzFeed and Tasty
Tasty Ecosystem Before
Why Related Recipes?
Search behavior on BuzzFeed Tasty vertical
showed
Narrows a potentially exhaustive search down for
user
Provides a channel for us to resurface older
content in a coherent manner on new highly
trafficked recipes
Why do we even need to calculate related
recipes?
Can’t the producers/chefs tag those for us???
Humans while great chefs… not great taggers
My favorite example of
awful human tagging....
My favorite example of
awful human tagging....
How will robots/computers understand the
content of our recipes?
Making text machine readable?
- Dummy coding e.g. pandas.get_dummies - Label Encoding e.g.
sklearn.preprocessing.LabelEncoder¶
More advanced techniques…
- Polynomial
- Backward Difference
- Helmert
Making text machine readable?
- Dummy coding e.g. pandas.get_dummies - Label Encoding e.g.
sklearn.preprocessing.LabelEncoder¶
More advanced techniques…
- Polynomial
- Backward Difference
- Helmert
Word Embeddings
raw text corpus
(recipes)
vector representations for words in corpus
E.g. eggplant = [0.1, 0.5, 0.2, 0.4, 0.9]
Word Embeddings
raw text corpus
(recipes)
vector representations for words in corpus
E.g. eggplant = [0.1, 0.5, 0.2, 0.4, 0.9]
Plotting these vectors should show
us that similar words end up
spatially closer to each other than
other dissimilar words.
Ways to make word embeddings
● TF-IDF vectorization
● Word-word co-occurrence matrix
○ GloVe log-bilinear model with a weighted
least-squares objective
● Neural Network
○ word2vec two-layer neural network
"A word is characterized by the company it keeps" -Firth
How does this work??
Using the word2vec implementation with skip-gram…
- Take a sentence like the quick brown fox jumped over the lazy dog
How does this work??
Using the word2vec implementation with skip-gram…
- Take a sentence like the quick brown fox jumped over the lazy dog
- Decompose into context words and target words
([the, brown], quick), ([quick, fox], brown), ([brown, jumped], fox), ...
How does this work??
Using the word2vec implementation with skip-gram…
- Take a sentence like the quick brown fox jumped over the lazy dog
- Decompose into context words and target words
([the, brown], quick), ([quick, fox], brown), ([brown, jumped], fox), ...
- Each word is initialized as a random vector with small values
How does this work??
Using the word2vec implementation with skip-gram…
- Take a sentence like the quick brown fox jumped over the lazy dog
- Decompose into context words and target words
([the, brown], quick), ([quick, fox], brown), ([brown, jumped], fox), ...
- Each word is initialized as a random vector with small values
- We try to predict the context words from the target word using softmax regression
classifier
How does this work??
Using the word2vec implementation with skip-gram…
- Take a sentence like the quick brown fox jumped over the lazy dog
- Decompose into context words and target words
([the, brown], quick), ([quick, fox], brown), ([brown, jumped], fox), ...
- Each word is initialized as a random vector with small values
- We try to predict the context words from the target word using softmax regression
classifier
- Update word vectors by taking a small step to maximize our objective function using
stochastic gradient descent and backpropagation
How does this work??
Using the word2vec implementation with skip-gram…
- Take a sentence like the quick brown fox jumped over the lazy dog
- Decompose into context words and target words
([the, brown], quick), ([quick, fox], brown), ([brown, jumped], fox), ...
- Each word is initialized as a random vector with small values
- We try to predict the context words from the target word using softmax regression
classifier
- Update word vectors by taking a small step to maximize our objective function using
stochastic gradient descent and backpropagation
- Rinse and Repeat!
Image:
http://adventuresinmachinelearning.
com/word2vec-tutorial-tensorflow/
How is word2vec different from other NN
implementations?
Word Pairs and “Phrases”
● Common pairings get treated as “phrases” rather than single words
Subsampling
● Subsampling frequent words to decrease the number of training
examples
Negative Sampling
● Only a small percentage of the model’s weights are updated with each
training set. “Negative” words are selected and their vectors are
updated along with the “Positive” word.
How do we evaluate these embeddings??
Maybe Dimensionality Reduction??
High dimensional data is very difficult to visualize.
However, there are methods to project high
dimensional data down to fewer dimensions.
Principal component analysis, Linear discriminant
analysis, and t-Distributed Stochastic Neighbor
Embedding are all dimensionality reduction
methods.
We can use one of them to visualize our
embeddings.
Image:
http://www.turingfinance.com/artificial-intelligence-and-statistics-p
rincipal-component-analysis-and-self-organizing-maps/
Why it’s great
Won the Merck Visualization
Challenge on Kaggle
Better for visualization than PCA
Solves “The Crowding Problem”
t-Distributed stochastic neighbor embedding (t-SNE) minimizes the divergence
between two distributions: a distribution that measures pairwise similarities of
the input objects and a distribution that measures pairwise similarities of the
corresponding low-dimensional points in the embedding.
What to be careful about
Those hyperparameters really matter
Cluster sizes in a t-SNE plot mean nothing
Distances between clusters might not mean
anything
Find similarity by sorting by Cosine similarity
Image:
http://dataaspirant.com/2015/04/11/five-most-popula
r-similarity-measures-implementation-in-python/
Evaluating word embeddings using similarities for known relationships
Turns out Word Embeddings are sort of modular
When you add, sum, subtract or
concatenate word embeddings they
retain their meaning.
Example:
“King” - “Man” + “Woman” = “Queen”
Image:
https://blog.acolyer.org/2016/04/21/th
e-amazing-power-of-word-vectors/
recipe2vec
Sum all our word embeddings from each recipe’s preparation steps to create our recipe vector.
Evaluate recipe vectors using t-SNE
Desserts
Comfort Food
Healthy-ish
Happy Hour (Boozy)
How does this stay fresh?
We publish ~15-20 new recipes a week
Recipe2Vec applied every time a recipe is published using stale model
Model completely retrained every 12 hours
Sample Results
Does need some
necessary tweaks for
recipes with similar
preparations but
different flavor profiles…
e.g. smoothies v. boozy
bevvies
Other recipe vectors uses
● Predicting performance of new recipes based performance of older
similar recipes
● Creating context aware recommendations for users combining
collaborative filtering recommendations with recipe similarity metrics
● Making recommendations to producers on types of recipes to make
based on past performance
● Generally as useful features in machine learning applications
THE
END
Questions?
We are HIRING!!!
Contact me at:
meghan.heintz@buzzfeed.com
Or on Twitter @dot2dotseurat
Or come talk to me outside.

Recipe2Vec: Or how does my robot know what’s tasty

  • 1.
    Recipe2Vec: Or how doesmy robot know what recipes are related? Meghan Heintz, Senior Data Scientist at BuzzFeed and Tasty
  • 3.
  • 5.
    Why Related Recipes? Searchbehavior on BuzzFeed Tasty vertical showed Narrows a potentially exhaustive search down for user Provides a channel for us to resurface older content in a coherent manner on new highly trafficked recipes
  • 6.
    Why do weeven need to calculate related recipes? Can’t the producers/chefs tag those for us???
  • 7.
    Humans while greatchefs… not great taggers
  • 8.
    My favorite exampleof awful human tagging....
  • 9.
    My favorite exampleof awful human tagging....
  • 10.
    How will robots/computersunderstand the content of our recipes?
  • 11.
    Making text machinereadable? - Dummy coding e.g. pandas.get_dummies - Label Encoding e.g. sklearn.preprocessing.LabelEncoder¶ More advanced techniques… - Polynomial - Backward Difference - Helmert
  • 12.
    Making text machinereadable? - Dummy coding e.g. pandas.get_dummies - Label Encoding e.g. sklearn.preprocessing.LabelEncoder¶ More advanced techniques… - Polynomial - Backward Difference - Helmert
  • 13.
    Word Embeddings raw textcorpus (recipes) vector representations for words in corpus E.g. eggplant = [0.1, 0.5, 0.2, 0.4, 0.9]
  • 14.
    Word Embeddings raw textcorpus (recipes) vector representations for words in corpus E.g. eggplant = [0.1, 0.5, 0.2, 0.4, 0.9] Plotting these vectors should show us that similar words end up spatially closer to each other than other dissimilar words.
  • 15.
    Ways to makeword embeddings ● TF-IDF vectorization ● Word-word co-occurrence matrix ○ GloVe log-bilinear model with a weighted least-squares objective ● Neural Network ○ word2vec two-layer neural network "A word is characterized by the company it keeps" -Firth
  • 16.
    How does thiswork?? Using the word2vec implementation with skip-gram… - Take a sentence like the quick brown fox jumped over the lazy dog
  • 17.
    How does thiswork?? Using the word2vec implementation with skip-gram… - Take a sentence like the quick brown fox jumped over the lazy dog - Decompose into context words and target words ([the, brown], quick), ([quick, fox], brown), ([brown, jumped], fox), ...
  • 18.
    How does thiswork?? Using the word2vec implementation with skip-gram… - Take a sentence like the quick brown fox jumped over the lazy dog - Decompose into context words and target words ([the, brown], quick), ([quick, fox], brown), ([brown, jumped], fox), ... - Each word is initialized as a random vector with small values
  • 19.
    How does thiswork?? Using the word2vec implementation with skip-gram… - Take a sentence like the quick brown fox jumped over the lazy dog - Decompose into context words and target words ([the, brown], quick), ([quick, fox], brown), ([brown, jumped], fox), ... - Each word is initialized as a random vector with small values - We try to predict the context words from the target word using softmax regression classifier
  • 20.
    How does thiswork?? Using the word2vec implementation with skip-gram… - Take a sentence like the quick brown fox jumped over the lazy dog - Decompose into context words and target words ([the, brown], quick), ([quick, fox], brown), ([brown, jumped], fox), ... - Each word is initialized as a random vector with small values - We try to predict the context words from the target word using softmax regression classifier - Update word vectors by taking a small step to maximize our objective function using stochastic gradient descent and backpropagation
  • 21.
    How does thiswork?? Using the word2vec implementation with skip-gram… - Take a sentence like the quick brown fox jumped over the lazy dog - Decompose into context words and target words ([the, brown], quick), ([quick, fox], brown), ([brown, jumped], fox), ... - Each word is initialized as a random vector with small values - We try to predict the context words from the target word using softmax regression classifier - Update word vectors by taking a small step to maximize our objective function using stochastic gradient descent and backpropagation - Rinse and Repeat!
  • 22.
  • 23.
    How is word2vecdifferent from other NN implementations? Word Pairs and “Phrases” ● Common pairings get treated as “phrases” rather than single words Subsampling ● Subsampling frequent words to decrease the number of training examples Negative Sampling ● Only a small percentage of the model’s weights are updated with each training set. “Negative” words are selected and their vectors are updated along with the “Positive” word.
  • 24.
    How do weevaluate these embeddings?? Maybe Dimensionality Reduction?? High dimensional data is very difficult to visualize. However, there are methods to project high dimensional data down to fewer dimensions. Principal component analysis, Linear discriminant analysis, and t-Distributed Stochastic Neighbor Embedding are all dimensionality reduction methods. We can use one of them to visualize our embeddings. Image: http://www.turingfinance.com/artificial-intelligence-and-statistics-p rincipal-component-analysis-and-self-organizing-maps/
  • 25.
    Why it’s great Wonthe Merck Visualization Challenge on Kaggle Better for visualization than PCA Solves “The Crowding Problem” t-Distributed stochastic neighbor embedding (t-SNE) minimizes the divergence between two distributions: a distribution that measures pairwise similarities of the input objects and a distribution that measures pairwise similarities of the corresponding low-dimensional points in the embedding. What to be careful about Those hyperparameters really matter Cluster sizes in a t-SNE plot mean nothing Distances between clusters might not mean anything
  • 27.
    Find similarity bysorting by Cosine similarity Image: http://dataaspirant.com/2015/04/11/five-most-popula r-similarity-measures-implementation-in-python/
  • 28.
    Evaluating word embeddingsusing similarities for known relationships
  • 29.
    Turns out WordEmbeddings are sort of modular When you add, sum, subtract or concatenate word embeddings they retain their meaning. Example: “King” - “Man” + “Woman” = “Queen” Image: https://blog.acolyer.org/2016/04/21/th e-amazing-power-of-word-vectors/
  • 30.
    recipe2vec Sum all ourword embeddings from each recipe’s preparation steps to create our recipe vector. Evaluate recipe vectors using t-SNE
  • 31.
  • 34.
    How does thisstay fresh? We publish ~15-20 new recipes a week Recipe2Vec applied every time a recipe is published using stale model Model completely retrained every 12 hours
  • 35.
  • 39.
    Does need some necessarytweaks for recipes with similar preparations but different flavor profiles… e.g. smoothies v. boozy bevvies
  • 40.
    Other recipe vectorsuses ● Predicting performance of new recipes based performance of older similar recipes ● Creating context aware recommendations for users combining collaborative filtering recommendations with recipe similarity metrics ● Making recommendations to producers on types of recipes to make based on past performance ● Generally as useful features in machine learning applications
  • 41.
  • 42.
    We are HIRING!!! Contactme at: meghan.heintz@buzzfeed.com Or on Twitter @dot2dotseurat Or come talk to me outside.