This document describes BakeSearch, a recipe search tool that clusters recipes based on ingredients using natural language processing and machine learning techniques. It discusses challenges in clustering and analyzing large datasets of 40,000 recipes and 4,000 ingredients, and how tools like MapReduce, Amazon EMR, NumPy, Scipy, Nltk and Networkx are used to overcome these challenges at scale.
5. Disambigua=ng
searches
Classic
Chocolate
chip
cookies
Pa6y’s
best
chocolate
cookies Bigrams
Peanut
bu6er
cookies
+
Sugar
cookies
with
fros=ng Trigrams
Gooey
bu6er
cookies
Banana
pumpkin
cookies
Black
and
white
cookies
Halloween
cookies
Candidate
labels
7. Defining
distance
measure
Recipe
1 Recipe
2
Ingr1
Ingr4
Ingr2
Ingr9
Ingr3
Ingr12
Ingr4
Ingredients
in
both
recipes
Jaccard
=
Ingredients
in
either
recipe
8. Challenges
of
big
data
• Most
clustering
algorithms
(k-‐means,
hierarchical,
graph-‐based)
take
>30
seconds
9. Challenges
of
big
data
• Most
clustering
algorithms
(k-‐means,
hierarchical,
graph-‐based)
take
>30
seconds
• 40k
baking
recipes,
4k
ingredients
4000
3000
# Recipes
2000
1000
0
0 10 20 30 40
# Ingredients in recipe
10. Challenges
of
big
data
• Most
clustering
algorithms
(k-‐means,
hierarchical,
graph-‐based)
take
>30
seconds
• 40k
baking
recipes,
4k
ingredients
4000
3000
# Recipes
2000
900
1000
# ingredients
600
0
0 10 20 30 40
# Ingredients in recipe 300
0
1 2 5 10 50 100 500 1000 5000 10000
# recipes containing ingredient
11. Challenges
of
big
data
• Most
clustering
algorithms
(k-‐means,
hierarchical,
graph-‐based)
take
>30
seconds
• 40k
baking
recipes,
4k
ingredients
• Pre-‐calculate
jaccard
distances
between
every
pair
of
recipes
(40k
=mes
40k
=
1.6
billion
pairs!)
12. Challenges
of
big
data
• Most
clustering
algorithms
(k-‐means,
hierarchical,
graph-‐based)
take
>30
seconds
• 40k
baking
recipes,
4k
ingredients
• Pre-‐calculate
jaccard
distances
between
every
pair
of
recipes
(40k
=mes
40k
=
1.6
billion
pairs!)
• MapReduce
on
Amazon
EMR
• Preload
into
networkx
graph