5. Disambigua=ng
searches
Classic
Chocolate
chip
cookies
Pa6y’s
best
chocolate
cookies Bigrams
Peanut
bu6er
cookies
+
Sugar
cookies
with
fros=ng Trigrams
Gooey
bu6er
cookies
Banana
pumpkin
cookies
Black
and
white
cookies
Halloween
cookies
Candidate
labels
6. Defining
distance
measure
Recipe
1 Recipe
2
Ingr1
Ingr4
Ingr2
Ingr9
Ingr3
Ingr12
Ingr4
Ingredients
in
both
recipes
Jaccard
=
Ingredients
in
either
recipe
9. Challenges
of
big
data
• Most
clustering
algorithms
(k-‐means,
hierarchical,
graph-‐based)
take
>30
seconds
10. Challenges
of
big
data
• Most
clustering
algorithms
(k-‐means,
hierarchical,
graph-‐based)
take
>30
seconds
• Pre-‐calculate
jaccard
distances
between
every
pair
of
recipes
(1.6
billion
pairs!)
11. Challenges
of
big
data
• Most
clustering
algorithms
(k-‐means,
hierarchical,
graph-‐based)
take
>30
seconds
• Pre-‐calculate
jaccard
distances
between
every
pair
of
recipes
(1.6
billion
pairs!)
4000
3000
# Recipes
2000
1000
0
0 10 20 30 40
# Ingredients in recipe
12. Challenges
of
big
data
• Most
clustering
algorithms
(k-‐means,
hierarchical,
graph-‐based)
take
>30
seconds
• Pre-‐calculate
jaccard
distances
between
every
pair
of
recipes
(1.6
billion
pairs!)
4000
3000
# Recipes
2000
900
1000
# ingredients
600
0
0 10 20 30 40
# Ingredients in recipe 300
0
1 2 5 10 50 100 500 1000 5000 10000
# recipes containing ingredient
13. Challenges
of
big
data
• Most
clustering
algorithms
(k-‐means,
hierarchical,
graph-‐based)
take
>30
seconds
• Pre-‐calculate
jaccard
distances
between
every
pair
of
recipes
(1.6
billion
pairs!)
• MapReduce
on
Amazon
EMR
• Preload
into
networkx
graph