The document discusses BakeSearch, a recipe search tool that clusters recipes based on their ingredients using natural language processing and machine learning techniques to analyze large datasets of recipes. It outlines challenges in clustering over 40,000 recipes with 4,000 unique ingredients and describes using MapReduce on Amazon EMR to pre-calculate distances between all recipe pairs to cluster them into a graph. The tool is meant to help users find recipes based on enriched or depleted ingredients.
5. Disambiguating searches
Classic Chocolate chip cookies
Patty’s best chocolate cookies Bigrams
Peanut butter cookies +
Sugar cookies with frosting Trigrams
Gooey butter cookies
Banana pumpkin cookies
Black and white cookies
Halloween cookies
Candidate labels
7. Defining distance measure
Recipe 1 Recipe 2
Ingr1
Ingr4
Ingr2
Ingr9
Ingr3
Ingr12
Ingr4
Ingredients in both recipes
Jaccard =
Ingredients in either recipe
8. Challenges of big data
• Most clustering algorithms (k-
means, hierarchical, graph-based) take >30
seconds
9. Challenges of big data
• Most clustering algorithms (k-
means, hierarchical, graph-based) take >30
seconds
• 40k baking recipes, 4k ingredients
4000
3000
# Recipes
2000
1000
0
0 10 20 30 40
# Ingredients in recipe
10. Challenges of big data
• Most clustering algorithms (k-
means, hierarchical, graph-based) take >30
seconds
• 40k baking recipes, 4k ingredients
4000
3000
# Recipes
2000
900
1000
# ingredients
600
0
0 10 20 30 40
# Ingredients 300 recipe
in
0
1 2 5 10 50 100 500 1000 5000 10000
# recipes containing ingredient
11. Challenges of big data
• Most clustering algorithms (k-
means, hierarchical, graph-based) take >30
seconds
• 40k baking recipes, 4k ingredients
• Pre-calculate jaccard distances between every
pair of recipes (40k times 40k = 1.6 billion
pairs!)
12. Challenges of big data
• Most clustering algorithms (k-
means, hierarchical, graph-based) take >30
seconds
• 40k baking recipes, 4k ingredients
• Pre-calculate jaccard distances between every
pair of recipes (40k times 40k = 1.6 billion
pairs!)
• MapReduce on Amazon EMR
• Preload into networkx graph