Diane wu insight demo

•

0 likes•357 views

This document describes BakeSearch, a recipe search tool that clusters recipes based on ingredients using natural language processing and machine learning techniques. It discusses challenges in clustering and analyzing large datasets of 40,000 recipes and 4,000 ingredients, and how tools like MapReduce, Amazon EMR, NumPy, Scipy, Nltk and Networkx are used to overcome these challenges at scale.

Education

BakeSearch
Make
sense
of
recipes
and
bake
like
a
pro

Disambigua=ng
searches

Classic
Chocolate
chip
cookies
Pa6y’s
best
chocolate
cookies Bigrams

Peanut
bu6er
cookies
+

Sugar
cookies
with
fros=ng Trigrams
Gooey
bu6er
cookies
Banana
pumpkin
cookies
Black
and
white
cookies
Halloween
cookies
Candidate
labels

Domain-‐speciﬁc
data
munging
•  Ingredients:
nltk
dic=onary

•  Domain
knowledge

•  Unit
parsing

Deﬁning
distance
measure
Recipe
1 Recipe
2
Ingr1

Ingr4

Ingr2

Ingr9

Ingr3

Ingr12

Ingr4

Ingredients
in
both
recipes
Jaccard
=
Ingredients
in
either
recipe

Challenges
of
big
data
•  Most
clustering
algorithms
(k-‐means,

hierarchical,
graph-‐based)
take
>30
seconds

Challenges
of
big
data
•  Most
clustering
algorithms
(k-‐means,

hierarchical,
graph-‐based)
take
>30
seconds

•  40k
baking
recipes,
4k
ingredients

4000

3000
# Recipes

2000

1000

0
0 10 20 30 40
# Ingredients in recipe

Challenges
of
big
data
•  Most
clustering
algorithms
(k-‐means,

hierarchical,
graph-‐based)
take
>30
seconds

•  40k
baking
recipes,
4k
ingredients

•  Pre-‐calculate
jaccard
distances
between
every

pair
of
recipes
(40k
=mes
40k
=
1.6
billion

pairs!)

Find
enriched/depleted
ingredients

abs(Log-‐2
ra=o)
>2

Tools
Back
end Analysis Front
end
•  Yummly
API
•  Numpy,
Scipy
•  HTML/CSS/
•  Python
•  Nltk,
JavaScript

–  Pycurl
networkx
•  Twi6er

–  Nltk
wordnet
Bootstrap

•  Python,
R

•  MySQL
•  Flask

•  Amazon
EMR

•  Amazon
AWS

Diane
Wu
•  PhD
Gene=cs,
Stanford
University,
CA

•  BSc
Compu=ng
Science,
Simon
Fraser,
Canada

Viewers also liked

Bio heroes final reportDiane Wu

රියාලුස් සාලිහීන්-සුබාරංචි පැවසීමත්,ප්‍රාර්ථනා කිරීමත්riyalussaaliheen

Powrtron corporation 11Mior Azwan

Proliferative phaseLalit Karki

Jc synthetic biology 6-15-2012Diane Wu

2013 SDSSA Photo of the Year Final FifteenCarol McFarland McKee

Reproductive systemGian Gonzaga

Affin Bank Berhad BSC and Business Intelligence toolsMior Azwan

Affin Bank Berhad AnalysisMior Azwan

Viewers also liked (9)

Bio heroes final report

රියාලුස් සාලිහීන්-සුබාරංචි පැවසීමත්,ප්‍රාර්ථනා කිරීමත්

Powrtron corporation 11

Proliferative phase

Jc synthetic biology 6-15-2012

2013 SDSSA Photo of the Year Final Fifteen

Reproductive system

Affin Bank Berhad BSC and Business Intelligence tools

Affin Bank Berhad Analysis

Recently uploaded

Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke

Biting mechanism of poisonous snakes.pdfadityarao40181

Science lesson Moon for 4th quarter lessonJericReyAuditor

TataKelola dan KamSiber Kecerdasan Buatan v022.pdfSarwono Sutikno, Dr.Eng.,CISA,CISSP,CISM,CSX-F

18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a

9953330565 Low Rate Call Girls In Rohini Delhi NCR9953056974 Low Rate Call Girls In Saket, Delhi NCR

Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita

Staff of Color (SOC) Retention Efforts DDSDDavid Douglas School District

Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching

Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth

Crayon Activity Handout For the Crayon AUnboundStockton

How to Configure Email Server in Odoo 17Celine George

Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam

How to Make a Pirate ship Primary Education.pptxmanuelaromero2013

Presiding Officer Training module 2024 lok sabha electionsanshu789521

EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3

Software Engineering Methodologies (overview)eniolaolutunde

POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar

Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos

Recently uploaded (20)

Painted Grey Ware.pptx, PGW Culture of India

Biting mechanism of poisonous snakes.pdf

Science lesson Moon for 4th quarter lesson

TataKelola dan KamSiber Kecerdasan Buatan v022.pdf

18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf

9953330565 Low Rate Call Girls In Rohini Delhi NCR

Class 11 Legal Studies Ch-1 Concept of State .pdf

Staff of Color (SOC) Retention Efforts DDSD

Incoming and Outgoing Shipments in 1 STEP Using Odoo 17

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...

Introduction to ArtificiaI Intelligence in Higher Education

Crayon Activity Handout For the Crayon A

How to Configure Email Server in Odoo 17

Pharmacognosy Flower 3. Compositae 2023.pdf

How to Make a Pirate ship Primary Education.pptx

Presiding Officer Training module 2024 lok sabha elections

EPANDING THE CONTENT OF AN OUTLINE using notes.pptx

Software Engineering Methodologies (overview)

POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx

Final demo Grade 9 for demo Plan dessert.pptx

Diane wu insight demo

2. Recipe search

3. Recipe search

4. BakeSearch Make sense of recipes and bake like a pro

5. Disambigua=ng searches Classic Chocolate chip cookies Pa6y’s best chocolate cookies Bigrams Peanut bu6er cookies + Sugar cookies with fros=ng Trigrams Gooey bu6er cookies Banana pumpkin cookies Black and white cookies Halloween cookies Candidate labels

6. Domain-‐speciﬁc data munging •  Ingredients: nltk dic=onary •  Domain knowledge •  Unit parsing

7. Deﬁning distance measure Recipe 1 Recipe 2 Ingr1 Ingr4 Ingr2 Ingr9 Ingr3 Ingr12 Ingr4 Ingredients in both recipes Jaccard = Ingredients in either recipe

8. Challenges of big data •  Most clustering algorithms (k-‐means, hierarchical, graph-‐based) take >30 seconds

9. Challenges of big data •  Most clustering algorithms (k-‐means, hierarchical, graph-‐based) take >30 seconds •  40k baking recipes, 4k ingredients 4000 3000 # Recipes 2000 1000 0 0 10 20 30 40 # Ingredients in recipe

10. Challenges of big data •  Most clustering algorithms (k-‐means, hierarchical, graph-‐based) take >30 seconds •  40k baking recipes, 4k ingredients 4000 3000 # Recipes 2000 900 1000 # ingredients 600 0 0 10 20 30 40 # Ingredients in recipe 300 0 1 2 5 10 50 100 500 1000 5000 10000 # recipes containing ingredient

11. Challenges of big data •  Most clustering algorithms (k-‐means, hierarchical, graph-‐based) take >30 seconds •  40k baking recipes, 4k ingredients •  Pre-‐calculate jaccard distances between every pair of recipes (40k =mes 40k = 1.6 billion pairs!)

12. Challenges of big data •  Most clustering algorithms (k-‐means, hierarchical, graph-‐based) take >30 seconds •  40k baking recipes, 4k ingredients •  Pre-‐calculate jaccard distances between every pair of recipes (40k =mes 40k = 1.6 billion pairs!) •  MapReduce on Amazon EMR •  Preload into networkx graph

13. Cluster recipes based on ingredient

14. Cluster recipes based on ingredient

15. Find enriched/depleted ingredients abs(Log-‐2 ra=o) >2

16. Tools Back end Analysis Front end •  Yummly API •  Numpy, Scipy •  HTML/CSS/ •  Python •  Nltk, JavaScript –  Pycurl networkx •  Twi6er –  Nltk wordnet Bootstrap •  Python, R •  MySQL •  Flask •  Amazon EMR •  Amazon AWS

17. Diane Wu •  PhD Gene=cs, Stanford University, CA •  BSc Compu=ng Science, Simon Fraser, Canada

18. Diane Wu •  PhD Gene=cs, Stanford University, CA •  BSc Compu=ng Science, Simon Fraser, Canada

19. Diane Wu •  PhD Gene=cs, Stanford University, CA •  BSc Compu=ng Science, Simon Fraser, Canada

Diane wu insight demo

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (9)

Recently uploaded

Recently uploaded (20)

Diane wu insight demo