SlideShare a Scribd company logo
1 of 19
Recipe	
  search
Recipe	
  search
BakeSearch
Make	
  sense	
  of	
  recipes	
  and	
  bake	
  like	
  a	
  pro
Disambigua=ng	
  searches

Classic	
  Chocolate	
  chip	
  cookies
Pa6y’s	
  best	
  chocolate	
  cookies       Bigrams	
  
Peanut	
  bu6er	
  cookies                      	
  +	
  
Sugar	
  cookies	
  with	
  fros=ng          Trigrams
Gooey	
  bu6er	
  cookies
Banana	
  pumpkin	
  cookies
Black	
  and	
  white	
  cookies
Halloween	
  cookies
                                          Candidate	
  labels
Domain-­‐specific	
  data	
  munging
•  Ingredients:	
  nltk	
  dic=onary	
  
•  Domain	
  knowledge	
  
•  Unit	
  parsing	
  
Defining	
  distance	
  measure
           Recipe	
  1                  Recipe	
  2
               Ingr1	
  
                                           Ingr4	
  
               Ingr2	
  
                                           Ingr9	
  
               Ingr3	
  
                                          Ingr12	
  
               Ingr4




                       Ingredients	
  in	
  both	
  recipes
Jaccard	
  =
                      Ingredients	
  in	
  either	
  recipe
Challenges	
  of	
  big	
  data
•  Most	
  clustering	
  algorithms	
  (k-­‐means,	
  
   hierarchical,	
  graph-­‐based)	
  take	
  >30	
  seconds	
  
Challenges	
  of	
  big	
  data
                   •  Most	
  clustering	
  algorithms	
  (k-­‐means,	
  
                      hierarchical,	
  graph-­‐based)	
  take	
  >30	
  seconds	
  
                   •  40k	
  baking	
  recipes,	
  4k	
  ingredients	
  

            4000


            3000
# Recipes




            2000


            1000


               0
                   0    10       20         30     40
                         # Ingredients in recipe
Challenges	
  of	
  big	
  data
                   •  Most	
  clustering	
  algorithms	
  (k-­‐means,	
  
                      hierarchical,	
  graph-­‐based)	
  take	
  >30	
  seconds	
  
                   •  40k	
  baking	
  recipes,	
  4k	
  ingredients	
  

            4000


            3000
# Recipes




            2000

                                                                   900

            1000
                                                   # ingredients




                                                                   600
               0
                   0    10       20         30                     40
                         # Ingredients in recipe                   300




                                                                     0

                                                                         1   2   5   10    50     100              500      1000   5000   10000
                                                                                          # recipes containing ingredient
Challenges	
  of	
  big	
  data
•  Most	
  clustering	
  algorithms	
  (k-­‐means,	
  
   hierarchical,	
  graph-­‐based)	
  take	
  >30	
  seconds	
  
•  40k	
  baking	
  recipes,	
  4k	
  ingredients	
  
•  Pre-­‐calculate	
  jaccard	
  distances	
  between	
  every	
  
   pair	
  of	
  recipes	
  (40k	
  =mes	
  40k	
  =	
  1.6	
  billion	
  
   pairs!)	
  
Challenges	
  of	
  big	
  data
•  Most	
  clustering	
  algorithms	
  (k-­‐means,	
  
   hierarchical,	
  graph-­‐based)	
  take	
  >30	
  seconds	
  
•  40k	
  baking	
  recipes,	
  4k	
  ingredients	
  
•  Pre-­‐calculate	
  jaccard	
  distances	
  between	
  every	
  
   pair	
  of	
  recipes	
  (40k	
  =mes	
  40k	
  =	
  1.6	
  billion	
  
   pairs!)	
  
•  MapReduce	
  on	
  Amazon	
  EMR	
  
•  Preload	
  into	
  networkx	
  graph
Cluster	
  recipes	
  based	
  on	
  ingredient
Cluster	
  recipes	
  based	
  on	
  ingredient
Find	
  enriched/depleted	
  ingredients




                            abs(Log-­‐2	
  ra=o)	
  >2
Tools
     Back	
  end                  Analysis                Front	
  end
•  Yummly	
  API	
           •  Numpy,	
  Scipy	
     •  HTML/CSS/
•  Python	
                  •  Nltk,	
                  JavaScript	
  
    –  Pycurl	
                 networkx	
            •  Twi6er	
  
    –  Nltk	
  wordnet	
                                 Bootstrap	
  
                             •  Python,	
  R	
  
•  MySQL	
                                            •  Flask	
  
                             •  Amazon	
  EMR	
  
                                                      •  Amazon	
  AWS	
  
Diane	
  Wu
•  PhD	
  Gene=cs,	
  Stanford	
  University,	
  CA	
  
•  BSc	
  Compu=ng	
  Science,	
  Simon	
  Fraser,	
  Canada	
  
Diane	
  Wu
•  PhD	
  Gene=cs,	
  Stanford	
  University,	
  CA	
  
•  BSc	
  Compu=ng	
  Science,	
  Simon	
  Fraser,	
  Canada	
  
Diane	
  Wu
•  PhD	
  Gene=cs,	
  Stanford	
  University,	
  CA	
  
•  BSc	
  Compu=ng	
  Science,	
  Simon	
  Fraser,	
  Canada	
  

More Related Content

Viewers also liked

Bio heroes final report
Bio heroes  final reportBio heroes  final report
Bio heroes final reportDiane Wu
 
රියාලුස් සාලිහීන්-සුබාරංචි පැවසීමත්,ප්‍රාර්ථනා කිරීමත්
රියාලුස් සාලිහීන්-සුබාරංචි පැවසීමත්,ප්‍රාර්ථනා කිරීමත්රියාලුස් සාලිහීන්-සුබාරංචි පැවසීමත්,ප්‍රාර්ථනා කිරීමත්
රියාලුස් සාලිහීන්-සුබාරංචි පැවසීමත්,ප්‍රාර්ථනා කිරීමත්riyalussaaliheen
 
Powrtron corporation 11
Powrtron corporation 11Powrtron corporation 11
Powrtron corporation 11Mior Azwan
 
Proliferative phase
Proliferative phaseProliferative phase
Proliferative phaseLalit Karki
 
Jc synthetic biology 6-15-2012
Jc synthetic biology   6-15-2012Jc synthetic biology   6-15-2012
Jc synthetic biology 6-15-2012Diane Wu
 
2013 SDSSA Photo of the Year Final Fifteen
2013 SDSSA Photo of the Year Final Fifteen2013 SDSSA Photo of the Year Final Fifteen
2013 SDSSA Photo of the Year Final FifteenCarol McFarland McKee
 
Reproductive system
Reproductive systemReproductive system
Reproductive systemGian Gonzaga
 
Affin Bank Berhad BSC and Business Intelligence tools
Affin Bank Berhad BSC and Business Intelligence toolsAffin Bank Berhad BSC and Business Intelligence tools
Affin Bank Berhad BSC and Business Intelligence toolsMior Azwan
 
Affin Bank Berhad Analysis
Affin Bank Berhad AnalysisAffin Bank Berhad Analysis
Affin Bank Berhad AnalysisMior Azwan
 

Viewers also liked (9)

Bio heroes final report
Bio heroes  final reportBio heroes  final report
Bio heroes final report
 
රියාලුස් සාලිහීන්-සුබාරංචි පැවසීමත්,ප්‍රාර්ථනා කිරීමත්
රියාලුස් සාලිහීන්-සුබාරංචි පැවසීමත්,ප්‍රාර්ථනා කිරීමත්රියාලුස් සාලිහීන්-සුබාරංචි පැවසීමත්,ප්‍රාර්ථනා කිරීමත්
රියාලුස් සාලිහීන්-සුබාරංචි පැවසීමත්,ප්‍රාර්ථනා කිරීමත්
 
Powrtron corporation 11
Powrtron corporation 11Powrtron corporation 11
Powrtron corporation 11
 
Proliferative phase
Proliferative phaseProliferative phase
Proliferative phase
 
Jc synthetic biology 6-15-2012
Jc synthetic biology   6-15-2012Jc synthetic biology   6-15-2012
Jc synthetic biology 6-15-2012
 
2013 SDSSA Photo of the Year Final Fifteen
2013 SDSSA Photo of the Year Final Fifteen2013 SDSSA Photo of the Year Final Fifteen
2013 SDSSA Photo of the Year Final Fifteen
 
Reproductive system
Reproductive systemReproductive system
Reproductive system
 
Affin Bank Berhad BSC and Business Intelligence tools
Affin Bank Berhad BSC and Business Intelligence toolsAffin Bank Berhad BSC and Business Intelligence tools
Affin Bank Berhad BSC and Business Intelligence tools
 
Affin Bank Berhad Analysis
Affin Bank Berhad AnalysisAffin Bank Berhad Analysis
Affin Bank Berhad Analysis
 

Recently uploaded

Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfadityarao40181
 
Science lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonScience lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonJericReyAuditor
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 

Recently uploaded (20)

Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdf
 
Science lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonScience lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lesson
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 

Diane wu insight demo

  • 1.
  • 4. BakeSearch Make  sense  of  recipes  and  bake  like  a  pro
  • 5. Disambigua=ng  searches Classic  Chocolate  chip  cookies Pa6y’s  best  chocolate  cookies Bigrams   Peanut  bu6er  cookies  +   Sugar  cookies  with  fros=ng Trigrams Gooey  bu6er  cookies Banana  pumpkin  cookies Black  and  white  cookies Halloween  cookies Candidate  labels
  • 6. Domain-­‐specific  data  munging •  Ingredients:  nltk  dic=onary   •  Domain  knowledge   •  Unit  parsing  
  • 7. Defining  distance  measure Recipe  1 Recipe  2 Ingr1   Ingr4   Ingr2   Ingr9   Ingr3   Ingr12   Ingr4 Ingredients  in  both  recipes Jaccard  = Ingredients  in  either  recipe
  • 8. Challenges  of  big  data •  Most  clustering  algorithms  (k-­‐means,   hierarchical,  graph-­‐based)  take  >30  seconds  
  • 9. Challenges  of  big  data •  Most  clustering  algorithms  (k-­‐means,   hierarchical,  graph-­‐based)  take  >30  seconds   •  40k  baking  recipes,  4k  ingredients   4000 3000 # Recipes 2000 1000 0 0 10 20 30 40 # Ingredients in recipe
  • 10. Challenges  of  big  data •  Most  clustering  algorithms  (k-­‐means,   hierarchical,  graph-­‐based)  take  >30  seconds   •  40k  baking  recipes,  4k  ingredients   4000 3000 # Recipes 2000 900 1000 # ingredients 600 0 0 10 20 30 40 # Ingredients in recipe 300 0 1 2 5 10 50 100 500 1000 5000 10000 # recipes containing ingredient
  • 11. Challenges  of  big  data •  Most  clustering  algorithms  (k-­‐means,   hierarchical,  graph-­‐based)  take  >30  seconds   •  40k  baking  recipes,  4k  ingredients   •  Pre-­‐calculate  jaccard  distances  between  every   pair  of  recipes  (40k  =mes  40k  =  1.6  billion   pairs!)  
  • 12. Challenges  of  big  data •  Most  clustering  algorithms  (k-­‐means,   hierarchical,  graph-­‐based)  take  >30  seconds   •  40k  baking  recipes,  4k  ingredients   •  Pre-­‐calculate  jaccard  distances  between  every   pair  of  recipes  (40k  =mes  40k  =  1.6  billion   pairs!)   •  MapReduce  on  Amazon  EMR   •  Preload  into  networkx  graph
  • 13. Cluster  recipes  based  on  ingredient
  • 14. Cluster  recipes  based  on  ingredient
  • 15. Find  enriched/depleted  ingredients abs(Log-­‐2  ra=o)  >2
  • 16. Tools Back  end Analysis Front  end •  Yummly  API   •  Numpy,  Scipy   •  HTML/CSS/ •  Python   •  Nltk,   JavaScript   –  Pycurl   networkx   •  Twi6er   –  Nltk  wordnet   Bootstrap   •  Python,  R   •  MySQL   •  Flask   •  Amazon  EMR   •  Amazon  AWS  
  • 17. Diane  Wu •  PhD  Gene=cs,  Stanford  University,  CA   •  BSc  Compu=ng  Science,  Simon  Fraser,  Canada  
  • 18. Diane  Wu •  PhD  Gene=cs,  Stanford  University,  CA   •  BSc  Compu=ng  Science,  Simon  Fraser,  Canada  
  • 19. Diane  Wu •  PhD  Gene=cs,  Stanford  University,  CA   •  BSc  Compu=ng  Science,  Simon  Fraser,  Canada