SlideShare a Scribd company logo
1 of 19
Download to read offline
Recipe	
  search
Recipe	
  search
BakeSearch
Make	
  sense	
  of	
  recipes	
  and	
  bake	
  like	
  a	
  pro
Disambigua=ng	
  searches

Classic	
  Chocolate	
  chip	
  cookies
Pa6y’s	
  best	
  chocolate	
  cookies       Bigrams	
  
Peanut	
  bu6er	
  cookies                      	
  +	
  
Sugar	
  cookies	
  with	
  fros=ng          Trigrams
Gooey	
  bu6er	
  cookies
Banana	
  pumpkin	
  cookies
Black	
  and	
  white	
  cookies
Halloween	
  cookies
                                          Candidate	
  labels
Defining	
  distance	
  measure
           Recipe	
  1                  Recipe	
  2
               Ingr1	
  
                                           Ingr4	
  
               Ingr2	
  
                                           Ingr9	
  
               Ingr3	
  
                                          Ingr12	
  
               Ingr4




                       Ingredients	
  in	
  both	
  recipes
Jaccard	
  =
                      Ingredients	
  in	
  either	
  recipe
Cluster	
  recipes	
  based	
  on	
  ingredient
Cluster	
  recipes	
  based	
  on	
  ingredient
Challenges	
  of	
  big	
  data
•  Most	
  clustering	
  algorithms	
  (k-­‐means,	
  
   hierarchical,	
  graph-­‐based)	
  take	
  >30	
  seconds	
  
Challenges	
  of	
  big	
  data
•  Most	
  clustering	
  algorithms	
  (k-­‐means,	
  
   hierarchical,	
  graph-­‐based)	
  take	
  >30	
  seconds	
  
•  Pre-­‐calculate	
  jaccard	
  distances	
  between	
  
   every	
  pair	
  of	
  recipes	
  (1.6	
  billion	
  pairs!)	
  
Challenges	
  of	
  big	
  data
                   •  Most	
  clustering	
  algorithms	
  (k-­‐means,	
  
                      hierarchical,	
  graph-­‐based)	
  take	
  >30	
  seconds	
  
                   •  Pre-­‐calculate	
  jaccard	
  distances	
  between	
  
                      every	
  pair	
  of	
  recipes	
  (1.6	
  billion	
  pairs!)	
  
            4000


            3000
# Recipes




            2000


            1000


               0
                   0    10       20         30     40
                         # Ingredients in recipe
Challenges	
  of	
  big	
  data
                   •  Most	
  clustering	
  algorithms	
  (k-­‐means,	
  
                      hierarchical,	
  graph-­‐based)	
  take	
  >30	
  seconds	
  
                   •  Pre-­‐calculate	
  jaccard	
  distances	
  between	
  
                      every	
  pair	
  of	
  recipes	
  (1.6	
  billion	
  pairs!)	
  
            4000


            3000
# Recipes




            2000

                                                                   900

            1000
                                                   # ingredients




                                                                   600
               0
                   0    10       20         30                     40
                         # Ingredients in recipe                   300




                                                                     0

                                                                         1   2   5   10    50     100              500      1000   5000   10000
                                                                                          # recipes containing ingredient
Challenges	
  of	
  big	
  data
•  Most	
  clustering	
  algorithms	
  (k-­‐means,	
  
   hierarchical,	
  graph-­‐based)	
  take	
  >30	
  seconds	
  
•  Pre-­‐calculate	
  jaccard	
  distances	
  between	
  
   every	
  pair	
  of	
  recipes	
  (1.6	
  billion	
  pairs!)	
  
•  MapReduce	
  on	
  Amazon	
  EMR	
  
•  Preload	
  into	
  networkx	
  graph
Find	
  enriched/depleted	
  ingredients




                            abs(Log-­‐2	
  ra=o)	
  >2
Domain-­‐specific	
  data	
  munging
•  Ingredients:	
  nltk	
  dic=onary	
  
•  Domain	
  knowledge	
  
•  Unit	
  parsing	
  
Tools
     Back	
  end                  Analysis                Front	
  end
•  Yummly	
  API	
           •  Numpy,	
  Scipy	
     •  HTML/CSS/
•  Python	
                  •  Nltk,	
                  JavaScript	
  
    –  Pycurl	
                 networkx	
            •  Twi6er	
  
    –  Nltk	
  wordnet	
                                 Bootstrap	
  
                             •  Python,	
  R	
  
•  MySQL	
                                            •  Flask	
  
                             •  Amazon	
  EMR	
  
                                                      •  Amazon	
  AWS	
  
Diane	
  Wu
•  PhD	
  Gene=cs,	
  Stanford	
  University,	
  CA	
  
•  BSc	
  Compu=ng	
  Science,	
  Simon	
  Fraser,	
  Canada	
  
Diane	
  Wu
•  PhD	
  Gene=cs,	
  Stanford	
  University,	
  CA	
  
•  BSc	
  Compu=ng	
  Science,	
  Simon	
  Fraser,	
  Canada	
  
Diane	
  Wu
•  PhD	
  Gene=cs,	
  Stanford	
  University,	
  CA	
  
•  BSc	
  Compu=ng	
  Science,	
  Simon	
  Fraser,	
  Canada	
  

More Related Content

Featured

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Diane wu insight final demo

  • 1.
  • 4. BakeSearch Make  sense  of  recipes  and  bake  like  a  pro
  • 5. Disambigua=ng  searches Classic  Chocolate  chip  cookies Pa6y’s  best  chocolate  cookies Bigrams   Peanut  bu6er  cookies  +   Sugar  cookies  with  fros=ng Trigrams Gooey  bu6er  cookies Banana  pumpkin  cookies Black  and  white  cookies Halloween  cookies Candidate  labels
  • 6. Defining  distance  measure Recipe  1 Recipe  2 Ingr1   Ingr4   Ingr2   Ingr9   Ingr3   Ingr12   Ingr4 Ingredients  in  both  recipes Jaccard  = Ingredients  in  either  recipe
  • 7. Cluster  recipes  based  on  ingredient
  • 8. Cluster  recipes  based  on  ingredient
  • 9. Challenges  of  big  data •  Most  clustering  algorithms  (k-­‐means,   hierarchical,  graph-­‐based)  take  >30  seconds  
  • 10. Challenges  of  big  data •  Most  clustering  algorithms  (k-­‐means,   hierarchical,  graph-­‐based)  take  >30  seconds   •  Pre-­‐calculate  jaccard  distances  between   every  pair  of  recipes  (1.6  billion  pairs!)  
  • 11. Challenges  of  big  data •  Most  clustering  algorithms  (k-­‐means,   hierarchical,  graph-­‐based)  take  >30  seconds   •  Pre-­‐calculate  jaccard  distances  between   every  pair  of  recipes  (1.6  billion  pairs!)   4000 3000 # Recipes 2000 1000 0 0 10 20 30 40 # Ingredients in recipe
  • 12. Challenges  of  big  data •  Most  clustering  algorithms  (k-­‐means,   hierarchical,  graph-­‐based)  take  >30  seconds   •  Pre-­‐calculate  jaccard  distances  between   every  pair  of  recipes  (1.6  billion  pairs!)   4000 3000 # Recipes 2000 900 1000 # ingredients 600 0 0 10 20 30 40 # Ingredients in recipe 300 0 1 2 5 10 50 100 500 1000 5000 10000 # recipes containing ingredient
  • 13. Challenges  of  big  data •  Most  clustering  algorithms  (k-­‐means,   hierarchical,  graph-­‐based)  take  >30  seconds   •  Pre-­‐calculate  jaccard  distances  between   every  pair  of  recipes  (1.6  billion  pairs!)   •  MapReduce  on  Amazon  EMR   •  Preload  into  networkx  graph
  • 14. Find  enriched/depleted  ingredients abs(Log-­‐2  ra=o)  >2
  • 15. Domain-­‐specific  data  munging •  Ingredients:  nltk  dic=onary   •  Domain  knowledge   •  Unit  parsing  
  • 16. Tools Back  end Analysis Front  end •  Yummly  API   •  Numpy,  Scipy   •  HTML/CSS/ •  Python   •  Nltk,   JavaScript   –  Pycurl   networkx   •  Twi6er   –  Nltk  wordnet   Bootstrap   •  Python,  R   •  MySQL   •  Flask   •  Amazon  EMR   •  Amazon  AWS  
  • 17. Diane  Wu •  PhD  Gene=cs,  Stanford  University,  CA   •  BSc  Compu=ng  Science,  Simon  Fraser,  Canada  
  • 18. Diane  Wu •  PhD  Gene=cs,  Stanford  University,  CA   •  BSc  Compu=ng  Science,  Simon  Fraser,  Canada  
  • 19. Diane  Wu •  PhD  Gene=cs,  Stanford  University,  CA   •  BSc  Compu=ng  Science,  Simon  Fraser,  Canada