Recipe	
  search
Recipe	
  search
BakeSearch
Make	
  sense	
  of	
  recipes	
  and	
  bake	
  like	
  a	
  pro
Disambigua=ng	
  searches

Classic	
  Chocolate	
  chip	
  cookies
Pa6y’s	
  best	
  chocolate	
  cookies       Bigrams	
  
Peanut	
  bu6er	
  cookies                      	
  +	
  
Sugar	
  cookies	
  with	
  fros=ng          Trigrams
Gooey	
  bu6er	
  cookies
Banana	
  pumpkin	
  cookies
Black	
  and	
  white	
  cookies
Halloween	
  cookies
                                          Candidate	
  labels
Domain-­‐specific	
  data	
  munging
•  Ingredients:	
  nltk	
  dic=onary	
  
•  Domain	
  knowledge	
  
•  Unit	
  parsing	
  
Defining	
  distance	
  measure
           Recipe	
  1                  Recipe	
  2
               Ingr1	
  
                                           Ingr4	
  
               Ingr2	
  
                                           Ingr9	
  
               Ingr3	
  
                                          Ingr12	
  
               Ingr4




                       Ingredients	
  in	
  both	
  recipes
Jaccard	
  =
                      Ingredients	
  in	
  either	
  recipe
Challenges	
  of	
  big	
  data
•  Most	
  clustering	
  algorithms	
  (k-­‐means,	
  
   hierarchical,	
  graph-­‐based)	
  take	
  >30	
  seconds	
  
Challenges	
  of	
  big	
  data
                   •  Most	
  clustering	
  algorithms	
  (k-­‐means,	
  
                      hierarchical,	
  graph-­‐based)	
  take	
  >30	
  seconds	
  
                   •  40k	
  baking	
  recipes,	
  4k	
  ingredients	
  

            4000


            3000
# Recipes




            2000


            1000


               0
                   0    10       20         30     40
                         # Ingredients in recipe
Challenges	
  of	
  big	
  data
                   •  Most	
  clustering	
  algorithms	
  (k-­‐means,	
  
                      hierarchical,	
  graph-­‐based)	
  take	
  >30	
  seconds	
  
                   •  40k	
  baking	
  recipes,	
  4k	
  ingredients	
  

            4000


            3000
# Recipes




            2000

                                                                   900

            1000
                                                   # ingredients




                                                                   600
               0
                   0    10       20         30                     40
                         # Ingredients in recipe                   300




                                                                     0

                                                                         1   2   5   10    50     100              500      1000   5000   10000
                                                                                          # recipes containing ingredient
Challenges	
  of	
  big	
  data
•  Most	
  clustering	
  algorithms	
  (k-­‐means,	
  
   hierarchical,	
  graph-­‐based)	
  take	
  >30	
  seconds	
  
•  40k	
  baking	
  recipes,	
  4k	
  ingredients	
  
•  Pre-­‐calculate	
  jaccard	
  distances	
  between	
  every	
  
   pair	
  of	
  recipes	
  (40k	
  =mes	
  40k	
  =	
  1.6	
  billion	
  
   pairs!)	
  
Challenges	
  of	
  big	
  data
•  Most	
  clustering	
  algorithms	
  (k-­‐means,	
  
   hierarchical,	
  graph-­‐based)	
  take	
  >30	
  seconds	
  
•  40k	
  baking	
  recipes,	
  4k	
  ingredients	
  
•  Pre-­‐calculate	
  jaccard	
  distances	
  between	
  every	
  
   pair	
  of	
  recipes	
  (40k	
  =mes	
  40k	
  =	
  1.6	
  billion	
  
   pairs!)	
  
•  MapReduce	
  on	
  Amazon	
  EMR	
  
•  Preload	
  into	
  networkx	
  graph
Cluster	
  recipes	
  based	
  on	
  ingredient
Cluster	
  recipes	
  based	
  on	
  ingredient
Find	
  enriched/depleted	
  ingredients




                            abs(Log-­‐2	
  ra=o)	
  >2
Tools
     Back	
  end                  Analysis                Front	
  end
•  Yummly	
  API	
           •  Numpy,	
  Scipy	
     •  HTML/CSS/
•  Python	
                  •  Nltk,	
                  JavaScript	
  
    –  Pycurl	
                 networkx	
            •  Twi6er	
  
    –  Nltk	
  wordnet	
                                 Bootstrap	
  
                             •  Python,	
  R	
  
•  MySQL	
                                            •  Flask	
  
                             •  Amazon	
  EMR	
  
                                                      •  Amazon	
  AWS	
  
Diane	
  Wu
•  PhD	
  Gene=cs,	
  Stanford	
  University,	
  CA	
  
•  BSc	
  Compu=ng	
  Science,	
  Simon	
  Fraser,	
  Canada	
  
Diane	
  Wu
•  PhD	
  Gene=cs,	
  Stanford	
  University,	
  CA	
  
•  BSc	
  Compu=ng	
  Science,	
  Simon	
  Fraser,	
  Canada	
  
Diane	
  Wu
•  PhD	
  Gene=cs,	
  Stanford	
  University,	
  CA	
  
•  BSc	
  Compu=ng	
  Science,	
  Simon	
  Fraser,	
  Canada	
  

Diane wu insight demo

  • 2.
  • 3.
  • 4.
    BakeSearch Make  sense  of  recipes  and  bake  like  a  pro
  • 5.
    Disambigua=ng  searches Classic  Chocolate  chip  cookies Pa6y’s  best  chocolate  cookies Bigrams   Peanut  bu6er  cookies  +   Sugar  cookies  with  fros=ng Trigrams Gooey  bu6er  cookies Banana  pumpkin  cookies Black  and  white  cookies Halloween  cookies Candidate  labels
  • 6.
    Domain-­‐specific  data  munging • Ingredients:  nltk  dic=onary   •  Domain  knowledge   •  Unit  parsing  
  • 7.
    Defining  distance  measure Recipe  1 Recipe  2 Ingr1   Ingr4   Ingr2   Ingr9   Ingr3   Ingr12   Ingr4 Ingredients  in  both  recipes Jaccard  = Ingredients  in  either  recipe
  • 8.
    Challenges  of  big  data •  Most  clustering  algorithms  (k-­‐means,   hierarchical,  graph-­‐based)  take  >30  seconds  
  • 9.
    Challenges  of  big  data •  Most  clustering  algorithms  (k-­‐means,   hierarchical,  graph-­‐based)  take  >30  seconds   •  40k  baking  recipes,  4k  ingredients   4000 3000 # Recipes 2000 1000 0 0 10 20 30 40 # Ingredients in recipe
  • 10.
    Challenges  of  big  data •  Most  clustering  algorithms  (k-­‐means,   hierarchical,  graph-­‐based)  take  >30  seconds   •  40k  baking  recipes,  4k  ingredients   4000 3000 # Recipes 2000 900 1000 # ingredients 600 0 0 10 20 30 40 # Ingredients in recipe 300 0 1 2 5 10 50 100 500 1000 5000 10000 # recipes containing ingredient
  • 11.
    Challenges  of  big  data •  Most  clustering  algorithms  (k-­‐means,   hierarchical,  graph-­‐based)  take  >30  seconds   •  40k  baking  recipes,  4k  ingredients   •  Pre-­‐calculate  jaccard  distances  between  every   pair  of  recipes  (40k  =mes  40k  =  1.6  billion   pairs!)  
  • 12.
    Challenges  of  big  data •  Most  clustering  algorithms  (k-­‐means,   hierarchical,  graph-­‐based)  take  >30  seconds   •  40k  baking  recipes,  4k  ingredients   •  Pre-­‐calculate  jaccard  distances  between  every   pair  of  recipes  (40k  =mes  40k  =  1.6  billion   pairs!)   •  MapReduce  on  Amazon  EMR   •  Preload  into  networkx  graph
  • 13.
    Cluster  recipes  based  on  ingredient
  • 14.
    Cluster  recipes  based  on  ingredient
  • 15.
    Find  enriched/depleted  ingredients abs(Log-­‐2  ra=o)  >2
  • 16.
    Tools Back  end Analysis Front  end •  Yummly  API   •  Numpy,  Scipy   •  HTML/CSS/ •  Python   •  Nltk,   JavaScript   –  Pycurl   networkx   •  Twi6er   –  Nltk  wordnet   Bootstrap   •  Python,  R   •  MySQL   •  Flask   •  Amazon  EMR   •  Amazon  AWS  
  • 17.
    Diane  Wu •  PhD  Gene=cs,  Stanford  University,  CA   •  BSc  Compu=ng  Science,  Simon  Fraser,  Canada  
  • 18.
    Diane  Wu •  PhD  Gene=cs,  Stanford  University,  CA   •  BSc  Compu=ng  Science,  Simon  Fraser,  Canada  
  • 19.
    Diane  Wu •  PhD  Gene=cs,  Stanford  University,  CA   •  BSc  Compu=ng  Science,  Simon  Fraser,  Canada