Thinking	
  Lucene	
  	
  	
  	
  	
  	
  	
  Think	
  Lucid	
  




Enhancing	
  Discovery	
  with	
  Solr	
  and	
  
Mahout	
  




Grant	
  Ingersoll	
  
Chief	
  Scien@st	
  
Lucid	
  Imagina@on	
  


                                                                                             CONFIDENTIAL	
  	
  	
  	
  	
  	
  |	
     	
  1	
  	
  
Evolution

                                Documents
                                • Models
                                • Feature Selection




                                                                    User
                                                                    Interaction
            Content
                                                                    • Clicks
            Relationships                                           • Ratings/
            • Page Rank, etc.                                        Reviews
            • Organization                                          • Learning to
                                                                     Rank
                                                                    • Social Graph




                                    Queries
                                    • Phrases
                                    • NLP




                                                      Copyright	
  Lucid	
  Imagina@on	
     CONFIDENTIAL	
  	
  	
  	
  	
  	
  |	
     	
  2	
  	
  
Minding the Intersection




                           Search




            Analytics                 Discovery



                                    Copyright	
  Lucid	
  Imagina@on	
     CONFIDENTIAL	
  	
  	
  	
  	
  	
  |	
     	
  3	
  	
  
Topics	
  



l    Background	
  
       –  Apache	
  Mahout	
  
       –  Apache	
  Solr	
  and	
  Lucene	
  



l    Recommenda@ons	
  with	
  Mahout	
  
       –  Collabora@ve	
  Filtering	
  
l    Discovery	
  with	
  Solr	
  and	
  Mahout	
  


l    Discussion	
  




                                                       Copyright	
  Lucid	
  Imagina@on	
     CONFIDENTIAL	
  	
  	
  	
  	
  	
  |	
     	
  4	
  	
  
Apache	
  Lucene	
  in	
  a	
  Nutshell	
  



l    hOp://lucene.apache.org/java	
  
l    Java	
  based	
  Applica@on	
  Programming	
  Interface	
  (API)	
  for	
  adding	
  search	
  and	
  
      indexing	
  func@onality	
  to	
  applica@ons	
  
l    Fast	
  and	
  efficient	
  scoring	
  and	
  indexing	
  algorithms	
  
l    Lots	
  of	
  contribu@ons	
  to	
  make	
  common	
  tasks	
  easier:	
  
             –  Highligh@ng,	
  spa@al,	
  Query	
  Parsers,	
  Benchmarking	
  tools,	
  etc.	
  


l    Most	
  widely	
  deployed	
  search	
  library	
  on	
  the	
  planet	
  
      	
  




                                                                            Copyright	
  Lucid	
  Imagina@on	
     CONFIDENTIAL	
  	
  	
  	
  	
  	
  |	
     	
  5	
  	
  
Apache	
  Solr	
  in	
  a	
  Nutshell	
  



l    hOp://lucene.apache.org/solr	
  
l    Lucene-­‐based	
  Search	
  Server	
  +	
  other	
  features	
  and	
  func@onality	
  
l    Access	
  Lucene	
  over	
  HTTP:	
  
       –  Java,	
  XML,	
  Ruby,	
  Python,	
  .NET,	
  JSON,	
  PHP,	
  etc.	
  
l    Most	
  programming	
  tasks	
  in	
  Lucene	
  are	
  taken	
  care	
  of	
  in	
  Solr	
  
l    Face@ng	
  (guided	
  naviga@on,	
  filters,	
  etc.)	
  
l    Replica@on	
  and	
  distributed	
  search	
  support	
  
l    Lucene	
  Best	
  Prac@ces	
  




                                                                             Copyright	
  Lucid	
  Imagina@on	
     CONFIDENTIAL	
  	
  	
  	
  	
  	
  |	
     	
  6	
  	
  
Apache	
  Mahout	
  in	
  a	
  Nutshell	
  

           http://dictionary.reference.com/browse/mahout

  l    An	
  Apache	
  Socware	
  Founda@on	
  project	
  to	
  create	
  
        scalable	
  machine	
  learning	
  libraries	
  under	
  the	
  Apache	
  
        Socware	
  License	
  
         –  hOp://mahout.apache.org	
  
  l    The	
  Three	
  C’s:	
  
         –  Collabora@ve	
  Filtering	
  (recommenders)	
  
         –  Clustering	
  
         –  Classifica@on	
  

  l    Others:	
  
         –  Frequent	
  Item	
  Mining	
  
         –  Primi@ve	
  collec@ons	
  
         –  Math	
  stuff	
  


                                                                  Copyright	
  Lucid	
  Imagina@on	
     CONFIDENTIAL	
  	
  	
  	
  	
  	
  |	
     	
  7	
  	
  
Thinking	
  Lucene	
  	
  	
  	
  	
  	
  	
  Think	
  Lucid	
  




Recommenda@ons	
  with	
  Mahout	
  




                                                                                     CONFIDENTIAL	
  	
  	
  	
  	
  	
  |	
     	
  8	
  	
  
Recommenders	
  



l    Collabora@ve	
  Filtering	
  (CF)	
  
       –  Provide	
  recommenda@ons	
  solely	
  based	
  on	
  preferences	
  expressed	
  between	
  
          users	
  and	
  items	
  
       –  “People	
  who	
  watched	
  this	
  also	
  watched	
  that”	
  
l    Content-­‐based	
  Recommenda@ons	
  (CBR)	
  
       –  Provide	
  recommenda@ons	
  based	
  on	
  the	
  aOributes	
  of	
  the	
  items	
  and	
  user	
  profile	
  
       –  ‘Modern	
  Family’	
  is	
  a	
  sitcom,	
  Bob	
  likes	
  sitcoms	
  	
  
              •  =>	
  Suggest	
  Modern	
  Family	
  to	
  Bob	
  

l    Mahout	
  geared	
  towards	
  CF,	
  can	
  be	
  extended	
  to	
  do	
  CBR	
  
       –  Classifica@on	
  can	
  also	
  be	
  used	
  for	
  CBR	
  
l    Aside:	
  search	
  engines	
  can	
  also	
  solve	
  these	
  problems	
  


                                                                                  Copyright	
  Lucid	
  Imagina@on	
     CONFIDENTIAL	
  	
  	
  	
  	
  	
  |	
     	
  9	
  	
  
To	
  Rate	
  or	
  Not?	
  



l     In	
  many	
  instances,	
  user’s	
  don’t	
  provide	
  actual	
  ra@ngs	
  
        –  Clicks,	
  views,	
  etc.	
  
l     Non-­‐Boolean	
  ra@ngs	
  can	
  also	
  ocen	
  introduce	
  unnecessary	
  noise	
  
        –  Even	
  a	
  low	
  ra@ng	
  ocen	
  has	
  a	
  posi@ve	
  correla@on	
  with	
  highly	
  rated	
  items	
  in	
  the	
  
           real	
  world	
  
l     Example:	
  	
  Should	
  we	
  recommend	
  Frankenstein	
  to	
  Bob?	
  

           Dracula
              Dracula Jane                             Frankenstein
                                                     Jane Eyre                      Java Programming
                                                                                     Frankenstein
                      Eyre
      Bob     1                                      4                                  ???
      Bob 1           4                                  ???                        -
      Mary    5                                      1                                  4
      Mary 5          1                                  4                          -



                                                                                 Copyright	
  Lucid	
  Imagina@on	
     CONFIDENTIAL	
  	
  	
  	
  	
  	
  |	
     	
  10	
  	
  
Collabora;ve	
  Filtering	
  with	
  Mahout	
  


                                                                          Item         Item …                   Item m
  l    Extensive	
  framework	
  for	
  collabora@ve	
  
                                                                          1            2
        filtering	
  
                                                            User 1        -            0.5                      0.9
  l    Recommenders	
  
         –  User	
  based	
                                 User 2        0.1          0.3                      -
         –  Item	
  based	
                                 …
         –  Slope	
  One	
  
                                                            User n        0.8          0.7                      0.1
  l    Online	
  and	
  Offline	
  support	
  
         –  Offline	
  can	
  u@lize	
  Hadoop	
  




                                                                           Recommendations
                                                                              for User X

                                                             Copyright	
  Lucid	
  Imagina@on	
     CONFIDENTIAL	
  	
  	
  	
  	
  	
  |	
     	
  11	
  	
  
User	
  Similarity	
  


                                    What	
  should	
  we	
  recommend	
  for	
  User	
  1?	
  




     User	
                            User	
  
      1	
                               2	
                                User	
  
                                                                            3	
                            User	
  
                                                                                                            4	
  




                    Item	
  1	
           Item	
  2	
      Item	
  3	
            Item	
  4	
  



                                                                           Copyright	
  Lucid	
  Imagina@on	
     CONFIDENTIAL	
  	
  	
  	
  	
  	
  |	
     	
  12	
  	
  
Item	
  Similarity	
  


                                    What	
  should	
  we	
  recommend	
  for	
  User	
  1?	
  




     User	
                            User	
  
      1	
                               2	
                                User	
  
                                                                            3	
                            User	
  
                                                                                                            4	
  




                    Item	
  1	
           Item	
  2	
      Item	
  3	
            Item	
  4	
  



                                                                           Copyright	
  Lucid	
  Imagina@on	
     CONFIDENTIAL	
  	
  	
  	
  	
  	
  |	
     	
  13	
  	
  
Slope	
  One	
  


                       User                                          Item 1                                     Item 2
                       A                                             3.5                                        2
                       B                                             ?                                          3


                                     User	
  A:	
  3.5	
  –	
  2	
  =	
  1.5	
  

                                     Item	
  1	
  (User	
  B)	
  =	
  3	
  +	
  1.5	
  =	
  4.5	
  	
  

l    Intui@on:	
  There	
  is	
  a	
  linear	
  rela@onship	
  between	
  rated	
  items	
  
       –  Y	
  =	
  mX	
  +	
  b	
  	
  where	
  m	
  =	
  1	
  
l    Solve	
  for	
  b	
  upfront	
  based	
  on	
  exis@ng	
  ra@ngs:	
  	
  b	
  =	
  (Y-­‐X)	
  
       –  Find	
  the	
  average	
  difference	
  in	
  preference	
  value	
  for	
  every	
  pair	
  of	
  items	
  

l    Online	
  can	
  be	
  very	
  fast,	
  but	
  requires	
  up	
  front	
  computa@on	
  and	
  memory	
  


                                                                                                          Copyright	
  Lucid	
  Imagina@on	
     CONFIDENTIAL	
  	
  	
  	
  	
  	
  |	
     	
  14	
  	
  
Online	
  and	
  Offline	
  Recommenda;ons	
  



l    Online	
  
       –  Predates	
  Hadoop	
  
       –  Designed	
  to	
  run	
  on	
  a	
  single	
  node	
  
              •  Matrix	
  size	
  of	
  ~	
  100M	
  interac@ons	
  
       –  API	
  for	
  integra@ng	
  with	
  your	
  applica@on	
  

l    Offline	
  
       –  Hadoop	
  based	
  
       –  Designed	
  to	
  run	
  on	
  large	
  cluster	
  
       –  Several	
  approaches:	
  
              •  RecommenderJob,	
  ItemSimilarityJob,	
  ParallelALSFactoriza@onJob	
  




                                                                        Copyright	
  Lucid	
  Imagina@on	
     CONFIDENTIAL	
  	
  	
  	
  	
  	
  |	
     	
  15	
  	
  
RecommenderJob	
  


l    Essen@ally	
  does	
  matrix	
  mul@plica@on	
  using	
  distributed	
  techniques	
  
l    $MAHOUT_HOME/bin/examples/asf-­‐email-­‐examples.sh	
  

        101     102      103     104      105                         User A                          Recs
                                                                      3.0                             30
101 7           2        0       1        3
                                                                      0                               37
102 2           8        3       5        2
                                                               X	
   4.0                    =	
  
103 0           3        3       6        4                                                           38

104 1           5        6       4        7                           3.0                             53

105 3           2        4       7        9                           2.0                             64



                                                               Copyright	
  Lucid	
  Imagina@on	
     CONFIDENTIAL	
  	
  	
  	
  	
  	
  |	
     	
  16	
  	
  
Thinking	
  Lucene	
  	
  	
  	
  	
  	
  	
  Think	
  Lucid	
  




Discovery	
  with	
  Solr	
  




                                                                                           CONFIDENTIAL	
  	
  	
  	
  	
  	
  |	
     	
  17	
  	
  
Discovery	
  with	
  Solr	
  

l    Goals:	
  
       –  Guide	
  users	
  to	
  results	
  without	
  having	
  to	
  guess	
  at	
  keywords	
  
       –  Encourage	
  serendipity	
  
       –  Never	
  show	
  empty	
  results	
  


l    Out	
  of	
  the	
  Box:	
  
       –  Face@ng	
  
       –  Spell	
  Checking	
  
       –  More	
  Like	
  This	
  
       –  Clustering	
  (Carrot2)	
  
l    Extend	
  
       –  Clustering	
  (with	
  Mahout)	
  
       –  Frequent	
  Item	
  Mining	
  (with	
  Mahout)	
  


                                                                             Copyright	
  Lucid	
  Imagina@on	
     CONFIDENTIAL	
  	
  	
  	
  	
  	
  |	
     	
  18	
  	
  
Clustering	
  



l    Automa@cally	
  group	
  similar	
  content	
  together	
  to	
  aid	
  users	
  in	
  discovering	
  
      related	
  items	
  and/or	
  avoiding	
  repe@@ve	
  content	
  


l    Solr	
  has	
  search	
  result	
  clustering	
  
       –  Pluggable	
  
       –  Default	
  implementa@on	
  uses	
  Carrot2	
  



l    Mahout	
  has	
  Hadoop	
  based	
  large	
  scale	
  clustering	
  
       –  K-­‐Means,	
  Minhash,	
  Dirichlet,	
  Canopy,	
  Spectral,	
  etc.	
  




                                                                        Copyright	
  Lucid	
  Imagina@on	
     CONFIDENTIAL	
  	
  	
  	
  	
  	
  |	
     	
  19	
  	
  
Discovery	
  In	
  Ac;on	
  

  l    Pre-­‐reqs:	
  
               –  Apache	
  Ant	
  1.7.x,	
  Subversion	
  (SVN)	
  
  l    Command	
  Line	
  1:	
  
               –  svn	
  co	
  hOps://svn.apache.org/repos/asf/lucene/dev/trunk	
  solr-­‐trunk	
  
               –  cd	
  solr-­‐trunk/solr/	
  
               –  ant	
  example	
  
               –  cd	
  example	
  
               –  java	
  –Dsolr.clustering.enabled=true	
  –jar	
  start.jar	
  
  l    Command	
  Line	
  2	
  
               –  cd	
  exampledocs;	
  java	
  –jar	
  post.jar	
  *.xml	
  

  l    hOp://localhost:8983/solr/browse?
        q=&debugQuery=true&annotateBrowse=true	
  
        	
  


                                                                                Copyright	
  Lucid	
  Imagina@on	
     CONFIDENTIAL	
  	
  	
  	
  	
  	
  |	
     	
  20	
  	
  
Thinking	
  Lucene	
  	
  	
  	
  	
  	
  	
  Think	
  Lucid	
  




Solr	
  +	
  Mahout	
  




                                                                                             CONFIDENTIAL	
  	
  	
  	
  	
  	
  |	
     	
  21	
  	
  
Basics	
  



l    Most	
  Mahout	
  tasks	
  are	
  offline	
  
l    Solr	
  provides	
  many	
  touch	
  points	
  for	
  integra@on:	
  
       –  ClusteringEngine	
  
             •  Clustering	
  results	
  
       –  SearchComponent	
  
             •  Sugges@ons	
  –	
  Related	
  searches,	
  clusters,	
  MLT,	
  spellchecking	
  
       –  UpdateProcessor	
  
             •  Classifica@on	
  of	
  documents	
  
       –  Func@onQuery	
  




                                                                             Copyright	
  Lucid	
  Imagina@on	
     CONFIDENTIAL	
  	
  	
  	
  	
  	
  |	
     	
  22	
  	
  
Example:	
  Frequent	
  Itemset	
  Mining	
  



l    Discover	
  frequently	
  co-­‐occurring	
  items	
  


l    Use	
  Case:	
  Related	
  Searches	
  from	
  Solr	
  Logs	
  


l    Hadoop	
  and	
  sequen@al	
  versions	
  
       –  Parallel	
  FP	
  Growth	
  	
  


l    Input:	
  
       –  <op@onal	
  document	
  id>TAB<TOKEN1>SPACE<TOKEN2>SPACE	
  
       –  Comma,	
  pipe	
  also	
  allowed	
  as	
  delimiters	
  



                                                                        Copyright	
  Lucid	
  Imagina@on	
     CONFIDENTIAL	
  	
  	
  	
  	
  	
  |	
     	
  23	
  	
  
FIM	
  on	
  Solr	
  Query	
  Logs	
  



l    Goal:	
  	
  
        –  Extract	
  user	
  queries	
  from	
  Solr	
  logs	
  
        –  Feed	
  into	
  FIM	
  to	
  generate	
  Related	
  Keyword	
  Searches	
  


l    Context:	
  
        –  Solr	
  Query	
  logs	
  
        –  bin/mahout	
  regexconverter	
  –input	
  $PATH_TO_LOGS	
  -­‐-­‐output	
  /tmp/solr/output	
  
           -­‐-­‐regex	
  "(?<=(?|&)q=).*?(?=&|$)"	
  -­‐-­‐overwrite	
  -­‐-­‐transformerClass	
  url	
  -­‐-­‐
           formaOerClass	
  fpg	
  
        –  bin/mahout	
  fpg	
  -­‐-­‐input	
  /tmp/solr/output/	
  -­‐o	
  /tmp/solr/fim/output	
  -­‐k	
  25	
  -­‐s	
  2	
  -­‐-­‐
           method	
  mapreduce	
  
        –  bin/mahout	
  seqdumper	
  -­‐-­‐seqFile	
  /tmp/solr2/results/frequentpaOerns/part-­‐
           r-­‐00000	
  

                                                                                 Copyright	
  Lucid	
  Imagina@on	
     CONFIDENTIAL	
  	
  	
  	
  	
  	
  |	
     	
  24	
  	
  
Output	
  



l    Key:	
  Chris:	
  Value:	
  ([Chris,	
  HosteOer],870),	
  ([Chris],870),	
  ([Search,	
  Faceted,	
  
      Chris,	
  HosteOer,	
  Webcast,	
  Power,	
  Mastering],18),	
  ([Search,	
  Faceted,	
  Chris,	
  
      HosteOer,	
  Webcast,	
  Power],18),	
  ([Search,	
  Faceted,	
  Chris,	
  HosteOer],18),	
  
      ([Solr,	
  new,	
  Chris,	
  HosteOer,	
  webcast,	
  along,	
  sponsors,	
  DZone,	
  QA,	
  Refcard],
      12),	
  ([Solr,	
  new,	
  Chris,	
  HosteOer,	
  webcast,	
  along,	
  sponsors,	
  DZone],12),	
  
      ([Solr,	
  new,	
  Chris,	
  HosteOer,	
  webcast,	
  along,	
  sponsors],12),	
  ([Solr,	
  new,	
  
      Chris,	
  HosteOer,	
  webcast,	
  along],12),	
  ([Solr,	
  new,	
  Chris,	
  HosteOer,	
  webcast],
      12),	
  ([Solr,	
  new,	
  Chris,	
  HosteOer],12)	
  




                                                                  Copyright	
  Lucid	
  Imagina@on	
     CONFIDENTIAL	
  	
  	
  	
  	
  	
  |	
     	
  25	
  	
  
Resources	
  



l    hOp://lucene.apache.org	
  
l    hOp://mahout.apache.org	
  
l    hOp://manning.com/owen	
  
l    hOp://manning.com/ingersoll	
  


l    hOp://www.lucidimagina@on.com	
  
l    grant@lucidimagina@on.com	
  
l    @gsingers	
  




                                          Copyright	
  Lucid	
  Imagina@on	
     CONFIDENTIAL	
  	
  	
  	
  	
  	
  |	
     	
  26	
  	
  
Thinking	
  Lucene	
  	
  	
  	
  	
  	
  	
  Think	
  Lucid	
  




Appendix	
  




                                                                                  CONFIDENTIAL	
  	
  	
  	
  	
  	
  |	
     	
  27	
  	
  
Mahout	
  Overview	
  



                               Applications



                                                                Examples


                   Freq.
    Genetic        Pattern   Classification   Clustering            Recommenders
                   Mining

                             Math
     Utilities/Integration                          Collections                  Apache
                             Vectors/Matrices/
     Lucene/Vectorizer                              (primitives)                 Hadoop
                             SVD

    See http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms

                                                 Copyright	
  Lucid	
  Imagina@on	
     CONFIDENTIAL	
  	
  	
  	
  	
  	
  |	
     	
  28	
  	
  

Enhance discovery Solr and Mahout

  • 1.
    Thinking  Lucene              Think  Lucid   Enhancing  Discovery  with  Solr  and   Mahout   Grant  Ingersoll   Chief  Scien@st   Lucid  Imagina@on   CONFIDENTIAL            |    1    
  • 2.
    Evolution Documents • Models • Feature Selection User Interaction Content • Clicks Relationships • Ratings/ • Page Rank, etc. Reviews • Organization • Learning to Rank • Social Graph Queries • Phrases • NLP Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    2    
  • 3.
    Minding the Intersection Search Analytics Discovery Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    3    
  • 4.
    Topics   l  Background   –  Apache  Mahout   –  Apache  Solr  and  Lucene   l  Recommenda@ons  with  Mahout   –  Collabora@ve  Filtering   l  Discovery  with  Solr  and  Mahout   l  Discussion   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    4    
  • 5.
    Apache  Lucene  in  a  Nutshell   l  hOp://lucene.apache.org/java   l  Java  based  Applica@on  Programming  Interface  (API)  for  adding  search  and   indexing  func@onality  to  applica@ons   l  Fast  and  efficient  scoring  and  indexing  algorithms   l  Lots  of  contribu@ons  to  make  common  tasks  easier:   –  Highligh@ng,  spa@al,  Query  Parsers,  Benchmarking  tools,  etc.   l  Most  widely  deployed  search  library  on  the  planet     Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    5    
  • 6.
    Apache  Solr  in  a  Nutshell   l  hOp://lucene.apache.org/solr   l  Lucene-­‐based  Search  Server  +  other  features  and  func@onality   l  Access  Lucene  over  HTTP:   –  Java,  XML,  Ruby,  Python,  .NET,  JSON,  PHP,  etc.   l  Most  programming  tasks  in  Lucene  are  taken  care  of  in  Solr   l  Face@ng  (guided  naviga@on,  filters,  etc.)   l  Replica@on  and  distributed  search  support   l  Lucene  Best  Prac@ces   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    6    
  • 7.
    Apache  Mahout  in  a  Nutshell   http://dictionary.reference.com/browse/mahout l  An  Apache  Socware  Founda@on  project  to  create   scalable  machine  learning  libraries  under  the  Apache   Socware  License   –  hOp://mahout.apache.org   l  The  Three  C’s:   –  Collabora@ve  Filtering  (recommenders)   –  Clustering   –  Classifica@on   l  Others:   –  Frequent  Item  Mining   –  Primi@ve  collec@ons   –  Math  stuff   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    7    
  • 8.
    Thinking  Lucene              Think  Lucid   Recommenda@ons  with  Mahout   CONFIDENTIAL            |    8    
  • 9.
    Recommenders   l  Collabora@ve  Filtering  (CF)   –  Provide  recommenda@ons  solely  based  on  preferences  expressed  between   users  and  items   –  “People  who  watched  this  also  watched  that”   l  Content-­‐based  Recommenda@ons  (CBR)   –  Provide  recommenda@ons  based  on  the  aOributes  of  the  items  and  user  profile   –  ‘Modern  Family’  is  a  sitcom,  Bob  likes  sitcoms     •  =>  Suggest  Modern  Family  to  Bob   l  Mahout  geared  towards  CF,  can  be  extended  to  do  CBR   –  Classifica@on  can  also  be  used  for  CBR   l  Aside:  search  engines  can  also  solve  these  problems   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    9    
  • 10.
    To  Rate  or  Not?   l  In  many  instances,  user’s  don’t  provide  actual  ra@ngs   –  Clicks,  views,  etc.   l  Non-­‐Boolean  ra@ngs  can  also  ocen  introduce  unnecessary  noise   –  Even  a  low  ra@ng  ocen  has  a  posi@ve  correla@on  with  highly  rated  items  in  the   real  world   l  Example:    Should  we  recommend  Frankenstein  to  Bob?   Dracula Dracula Jane Frankenstein Jane Eyre Java Programming Frankenstein Eyre Bob 1 4 ??? Bob 1 4 ??? - Mary 5 1 4 Mary 5 1 4 - Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    10    
  • 11.
    Collabora;ve  Filtering  with  Mahout   Item Item … Item m l  Extensive  framework  for  collabora@ve   1 2 filtering   User 1 - 0.5 0.9 l  Recommenders   –  User  based   User 2 0.1 0.3 - –  Item  based   … –  Slope  One   User n 0.8 0.7 0.1 l  Online  and  Offline  support   –  Offline  can  u@lize  Hadoop   Recommendations for User X Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    11    
  • 12.
    User  Similarity   What  should  we  recommend  for  User  1?   User   User   1   2   User   3   User   4   Item  1   Item  2   Item  3   Item  4   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    12    
  • 13.
    Item  Similarity   What  should  we  recommend  for  User  1?   User   User   1   2   User   3   User   4   Item  1   Item  2   Item  3   Item  4   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    13    
  • 14.
    Slope  One   User Item 1 Item 2 A 3.5 2 B ? 3 User  A:  3.5  –  2  =  1.5   Item  1  (User  B)  =  3  +  1.5  =  4.5     l  Intui@on:  There  is  a  linear  rela@onship  between  rated  items   –  Y  =  mX  +  b    where  m  =  1   l  Solve  for  b  upfront  based  on  exis@ng  ra@ngs:    b  =  (Y-­‐X)   –  Find  the  average  difference  in  preference  value  for  every  pair  of  items   l  Online  can  be  very  fast,  but  requires  up  front  computa@on  and  memory   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    14    
  • 15.
    Online  and  Offline  Recommenda;ons   l  Online   –  Predates  Hadoop   –  Designed  to  run  on  a  single  node   •  Matrix  size  of  ~  100M  interac@ons   –  API  for  integra@ng  with  your  applica@on   l  Offline   –  Hadoop  based   –  Designed  to  run  on  large  cluster   –  Several  approaches:   •  RecommenderJob,  ItemSimilarityJob,  ParallelALSFactoriza@onJob   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    15    
  • 16.
    RecommenderJob   l  Essen@ally  does  matrix  mul@plica@on  using  distributed  techniques   l  $MAHOUT_HOME/bin/examples/asf-­‐email-­‐examples.sh   101 102 103 104 105 User A Recs 3.0 30 101 7 2 0 1 3 0 37 102 2 8 3 5 2 X   4.0 =   103 0 3 3 6 4 38 104 1 5 6 4 7 3.0 53 105 3 2 4 7 9 2.0 64 Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    16    
  • 17.
    Thinking  Lucene              Think  Lucid   Discovery  with  Solr   CONFIDENTIAL            |    17    
  • 18.
    Discovery  with  Solr   l  Goals:   –  Guide  users  to  results  without  having  to  guess  at  keywords   –  Encourage  serendipity   –  Never  show  empty  results   l  Out  of  the  Box:   –  Face@ng   –  Spell  Checking   –  More  Like  This   –  Clustering  (Carrot2)   l  Extend   –  Clustering  (with  Mahout)   –  Frequent  Item  Mining  (with  Mahout)   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    18    
  • 19.
    Clustering   l  Automa@cally  group  similar  content  together  to  aid  users  in  discovering   related  items  and/or  avoiding  repe@@ve  content   l  Solr  has  search  result  clustering   –  Pluggable   –  Default  implementa@on  uses  Carrot2   l  Mahout  has  Hadoop  based  large  scale  clustering   –  K-­‐Means,  Minhash,  Dirichlet,  Canopy,  Spectral,  etc.   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    19    
  • 20.
    Discovery  In  Ac;on   l  Pre-­‐reqs:   –  Apache  Ant  1.7.x,  Subversion  (SVN)   l  Command  Line  1:   –  svn  co  hOps://svn.apache.org/repos/asf/lucene/dev/trunk  solr-­‐trunk   –  cd  solr-­‐trunk/solr/   –  ant  example   –  cd  example   –  java  –Dsolr.clustering.enabled=true  –jar  start.jar   l  Command  Line  2   –  cd  exampledocs;  java  –jar  post.jar  *.xml   l  hOp://localhost:8983/solr/browse? q=&debugQuery=true&annotateBrowse=true     Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    20    
  • 21.
    Thinking  Lucene              Think  Lucid   Solr  +  Mahout   CONFIDENTIAL            |    21    
  • 22.
    Basics   l  Most  Mahout  tasks  are  offline   l  Solr  provides  many  touch  points  for  integra@on:   –  ClusteringEngine   •  Clustering  results   –  SearchComponent   •  Sugges@ons  –  Related  searches,  clusters,  MLT,  spellchecking   –  UpdateProcessor   •  Classifica@on  of  documents   –  Func@onQuery   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    22    
  • 23.
    Example:  Frequent  Itemset  Mining   l  Discover  frequently  co-­‐occurring  items   l  Use  Case:  Related  Searches  from  Solr  Logs   l  Hadoop  and  sequen@al  versions   –  Parallel  FP  Growth     l  Input:   –  <op@onal  document  id>TAB<TOKEN1>SPACE<TOKEN2>SPACE   –  Comma,  pipe  also  allowed  as  delimiters   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    23    
  • 24.
    FIM  on  Solr  Query  Logs   l  Goal:     –  Extract  user  queries  from  Solr  logs   –  Feed  into  FIM  to  generate  Related  Keyword  Searches   l  Context:   –  Solr  Query  logs   –  bin/mahout  regexconverter  –input  $PATH_TO_LOGS  -­‐-­‐output  /tmp/solr/output   -­‐-­‐regex  "(?<=(?|&)q=).*?(?=&|$)"  -­‐-­‐overwrite  -­‐-­‐transformerClass  url  -­‐-­‐ formaOerClass  fpg   –  bin/mahout  fpg  -­‐-­‐input  /tmp/solr/output/  -­‐o  /tmp/solr/fim/output  -­‐k  25  -­‐s  2  -­‐-­‐ method  mapreduce   –  bin/mahout  seqdumper  -­‐-­‐seqFile  /tmp/solr2/results/frequentpaOerns/part-­‐ r-­‐00000   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    24    
  • 25.
    Output   l  Key:  Chris:  Value:  ([Chris,  HosteOer],870),  ([Chris],870),  ([Search,  Faceted,   Chris,  HosteOer,  Webcast,  Power,  Mastering],18),  ([Search,  Faceted,  Chris,   HosteOer,  Webcast,  Power],18),  ([Search,  Faceted,  Chris,  HosteOer],18),   ([Solr,  new,  Chris,  HosteOer,  webcast,  along,  sponsors,  DZone,  QA,  Refcard], 12),  ([Solr,  new,  Chris,  HosteOer,  webcast,  along,  sponsors,  DZone],12),   ([Solr,  new,  Chris,  HosteOer,  webcast,  along,  sponsors],12),  ([Solr,  new,   Chris,  HosteOer,  webcast,  along],12),  ([Solr,  new,  Chris,  HosteOer,  webcast], 12),  ([Solr,  new,  Chris,  HosteOer],12)   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    25    
  • 26.
    Resources   l  hOp://lucene.apache.org   l  hOp://mahout.apache.org   l  hOp://manning.com/owen   l  hOp://manning.com/ingersoll   l  hOp://www.lucidimagina@on.com   l  grant@lucidimagina@on.com   l  @gsingers   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    26    
  • 27.
    Thinking  Lucene              Think  Lucid   Appendix   CONFIDENTIAL            |    27    
  • 28.
    Mahout  Overview   Applications Examples Freq. Genetic Pattern Classification Clustering Recommenders Mining Math Utilities/Integration Collections Apache Vectors/Matrices/ Lucene/Vectorizer (primitives) Hadoop SVD See http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    28