Your SlideShare is downloading. ×
Enhance discovery Solr and Mahout
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Enhance discovery Solr and Mahout

4,805
views

Published on

Los Angeles/ OC Apache Lucene/Solr User group meeting held at Shopzilla in LA on January 19th 2012.

Los Angeles/ OC Apache Lucene/Solr User group meeting held at Shopzilla in LA on January 19th 2012.

Published in: Technology, Spiritual

1 Comment
9 Likes
Statistics
Notes
No Downloads
Views
Total Views
4,805
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
76
Comments
1
Likes
9
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Thinking  Lucene              Think  Lucid  Enhancing  Discovery  with  Solr  and  Mahout  Grant  Ingersoll  Chief  Scien@st  Lucid  Imagina@on   CONFIDENTIAL            |    1    
  • 2. Evolution Documents • Models • Feature Selection User Interaction Content • Clicks Relationships • Ratings/ • Page Rank, etc. Reviews • Organization • Learning to Rank • Social Graph Queries • Phrases • NLP Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    2    
  • 3. Minding the Intersection Search Analytics Discovery Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    3    
  • 4. Topics  l  Background   –  Apache  Mahout   –  Apache  Solr  and  Lucene  l  Recommenda@ons  with  Mahout   –  Collabora@ve  Filtering  l  Discovery  with  Solr  and  Mahout  l  Discussion   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    4    
  • 5. Apache  Lucene  in  a  Nutshell  l  hOp://lucene.apache.org/java  l  Java  based  Applica@on  Programming  Interface  (API)  for  adding  search  and   indexing  func@onality  to  applica@ons  l  Fast  and  efficient  scoring  and  indexing  algorithms  l  Lots  of  contribu@ons  to  make  common  tasks  easier:   –  Highligh@ng,  spa@al,  Query  Parsers,  Benchmarking  tools,  etc.  l  Most  widely  deployed  search  library  on  the  planet     Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    5    
  • 6. Apache  Solr  in  a  Nutshell  l  hOp://lucene.apache.org/solr  l  Lucene-­‐based  Search  Server  +  other  features  and  func@onality  l  Access  Lucene  over  HTTP:   –  Java,  XML,  Ruby,  Python,  .NET,  JSON,  PHP,  etc.  l  Most  programming  tasks  in  Lucene  are  taken  care  of  in  Solr  l  Face@ng  (guided  naviga@on,  filters,  etc.)  l  Replica@on  and  distributed  search  support  l  Lucene  Best  Prac@ces   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    6    
  • 7. Apache  Mahout  in  a  Nutshell   http://dictionary.reference.com/browse/mahout l  An  Apache  Socware  Founda@on  project  to  create   scalable  machine  learning  libraries  under  the  Apache   Socware  License   –  hOp://mahout.apache.org   l  The  Three  C’s:   –  Collabora@ve  Filtering  (recommenders)   –  Clustering   –  Classifica@on   l  Others:   –  Frequent  Item  Mining   –  Primi@ve  collec@ons   –  Math  stuff   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    7    
  • 8. Thinking  Lucene              Think  Lucid  Recommenda@ons  with  Mahout   CONFIDENTIAL            |    8    
  • 9. Recommenders  l  Collabora@ve  Filtering  (CF)   –  Provide  recommenda@ons  solely  based  on  preferences  expressed  between   users  and  items   –  “People  who  watched  this  also  watched  that”  l  Content-­‐based  Recommenda@ons  (CBR)   –  Provide  recommenda@ons  based  on  the  aOributes  of  the  items  and  user  profile   –  ‘Modern  Family’  is  a  sitcom,  Bob  likes  sitcoms     •  =>  Suggest  Modern  Family  to  Bob  l  Mahout  geared  towards  CF,  can  be  extended  to  do  CBR   –  Classifica@on  can  also  be  used  for  CBR  l  Aside:  search  engines  can  also  solve  these  problems   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    9    
  • 10. To  Rate  or  Not?  l  In  many  instances,  user’s  don’t  provide  actual  ra@ngs   –  Clicks,  views,  etc.  l  Non-­‐Boolean  ra@ngs  can  also  ocen  introduce  unnecessary  noise   –  Even  a  low  ra@ng  ocen  has  a  posi@ve  correla@on  with  highly  rated  items  in  the   real  world  l  Example:    Should  we  recommend  Frankenstein  to  Bob?   Dracula Dracula Jane Frankenstein Jane Eyre Java Programming Frankenstein Eyre Bob 1 4 ??? Bob 1 4 ??? - Mary 5 1 4 Mary 5 1 4 - Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    10    
  • 11. Collabora;ve  Filtering  with  Mahout   Item Item … Item m l  Extensive  framework  for  collabora@ve   1 2 filtering   User 1 - 0.5 0.9 l  Recommenders   –  User  based   User 2 0.1 0.3 - –  Item  based   … –  Slope  One   User n 0.8 0.7 0.1 l  Online  and  Offline  support   –  Offline  can  u@lize  Hadoop   Recommendations for User X Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    11    
  • 12. User  Similarity   What  should  we  recommend  for  User  1?   User   User   1   2   User   3   User   4   Item  1   Item  2   Item  3   Item  4   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    12    
  • 13. Item  Similarity   What  should  we  recommend  for  User  1?   User   User   1   2   User   3   User   4   Item  1   Item  2   Item  3   Item  4   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    13    
  • 14. Slope  One   User Item 1 Item 2 A 3.5 2 B ? 3 User  A:  3.5  –  2  =  1.5   Item  1  (User  B)  =  3  +  1.5  =  4.5    l  Intui@on:  There  is  a  linear  rela@onship  between  rated  items   –  Y  =  mX  +  b    where  m  =  1  l  Solve  for  b  upfront  based  on  exis@ng  ra@ngs:    b  =  (Y-­‐X)   –  Find  the  average  difference  in  preference  value  for  every  pair  of  items  l  Online  can  be  very  fast,  but  requires  up  front  computa@on  and  memory   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    14    
  • 15. Online  and  Offline  Recommenda;ons  l  Online   –  Predates  Hadoop   –  Designed  to  run  on  a  single  node   •  Matrix  size  of  ~  100M  interac@ons   –  API  for  integra@ng  with  your  applica@on  l  Offline   –  Hadoop  based   –  Designed  to  run  on  large  cluster   –  Several  approaches:   •  RecommenderJob,  ItemSimilarityJob,  ParallelALSFactoriza@onJob   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    15    
  • 16. RecommenderJob  l  Essen@ally  does  matrix  mul@plica@on  using  distributed  techniques  l  $MAHOUT_HOME/bin/examples/asf-­‐email-­‐examples.sh   101 102 103 104 105 User A Recs 3.0 30101 7 2 0 1 3 0 37102 2 8 3 5 2 X   4.0 =  103 0 3 3 6 4 38104 1 5 6 4 7 3.0 53105 3 2 4 7 9 2.0 64 Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    16    
  • 17. Thinking  Lucene              Think  Lucid  Discovery  with  Solr   CONFIDENTIAL            |    17    
  • 18. Discovery  with  Solr  l  Goals:   –  Guide  users  to  results  without  having  to  guess  at  keywords   –  Encourage  serendipity   –  Never  show  empty  results  l  Out  of  the  Box:   –  Face@ng   –  Spell  Checking   –  More  Like  This   –  Clustering  (Carrot2)  l  Extend   –  Clustering  (with  Mahout)   –  Frequent  Item  Mining  (with  Mahout)   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    18    
  • 19. Clustering  l  Automa@cally  group  similar  content  together  to  aid  users  in  discovering   related  items  and/or  avoiding  repe@@ve  content  l  Solr  has  search  result  clustering   –  Pluggable   –  Default  implementa@on  uses  Carrot2  l  Mahout  has  Hadoop  based  large  scale  clustering   –  K-­‐Means,  Minhash,  Dirichlet,  Canopy,  Spectral,  etc.   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    19    
  • 20. Discovery  In  Ac;on   l  Pre-­‐reqs:   –  Apache  Ant  1.7.x,  Subversion  (SVN)   l  Command  Line  1:   –  svn  co  hOps://svn.apache.org/repos/asf/lucene/dev/trunk  solr-­‐trunk   –  cd  solr-­‐trunk/solr/   –  ant  example   –  cd  example   –  java  –Dsolr.clustering.enabled=true  –jar  start.jar   l  Command  Line  2   –  cd  exampledocs;  java  –jar  post.jar  *.xml   l  hOp://localhost:8983/solr/browse? q=&debugQuery=true&annotateBrowse=true     Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    20    
  • 21. Thinking  Lucene              Think  Lucid  Solr  +  Mahout   CONFIDENTIAL            |    21    
  • 22. Basics  l  Most  Mahout  tasks  are  offline  l  Solr  provides  many  touch  points  for  integra@on:   –  ClusteringEngine   •  Clustering  results   –  SearchComponent   •  Sugges@ons  –  Related  searches,  clusters,  MLT,  spellchecking   –  UpdateProcessor   •  Classifica@on  of  documents   –  Func@onQuery   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    22    
  • 23. Example:  Frequent  Itemset  Mining  l  Discover  frequently  co-­‐occurring  items  l  Use  Case:  Related  Searches  from  Solr  Logs  l  Hadoop  and  sequen@al  versions   –  Parallel  FP  Growth    l  Input:   –  <op@onal  document  id>TAB<TOKEN1>SPACE<TOKEN2>SPACE   –  Comma,  pipe  also  allowed  as  delimiters   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    23    
  • 24. FIM  on  Solr  Query  Logs  l  Goal:     –  Extract  user  queries  from  Solr  logs   –  Feed  into  FIM  to  generate  Related  Keyword  Searches  l  Context:   –  Solr  Query  logs   –  bin/mahout  regexconverter  –input  $PATH_TO_LOGS  -­‐-­‐output  /tmp/solr/output   -­‐-­‐regex  "(?<=(?|&)q=).*?(?=&|$)"  -­‐-­‐overwrite  -­‐-­‐transformerClass  url  -­‐-­‐ formaOerClass  fpg   –  bin/mahout  fpg  -­‐-­‐input  /tmp/solr/output/  -­‐o  /tmp/solr/fim/output  -­‐k  25  -­‐s  2  -­‐-­‐ method  mapreduce   –  bin/mahout  seqdumper  -­‐-­‐seqFile  /tmp/solr2/results/frequentpaOerns/part-­‐ r-­‐00000   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    24    
  • 25. Output  l  Key:  Chris:  Value:  ([Chris,  HosteOer],870),  ([Chris],870),  ([Search,  Faceted,   Chris,  HosteOer,  Webcast,  Power,  Mastering],18),  ([Search,  Faceted,  Chris,   HosteOer,  Webcast,  Power],18),  ([Search,  Faceted,  Chris,  HosteOer],18),   ([Solr,  new,  Chris,  HosteOer,  webcast,  along,  sponsors,  DZone,  QA,  Refcard], 12),  ([Solr,  new,  Chris,  HosteOer,  webcast,  along,  sponsors,  DZone],12),   ([Solr,  new,  Chris,  HosteOer,  webcast,  along,  sponsors],12),  ([Solr,  new,   Chris,  HosteOer,  webcast,  along],12),  ([Solr,  new,  Chris,  HosteOer,  webcast], 12),  ([Solr,  new,  Chris,  HosteOer],12)   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    25    
  • 26. Resources  l  hOp://lucene.apache.org  l  hOp://mahout.apache.org  l  hOp://manning.com/owen  l  hOp://manning.com/ingersoll  l  hOp://www.lucidimagina@on.com  l  grant@lucidimagina@on.com  l  @gsingers   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    26    
  • 27. Thinking  Lucene              Think  Lucid  Appendix   CONFIDENTIAL            |    27    
  • 28. Mahout  Overview   Applications Examples Freq. Genetic Pattern Classification Clustering Recommenders Mining Math Utilities/Integration Collections Apache Vectors/Matrices/ Lucene/Vectorizer (primitives) Hadoop SVD See http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    28