• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Enhance discovery Solr and Mahout
 

Enhance discovery Solr and Mahout

on

  • 4,892 views

Los Angeles/ OC Apache Lucene/Solr User group meeting held at Shopzilla in LA on January 19th 2012.

Los Angeles/ OC Apache Lucene/Solr User group meeting held at Shopzilla in LA on January 19th 2012.

Statistics

Views

Total Views
4,892
Views on SlideShare
3,736
Embed Views
1,156

Actions

Likes
6
Downloads
68
Comments
1

7 Embeds 1,156

http://www.lucidimagination.com 725
http://www.lucidworks.com 376
https://twitter.com 35
http://lucidqa.no10web.com 9
http://s2.lucidimagination.com 8
http://lucidstage.no10web.com 2
http://staging.lucidworks.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Enhance discovery Solr and Mahout Enhance discovery Solr and Mahout Presentation Transcript

    • Thinking  Lucene              Think  Lucid  Enhancing  Discovery  with  Solr  and  Mahout  Grant  Ingersoll  Chief  Scien@st  Lucid  Imagina@on   CONFIDENTIAL            |    1    
    • Evolution Documents • Models • Feature Selection User Interaction Content • Clicks Relationships • Ratings/ • Page Rank, etc. Reviews • Organization • Learning to Rank • Social Graph Queries • Phrases • NLP Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    2    
    • Minding the Intersection Search Analytics Discovery Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    3    
    • Topics  l  Background   –  Apache  Mahout   –  Apache  Solr  and  Lucene  l  Recommenda@ons  with  Mahout   –  Collabora@ve  Filtering  l  Discovery  with  Solr  and  Mahout  l  Discussion   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    4    
    • Apache  Lucene  in  a  Nutshell  l  hOp://lucene.apache.org/java  l  Java  based  Applica@on  Programming  Interface  (API)  for  adding  search  and   indexing  func@onality  to  applica@ons  l  Fast  and  efficient  scoring  and  indexing  algorithms  l  Lots  of  contribu@ons  to  make  common  tasks  easier:   –  Highligh@ng,  spa@al,  Query  Parsers,  Benchmarking  tools,  etc.  l  Most  widely  deployed  search  library  on  the  planet     Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    5    
    • Apache  Solr  in  a  Nutshell  l  hOp://lucene.apache.org/solr  l  Lucene-­‐based  Search  Server  +  other  features  and  func@onality  l  Access  Lucene  over  HTTP:   –  Java,  XML,  Ruby,  Python,  .NET,  JSON,  PHP,  etc.  l  Most  programming  tasks  in  Lucene  are  taken  care  of  in  Solr  l  Face@ng  (guided  naviga@on,  filters,  etc.)  l  Replica@on  and  distributed  search  support  l  Lucene  Best  Prac@ces   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    6    
    • Apache  Mahout  in  a  Nutshell   http://dictionary.reference.com/browse/mahout l  An  Apache  Socware  Founda@on  project  to  create   scalable  machine  learning  libraries  under  the  Apache   Socware  License   –  hOp://mahout.apache.org   l  The  Three  C’s:   –  Collabora@ve  Filtering  (recommenders)   –  Clustering   –  Classifica@on   l  Others:   –  Frequent  Item  Mining   –  Primi@ve  collec@ons   –  Math  stuff   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    7    
    • Thinking  Lucene              Think  Lucid  Recommenda@ons  with  Mahout   CONFIDENTIAL            |    8    
    • Recommenders  l  Collabora@ve  Filtering  (CF)   –  Provide  recommenda@ons  solely  based  on  preferences  expressed  between   users  and  items   –  “People  who  watched  this  also  watched  that”  l  Content-­‐based  Recommenda@ons  (CBR)   –  Provide  recommenda@ons  based  on  the  aOributes  of  the  items  and  user  profile   –  ‘Modern  Family’  is  a  sitcom,  Bob  likes  sitcoms     •  =>  Suggest  Modern  Family  to  Bob  l  Mahout  geared  towards  CF,  can  be  extended  to  do  CBR   –  Classifica@on  can  also  be  used  for  CBR  l  Aside:  search  engines  can  also  solve  these  problems   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    9    
    • To  Rate  or  Not?  l  In  many  instances,  user’s  don’t  provide  actual  ra@ngs   –  Clicks,  views,  etc.  l  Non-­‐Boolean  ra@ngs  can  also  ocen  introduce  unnecessary  noise   –  Even  a  low  ra@ng  ocen  has  a  posi@ve  correla@on  with  highly  rated  items  in  the   real  world  l  Example:    Should  we  recommend  Frankenstein  to  Bob?   Dracula Dracula Jane Frankenstein Jane Eyre Java Programming Frankenstein Eyre Bob 1 4 ??? Bob 1 4 ??? - Mary 5 1 4 Mary 5 1 4 - Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    10    
    • Collabora;ve  Filtering  with  Mahout   Item Item … Item m l  Extensive  framework  for  collabora@ve   1 2 filtering   User 1 - 0.5 0.9 l  Recommenders   –  User  based   User 2 0.1 0.3 - –  Item  based   … –  Slope  One   User n 0.8 0.7 0.1 l  Online  and  Offline  support   –  Offline  can  u@lize  Hadoop   Recommendations for User X Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    11    
    • User  Similarity   What  should  we  recommend  for  User  1?   User   User   1   2   User   3   User   4   Item  1   Item  2   Item  3   Item  4   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    12    
    • Item  Similarity   What  should  we  recommend  for  User  1?   User   User   1   2   User   3   User   4   Item  1   Item  2   Item  3   Item  4   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    13    
    • Slope  One   User Item 1 Item 2 A 3.5 2 B ? 3 User  A:  3.5  –  2  =  1.5   Item  1  (User  B)  =  3  +  1.5  =  4.5    l  Intui@on:  There  is  a  linear  rela@onship  between  rated  items   –  Y  =  mX  +  b    where  m  =  1  l  Solve  for  b  upfront  based  on  exis@ng  ra@ngs:    b  =  (Y-­‐X)   –  Find  the  average  difference  in  preference  value  for  every  pair  of  items  l  Online  can  be  very  fast,  but  requires  up  front  computa@on  and  memory   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    14    
    • Online  and  Offline  Recommenda;ons  l  Online   –  Predates  Hadoop   –  Designed  to  run  on  a  single  node   •  Matrix  size  of  ~  100M  interac@ons   –  API  for  integra@ng  with  your  applica@on  l  Offline   –  Hadoop  based   –  Designed  to  run  on  large  cluster   –  Several  approaches:   •  RecommenderJob,  ItemSimilarityJob,  ParallelALSFactoriza@onJob   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    15    
    • RecommenderJob  l  Essen@ally  does  matrix  mul@plica@on  using  distributed  techniques  l  $MAHOUT_HOME/bin/examples/asf-­‐email-­‐examples.sh   101 102 103 104 105 User A Recs 3.0 30101 7 2 0 1 3 0 37102 2 8 3 5 2 X   4.0 =  103 0 3 3 6 4 38104 1 5 6 4 7 3.0 53105 3 2 4 7 9 2.0 64 Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    16    
    • Thinking  Lucene              Think  Lucid  Discovery  with  Solr   CONFIDENTIAL            |    17    
    • Discovery  with  Solr  l  Goals:   –  Guide  users  to  results  without  having  to  guess  at  keywords   –  Encourage  serendipity   –  Never  show  empty  results  l  Out  of  the  Box:   –  Face@ng   –  Spell  Checking   –  More  Like  This   –  Clustering  (Carrot2)  l  Extend   –  Clustering  (with  Mahout)   –  Frequent  Item  Mining  (with  Mahout)   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    18    
    • Clustering  l  Automa@cally  group  similar  content  together  to  aid  users  in  discovering   related  items  and/or  avoiding  repe@@ve  content  l  Solr  has  search  result  clustering   –  Pluggable   –  Default  implementa@on  uses  Carrot2  l  Mahout  has  Hadoop  based  large  scale  clustering   –  K-­‐Means,  Minhash,  Dirichlet,  Canopy,  Spectral,  etc.   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    19    
    • Discovery  In  Ac;on   l  Pre-­‐reqs:   –  Apache  Ant  1.7.x,  Subversion  (SVN)   l  Command  Line  1:   –  svn  co  hOps://svn.apache.org/repos/asf/lucene/dev/trunk  solr-­‐trunk   –  cd  solr-­‐trunk/solr/   –  ant  example   –  cd  example   –  java  –Dsolr.clustering.enabled=true  –jar  start.jar   l  Command  Line  2   –  cd  exampledocs;  java  –jar  post.jar  *.xml   l  hOp://localhost:8983/solr/browse? q=&debugQuery=true&annotateBrowse=true     Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    20    
    • Thinking  Lucene              Think  Lucid  Solr  +  Mahout   CONFIDENTIAL            |    21    
    • Basics  l  Most  Mahout  tasks  are  offline  l  Solr  provides  many  touch  points  for  integra@on:   –  ClusteringEngine   •  Clustering  results   –  SearchComponent   •  Sugges@ons  –  Related  searches,  clusters,  MLT,  spellchecking   –  UpdateProcessor   •  Classifica@on  of  documents   –  Func@onQuery   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    22    
    • Example:  Frequent  Itemset  Mining  l  Discover  frequently  co-­‐occurring  items  l  Use  Case:  Related  Searches  from  Solr  Logs  l  Hadoop  and  sequen@al  versions   –  Parallel  FP  Growth    l  Input:   –  <op@onal  document  id>TAB<TOKEN1>SPACE<TOKEN2>SPACE   –  Comma,  pipe  also  allowed  as  delimiters   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    23    
    • FIM  on  Solr  Query  Logs  l  Goal:     –  Extract  user  queries  from  Solr  logs   –  Feed  into  FIM  to  generate  Related  Keyword  Searches  l  Context:   –  Solr  Query  logs   –  bin/mahout  regexconverter  –input  $PATH_TO_LOGS  -­‐-­‐output  /tmp/solr/output   -­‐-­‐regex  "(?<=(?|&)q=).*?(?=&|$)"  -­‐-­‐overwrite  -­‐-­‐transformerClass  url  -­‐-­‐ formaOerClass  fpg   –  bin/mahout  fpg  -­‐-­‐input  /tmp/solr/output/  -­‐o  /tmp/solr/fim/output  -­‐k  25  -­‐s  2  -­‐-­‐ method  mapreduce   –  bin/mahout  seqdumper  -­‐-­‐seqFile  /tmp/solr2/results/frequentpaOerns/part-­‐ r-­‐00000   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    24    
    • Output  l  Key:  Chris:  Value:  ([Chris,  HosteOer],870),  ([Chris],870),  ([Search,  Faceted,   Chris,  HosteOer,  Webcast,  Power,  Mastering],18),  ([Search,  Faceted,  Chris,   HosteOer,  Webcast,  Power],18),  ([Search,  Faceted,  Chris,  HosteOer],18),   ([Solr,  new,  Chris,  HosteOer,  webcast,  along,  sponsors,  DZone,  QA,  Refcard], 12),  ([Solr,  new,  Chris,  HosteOer,  webcast,  along,  sponsors,  DZone],12),   ([Solr,  new,  Chris,  HosteOer,  webcast,  along,  sponsors],12),  ([Solr,  new,   Chris,  HosteOer,  webcast,  along],12),  ([Solr,  new,  Chris,  HosteOer,  webcast], 12),  ([Solr,  new,  Chris,  HosteOer],12)   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    25    
    • Resources  l  hOp://lucene.apache.org  l  hOp://mahout.apache.org  l  hOp://manning.com/owen  l  hOp://manning.com/ingersoll  l  hOp://www.lucidimagina@on.com  l  grant@lucidimagina@on.com  l  @gsingers   Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    26    
    • Thinking  Lucene              Think  Lucid  Appendix   CONFIDENTIAL            |    27    
    • Mahout  Overview   Applications Examples Freq. Genetic Pattern Classification Clustering Recommenders Mining Math Utilities/Integration Collections Apache Vectors/Matrices/ Lucene/Vectorizer (primitives) Hadoop SVD See http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms Copyright  Lucid  Imagina@on   CONFIDENTIAL            |    28