Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr


Published on

Published in: Technology, Business

Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr

  1. 1. Search              Discover              Analyze  Large  Scale  Search,  Discovery  and  Analy5cs  with  Solr,  Mahout  and  Hadoop  Grant  Ingersoll  Chief  Scien:st  Lucid  Imagina:on              |    1    
  2. 2. Search  is  Dead,  Long  Live  Search  l  Good  keyword  search  is  a   Documents commodity  and  easy  to  get   up  and  running  l  The  Bar  is  Raised   Content User Relationships Interaction –  Relevance  is  (always  will   be?)  hard  l  Holis:c  view  of  the  data   AND  the  users  is  cri:cal   Access            |    2    
  3. 3. Topics  l  Quick  Background  and  needs  l  Architecture   –  Abstract   –  Prac:cal  l  SDA  In  Prac:ce   –  Components   –  Challenges  and  Lessons  Learned  l  Wrap  Up              |    3    
  4. 4. Why  Search,  Discovery  and  Analy8cs  (SDA)?   l  User  Needs   –  Real-­‐:me,  ad  hoc  access  to  content   –  Aggressive  Priori:za:on  based  on  Importance   Search –  Serendipity   –  Feedback/Learning  from  past   l  Business  Needs   Analytics Discovery –  Deeper  insight  into  users   –  Leverage  exis:ng  internal  knowledge   –  Cost  effec:ve              |    4    
  5. 5. What  Do  Developers  Need  for  SDA?  l  Fast, efficient, scalable search –  Bulk and Near Real Time Indexing –  Handle billions of records w/ sub-second search and facetingl  Large scale, cost effective storage and processing capabilities –  Need whole data consumption and analysis –  Experimentation/Sampling tools –  Distributed In Memory where appropriatel  NLP and machine learning tools that scale to enhance discovery and analysis            |    5    
  6. 6. Abstract  -­‐>  Prac8cal  SDA  Architecture   Access (API, UI,Visualization) Search, Discovery and Analytics Glue Stats Mahout, R, GATE, Others Pig, Machine Docs User Admin Package Learning Access Modeling Experiment Mgmt Service Mgmt Content Computation and Storage Acquisition DB Dist. Data Search NoSQL Process Mgmt KV Shards Shards Shards Shards Shards Shards Shards Logs DFS Provisioning, Monitoring, Infrastructure            |    6    
  7. 7. Computa8on  and  Storage   Solr Hadoop HBase•  Document Index •  Stores Logs, •  Metric Storage•  Document Raw files, •  User Histories Storage? intermediate •  Document files, etc. Storage?•  SolrCloud makes •  WebHDFS sharding easy •  Small file are an unnatural actChallenges   •  Who  is  the  authorita:ve  store?  Solr  or  HBase?   •  Real  :me  vs.  Batch   •  Where  should  analysis  be  done?              |    7    
  8. 8. Search  In  Prac8ce  l  Three  primary  concerns   –  Performance/Scaling   –  Relevance   –  Opera:ons:  monitoring,  failover,  etc.  l  Business  typically  cares  more  about  relevance  l  Devs  more  about  performance  (and  then  ops)              |    8    
  9. 9. Search  with  Solr:  Scaling  and  NRT  l  SolrCloud  takes  care  of  distributed  indexing  and  search  needs   –  Transac:on  logs  for  recovery   –  Automa:c  leader  elec:on,  so  no  more  master/worker   –  Have  to  declare  number  of  shards  now,  but  spliang  coming  soon   –  Use  CloudSolrServer  in  SolrJ  l  NRT  Config  :ps:   –  1  second  sod  commits  for  NRT  updates   –  1  minute  hard  commits  (no  searcher  reopen)              |    9    
  10. 10. Search:  Relevance  l  ABT  –  Always  Be  Tes:ng   –  Experiment  management  is  cri:cal   –  Top  X  +  Random  Sampling  of  Long  Tail   –  Click  logs  l  Track  Everything!   –  Queries   –  Clicks   –  Displayed  Documents     –  Mouse/Scroll  tracking???  l  Phrases  are  your  friend              |    10    
  11. 11. Discovery  Components   Serendipity Organization Data Quality•  Trends •  Importance •  Document factor•  Topics •  Clustering Distributions•  Recommendations •  Classification •  Length•  Related Items •  Named Entities •  Boosts•  More Like This •  Time Factors •  Duplicates•  Did you mean? •  Faceting•  Stat. Interesting PhrasesChallenges   •  Many  of  these  are  intense  calcula:ons  or  itera:ve   •  Many  are  subjec:ve  and  require  a  lot  of  experimenta:on              |    11    
  12. 12. Discovery  with  Mahout  l  Mahout’s  3  “C”s  provide  tools  for  helping  across  many  aspects  of  discovery   –  Collabora:ve  Filtering   –  Classifica:on   –  Clustering  l  Also:     –  Colloca:ons  (Sta:s:cally  Interes:ng  Phrases)   –  SVD   –  Others  l  Challenges:   –  High  cost  to  itera:ve  machine  learning  algorithms   –  Mahout  is  very  command  line  oriented   –  Some  areas  less  mature              |    12    
  13. 13. Aside:  Experiment  Management  l  Plan  for  running  experiments  from  the  beginning  across  Search  and   Discovery  components   –  Your  analy:cs  engine  should  help!  l  Types  of  Experiments  to  consider   –  Indexing/Analysis   –  Query  parsing   –  Scoring  formulas   –  Machine  Learning  Models   –  Recommenda:ons,  many  more  l  Make  it  easy  to  do  A/B  tes:ng  across  all  experiments  and  compare  and   contrast  the  results              |    13    
  14. 14. Analy8cs  Components  l  Commonly  used  components   –  Solr   –  R  Stats   –  Hive   –  Pig   –  Commercial  l  Star:ng  with  Search  and  Discovery  metrics  and  analysis  gives  context  into   where  to  make  investments  for  broader  analy:cs              |    14    
  15. 15. Analy8cs  in  Prac8ce  l  Simple  Counts:   –  Facets   –  Term  and  Document  frequencies   –  Clicks  l  Search  and  Discovery  example  metrics   –  Relevance  measures  like  Mean  Reciprocal  Rank   –  Histograms/Drilldowns  around  Number  of  Results   –  Log  and  naviga:on  analysis  l  Data  cleanliness  analysis  is  helpful  for  finding  poten:al  issues  in  content              |    15    
  16. 16. Wrap  l  Search,  Discovery  and  Analy:cs,  when  combined  into  a  single,  coherent   system  provides  powerful  insight  into  both  your  content  and  your  users  l  Lucid  has  combined  many  of  these  things  into  LucidWorks  Big  Data   –  hrp://­‐search-­‐plasorm/ lucidworks-­‐big-­‐data  l  Design  for  the  big  picture  when  building  search-­‐based  applica:ons              |    16    
  17. 17. Find  It  l  hrp://­‐search-­‐plasorm/ lucidworks-­‐big-­‐data  l  hrp://  l  l  @gsingers              |    17