Search	  	  	  	  	  	  	  Discover	  	  	  	  	  	  	  Analyze	  Large	  Scale	  Search,	  Discovery	  and	  Analy5cs	  w...
Search	  is	  Dead,	  Long	  Live	  Search	  l    Good	  keyword	  search	  is	  a	                       Documents      ...
Topics	  l    Quick	  Background	  and	  needs	  l    Architecture	         –  Abstract	         –  Prac:cal	  l    SDA...
Why	  Search,	  Discovery	  and	  Analy8cs	  (SDA)?	                                       l    User	  Needs	            ...
What	  Do	  Developers	  Need	  for	  SDA?	  l    Fast, efficient, scalable search       –  Bulk and Near Real Time Index...
Abstract	  -­‐>	  Prac8cal	  SDA	  Architecture	                                Access (API, UI,Visualization)            ...
Computa8on	  and	  Storage	           Solr                              Hadoop                                 HBase•  Doc...
Search	  In	  Prac8ce	  l    Three	  primary	  concerns	         –  Performance/Scaling	         –  Relevance	         – ...
Search	  with	  Solr:	  Scaling	  and	  NRT	  l    SolrCloud	  takes	  care	  of	  distributed	  indexing	  and	  search	...
Search:	  Relevance	  l    ABT	  –	  Always	  Be	  Tes:ng	         –  Experiment	  management	  is	  cri:cal	         –  ...
Discovery	  Components	         Serendipity                      Organization                        Data Quality•    Tren...
Discovery	  with	  Mahout	  l    Mahout’s	  3	  “C”s	  provide	  tools	  for	  helping	  across	  many	  aspects	  of	  d...
Aside:	  Experiment	  Management	  l    Plan	  for	  running	  experiments	  from	  the	  beginning	  across	  Search	  a...
Analy8cs	  Components	  l    Commonly	  used	  components	         –  Solr	         –  R	  Stats	         –  Hive	       ...
Analy8cs	  in	  Prac8ce	  l    Simple	  Counts:	         –  Facets	         –  Term	  and	  Document	  frequencies	      ...
Wrap	  l    Search,	  Discovery	  and	  Analy:cs,	  when	  combined	  into	  a	  single,	  coherent	        system	  prov...
Find	  It	  l    hrp://www.lucidimagina:on.com/products/lucidworks-­‐search-­‐plasorm/      lucidworks-­‐big-­‐data	  l ...
Upcoming SlideShare
Loading in...5
×

Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr

3,033

Published on

Published in: Technology, Business
0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,033
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
8
Embeds 0
No embeds

No notes for slide

Transcript of "Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr"

  1. 1. Search              Discover              Analyze  Large  Scale  Search,  Discovery  and  Analy5cs  with  Solr,  Mahout  and  Hadoop  Grant  Ingersoll  Chief  Scien:st  Lucid  Imagina:on              |    1    
  2. 2. Search  is  Dead,  Long  Live  Search  l  Good  keyword  search  is  a   Documents commodity  and  easy  to  get   up  and  running  l  The  Bar  is  Raised   Content User Relationships Interaction –  Relevance  is  (always  will   be?)  hard  l  Holis:c  view  of  the  data   AND  the  users  is  cri:cal   Access            |    2    
  3. 3. Topics  l  Quick  Background  and  needs  l  Architecture   –  Abstract   –  Prac:cal  l  SDA  In  Prac:ce   –  Components   –  Challenges  and  Lessons  Learned  l  Wrap  Up              |    3    
  4. 4. Why  Search,  Discovery  and  Analy8cs  (SDA)?   l  User  Needs   –  Real-­‐:me,  ad  hoc  access  to  content   –  Aggressive  Priori:za:on  based  on  Importance   Search –  Serendipity   –  Feedback/Learning  from  past   l  Business  Needs   Analytics Discovery –  Deeper  insight  into  users   –  Leverage  exis:ng  internal  knowledge   –  Cost  effec:ve              |    4    
  5. 5. What  Do  Developers  Need  for  SDA?  l  Fast, efficient, scalable search –  Bulk and Near Real Time Indexing –  Handle billions of records w/ sub-second search and facetingl  Large scale, cost effective storage and processing capabilities –  Need whole data consumption and analysis –  Experimentation/Sampling tools –  Distributed In Memory where appropriatel  NLP and machine learning tools that scale to enhance discovery and analysis            |    5    
  6. 6. Abstract  -­‐>  Prac8cal  SDA  Architecture   Access (API, UI,Visualization) Search, Discovery and Analytics Glue Stats Mahout, R, GATE, Others Pig, Machine Docs User Admin Package Learning Access Modeling Experiment Mgmt Service Mgmt Content Computation and Storage Acquisition DB Dist. Data Search NoSQL Process Mgmt KV Shards Shards Shards Shards Shards Shards Shards Logs DFS Provisioning, Monitoring, Infrastructure            |    6    
  7. 7. Computa8on  and  Storage   Solr Hadoop HBase•  Document Index •  Stores Logs, •  Metric Storage•  Document Raw files, •  User Histories Storage? intermediate •  Document files, etc. Storage?•  SolrCloud makes •  WebHDFS sharding easy •  Small file are an unnatural actChallenges   •  Who  is  the  authorita:ve  store?  Solr  or  HBase?   •  Real  :me  vs.  Batch   •  Where  should  analysis  be  done?              |    7    
  8. 8. Search  In  Prac8ce  l  Three  primary  concerns   –  Performance/Scaling   –  Relevance   –  Opera:ons:  monitoring,  failover,  etc.  l  Business  typically  cares  more  about  relevance  l  Devs  more  about  performance  (and  then  ops)              |    8    
  9. 9. Search  with  Solr:  Scaling  and  NRT  l  SolrCloud  takes  care  of  distributed  indexing  and  search  needs   –  Transac:on  logs  for  recovery   –  Automa:c  leader  elec:on,  so  no  more  master/worker   –  Have  to  declare  number  of  shards  now,  but  spliang  coming  soon   –  Use  CloudSolrServer  in  SolrJ  l  NRT  Config  :ps:   –  1  second  sod  commits  for  NRT  updates   –  1  minute  hard  commits  (no  searcher  reopen)              |    9    
  10. 10. Search:  Relevance  l  ABT  –  Always  Be  Tes:ng   –  Experiment  management  is  cri:cal   –  Top  X  +  Random  Sampling  of  Long  Tail   –  Click  logs  l  Track  Everything!   –  Queries   –  Clicks   –  Displayed  Documents     –  Mouse/Scroll  tracking???  l  Phrases  are  your  friend              |    10    
  11. 11. Discovery  Components   Serendipity Organization Data Quality•  Trends •  Importance •  Document factor•  Topics •  Clustering Distributions•  Recommendations •  Classification •  Length•  Related Items •  Named Entities •  Boosts•  More Like This •  Time Factors •  Duplicates•  Did you mean? •  Faceting•  Stat. Interesting PhrasesChallenges   •  Many  of  these  are  intense  calcula:ons  or  itera:ve   •  Many  are  subjec:ve  and  require  a  lot  of  experimenta:on              |    11    
  12. 12. Discovery  with  Mahout  l  Mahout’s  3  “C”s  provide  tools  for  helping  across  many  aspects  of  discovery   –  Collabora:ve  Filtering   –  Classifica:on   –  Clustering  l  Also:     –  Colloca:ons  (Sta:s:cally  Interes:ng  Phrases)   –  SVD   –  Others  l  Challenges:   –  High  cost  to  itera:ve  machine  learning  algorithms   –  Mahout  is  very  command  line  oriented   –  Some  areas  less  mature              |    12    
  13. 13. Aside:  Experiment  Management  l  Plan  for  running  experiments  from  the  beginning  across  Search  and   Discovery  components   –  Your  analy:cs  engine  should  help!  l  Types  of  Experiments  to  consider   –  Indexing/Analysis   –  Query  parsing   –  Scoring  formulas   –  Machine  Learning  Models   –  Recommenda:ons,  many  more  l  Make  it  easy  to  do  A/B  tes:ng  across  all  experiments  and  compare  and   contrast  the  results              |    13    
  14. 14. Analy8cs  Components  l  Commonly  used  components   –  Solr   –  R  Stats   –  Hive   –  Pig   –  Commercial  l  Star:ng  with  Search  and  Discovery  metrics  and  analysis  gives  context  into   where  to  make  investments  for  broader  analy:cs              |    14    
  15. 15. Analy8cs  in  Prac8ce  l  Simple  Counts:   –  Facets   –  Term  and  Document  frequencies   –  Clicks  l  Search  and  Discovery  example  metrics   –  Relevance  measures  like  Mean  Reciprocal  Rank   –  Histograms/Drilldowns  around  Number  of  Results   –  Log  and  naviga:on  analysis  l  Data  cleanliness  analysis  is  helpful  for  finding  poten:al  issues  in  content              |    15    
  16. 16. Wrap  l  Search,  Discovery  and  Analy:cs,  when  combined  into  a  single,  coherent   system  provides  powerful  insight  into  both  your  content  and  your  users  l  Lucid  has  combined  many  of  these  things  into  LucidWorks  Big  Data   –  hrp://www.lucidimagina:on.com/products/lucidworks-­‐search-­‐plasorm/ lucidworks-­‐big-­‐data  l  Design  for  the  big  picture  when  building  search-­‐based  applica:ons              |    16    
  17. 17. Find  It  l  hrp://www.lucidimagina:on.com/products/lucidworks-­‐search-­‐plasorm/ lucidworks-­‐big-­‐data  l  hrp://www.lucidimagina:on.com  l  grant@lucidimagina:on.com  l  @gsingers              |    17    

×