KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop


Published on

Presented by Grant Ingersoll, Chief Scientist, Lucid Imagination - See conference video -

Understanding and accessing large volumes of content often requires a multi-faceted approach that goes well beyond the needs of batch processing approaches. In many cases, one needs both ad hoc, real-time access to the content as well as the ability to discover interesting
information based on a variety of features such as recommendations, summaries and other interesting insights. Furthermore, analyzing how users interact with the content can both further enhance the quality of the system as well as deliver much needed insight into both the users and the content for the business. In this talk, we'll discuss a platform that enables large scale search, discovery and analytics over a wide variety of content utilizing tools like Solr, Hadoop, Mahout and others. The talk will discuss the architecture and capabilities of the system along with
how the capabilities of Solr 4 help drive real time access for content discovery and analytics.

Published in: Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

  1. 1. Search              Discover              Analyze  Enabling  Scalable  Search,  Discovery  and  Analy6cs  with  Solr,  Mahout  and  Hadoop  Grant  Ingersoll  Chief  Scien:st  Lucid  Imagina:on              |    1    
  2. 2. We  All  Know  the  Pain  l  ________  data  growth  in  the  next  ___  days/months/years   –  Many  es:mate  80-­‐90%  of  data  is  “unstructured”  (mul:-­‐structured?)  l  The  Age  of  “Data  Paranoia”   –  What  if  I  don’t  collect  it  all?   –  What  if  I  miss  something  or  lose  something?   –  What  if  I  can’t  store  it  long  enough?   –  How  do  I  secure  it?   –  Can  I  afford  to  do  any  of  this?    Can  I  afford  not  to?   –  What  if  I  can’t  make  sense  of  it?              |    2    
  3. 3. Big  Data  Premise  and  Promise   Premise Promise Large Scale Data Collection/Storage ✔ Prevents Data Loss ✔ Long Term Storage ✔ Affordable ✔ New Science Delivering New Insights ?            |    3    
  4. 4. Why  Search,  Discovery  and  Analy;cs  (SDA)?  l  User  Needs:   –  Real-­‐:me,  ad  hoc  access  to  content   Search –  Aggressive  Priori:za:on  based  on  Importance   –  Serendipity  l  Batch  processing  isn’t  enough  l  Search  is  built  for  mul:-­‐structured   Analytics Discoveryl  Deeper  analysis  yields:   –  Business  insight  into  users   –  Beaer  Search  and  Discovery  for  users              |    4    
  5. 5. What  do  you  need  for  SDA?  l  Fast, efficient, scalable search –  Bulk and Near Real Time Indexingl  Large scale, cost effective storagel  Large scale processing power –  Large scale and distributed for whole data consumption and analysis –  Sampling tools –  Distributed In Memory where appropriatel  NLP and machine learning tools that scale to enhance discovery and analysis            |    5    
  6. 6. Example  Use  Cases  l  Dark  Data  –  Petabytes  (and  beyond)  of  content  in  storage  with  liale  insight   into  what’s  in  it   –  Forensics,  Intelligence  Gathering,  Risk  analysis,  etc.  l  Financial  –  Enable  total  customer  view  to  beaer  understand  risks  and   opportuni:es  l  Medical  –  Extend  research  capabili:es  through  deeper  analysis  of  both   scien:fic  data,  publica:ons  and  field  usage  l  Social  Media  Monitoring  –  Understand  and  analyze  social  networks  and   their  trends  all  the  :me,  no  maaer  the  scale  l  Commerce  –  Drive  more  sales  through  metric  driven  search  and  discovery   without  the  guesswork              |    6    
  7. 7. Announcing  LucidWorks  Big  Data  Beta  An  applica:on  development  plaiorm  aimed  at  enabling  Search,  Discovery  and   Analysis  of  your  content  and  user  interac:ons,  no  maaer  the  volume,  variety   and  velocity  of  that  content,  nor  the  number  of  users              |    7    
  8. 8. Architecture              |    8    
  9. 9. Key  Features  of  Beta  l  Combines  the  real  :me,  ad  hoc  data  accessibility  of  LucidWorks  with   compute  and  storage  capabili:es  of  Hadoop  l  Delivers  analy:c  capabili:es  along  with  scalable  machine  learning   algorithms  for  deeper  insight  into  both  content  and  users  l  RESTful  API  suppor:ng  JSON  input/output  formats  for  easy  integra:on  l  Full  Stack  -­‐  Minimizes  the  impact  of  provisioning  Hadoop,  LucidWorks  and   other  components  l  Hosted  in  cloud  and  supported  by  Lucid  Imagina:on              |    9    
  10. 10. APIs  l  Search  and  Indexing   l  Analy:cs   –  Full  power  of  LucidWorks  (Solr)   –  Common  search  analy:cs  for   –  Bulk  and  Near  Real  Time  Indexing   beaer  understanding  of  relevancy   based  on  log  analysis   –  Sharded  via  SolrCloud   –  Historical  views  l  Workflows   l  Machine  Learning   –  Predefined  workflows  ease   common  data  tasks  such  as  bulk   –  Clustering   indexing   –  Sta:s:cally  Interes:ng  Phrases  l  Administra:on   –  Future  enhancements  planned   –  Access  to  key  system  informa:on   l  Proxy  APIs   –  User  management   –  LucidWorks   –  WebHDFS              |    10    
  11. 11. Under  the  Hood   LucidWorks 2.1 SDA Enginel  Lucene/Solr  4.0-­‐dev   l  RESTful  services  built  on  Restlet  2.1  l  Sharded  with  SolrCloud   l  Service  Discovery,  load  balancing,   –  1  second  (default)  som  commits  for   failover  enabled  via  ZooKeeper  +   NRT  updates   Neilix  Curator   –  1  minute  (default)  hard  commits   l  Authen:ca:on  and  authoriza:on   (no  searcher  reopen)   over  SSL  (op:onal)   –  Transac:on  logs  for  recovery   l  Proxies  for  LucidWorks  and   –  Solr  takes  care  of  leader  elec:on,   etc.  so  no  more  master/worker   WebHDFS  API  l  See  Mark  Miller’s  talk  on  SolrCloud   l  Workflow  engine  coordinates  data   flow              |    11    
  12. 12. Under  the  Hood  l  Apache  Hadoop   l  Apache  HBase   –  Map-­‐Reduce  (MR)  jobs  for  ETL  and   –  Key-­‐value  and  :me  series  of  all   bulk  indexing  into  SolrCloud   calculated  metrics   sharded  system   l  Apache  Pig   –  Leverage  Pig  and  custom  MR  jobs   for  log  processing  and  metric   –  ETL   calcula:on   –  Log  analysis  -­‐>  HBase   –  WebHDFS   l  Apache  ZooKeeper  l  Apache  Mahout   –  Neilix  Curator  for  service   –  K-­‐Means  Clustering   discovery  and  higher  level  ZK  client   –  Sta:s:cally  Interes:ng  Phrases   l  Apache  Kasa   –  More  to  come   –  Pub-­‐sub  for  collec:ng  logs  from   LucidWorks  into  HDFS              |    12    
  13. 13. The  Road  Ahead  l  Our  approach  is  from  search  and  discovery  outwards  to  analy:cs   –  Analy:cs  in  beta  are  focused  around  analysis  of  search  logs  l  Analy:cs  Themes   –  Relevance   –  Data  quality   –  Discovery     –  Integra:on  with  other  packages  (R?)  l  Machine  Learning   –  Classifica:on   –  NLP  l  More  analy:cs  on  the  index  itself?              |    13    
  14. 14. Contacts  l  hap://­‐big-­‐data  l  hap://  l  l  @gsingers              |    14