SolrCloud on Hadoop

2,996 views
2,498 views

Published on

An overview of building and serving Lucene indexes on a Hadoop cluster with Solr for text and parametric searching, as presented at Cleveland Hadoop User Group on 13 January 2014.

Published in: Technology, Education
1 Comment
6 Likes
Statistics
Notes
  • Thank you, Alex. This was a very informative presentation.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
2,996
On SlideShare
0
From Embeds
0
Number of Embeds
42
Actions
Shares
0
Downloads
115
Comments
1
Likes
6
Embeds 0
No embeds

No notes for slide

SolrCloud on Hadoop

  1. 1. 1 SolrCloud  on  Hadoop   Cleveland  Big  Data  and  Hadoop  User  Group,  January  2014   Alex  Moundalexis     @technmsg  
  2. 2. Disclaimer   •  Technologies,  not  products   •  Cloudera  builds  things  soGware   •  most  donated  to  Apache   •  some  closed-­‐source   •  I  will  likely  menLon  “Cloudera  Something”   •  Cloudera  “products”  I  reference  are  open  source   •  Apache  Licensed   •  Source  code  is  on  GitHub   •  hQps://github.com/cloudera   2
  3. 3. What  This  Talk  Isn’t  About   •  Deploying   •  Puppet,  Chef,  Ansible,  homegrown  scripts,  intern  labor   •  Sizing  &  Tuning   •  Depends  heavily  on  data  and  workload   •  Coding   •  Unless  you  count  XML  or  CSV   •  Algorithms   3
  4. 4. 4 Quick  and  dirty,  more  Lme  for  use  cases.   The  Apache  Hadoop  Ecosystem  
  5. 5. Why  “Ecosystem?”   •  In  the  beginning,  just  Hadoop   •  HDFS   •  MapReduce   •  Today,  dozens  of  interrelated  components   •  I/O   •  Processing   •  Specialty  ApplicaLons   •  ConfiguraLon   •  Workflow   5
  6. 6. ParLal  Ecosystem   6 Hadoop   external  system   RDBMS  /  DWH   web  server   device  logs   API  access   log  collecLon   DB  table  import   batch  processing   machine  learning   external  system   API  access   user   RDBMS  /  DWH   DB  table    export   BI  tool   +  JDBC/ODBC   Search   SQL  
  7. 7. HDFS   •  Distributed,  highly  fault-­‐tolerant  filesystem   •  OpLmized  for  large  streaming  access  to  data   •  Based  on  Google  File  System   •  hQp://research.google.com/archive/gfs.html   7
  8. 8. Lots  of  Commodity  Machines   8 Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
  9. 9. MapReduce  (MR)   •  Programming  paradigm   •  Batch  oriented,  not  realLme   •  Works  well  with  distributed  compuLng   •  Lots  of  Java,  but  other  languages  supported   •  Based  on  Google’s  paper   •  hQp://research.google.com/archive/mapreduce.html   9
  10. 10. Under  the  Covers  
  11. 11. You specify map() and reduce() functions.

×