Search in the Apache Hadoop Ecosystem: Thoughts from the Field

1,357 views
1,179 views

Published on

This presentation describes the Hadoop ecosystem and gives examples of how these open source tools are combined and used to solve specific and sometimes very complex problems. Drawing upon case studies from the field, Mr. Moundalexis demonstrates that one-size, rigid traditional systems don’t fit all, but that combinations of tools in the Apache Hadoop ecosystem provide a versatile and flexible platform for integrating, finding, and analyzing information.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,357
On SlideShare
0
From Embeds
0
Number of Embeds
47
Actions
Shares
0
Downloads
66
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Search in the Apache Hadoop Ecosystem: Thoughts from the Field

  1. 1. 1 Search  in  the  Apache  Hadoop   Ecosystem:  Thoughts  from  the  Field   Open  Source  Search  Conference,  November  2013   Alex  Moundalexis     @technmsg  
  2. 2. 2 Thoughts  of  a  Former  SA  
  3. 3. 3 Thoughts  of  a  Former  SA  Field  Guy  
  4. 4. Disclaimer   •  Technologies,  not  products   •  Cloudera  builds  things  soJware   •  most  donated  to  Apache   •  some  closed-­‐source   •  I  will  likely  menOon  “Cloudera  Something”   •  Cloudera  “products”  I  reference  are  open  source   •  Apache  Licensed   •  Source  code  is  on  GitHub   •  hSps://github.com/cloudera   4
  5. 5. What  This  Talk  Isn’t  About   •  Deploying   •  Puppet,  Chef,  Ansible,  homegrown  scripts,  intern  labor   •  Sizing  &  Tuning   •  Depends  heavily  on  data  and  workload   •  Coding   •  Algorithms   5
  6. 6. 6   “  The  answer  to  most   Hadoop  quesOons  is  it   depends.”  
  7. 7. 7 Quick  and  dirty,  more  Ome  for  use  cases.   The  Apache  Hadoop  Ecosystem  
  8. 8. Why  “Ecosystem?”   •  In  the  beginning,  just  Hadoop   •  HDFS   •  MapReduce   •  Today,  dozens  of  interrelated  components   •  I/O   •  Processing   •  Specialty  ApplicaOons   •  ConfiguraOon   •  Workflow   8
  9. 9. ParOal  Ecosystem   9 Hadoop   external  system   RDBMS  /  DWH   web  server   device  logs   API  access   log  collecOon   DB  table  import   batch  processing   machine  learning   external  system   API  access   user   RDBMS  /  DWH   DB  table    export   BI  tool   +  JDBC/ODBC   Search   SQL  
  10. 10. HDFS   •  Distributed,  highly  fault-­‐tolerant  filesystem   •  OpOmized  for  large  streaming  access  to  data   •  Based  on  Google  File  System   •  hSp://research.google.com/archive/gfs.html   10
  11. 11. Lots  of  Commodity  Machines   11 Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
  12. 12. MapReduce  (MR)   •  Programming  paradigm   •  Batch  oriented,  not  realOme   •  Works  well  with  distributed  compuOng   •  Lots  of  Java,  but  other  languages  supported   •  Based  on  Google’s  paper   •  hSp://research.google.com/archive/mapreduce.html   12
  13. 13. Under  the  Covers  
  14. 14. You specify map() and reduce() functions.

×