Hadoop operations

  • 1,382 views
Uploaded on

Lynx Consultants training about Hadoop Operations

Lynx Consultants training about Hadoop Operations

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,382
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
27
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Marc  Cluet  –  Lynx  Consultants  How  Hadoop  Works  
  • 2. What we’ll cover?¡  Understand  Hadoop  in  detail  ¡  See  how  Hadoop  works  operationally  ¡  Be  able  to  start  asking  the  right  questions  from  your  data  Lynx  Consultants  ©  2013  
  • 3. Hadoop Distributions¡  Cloudera  CDH  ¡  Hortonworks  ¡  MapR  Lynx  Consultants  ©  2013  
  • 4. Hadoop Components¡  HDFS  ¡  Hbase  ¡  MapRed  ¡  YARN  Lynx  Consultants  ©  2013  
  • 5. Hadoop Components¡  HDFS  §  Hadoop  Distributed  File  System  §  Everything  sits  on  top  of  it  §  Has  3  copies  by  default  of  every  block  ¡  Hbase  ¡  MapRed  ¡  YARN  Lynx  Consultants  ©  2013  
  • 6. Hadoop Components¡  HDFS  ¡  Hbase  §  Hadoop  Schemaless  Database  §  Key  value  Store  §  Sits  on  top  of  HDFS  ¡  MapRed  ¡  YARN  Lynx  Consultants  ©  2013  
  • 7. Hadoop Components¡  HDFS  ¡  Hbase  ¡  MapRed  §  Hadoop  Map/Reduce  §  Non-­‐pluggable,  archaic  §  Requires  HDFS  for  temp  storage  ¡  YARN  Lynx  Consultants  ©  2013  
  • 8. Hadoop Components¡  HDFS  ¡  Hbase  ¡  MapRed  ¡  YARN  §  Hadoop  Map/Reduce  version  2.0  §  Pluggable,  you  can  add  your  own  §  Fast  and  not  so  much  memory  hungry    Lynx  Consultants  ©  2013  
  • 9. Hadoop Component Breakdown¡  All  these  components  divide  themselves  in    §  client/server    §  master/slave  scenarios  ¡  We  will  now  check  each  individual  component  breakdown  Lynx  Consultants  ©  2013  
  • 10. Hadoop Components Breakdown¡  HDFS  §  Master  Namenode  ▪  Keeps  track  of  all  file  allocation  on  Datanodes  ▪  Rebalances  data  if  one  of  the  namenodes  goes  down  ▪  Is  Rack  aware  §  Secondary  Namenode  ▪  Does  cleanup  services  for  the  namenode  ▪  Not  necessarily  two  different  servers  §  Datanode  ▪  Stores  the  data  ▪  Good  to  have  not  RAID  disks  for  extra  I/O  speed  Lynx  Consultants  ©  2013  
  • 11. Hadoop Components Breakdown¡  HDFS  §  How  to  access  ▪  Client  can  connect  with  hadoop  client  to  hdfs://namenode:8020  ▪  Supports  all  basic  Unix  commands  §  Configuration  files  ▪  /etc/hadoop/conf/core-­‐site.xml  ▪  Defines  major  configuration  as  hdfs  namenode  and  default  parameters  ▪  /etc/hadoop/conf/hdfs-­‐site.xml  ▪  Defines  configuration  specific  to  namenode  or  datanode  on  file  locations  ▪  /etc/hadoop/conf/slaves  ▪  Defines  the  list  of  servers  that  are  available  in  this  cluster  Lynx  Consultants  ©  2013  
  • 12. Hadoop Components Breakdown¡  Hbase  §  Master  ▪  Controls  the  Hbase  cluster,  knows  where  the  data  is  allocated  and  provides  a  client  listening  socket  using  Thrift  and/or  a  RESTful  API  §  Regionserver  ▪  Hbase  node,  stores  some  of  the  information  in  one  of  the  regions,  it’d  be  equivalent  to  sharding  §  Thrift  /  REST  ▪  Interface  to  connect  to  HBase  Lynx  Consultants  ©  2013  
  • 13. Hadoop Components Breakdown¡  Hbase  §  How  to  access  ▪  Through  the  Hbase  client  (using  Thrift)  ▪  Through  the  RESTful  API  §  Configuration  files  ▪  /etc/hbase/conf/hbase-­‐site.xml  ▪  Defines  all  the  basic  configuration  for  accessing  hbase  ▪  /etc/hbase/conf/hbase-­‐policy.xml  ▪  Defines  all  the  security  (ACL)  and  all  the  hbase  memory  tweaks  ▪  /etc/hbase/conf/regionservers  ▪  List  all  the  regionservers  available  to  this  cluster  Lynx  Consultants  ©  2013  
  • 14. Hadoop Components Breakdown¡  MapRed  §  JobTracker  ▪  Creates  the  Map/Reduce  jobs  ▪  Stores  all  the  intermediate  data  ▪  Keeps  track  of  all  the  previous  results  through  the  HistoryServer  §  TaskTracker  ▪  Executed  Tasks  related  to  the  Map/Reduce  job  ▪  Very  CPU  and  memory  intensive  ▪  Stores  intermediate  results  which  then  are  pushed  to  JobTracker  Lynx  Consultants  ©  2013  
  • 15. Hadoop Components Breakdown¡  MapRed  §  How  to  access  ▪  Through  the  Hadoop  Client  ▪  Through  any  MapRed  client  like  Pig  or  Hive  ▪  Own  Java  code  §  Configuration  files  ▪  /etc/hadoop/conf/mapred-­‐site.xml  ▪  Defines  how  to  contact  this  MapRed  Cluster  ▪  /etc/hadoop/conf/mapred-­‐queue-­‐acls.xml  ▪  Defines  ACL  structure  for  accessing  MapRed,  normally  not  necessary  ▪  /etc/hadoop/conf/slaves  ▪  Defines  the  list  of  TaskTrackers  in  this  cluster  Lynx  Consultants  ©  2013  
  • 16. Hadoop Components Breakdown¡  YARN  §  Same  structure  as  MapRed  (lives  on  top  of  it)  §  Configuration  files  ▪  /etc/hadoop/conf/yarn-­‐site.xml  ▪  All  required  configuration  for  YARN  Lynx  Consultants  ©  2013  
  • 17. Hadoop Cluster Breakdown¡  Namenode  Server  §  HDFS  Namenode  §  Hbase  Master  ¡  Secondary  Namenode  Server  §  HDFS  Secondary  Namenode  ¡  JobTracker  Server  §  MapRed  JobTracker  §  MapRed  History  Server  Lynx  Consultants  ©  2013  
  • 18. Hadoop Cluster Breakdown¡  Datanode  Server  §  HDFS  Datanode  §  Hbase  RegionServer  §  MapRed  TaskTracker  Lynx  Consultants  ©  2013  
  • 19. Hadoop Hardware Requirements¡  Namenode  Server  §  Redundant  power  supplies  §  RAID1  Drives  §  Enough  memory  (16Gb)  ¡  Secondary  Namenode  Server  §  Almost  none  Lynx  Consultants  ©  2013  
  • 20. Hadoop Hardware Requirements¡  Jobtracker  Server  §  Redundant  power  supplies  §  RAID1  Drives  §  Enough  memory  (16Gb)  ¡  Datanode  Server  §  Lots  of  cheap  disk  (no  RAID)  §  Lots  of  memory  (32Gb)  §  Lots  of  CPU  Lynx  Consultants  ©  2013  
  • 21. Hadoop Default Ports¡  HDFS  §  8020:  HDFS  Namenode  §  50010:  HDFS  Datanode  FS  transfer  ¡  MapRed  §  No  defaults  ¡  Hbase  §  60010:  Master  §  60020:  Regionserver  Lynx  Consultants  ©  2013  
  • 22. Hadoop HDFS WorkflowLynx  Consultants  ©  2013  
  • 23. Hadoop MapRed WorkflowLynx  Consultants  ©  2013  
  • 24. Hadoop MapRed WorkflowLynx  Consultants  ©  2013  
  • 25. Flume¡  Transports  streams  of  data  from  point  A  to  point  B  ¡  Source  §  Where  the  data  is  read  from  ¡  Channel  §  How  the  data  is  buffered  ¡  Sink  §  Where  the  data  is  written  Lynx  Consultants  ©  2013  
  • 26. Flume¡  Flume  is  fault  tolerant  ¡  Sources  are  pointer  kept  §  With  some  exceptions,  but  most  sources  are  in  a  known  state  ¡  Channels  can  be  fault  tolerant  §  Channel  written  to  disk  can  recover  from  where  it  left  ¡  Sinks  can  be  redundant  §  More  than  one  sink  for  the  same  data  §  Data  is  serialised  and  deduplicated  using  AVRO  Lynx  Consultants  ©  2013  
  • 27. FlumeLynx  Consultants  ©  2013  
  • 28. Flume¡  Configuration  files  §  /etc/flume-­‐ng/conf/flume.conf  ▪  Defines  the  agent  configuration  with  source,  channel,  sink  Lynx  Consultants  ©  2013  
  • 29. FlumeLynx  Consultants  ©  2013  
  • 30. Hadoop Recommended ReadsLynx  Consultants  ©  2013  
  • 31. Hadoop References¡  Hadoop  §  http://hadoop.apache.org/docs/stable/cluster_setup.html  §  http://rc.cloudera.com/cdh/4/hadoop/hadoop-­‐yarn/hadoop-­‐yarn-­‐site/ClusterSetup.html  §  http://pig.apache.org/docs/r0.7.0/setup.html  §  http://wiki.apache.org/hadoop/NameNodeFailover  ¡  Hbase  §  http://hbase.apache.org/book/book.html  ¡  Flume  §  http://archive.cloudera.com/cdh4/cdh/4/flume-­‐ng/FlumeUserGuide.html  Lynx  Consultants  ©  2013  
  • 32. Questions?Lynx  Consultants  ©  2013