Hadoop operations
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Hadoop operations

on

  • 1,611 views

Lynx Consultants training about Hadoop Operations

Lynx Consultants training about Hadoop Operations

Statistics

Views

Total Views
1,611
Views on SlideShare
1,593
Embed Views
18

Actions

Likes
2
Downloads
27
Comments
0

4 Embeds 18

http://54.199.180.60 7
http://www.linkedin.com 5
https://www.linkedin.com 4
http://hubot-clb-2081983768.ap-northeast-1.elb.amazonaws.com 2

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hadoop operations Presentation Transcript

  • 1. Marc  Cluet  –  Lynx  Consultants  How  Hadoop  Works  
  • 2. What we’ll cover?¡  Understand  Hadoop  in  detail  ¡  See  how  Hadoop  works  operationally  ¡  Be  able  to  start  asking  the  right  questions  from  your  data  Lynx  Consultants  ©  2013  
  • 3. Hadoop Distributions¡  Cloudera  CDH  ¡  Hortonworks  ¡  MapR  Lynx  Consultants  ©  2013  
  • 4. Hadoop Components¡  HDFS  ¡  Hbase  ¡  MapRed  ¡  YARN  Lynx  Consultants  ©  2013  
  • 5. Hadoop Components¡  HDFS  §  Hadoop  Distributed  File  System  §  Everything  sits  on  top  of  it  §  Has  3  copies  by  default  of  every  block  ¡  Hbase  ¡  MapRed  ¡  YARN  Lynx  Consultants  ©  2013  
  • 6. Hadoop Components¡  HDFS  ¡  Hbase  §  Hadoop  Schemaless  Database  §  Key  value  Store  §  Sits  on  top  of  HDFS  ¡  MapRed  ¡  YARN  Lynx  Consultants  ©  2013  
  • 7. Hadoop Components¡  HDFS  ¡  Hbase  ¡  MapRed  §  Hadoop  Map/Reduce  §  Non-­‐pluggable,  archaic  §  Requires  HDFS  for  temp  storage  ¡  YARN  Lynx  Consultants  ©  2013  
  • 8. Hadoop Components¡  HDFS  ¡  Hbase  ¡  MapRed  ¡  YARN  §  Hadoop  Map/Reduce  version  2.0  §  Pluggable,  you  can  add  your  own  §  Fast  and  not  so  much  memory  hungry    Lynx  Consultants  ©  2013  
  • 9. Hadoop Component Breakdown¡  All  these  components  divide  themselves  in    §  client/server    §  master/slave  scenarios  ¡  We  will  now  check  each  individual  component  breakdown  Lynx  Consultants  ©  2013  
  • 10. Hadoop Components Breakdown¡  HDFS  §  Master  Namenode  ▪  Keeps  track  of  all  file  allocation  on  Datanodes  ▪  Rebalances  data  if  one  of  the  namenodes  goes  down  ▪  Is  Rack  aware  §  Secondary  Namenode  ▪  Does  cleanup  services  for  the  namenode  ▪  Not  necessarily  two  different  servers  §  Datanode  ▪  Stores  the  data  ▪  Good  to  have  not  RAID  disks  for  extra  I/O  speed  Lynx  Consultants  ©  2013  
  • 11. Hadoop Components Breakdown¡  HDFS  §  How  to  access  ▪  Client  can  connect  with  hadoop  client  to  hdfs://namenode:8020  ▪  Supports  all  basic  Unix  commands  §  Configuration  files  ▪  /etc/hadoop/conf/core-­‐site.xml  ▪  Defines  major  configuration  as  hdfs  namenode  and  default  parameters  ▪  /etc/hadoop/conf/hdfs-­‐site.xml  ▪  Defines  configuration  specific  to  namenode  or  datanode  on  file  locations  ▪  /etc/hadoop/conf/slaves  ▪  Defines  the  list  of  servers  that  are  available  in  this  cluster  Lynx  Consultants  ©  2013  
  • 12. Hadoop Components Breakdown¡  Hbase  §  Master  ▪  Controls  the  Hbase  cluster,  knows  where  the  data  is  allocated  and  provides  a  client  listening  socket  using  Thrift  and/or  a  RESTful  API  §  Regionserver  ▪  Hbase  node,  stores  some  of  the  information  in  one  of  the  regions,  it’d  be  equivalent  to  sharding  §  Thrift  /  REST  ▪  Interface  to  connect  to  HBase  Lynx  Consultants  ©  2013  
  • 13. Hadoop Components Breakdown¡  Hbase  §  How  to  access  ▪  Through  the  Hbase  client  (using  Thrift)  ▪  Through  the  RESTful  API  §  Configuration  files  ▪  /etc/hbase/conf/hbase-­‐site.xml  ▪  Defines  all  the  basic  configuration  for  accessing  hbase  ▪  /etc/hbase/conf/hbase-­‐policy.xml  ▪  Defines  all  the  security  (ACL)  and  all  the  hbase  memory  tweaks  ▪  /etc/hbase/conf/regionservers  ▪  List  all  the  regionservers  available  to  this  cluster  Lynx  Consultants  ©  2013  
  • 14. Hadoop Components Breakdown¡  MapRed  §  JobTracker  ▪  Creates  the  Map/Reduce  jobs  ▪  Stores  all  the  intermediate  data  ▪  Keeps  track  of  all  the  previous  results  through  the  HistoryServer  §  TaskTracker  ▪  Executed  Tasks  related  to  the  Map/Reduce  job  ▪  Very  CPU  and  memory  intensive  ▪  Stores  intermediate  results  which  then  are  pushed  to  JobTracker  Lynx  Consultants  ©  2013  
  • 15. Hadoop Components Breakdown¡  MapRed  §  How  to  access  ▪  Through  the  Hadoop  Client  ▪  Through  any  MapRed  client  like  Pig  or  Hive  ▪  Own  Java  code  §  Configuration  files  ▪  /etc/hadoop/conf/mapred-­‐site.xml  ▪  Defines  how  to  contact  this  MapRed  Cluster  ▪  /etc/hadoop/conf/mapred-­‐queue-­‐acls.xml  ▪  Defines  ACL  structure  for  accessing  MapRed,  normally  not  necessary  ▪  /etc/hadoop/conf/slaves  ▪  Defines  the  list  of  TaskTrackers  in  this  cluster  Lynx  Consultants  ©  2013  
  • 16. Hadoop Components Breakdown¡  YARN  §  Same  structure  as  MapRed  (lives  on  top  of  it)  §  Configuration  files  ▪  /etc/hadoop/conf/yarn-­‐site.xml  ▪  All  required  configuration  for  YARN  Lynx  Consultants  ©  2013  
  • 17. Hadoop Cluster Breakdown¡  Namenode  Server  §  HDFS  Namenode  §  Hbase  Master  ¡  Secondary  Namenode  Server  §  HDFS  Secondary  Namenode  ¡  JobTracker  Server  §  MapRed  JobTracker  §  MapRed  History  Server  Lynx  Consultants  ©  2013  
  • 18. Hadoop Cluster Breakdown¡  Datanode  Server  §  HDFS  Datanode  §  Hbase  RegionServer  §  MapRed  TaskTracker  Lynx  Consultants  ©  2013  
  • 19. Hadoop Hardware Requirements¡  Namenode  Server  §  Redundant  power  supplies  §  RAID1  Drives  §  Enough  memory  (16Gb)  ¡  Secondary  Namenode  Server  §  Almost  none  Lynx  Consultants  ©  2013  
  • 20. Hadoop Hardware Requirements¡  Jobtracker  Server  §  Redundant  power  supplies  §  RAID1  Drives  §  Enough  memory  (16Gb)  ¡  Datanode  Server  §  Lots  of  cheap  disk  (no  RAID)  §  Lots  of  memory  (32Gb)  §  Lots  of  CPU  Lynx  Consultants  ©  2013  
  • 21. Hadoop Default Ports¡  HDFS  §  8020:  HDFS  Namenode  §  50010:  HDFS  Datanode  FS  transfer  ¡  MapRed  §  No  defaults  ¡  Hbase  §  60010:  Master  §  60020:  Regionserver  Lynx  Consultants  ©  2013  
  • 22. Hadoop HDFS WorkflowLynx  Consultants  ©  2013  
  • 23. Hadoop MapRed WorkflowLynx  Consultants  ©  2013  
  • 24. Hadoop MapRed WorkflowLynx  Consultants  ©  2013  
  • 25. Flume¡  Transports  streams  of  data  from  point  A  to  point  B  ¡  Source  §  Where  the  data  is  read  from  ¡  Channel  §  How  the  data  is  buffered  ¡  Sink  §  Where  the  data  is  written  Lynx  Consultants  ©  2013  
  • 26. Flume¡  Flume  is  fault  tolerant  ¡  Sources  are  pointer  kept  §  With  some  exceptions,  but  most  sources  are  in  a  known  state  ¡  Channels  can  be  fault  tolerant  §  Channel  written  to  disk  can  recover  from  where  it  left  ¡  Sinks  can  be  redundant  §  More  than  one  sink  for  the  same  data  §  Data  is  serialised  and  deduplicated  using  AVRO  Lynx  Consultants  ©  2013  
  • 27. FlumeLynx  Consultants  ©  2013  
  • 28. Flume¡  Configuration  files  §  /etc/flume-­‐ng/conf/flume.conf  ▪  Defines  the  agent  configuration  with  source,  channel,  sink  Lynx  Consultants  ©  2013  
  • 29. FlumeLynx  Consultants  ©  2013  
  • 30. Hadoop Recommended ReadsLynx  Consultants  ©  2013  
  • 31. Hadoop References¡  Hadoop  §  http://hadoop.apache.org/docs/stable/cluster_setup.html  §  http://rc.cloudera.com/cdh/4/hadoop/hadoop-­‐yarn/hadoop-­‐yarn-­‐site/ClusterSetup.html  §  http://pig.apache.org/docs/r0.7.0/setup.html  §  http://wiki.apache.org/hadoop/NameNodeFailover  ¡  Hbase  §  http://hbase.apache.org/book/book.html  ¡  Flume  §  http://archive.cloudera.com/cdh4/cdh/4/flume-­‐ng/FlumeUserGuide.html  Lynx  Consultants  ©  2013  
  • 32. Questions?Lynx  Consultants  ©  2013