Marc	  Cluet	  –	  Lynx	  Consultants	  How	  Hadoop	  Works	  
What we’ll cover?¡  Understand	  Hadoop	  in	  detail	  ¡  See	  how	  Hadoop	  works	  operationally	  ¡  Be	  able	  ...
Hadoop Distributions¡  Cloudera	  CDH	  ¡  Hortonworks	  ¡  MapR	  Lynx	  Consultants	  ©	  2013	  
Hadoop Components¡  HDFS	  ¡  Hbase	  ¡  MapRed	  ¡  YARN	  Lynx	  Consultants	  ©	  2013	  
Hadoop Components¡  HDFS	  §  Hadoop	  Distributed	  File	  System	  §  Everything	  sits	  on	  top	  of	  it	  §  Ha...
Hadoop Components¡  HDFS	  ¡  Hbase	  §  Hadoop	  Schemaless	  Database	  §  Key	  value	  Store	  §  Sits	  on	  top...
Hadoop Components¡  HDFS	  ¡  Hbase	  ¡  MapRed	  §  Hadoop	  Map/Reduce	  §  Non-­‐pluggable,	  archaic	  §  Requir...
Hadoop Components¡  HDFS	  ¡  Hbase	  ¡  MapRed	  ¡  YARN	  §  Hadoop	  Map/Reduce	  version	  2.0	  §  Pluggable,	 ...
Hadoop Component Breakdown¡  All	  these	  components	  divide	  themselves	  in	  	  §  client/server	  	  §  master/s...
Hadoop Components Breakdown¡  HDFS	  §  Master	  Namenode	  ▪  Keeps	  track	  of	  all	  file	  allocation	  on	  Datano...
Hadoop Components Breakdown¡  HDFS	  §  How	  to	  access	  ▪  Client	  can	  connect	  with	  hadoop	  client	  to	  hd...
Hadoop Components Breakdown¡  Hbase	  §  Master	  ▪  Controls	  the	  Hbase	  cluster,	  knows	  where	  the	  data	  is...
Hadoop Components Breakdown¡  Hbase	  §  How	  to	  access	  ▪  Through	  the	  Hbase	  client	  (using	  Thrift)	  ▪  T...
Hadoop Components Breakdown¡  MapRed	  §  JobTracker	  ▪  Creates	  the	  Map/Reduce	  jobs	  ▪  Stores	  all	  the	  in...
Hadoop Components Breakdown¡  MapRed	  §  How	  to	  access	  ▪  Through	  the	  Hadoop	  Client	  ▪  Through	  any	  Ma...
Hadoop Components Breakdown¡  YARN	  §  Same	  structure	  as	  MapRed	  (lives	  on	  top	  of	  it)	  §  Configuration...
Hadoop Cluster Breakdown¡  Namenode	  Server	  §  HDFS	  Namenode	  §  Hbase	  Master	  ¡  Secondary	  Namenode	  Serv...
Hadoop Cluster Breakdown¡  Datanode	  Server	  §  HDFS	  Datanode	  §  Hbase	  RegionServer	  §  MapRed	  TaskTracker	...
Hadoop Hardware Requirements¡  Namenode	  Server	  §  Redundant	  power	  supplies	  §  RAID1	  Drives	  §  Enough	  m...
Hadoop Hardware Requirements¡  Jobtracker	  Server	  §  Redundant	  power	  supplies	  §  RAID1	  Drives	  §  Enough	 ...
Hadoop Default Ports¡  HDFS	  §  8020:	  HDFS	  Namenode	  §  50010:	  HDFS	  Datanode	  FS	  transfer	  ¡  MapRed	  §...
Hadoop HDFS WorkflowLynx	  Consultants	  ©	  2013	  
Hadoop MapRed WorkflowLynx	  Consultants	  ©	  2013	  
Hadoop MapRed WorkflowLynx	  Consultants	  ©	  2013	  
Flume¡  Transports	  streams	  of	  data	  from	  point	  A	  to	  point	  B	  ¡  Source	  §  Where	  the	  data	  is	 ...
Flume¡  Flume	  is	  fault	  tolerant	  ¡  Sources	  are	  pointer	  kept	  §  With	  some	  exceptions,	  but	  most	 ...
FlumeLynx	  Consultants	  ©	  2013	  
Flume¡  Configuration	  files	  §  /etc/flume-­‐ng/conf/flume.conf	  ▪  Defines	  the	  agent	  configuration	  with	  source,...
FlumeLynx	  Consultants	  ©	  2013	  
Hadoop Recommended ReadsLynx	  Consultants	  ©	  2013	  
Hadoop References¡  Hadoop	  §  http://hadoop.apache.org/docs/stable/cluster_setup.html	  §  http://rc.cloudera.com/cdh...
Questions?Lynx	  Consultants	  ©	  2013	  
Upcoming SlideShare
Loading in...5
×

Hadoop operations

1,496

Published on

Lynx Consultants training about Hadoop Operations

Published in: Technology, News & Politics
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,496
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
38
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Hadoop operations

  1. 1. Marc  Cluet  –  Lynx  Consultants  How  Hadoop  Works  
  2. 2. What we’ll cover?¡  Understand  Hadoop  in  detail  ¡  See  how  Hadoop  works  operationally  ¡  Be  able  to  start  asking  the  right  questions  from  your  data  Lynx  Consultants  ©  2013  
  3. 3. Hadoop Distributions¡  Cloudera  CDH  ¡  Hortonworks  ¡  MapR  Lynx  Consultants  ©  2013  
  4. 4. Hadoop Components¡  HDFS  ¡  Hbase  ¡  MapRed  ¡  YARN  Lynx  Consultants  ©  2013  
  5. 5. Hadoop Components¡  HDFS  §  Hadoop  Distributed  File  System  §  Everything  sits  on  top  of  it  §  Has  3  copies  by  default  of  every  block  ¡  Hbase  ¡  MapRed  ¡  YARN  Lynx  Consultants  ©  2013  
  6. 6. Hadoop Components¡  HDFS  ¡  Hbase  §  Hadoop  Schemaless  Database  §  Key  value  Store  §  Sits  on  top  of  HDFS  ¡  MapRed  ¡  YARN  Lynx  Consultants  ©  2013  
  7. 7. Hadoop Components¡  HDFS  ¡  Hbase  ¡  MapRed  §  Hadoop  Map/Reduce  §  Non-­‐pluggable,  archaic  §  Requires  HDFS  for  temp  storage  ¡  YARN  Lynx  Consultants  ©  2013  
  8. 8. Hadoop Components¡  HDFS  ¡  Hbase  ¡  MapRed  ¡  YARN  §  Hadoop  Map/Reduce  version  2.0  §  Pluggable,  you  can  add  your  own  §  Fast  and  not  so  much  memory  hungry    Lynx  Consultants  ©  2013  
  9. 9. Hadoop Component Breakdown¡  All  these  components  divide  themselves  in    §  client/server    §  master/slave  scenarios  ¡  We  will  now  check  each  individual  component  breakdown  Lynx  Consultants  ©  2013  
  10. 10. Hadoop Components Breakdown¡  HDFS  §  Master  Namenode  ▪  Keeps  track  of  all  file  allocation  on  Datanodes  ▪  Rebalances  data  if  one  of  the  namenodes  goes  down  ▪  Is  Rack  aware  §  Secondary  Namenode  ▪  Does  cleanup  services  for  the  namenode  ▪  Not  necessarily  two  different  servers  §  Datanode  ▪  Stores  the  data  ▪  Good  to  have  not  RAID  disks  for  extra  I/O  speed  Lynx  Consultants  ©  2013  
  11. 11. Hadoop Components Breakdown¡  HDFS  §  How  to  access  ▪  Client  can  connect  with  hadoop  client  to  hdfs://namenode:8020  ▪  Supports  all  basic  Unix  commands  §  Configuration  files  ▪  /etc/hadoop/conf/core-­‐site.xml  ▪  Defines  major  configuration  as  hdfs  namenode  and  default  parameters  ▪  /etc/hadoop/conf/hdfs-­‐site.xml  ▪  Defines  configuration  specific  to  namenode  or  datanode  on  file  locations  ▪  /etc/hadoop/conf/slaves  ▪  Defines  the  list  of  servers  that  are  available  in  this  cluster  Lynx  Consultants  ©  2013  
  12. 12. Hadoop Components Breakdown¡  Hbase  §  Master  ▪  Controls  the  Hbase  cluster,  knows  where  the  data  is  allocated  and  provides  a  client  listening  socket  using  Thrift  and/or  a  RESTful  API  §  Regionserver  ▪  Hbase  node,  stores  some  of  the  information  in  one  of  the  regions,  it’d  be  equivalent  to  sharding  §  Thrift  /  REST  ▪  Interface  to  connect  to  HBase  Lynx  Consultants  ©  2013  
  13. 13. Hadoop Components Breakdown¡  Hbase  §  How  to  access  ▪  Through  the  Hbase  client  (using  Thrift)  ▪  Through  the  RESTful  API  §  Configuration  files  ▪  /etc/hbase/conf/hbase-­‐site.xml  ▪  Defines  all  the  basic  configuration  for  accessing  hbase  ▪  /etc/hbase/conf/hbase-­‐policy.xml  ▪  Defines  all  the  security  (ACL)  and  all  the  hbase  memory  tweaks  ▪  /etc/hbase/conf/regionservers  ▪  List  all  the  regionservers  available  to  this  cluster  Lynx  Consultants  ©  2013  
  14. 14. Hadoop Components Breakdown¡  MapRed  §  JobTracker  ▪  Creates  the  Map/Reduce  jobs  ▪  Stores  all  the  intermediate  data  ▪  Keeps  track  of  all  the  previous  results  through  the  HistoryServer  §  TaskTracker  ▪  Executed  Tasks  related  to  the  Map/Reduce  job  ▪  Very  CPU  and  memory  intensive  ▪  Stores  intermediate  results  which  then  are  pushed  to  JobTracker  Lynx  Consultants  ©  2013  
  15. 15. Hadoop Components Breakdown¡  MapRed  §  How  to  access  ▪  Through  the  Hadoop  Client  ▪  Through  any  MapRed  client  like  Pig  or  Hive  ▪  Own  Java  code  §  Configuration  files  ▪  /etc/hadoop/conf/mapred-­‐site.xml  ▪  Defines  how  to  contact  this  MapRed  Cluster  ▪  /etc/hadoop/conf/mapred-­‐queue-­‐acls.xml  ▪  Defines  ACL  structure  for  accessing  MapRed,  normally  not  necessary  ▪  /etc/hadoop/conf/slaves  ▪  Defines  the  list  of  TaskTrackers  in  this  cluster  Lynx  Consultants  ©  2013  
  16. 16. Hadoop Components Breakdown¡  YARN  §  Same  structure  as  MapRed  (lives  on  top  of  it)  §  Configuration  files  ▪  /etc/hadoop/conf/yarn-­‐site.xml  ▪  All  required  configuration  for  YARN  Lynx  Consultants  ©  2013  
  17. 17. Hadoop Cluster Breakdown¡  Namenode  Server  §  HDFS  Namenode  §  Hbase  Master  ¡  Secondary  Namenode  Server  §  HDFS  Secondary  Namenode  ¡  JobTracker  Server  §  MapRed  JobTracker  §  MapRed  History  Server  Lynx  Consultants  ©  2013  
  18. 18. Hadoop Cluster Breakdown¡  Datanode  Server  §  HDFS  Datanode  §  Hbase  RegionServer  §  MapRed  TaskTracker  Lynx  Consultants  ©  2013  
  19. 19. Hadoop Hardware Requirements¡  Namenode  Server  §  Redundant  power  supplies  §  RAID1  Drives  §  Enough  memory  (16Gb)  ¡  Secondary  Namenode  Server  §  Almost  none  Lynx  Consultants  ©  2013  
  20. 20. Hadoop Hardware Requirements¡  Jobtracker  Server  §  Redundant  power  supplies  §  RAID1  Drives  §  Enough  memory  (16Gb)  ¡  Datanode  Server  §  Lots  of  cheap  disk  (no  RAID)  §  Lots  of  memory  (32Gb)  §  Lots  of  CPU  Lynx  Consultants  ©  2013  
  21. 21. Hadoop Default Ports¡  HDFS  §  8020:  HDFS  Namenode  §  50010:  HDFS  Datanode  FS  transfer  ¡  MapRed  §  No  defaults  ¡  Hbase  §  60010:  Master  §  60020:  Regionserver  Lynx  Consultants  ©  2013  
  22. 22. Hadoop HDFS WorkflowLynx  Consultants  ©  2013  
  23. 23. Hadoop MapRed WorkflowLynx  Consultants  ©  2013  
  24. 24. Hadoop MapRed WorkflowLynx  Consultants  ©  2013  
  25. 25. Flume¡  Transports  streams  of  data  from  point  A  to  point  B  ¡  Source  §  Where  the  data  is  read  from  ¡  Channel  §  How  the  data  is  buffered  ¡  Sink  §  Where  the  data  is  written  Lynx  Consultants  ©  2013  
  26. 26. Flume¡  Flume  is  fault  tolerant  ¡  Sources  are  pointer  kept  §  With  some  exceptions,  but  most  sources  are  in  a  known  state  ¡  Channels  can  be  fault  tolerant  §  Channel  written  to  disk  can  recover  from  where  it  left  ¡  Sinks  can  be  redundant  §  More  than  one  sink  for  the  same  data  §  Data  is  serialised  and  deduplicated  using  AVRO  Lynx  Consultants  ©  2013  
  27. 27. FlumeLynx  Consultants  ©  2013  
  28. 28. Flume¡  Configuration  files  §  /etc/flume-­‐ng/conf/flume.conf  ▪  Defines  the  agent  configuration  with  source,  channel,  sink  Lynx  Consultants  ©  2013  
  29. 29. FlumeLynx  Consultants  ©  2013  
  30. 30. Hadoop Recommended ReadsLynx  Consultants  ©  2013  
  31. 31. Hadoop References¡  Hadoop  §  http://hadoop.apache.org/docs/stable/cluster_setup.html  §  http://rc.cloudera.com/cdh/4/hadoop/hadoop-­‐yarn/hadoop-­‐yarn-­‐site/ClusterSetup.html  §  http://pig.apache.org/docs/r0.7.0/setup.html  §  http://wiki.apache.org/hadoop/NameNodeFailover  ¡  Hbase  §  http://hbase.apache.org/book/book.html  ¡  Flume  §  http://archive.cloudera.com/cdh4/cdh/4/flume-­‐ng/FlumeUserGuide.html  Lynx  Consultants  ©  2013  
  32. 32. Questions?Lynx  Consultants  ©  2013  
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×