Your SlideShare is downloading. ×
HDFS NameNode High Availability
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

HDFS NameNode High Availability

6,177

Published on

The HDFS NameNode is a robust and reliable service as seen in practice in production at Yahoo and other customers. However, the NameNode does not have automatic failover support. A hot failover …

The HDFS NameNode is a robust and reliable service as seen in practice in production at Yahoo and other customers. However, the NameNode does not have automatic failover support. A hot failover solution called HA NameNode is currently under active development (HDFS-1623). This talk will cover the architecture, design and setup. We will also discuss the future direction for HA NameNode.

Published in: Technology, Business
2 Comments
26 Likes
Statistics
Notes
No Downloads
Views
Total Views
6,177
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
0
Comments
2
Likes
26
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. HDFS    High  Availability  Suresh   S rinivas-­‐   H ortonworks  Aaron   T .   M yers   -­‐   C loudera  
  • 2. Overview  •  Part  1  –  Suresh  Srinivas(Hortonworks)   − HDFS  Availability  and  Reliability  –  what  is  the  record?   − HA  Use  Cases   − HA  Design  •  Part  2  –  Aaron  T.  Myers  (Cloudera)   − NN  HA  Design  Details   ü AutomaJc  failure  detecJon  and  NN  failover   ü Client-­‐NN  connecJon  failover   − OperaJons  and  Admin  of  HA   − Future  Work   2  
  • 3. Availability,  Reliability  and  Maintainability  Reliability  =  MTBF/(1  +  MTBF)  •  Probability  a  system  performs  its  funcJons  without  failure  for   a  desired  period  of  Jme  Maintainability  =  1/(1+MTTR)  •  Probability  that  a  failed  system  can  be  restored  within  a  given   Jmeframe  Availability  =  MTTF/MTBF  •  Probability  that  a  system  is  up  when  requested  for  use  •  Depends  on  both  on  Reliability  and  Maintainability    Mean  Time  To  Failure  (MTTF):  Average  Jme  between  successive  failures  Mean  Time  To  Repair/Restore  (MTTR):  Average  Jme  to  repair  failed  system  Mean  Time  Between  Failures  (MTBF):  Average  Jme  between  successive  failures  =  MTTR  +  MTTF     3  
  • 4. Current  HDFS  Availability  &  Data  Integrity  •  Simple  design  for  Higher  Reliability   − Storage:  Rely  on  NaJve  file  system  on  the  OS  rather  than  use  raw  disk   − Single  NameNode  master   ü  EnJre  file  system  state  is  in  memory   − DataNodes  simply  store  and  deliver  blocks   ü  All  sophisJcated  recovery  mechanisms  in  NN  •  Fault  Tolerance   − Design  assumes  disks,  nodes  and  racks  fail   − MulJple  replicas  of  blocks   ü  acJve  monitoring  and  replicaJon   ü  DN  acJvely  monitor  for  block  deleJon  and  corrupJon   − Restart/migrate  the  NameNode  on  failure   ü  Persistent  state:    mulJple  copies    +  checkpoints   ü  FuncJons  as  Cold  Standby   − Restart/replace  the  DNs  on  failure   − DNs  tolerate  individual  disk  failures   4  
  • 5. How  Well  Did  HDFS  Work?  •  Data  Reliability   − Lost  19  out  of  329  Million  blocks  on  10  clusters  with  20K  nodes  in  2009     − 7-­‐9’s  of  reliability   − Related  bugs  fixed  in  20  and  21.  •  NameNode  Availability   − 18  months  Study:  22  failures  on  25  clusters  -­‐  0.58  failures  per  year  per  cluster   − Only  8  would  have  benefi1ed  from  HA  failover!!  (0.23  failures  per  cluster  year)   − NN  is  very  reliable   ü  Resilient  against  overload  caused  by  misbehaving  apps  •  Maintainability   − Large  clusters  see  failure  of  one  DataNode/day  and  more  frequent  disk  failures   − Maintenance  once  in  3  months  to  repair  or  replace  DataNodes   5  
  • 6. Why  NameNode  HA?  •  NameNode  is  highly  reliable  (low  MTTF)   − But  Availability  is  not  the  same  as  Reliability  •  NameNode  MTTR  depends  on   − RestarJng  NameNode  daemon  on  failure   ü  Operator  restart  –  (failure  detecJon  +  manual  restore)  Jme   ü  AutomaJc  restart  –  1-­‐2  minutes   − NameNode  Startup  Jme   ü  Small/medium  cluster  1-­‐2  minutes   ü  Very  large  cluster  –  5-­‐15  minutes  •  Affects  applicaJons  that  have  real  Jme  requirement  •  For  higher  HDFS  Availability   − Need  redundant  NameNode  to  eliminate  SPOF   − Need  automaJc  failover  to  reduce  MTTR  and  improve  Maintainability   − Need  Hot  standby  to  reduce  MTTR  for  very  large  clusters   ü  Cold  standby  is  sufficient  for  small  clusters   6  
  • 7. NameNode  HA  –  IniLal  Goals  •  Support  for  AcJve  and  a  single  Standby   − AcJve  and  Standby  with  manual  failover   ü  Standby  could  be  cold/warm/hot   ü  Addresses  downJme  during  upgrades  –  main  cause  of  unavailability   − AcJve  and  Standby  with  automaJc  failover   ü  Hot  standby   ü  Addresses  downJme  during  upgrades  and  other  failures  •  Backward  compaJble  configuraJon  •  Standby  performs  checkpoinJng   − Secondary  NameNode  not  needed  •  Management  and  monitoring  tools  •  Design  philosophy  –  choose  data  integrity  over  service  availability   7  
  • 8. High  Level  Use  Cases  •  Planned  downJme   Supported  failures   − Upgrades   •  Single  hardware  failure   − Config  changes   − Double  hardware  failure  not   − Main  reason  for  downJme   supported   •  Some  sogware  failures   − Same  sogware  failure  affects  •  Unplanned  downJme   both  acJve  and  standby   − Hardware  failure   − Server  unresponsive   − Sogware  failures   − Occurs  infrequently   8  
  • 9. High  Level  Design  •  Service  monitoring  and  leader  elecJon  outside  NN   − Similar  to  industry  standard  HA  frameworks  •  Parallel  Block  reports  to  both  AcJve  and  Standby  NN  •  Shared  or  non-­‐shared  NN  file  system  state  •  Fencing  of  shared  resources/data   − DataNodes   − Shared  NN  state  (if  any)  •  Client  failover   − Client  side  failover  (based  on  configuraJon  or  ZooKeeper)   − IP  Failover   9  
  • 10. Design  ConsideraLons  •  Sharing  state  between  AcJve  and  Hot  Standby   − File  system  state  and  Block  locaJons  •  AutomaJc  Failover   − Monitoring  AcJve  NN  and  performing  failover  on  failure  •  Making  a  NameNode  acJve  during  startup   − Reliable  mechanism  for  choosing  only  one  NN  as  acJve  and  the  other  as   standby  •  Prevent  data  corrupJon  on  split  brain   − Shared  Resource  Fencing   ü  DataNodes  and  shared  storage  for  NN  metadata   − NameNode  Fencing   ü  when  shared  resource  cannot  be  fenced  •  Client  failover   − Clients  connect  to  the  new  AcJve  NN  during  failover   10  
  • 11. Failover  Control  Outside  NN   •  Similar  to  Industry  Standard  HA   frameworks   •  HA  daemon  outside  NameNode   ZooKeeper     − Simpler  to  build   − Immune  to  NN  failures   •  Daemon  manages  resources   Resources   Failover   Resources  Controller   AcJons   start,  stop,       Resources       − Resources  –  OS,  HW,  Network  etc.   − NameNode  is  just  another  resource   failover,  monitor,  …   •  Performs   Shared   Resources   − AcJve  NN  elecJon  during  startup     − AutomaJc  Failover   − Fencing   ü Shared  resources   ü NameNode    
  • 12. Architecture   ZK   ZK   ZK   Leader  elecJon   Failover   Failover   Controller   Controller   AcJve   Standby   Cmds   editlog   Monitor  Health   Monitor  Health   editlogs   NN   (fencing)   NN   AcJve   Standby   Block  Reports   DN   DN   DN  
  • 13. First  Phase  –  Hot  Standby   Needs  to  be  HA   editlogs   NN   (Shared  NFS  storage)   NN   AcJve   Standby   Manual  Failover   Block  Reports   DN  fencing   DN   DN   DN  
  • 14. HA  Design  Details   14  
  • 15. Client  Failover  Design  Details  •  Smart  clients  (client  side  failover)   − Users  use  one  logical  URI,  client  selects  correct  NN  to  connect  to   − Clients  know  which  operaJons  are  idempotent,  therefore  safe  to  retry   on  a  failover   − Clients  have  configurable  failover/retry  strategies  •  Current  implementaJon   − Client  configured  with  the  addresses  of  all  NNs  •  Other  implementaJons  in  the  future  (more  later)   15  
  • 16. Client  Failover  ConfiguraLon  Example  ...<property> <name>dfs.namenode.rpc-address.name-service1.nn1</name> <value>host1.example.com:8020</value></property><property> <name>dfs.namenode.rpc-address.name-service1.nn2</name> <value>host2.example.com:8020</value></property><property> <name>dfs.namenode.http-address.name-service1.nn1</name> <value>host1.example.com:50070</value></property>... 16  
  • 17. AutomaLc  Failover  Design  Details  •  AutomaJc  failover  requires  Zookeeper   − Not  required  for  manual  failover   − ZK  makes  it  easy  to:   ü Detect  failure  of  the  acJve  NN   ü Determine  which  NN  should  become  the  AcJve  NN  •  On  both  NN  machines,  run  another  daemon   − ZKFailoverController  (Zookeeper  Failover  Controller)  •  Each  ZKFC  is  responsible  for:   − Health  monitoring  of  its  associated  NameNode   − ZK  session  management  /  ZK-­‐based  leader  elecJon  •  See  HDFS-­‐2185  and  HADOOP-­‐8206  for  more  details   17  
  • 18. AutomaLc  Failover  Design  Details  (cont)   18  
  • 19. Ops/Admin:  Shared  Storage  •  To  share  NN  state,  need  shared  storage   − Needs  to  be  HA  itself  to  avoid  just  shiging  SPOF   − Many  come  with  IP  fencing  opJons   − Recommended  mount  opJons:   ü tcp,soft,intr,timeo=60,retrans=10•  SJll  configure  local  edits  dirs,  but  shared  dir  is  special  •  Work  is  currently  underway  to  do  away  with  shared  storage   requirement  (more  later)   19  
  • 20. Ops/Admin:  NN  fencing  •  CriJcal  for  correctness  that  only  one  NN  is  acJve  at  a  Jme  •  Out  of  the  box   − RPC  to  acJve  NN  to  tell  it  to  go  to  standby  (graceful  failover)   − SSH  to  acJve  NN  and  `kill -9’  NN  •  Pluggable  opJons   − Many  filers  have  protocols  for  IP-­‐based  fencing  opJons   − Many  PDUs  have  protocols  for  IP-­‐based  plug-­‐pulling  (STONITH)   ü Nuke  the  node  from  orbit.  It’s  the  only  way  to  be  sure.  •  Configure  extra  opJons  if  available  to  you   − Will  be  tried  in  order  during  a  failover  event   − Escalate  the  aggressiveness  of  the  method   − Fencing  is  criJcal  for  correctness  of  NN  metadata   20  
  • 21. Ops/Admin:  AutomaLc  Failover  •  Deploy  ZK  as  usual  (3  or  5  nodes)  or  reuse  exisJng  ZK   − ZK  daemons  have  light  resource  requirement   − OK  to  collocate  1  on  each  NN,  many  collocate  3rd  on  the  YARN  RM   − Advisable  to  configure  ZK  daemons  with  dedicated  disks  for  isolaJon   − Fine  to  use  the  same  ZK  quorum  as  for  HBase,  etc.  •  Fencing  methods  sJll  required   − The  ZKFC  that  wins  the  elecJon  is  responsible  for  performing  fencing   − Fencing  script(s)  must  be  configured  and  work  from  the  NNs  •  Admin  commands  which  manually  iniJate  failovers  sJll  work   − But  rather  than  coordinaJng  the  failover  themselves,  use  the  ZKFCs   21  
  • 22. Ops/Admin:  Monitoring  •  New  NN  metrics   − Size  of  pending  DN  message  queues   − Seconds  since  the  standby  NN  last  read  from  shared  edit  log   − DN  block  report  lag   − All  measurements  of  standby  NN  lag  –  monitor/alert  on  all  of  these  •  Monitor  shared  storage  soluJon   − Volumes  fill  up,  disks  go  bad,  etc   − Should  configure  paranoid  edit  log  retenJon  policy  (default  is  2)  •  Canary-­‐based  monitoring  of  HDFS  a  good  idea   − Pinging  both  NNs  not  sufficient   22  
  • 23. Ops/Admin:  Hardware  •  AcJve/Standby  NNs  should  be  on  separate  racks  •  Shared  storage  system  should  be  on  separate  rack  •  AcJve/Standby  NNs  should  have  close  to  the  same  hardware   − Same  amount  of  RAM  –  need  to  store  the  same  things   − Same  #  of  processors  -­‐  need  to  serve  same  number  of  clients  •  All  the  same  recommendaJons  sJll  apply  for  NN   − ECC  memory,  48GB   − Several  separate  disks  for  NN  metadata  directories   − Redundant  disks  for  OS  drives,  probably  RAID  5  or  mirroring   − Redundant  power   23  
  • 24. Future  Work  •  Other  opJons  to  share  NN  metadata   − Journal  daemons  with  list  of  acJve  JDs  stored  in  ZK  (HDFS-­‐3092)   − Journal  daemons  with  quorum  writes  (HDFS-­‐3077)    •  More  advanced  client  failover/load  shedding   − Serve  stale  reads  from  the  standby  NN   − SpeculaJve  RPC   − Non-­‐RPC  clients  (IP  failover,  DNS  failover,  proxy,  etc.)   − Less  client-­‐side  configuraJon  (ZK,  custom  DNS  records,  HDFS-­‐3043)    •  Even  Higher  HA   − MulJple  standby  NNs   24  
  • 25. QA  •  HA  design:  HDFS-­‐1623   − First  released  in  Hadoop  2.0.0-­‐alpha  •  Auto  failover  design:  HDFS-­‐3042  /  -­‐2185   − First  released  in  Hadoop  2.0.1-­‐alpha  •  Community  effort   25  

×