What's New and Upcoming in HDFS - the Hadoop Distributed File System


Published on

Todd Lipcon gives shares at the Federal Big Data Forum what is new and upcoming in HDFS (Hadoop Distributed File System).

What's New and Upcoming in HDFS - the Hadoop Distributed File System

  1. 1. What’s  new  and  upcoming  in  HDFS   January  30,  2013   Todd  Lipcon,  SoAware  Engineer   todd@cloudera.com      @tlipcon  1
  2. 2. IntroducGons   •  SoAware  engineer  on  Cloudera’s  Storage  Engineering   team   •  CommiIer  and  PMC  Member  for  Apache  Hadoop  and   Apache  HBase   •  Projects  in  2012   •  Responsible  for  >50%  of  the  code  for  all  phases  of  HA   development   •  Also  worked  on  many  performance  and  stability   improvements   •  This  presentaGon  is  highly  technical  –  please  feel  free   to  grab/email  me  later  if  you’d  like  to  clarify  anything!   ©2013 Cloudera, Inc. All Rights2 Reserved.
  3. 3. Outline   •  HDFS  2.0  –  what’s  new  in  2012?   •  HA  Phase  1  (Q1  2012)   •  HA  Phase  2  (Q2-­‐Q4  2012)   •  Performance  improvements  and  other  new  features   •  What’s  coming  in  2013?   •  HDFS  Snapshots   •  BeIer  storage  density  and  file  formats   •  Caching  and  Hierarchical  Storage  Management   ©2013 Cloudera, Inc. All Rights3   Reserved.
  4. 4. HDFS  HA  Phase  1  Review   HDFS-­‐1623:  completed  March  2012  4
  5. 5. HDFS  HA  Background   •  HDFS’s  strength  is  its  simple  and  robust  design   •  Single  master  NameNode  maintains  all  metadata   •  Scales  to  mul4-­‐petabyte  clusters  easily  on  modern   hardware   •  TradiGonally,  the  single  master  was  also  a  single  point   of  failure   •  Generally  good  availability,  but  not  ops-­‐friendly   •  No  hot  patch  ability,  no  hot  reconfiguraGon   •  No  hot  hardware  replacement   •  Hadoop  is  now  mission  cri4cal:  SPOF  not  OK!   ©2013 Cloudera, Inc. All Rights5 Reserved.
  6. 6. HDFS  HA  Development  Phase  1   •  Completed  March  2012  (HDFS-­‐1623)   •  Introduced  the  StandbyNode,  a  hot  backup  for  the  HDFS   NameNode.   •  Relied  on  shared  storage  to  synchronize  namespace  state   •  (e.g.  a  NAS  filer  appliance)   •  Allowed  operators  to  manually  trigger  failover  to  the   Standby   •  Sufficient  for  many  HA  use  cases:  avoided  planned   down4me  for  hardware  and  soAware  upgrades,  planned   machine/OS  maintenance,  configuraGon  changes,  etc.   ©2013 Cloudera, Inc. All Rights6 Reserved.
  7. 7. HDFS  HA  Architecture  Phase  1   •  Parallel  block  reports  sent  to  AcGve  and  Standby   NameNodes   •  NameNode  state  shared  by  locaGng  edit  log  on  NAS   over  NFS   •  AcGve  NameNode  writes  while  Standby  Node  “tails”     •  Client  failover  done  via  client  configuraGon   •  Each  client  configured  with  the  address  of  both  NNs:  try   both  to  find  acGve   ©2013 Cloudera, Inc. All Rights7 Reserved.
  8. 8. HDFS  HA  Architecture  Phase  1   ©2013 Cloudera, Inc. All Rights8   Reserved.
  9. 9. Fencing  and  NFS   •  Must  avoid  split-­‐brain  syndrome   •  Both  nodes  think  they  are  acGve  and  try  to  write  to  the   same  edit  log.  Your  metadata  becomes  corrupt  and   requires  manual  intervenGon  to  restart   •  Configure  a  fencing  script   •  Script  must  ensure  that  prior  acGve  has  stopped  wriGng   •  STONITH:  shoot-­‐the-­‐other-­‐node-­‐in-­‐the-­‐head   •  Storage  fencing:  e.g  using  NetApp  ONTAP  API  to  restrict   filer  access   •  Fencing  script  must  succeed  to  have  a  successful   failover   ©2013 Cloudera, Inc. All Rights9 Reserved.
  10. 10. Shortcomings  of  Phase  1   •  Insufficient  to  protect  against  unplanned  down4me   •  Manual  failover  only:  requires  an  operator  to  step  in   quickly  aAer  a  crash   •  Various  studies  indicated  this  was  the  minority  of   downGme,  but  sGll  important  to  address   •  Requirement  of  a  NAS  device  made  deployment   complex,  expensive,  and  error-­‐prone     (we  always  knew  this  was  just  the  first  phase!) ©2013 Cloudera, Inc. All Rights10 Reserved.
  11. 11. HDFS  HA  Development  Phase  2   •  MulGple  new  features  for  high  availability   •  Automa4c  failover,  based  on  Apache  ZooKeeper   •  Remove  dependency  on  NAS  (network-­‐aIached  storage)   •  Address  new  HA  use  cases   •  Avoid  unplanned  downGme  due  to  soAware  or  hardware   faults   •  Deploy  in  filer-­‐less  environments   •  Completely  stand-­‐alone  HA  with  no  external  hardware  or   soAware  dependencies   •  no  Linux-­‐HA,  filers,  etc   ©2013 Cloudera, Inc. All Rights11 Reserved.
  12. 12. AutomaGc  Failover  Overview   HDFS-­‐3042:  completed  May  2012  12
  13. 13. AutomaGc  Failover  Goals   •  Automa4cally  detect  failure  of  the  AcGve  NameNode   •  Hardware,  soAware,  network,  etc.   •  Do  not  require  operator  interven4on  to  iniGate   failover   •  Once  failure  is  detected,  process  completes  automaGcally   •  Support  manually  ini4ated  failover  as  first-­‐class   •  Operators  can  sGll  trigger  failover  without  having  to  stop   AcGve   •  Do  not  introduce  a  new  SPOF   •  All  parts  of  auto-­‐failover  deployment  must  themselves  be   HA   ©2013 Cloudera, Inc. All Rights13 Reserved.
  14. 14. AutomaGc  Failover  Architecture   •  AutomaGc  failover  requires  ZooKeeper   •  Not  required  for  manual  failover   •  ZK  makes  it  easy  to:   •  Detect  failure  of  AcGve  NameNode   •  Determine  which  NameNode  should   become  the  AcGve  NN   ©2013 Cloudera, Inc. All Rights14   Reserved.
  15. 15. AutomaGc  Failover  Architecture   •  New  daemon:  ZooKeeper  Failover  Controller  (ZKFC)   •  In  an  auto  failover  deployment,  run  two  ZKFCs   •  One  per  NameNode,  on  that  NameNode  machine   •  ZKFC  has  three  simple  responsibili4es:   •  Monitors  health  of  associated  NameNode   •  ParGcipates  in  leader  elec4on  of  NameNodes   •  Fences  the  other  NameNode  if  it  wins  elecGon   ©2013 Cloudera, Inc. All Rights15   Reserved.
  16. 16. AutomaGc  Failover  Architecture   ©2013 Cloudera, Inc. All Rights16   Reserved.
  17. 17. Removing  the  NAS  dependency   HDFS-­‐3077:  completed  October  2012  17
  18. 18. Shared  Storage  in  HDFS  HA   •  The  Standby  NameNode  synchronizes  the  namespace   by  following  the  AcGve  NameNode’s  transacGon  log   •  Each  operaGon  (eg  mkdir(/foo))  is  wriIen  to  the  log  by  the   AcGve   •  The  StandbyNode  periodically  reads  all  new  edits  and   applies  them  to  its  own  metadata  structures   •  Reliable  shared  storage  is  required  for  correct   opera4on   •  In  phase  1,  shared  storage  was  synonymous  with  NFS-­‐ mounted  NAS   ©2013 Cloudera, Inc. All Rights18 Reserved.
  19. 19. Shortcomings  of  NFS-­‐based  approach   •  Custom  hardware   •  Lots  of  our  customers  don’t  have  SAN/NAS  available  in  their   datacenters   •  Costs  money,  Gme  and  experGse   •  Extra  “stuff”  to  monitor  outside  HDFS   •  We  just  moved  the  SPOF,  didn’t  eliminate  it!   •  Complicated   •  Storage  fencing,  NFS  mount  opGons,  mulGpath  networking,  etc   •  OrganizaGonally  complicated:  dependencies  on  storage  ops   team   •  NFS  issues   •  Buggy  client  implementaGons,  liIle  control  over  Gmeout   behavior,  etc   ©2013 Cloudera, Inc. All Rights19 Reserved.
  20. 20. Primary  Requirements  for  Improved  Storage   •  No  special  hardware  (PDUs,  NAS)   •  No  custom  fencing  configuraGon   •  Too  complicated  ==  too  easy  to  misconfigure   •  No  SPOFs   •  punGng  to  filers  isn’t  a  good  opGon   •  need  something  inherently  distributed   ©2013 Cloudera, Inc. All Rights20 Reserved.
  21. 21. Secondary  Requirements   •  Configurable  degree  of  fault  tolerance   •  Configure  N  nodes  to  tolerate  (N-­‐1)/2   •  Making  N  bigger  (within  reasonable  bounds)   shouldn’t  hurt  performance.  Implies:   •  Writes  done  in  parallel,  not  pipelined   •  Writes  should  not  wait  on  slowest  replica   •  Locate  replicas  on  exisGng  hardware  investment  (eg   share  with  JobTracker,  NN,  SBN)   ©2013 Cloudera, Inc. All Rights21 Reserved.
  22. 22. OperaGonal  Requirements   •  Should  be  operable  by  exisGng  Hadoop  admins.   Implies:   •  Same  metrics  system  (“hadoop  metrics”)   •  Same  configuraGon  system  (xml)   •  Same  logging  infrastructure  (log4j)   •  Same  security  system  (Kerberos-­‐based)   •  Allow  exisGng  ops  to  easily  deploy  and  manage  the   new  feature   •  Allow  exisGng  Hadoop  tools  to  monitor  the  feature   •  (eg  Cloudera  Manager,  Ganglia,  etc)   ©2013 Cloudera, Inc. All Rights22 Reserved.
  23. 23. Our  soluGon:  QuorumJournalManager   •  QuorumJournalManager  (client)   •  Plugs  into  JournalManager  abstracGon  in  NN  (instead  of   exisGng  FileJournalManager)   •  Provides  edit  log  storage  abstracGon   •  JournalNode  (server)   •  Standalone  daemon  running  on  an  odd  number  of  nodes   •  Provides  actual  storage  of  edit  logs  on  local  disks   •  Could  run  inside  other  daemons  in  the  future     ©2013 Cloudera, Inc. All Rights23 Reserved.
  24. 24. Architecture   ©2013 Cloudera, Inc. All Rights24 Reserved.
  25. 25. Commit  protocol   •  NameNode  accumulates  edits  locally  as  they  are   logged   •  On  logSync(),  sends  accumulated  batch  to  all  JNs  via   Hadoop  RPC   •  Waits  for  success  ACK  from  a  majority  of  nodes   •  Majority  commit  means  that  a  single  lagging  or  crashed   replica  does  not  impact  NN  latency   •  Latency  @  NN  =  median(Latency  @  JNs)   •  Uses  the  well-­‐known  Paxos  algorithm  to  perform   recovery  of  any  in-­‐flight  edits  on  leader  switchover   ©2013 Cloudera, Inc. All Rights25 Reserved.
  26. 26. JN  Fencing   •  How  do  we  prevent  split-­‐brain?   •  Each  instance  of  QJM  is  assigned  a  unique  epoch   number   •  provides  a  strong  ordering  between  client  NNs   •  Each  IPC  contains  the  client’s  epoch   •  JN  remembers  on  disk  the  highest  epoch  it  has  seen   •  Any  request  from  an  earlier  epoch  is  rejected.  Any  from  a   newer  one  is  recorded  on  disk   •  Distributed  Systems  folks  may  recognize  this  technique   from  Paxos  and  other  literature   ©2013 Cloudera, Inc. All Rights26 Reserved.
  27. 27. Fencing  with  epochs   •  Fencing  is  now  implicit   •  The  act  of  becoming  acGve  causes  any  earlier  acGve   NN  to  be  fenced  out   •  Since  a  quorum  of  nodes  has  accepted  the  new  acGve,  any   other  IPC  by  an  earlier  epoch  number  can’t  get  quorum   •  Eliminates  confusing  and  error-­‐prone  custom  fencing   configura4on   ©2013 Cloudera, Inc. All Rights27 Reserved.
  28. 28. Other  implementaGon  features   •  Hadoop  Metrics   •  lag,  percenGle  latencies,  etc  from  perspecGve  of  JN,  NN   •  metrics  for  queued  txns,  %  of  Gme  each  JN  fell  behind,  etc,   to  help  suss  out  a  slow  JN  before  it  causes  problems   •  Security   •  full  Kerberos  and  SSL  support:  edits  can  be  opGonally   encrypted  in-­‐flight,  and  all  access  is  mutually  authenGcated   ©2013 Cloudera, Inc. All Rights28 Reserved.
  29. 29. TesGng   •  Randomized  fault  test   •  Runs  all  communicaGons  in  a  single  thread  with   determinisGc  order  and  fault  injecGons  based  on  a  seed   •  Caught  a  number  of  really  subtle  bugs  along  the  way   •  Run  as  an  MR  job:  5000  fault  tests  in  parallel   •  MulGple  CPU-­‐years  of  stress  tesGng:  found  2  bugs  in  JeIy!     •  Cluster  tesGng:  100-­‐node,  MR,  HBase,  Hive,  etc   •  Commit  latency  in  pracGce:  within  same  range  as  local   disks  (beIer  than  one  of  two  local  disks,  worse  than  the   other  one)   ©2013 Cloudera, Inc. All Rights30 Reserved.
  30. 30. Deployment   •  Most  customers  running  3  JNs  (tolerate  1  failure)   •  1  on  NN,  1  on  SBN,  1  on  JobTracker/ResourceManager   •  OpGonally  run  2  more  (eg  on  basGon/gateway  nodes)  to   tolerate  2  failures   •  No  new  hardware  investment     •  Refer  to  docs  for  detailed  configuraGon  info   ©2013 Cloudera, Inc. All Rights31 Reserved.
  31. 31. Status   •  Merged  into  Hadoop  development  trunk  in  early   October   •  Available  in  CDH4.1,  will  be  in  upcoming  Hadoop  2.1   •  Deployed  at  several  customer/community  sites  with   good  success  so  far  (no  lost  data)   •  In  contrast,  we’ve  had  several  issues  with  misconfigured   NFS  filers  causing  downGme   •  Highly  recommend  you  use  Quorum  Journaling  instead  of   NFS!   ©2013 Cloudera, Inc. All Rights32 Reserved.
  32. 32. Summary  of  HA  Improvements   •  Run  an  acGve  NameNode  and  a  hot  Standby   NameNode   •  AutomaGcally  triggers  seamless  failover  using  Apache   ZooKeeper   •  Stores  shared  metadata  on  QuorumJournalManager:   a  fully  distributed,  redundant,  low  latency  journaling   system.   •  All  improvements  available  now  in  HDFS  branch-­‐2  and   CDH4.1   ©2013 Cloudera, Inc. All Rights33 Reserved.
  33. 33. HDFS  Performance  Update  34
  34. 34. Performance  Improvements  (overview)   •  Several  improvements  made  for  Impala   •  Much  faster  libhdfs   •  APIs  for  spindle-­‐based  scheduling   •  Other  more  general  improvements  (especially  for   HBase  and  Accumulo)   •  Ability  to  read  directly  from  block  files  in  secure   environments   •  Ability  for  applicaGons  to  perform  their  own  checksums   and  eliminate  IOPS   ©2013 Cloudera, Inc. All Rights35 Reserved.
  35. 35. libhdfs  “direct  read”  support  (HDFS-­‐2834)  •  This  can  also  benefit  apps  like  HBase,  Accumulo,  and  MR  with  a  bit  more   work  (TBD  in  2013)  36  
  36. 36. Disk  locaGons  API  (HDFS-­‐3672)   •  HDFS  has  always  exposed  node  locality  informaGon   •  Map<Block,  List<Datanode  Addresess>>   •  Now  also  can  expose  disk  locality  informaGon   •  Map<Replica,  List<Spindle  IdenGfiers>>   •  Impala  uses  this  API  to  keep  all  disks  spinning  at  full   throughput     •  ~2x  improvement  on  IO-­‐bound  workloads  on  12-­‐spindle   machines   ©2013 Cloudera, Inc. All Rights37 Reserved.
  37. 37. Short-­‐circuit  reads   •  “Short  circuit”  allows  HDFS  clients  to  open  HDFS  block   files  directly  from  the  local  filesystem   •  Avoids  context  switches  and  trips  back  and  forth  from  user   space  to  kernel  space  memory,  TCP  stack,  etc   •  Uses  50%  less  CPU,  avoids  significant  latency  when  reading   data  from  Linux  buffer  cache   •  SequenGal  IO  performance:  2x  improvement   •  Random  IO  performance:  3.5x  improvement   •  This  has  existed  for  a  while  in  insecure  setups  only!   •  Clients  need  read  access  to  all  block  files  L   ©2013 Cloudera, Inc. All Rights38 Reserved.
  38. 38. Secure  short-­‐circuit  reads  (HDFS-­‐347)   •  DataNode  conGnues  to  arbitrate  access  to  block  files   •  Opens  input  streams  and  passes  them  to  the  DFS   client  aAer  authenGcaGon  and  authorizaGon  checks   •  Uses  a  trick  involving  Unix  Domain  Sockets  (sendmsg  with   SCM_RIGHTS)   •  Now  perf-­‐sensiGve  apps  like  HBase,  Accumulo,  and   Impala  can  safely  configure  this  feature  in  all   environments   ©2013 Cloudera, Inc. All Rights39 Reserved.
  39. 39. Checksum  skipping  (HDFS-­‐3429)   •  Problem:  HDFS  stores  block  data  and  block   checksums  in  separate  files   •  A  truly  random  read  incurs  two  seeks  instead  of  one!   •  Solu4on:  HBase  now  stores  its  own  checksums  on  its   own  internal  64KB  blocks   •  But  it  turns  out  that  prior  versions  of  HDFS  sGll  read  the   checksum,  even  if  the  client  flipped  verificaGon  off   •  Fixing  this  yielded  a  40%  reduc4on  in  IOPS  and   latency  for  a  mulG-­‐TB  uniform  random-­‐read   workload!   ©2013 Cloudera, Inc. All Rights40 Reserved.
  40. 40. SGll  more  to  come?   •  Not  a  ton  leA  on  the  read  path   •  Write  path  sGll  has  some  low  hanging  fruit  –  hang   Gght  for  next  year   •  Reality  check  (mulG-­‐threaded  random-­‐read)   •  Hadoop  1.0:    264MB/sec   •  Hadoop  2.x:  1393MB/sec   •  We’ve  come  a  long  way  (5x)  in  a  few  years!   ©2013 Cloudera, Inc. All Rights41 Reserved.
  41. 41. Other  key  new  features  42
  42. 42. On-­‐the-­‐wire  EncrypGon   •  Strong  encrypGon  now  supported  for  all  traffic  on  the   wire   •  both  data  and  RPC   •  Configurable  cipher  (eg  RC5,  DES,  3DES)   •  Developed  specifically  based  on  requirements  from   the  IC   •  Reviewed  by  some  experts  here  today  (thanks!)   ©2013 Cloudera, Inc. All Rights43 Reserved.
  43. 43. Rolling  Upgrades  and  Wire  CompaGbility   •  RPC  and  Data  Transfer  now  using  Protocol  Buffers   •  Easy  for  developers  to  add  new  features  without   breaking  compaGbility   •  Allows  zero-­‐downGme  upgrade  between  minor   releases   •  Planning  to  lock  down  client-­‐server  compaGbility  even  for   more  major  releases  in  2013   ©2013 Cloudera, Inc. All Rights44 Reserved.
  44. 44. What’s  up  next  in  2013?  45
  45. 45. HDFS  Snapshots   •  Full  support  for  efficent  subtree  snapshots   •  Point-­‐in-­‐Gme  “copy”  of  a  part  of  the  filesystem   •  Like  a  NetApp  NAS:  simple  administraGve  API   •  Copy-­‐on-­‐write  (instantaneous  snapshoyng)   •  Can  serve  as  input  for  MR,  distcp,  backups,  etc   •  IniGally  read-­‐only,  some  thought  about  read-­‐write  in   the  future   •  In  progress  now,  hoping  to  merge  into  trunk  by   summerGme   ©2013 Cloudera, Inc. All Rights46 Reserved.
  46. 46. Hierarchical  storage   •  Early  exploraGon  into  SSD/Flash   •  AnGcipaGng  “hybrid”  storage  will  become  common  soon   •  What  performance  improvements  do  we  need  to  take   good  advantage  of  it?   •  Tiered  caching  of  hot  data  onto  flash?   •  Explicit  storage  “pools”  for  apps  to  manage?   •  Big-­‐RAM  boxes   •  256GB/box  not  so  expensive  anymore   •  How  can  we  best  make  use  of  all  this  RAM?  Caching!   ©2013 Cloudera, Inc. All Rights47 Reserved.
  47. 47. Storage  efficiency   •  Transparent  re-­‐compression  of  cold  data?   •  More  efficient  file  formats   •  Columnar  storage  for  Hive,  Impala   •  Faster  to  operate  on  and  more  compact   •  Work  on  “fat  datanodes”   •  36-­‐72TB/node  will  require  some  investment  in  DataNode   scaling   •  More  parallelism,  more  efficient  use  of  RAM,  etc.   ©2013 Cloudera, Inc. All Rights48 Reserved.