SlideShare a Scribd company logo
1 of 49
Download to read offline
What’s	
  new	
  and	
  upcoming	
  in	
  HDFS	
  
    January	
  30,	
  2013	
  
    Todd	
  Lipcon,	
  SoAware	
  Engineer	
  
    todd@cloudera.com	
  	
  	
  @tlipcon	
  




1
IntroducGons	
  

    •  SoAware	
  engineer	
  on	
  Cloudera’s	
  Storage	
  Engineering	
  
       team	
  
    •  CommiIer	
  and	
  PMC	
  Member	
  for	
  Apache	
  Hadoop	
  and	
  
       Apache	
  HBase	
  
    •  Projects	
  in	
  2012	
  
           •    Responsible	
  for	
  >50%	
  of	
  the	
  code	
  for	
  all	
  phases	
  of	
  HA	
  
                development	
  
           •    Also	
  worked	
  on	
  many	
  performance	
  and	
  stability	
  
                improvements	
  
    •    This	
  presentaGon	
  is	
  highly	
  technical	
  –	
  please	
  feel	
  free	
  
         to	
  grab/email	
  me	
  later	
  if	
  you’d	
  like	
  to	
  clarify	
  anything!	
  

                                             ©2013 Cloudera, Inc. All Rights
2
                                                      Reserved.
Outline	
  

        •  HDFS	
  2.0	
  –	
  what’s	
  new	
  in	
  2012?	
  
              •  HA	
  Phase	
  1	
  (Q1	
  2012)	
  
              •  HA	
  Phase	
  2	
  (Q2-­‐Q4	
  2012)	
  
              •  Performance	
  improvements	
  and	
  other	
  new	
  features	
  
        •  What’s	
  coming	
  in	
  2013?	
  
              •  HDFS	
  Snapshots	
  
              •  BeIer	
  storage	
  density	
  and	
  file	
  formats	
  
              •  Caching	
  and	
  Hierarchical	
  Storage	
  Management	
  




                                               ©2013 Cloudera, Inc. All Rights
3	
  
                                                        Reserved.
HDFS	
  HA	
  Phase	
  1	
  Review	
  
    HDFS-­‐1623:	
  completed	
  March	
  2012	
  




4
HDFS	
  HA	
  Background	
  

    •    HDFS’s	
  strength	
  is	
  its	
  simple	
  and	
  robust	
  design	
  
           •    Single	
  master	
  NameNode	
  maintains	
  all	
  metadata	
  
           •    Scales	
  to	
  mul4-­‐petabyte	
  clusters	
  easily	
  on	
  modern	
  
                hardware	
  
    •    TradiGonally,	
  the	
  single	
  master	
  was	
  also	
  a	
  single	
  point	
  
         of	
  failure	
  
           •    Generally	
  good	
  availability,	
  but	
  not	
  ops-­‐friendly	
  
           •    No	
  hot	
  patch	
  ability,	
  no	
  hot	
  reconfiguraGon	
  
           •    No	
  hot	
  hardware	
  replacement	
  
    •    Hadoop	
  is	
  now	
  mission	
  cri4cal:	
  SPOF	
  not	
  OK!	
  

                                          ©2013 Cloudera, Inc. All Rights
5
                                                   Reserved.
HDFS	
  HA	
  Development	
  Phase	
  1	
  

    •  Completed	
  March	
  2012	
  (HDFS-­‐1623)	
  
    •  Introduced	
  the	
  StandbyNode,	
  a	
  hot	
  backup	
  for	
  the	
  HDFS	
  
       NameNode.	
  
    •  Relied	
  on	
  shared	
  storage	
  to	
  synchronize	
  namespace	
  state	
  
         •    (e.g.	
  a	
  NAS	
  filer	
  appliance)	
  
    •  Allowed	
  operators	
  to	
  manually	
  trigger	
  failover	
  to	
  the	
  
       Standby	
  
    •  Sufficient	
  for	
  many	
  HA	
  use	
  cases:	
  avoided	
  planned	
  
       down4me	
  for	
  hardware	
  and	
  soAware	
  upgrades,	
  planned	
  
       machine/OS	
  maintenance,	
  configuraGon	
  changes,	
  etc.	
  

                                              ©2013 Cloudera, Inc. All Rights
6
                                                       Reserved.
HDFS	
  HA	
  Architecture	
  Phase	
  1	
  

    •  Parallel	
  block	
  reports	
  sent	
  to	
  AcGve	
  and	
  Standby	
  
       NameNodes	
  
    •  NameNode	
  state	
  shared	
  by	
  locaGng	
  edit	
  log	
  on	
  NAS	
  
       over	
  NFS	
  
           •    AcGve	
  NameNode	
  writes	
  while	
  Standby	
  Node	
  “tails”	
  
                	
  
    •    Client	
  failover	
  done	
  via	
  client	
  configuraGon	
  
           •    Each	
  client	
  configured	
  with	
  the	
  address	
  of	
  both	
  NNs:	
  try	
  
                both	
  to	
  find	
  acGve	
  



                                            ©2013 Cloudera, Inc. All Rights
7
                                                     Reserved.
HDFS	
  HA	
  Architecture	
  Phase	
  1	
  




                               ©2013 Cloudera, Inc. All Rights
8	
  
                                        Reserved.
Fencing	
  and	
  NFS	
  

    •    Must	
  avoid	
  split-­‐brain	
  syndrome	
  
           •    Both	
  nodes	
  think	
  they	
  are	
  acGve	
  and	
  try	
  to	
  write	
  to	
  the	
  
                same	
  edit	
  log.	
  Your	
  metadata	
  becomes	
  corrupt	
  and	
  
                requires	
  manual	
  intervenGon	
  to	
  restart	
  
    •    Configure	
  a	
  fencing	
  script	
  
           •    Script	
  must	
  ensure	
  that	
  prior	
  acGve	
  has	
  stopped	
  wriGng	
  
           •    STONITH:	
  shoot-­‐the-­‐other-­‐node-­‐in-­‐the-­‐head	
  
           •    Storage	
  fencing:	
  e.g	
  using	
  NetApp	
  ONTAP	
  API	
  to	
  restrict	
  
                filer	
  access	
  
    •    Fencing	
  script	
  must	
  succeed	
  to	
  have	
  a	
  successful	
  
         failover	
  

                                              ©2013 Cloudera, Inc. All Rights
9
                                                       Reserved.
Shortcomings	
  of	
  Phase	
  1	
  

     •    Insufficient	
  to	
  protect	
  against	
  unplanned	
  down4me	
  
             •    Manual	
  failover	
  only:	
  requires	
  an	
  operator	
  to	
  step	
  in	
  
                  quickly	
  aAer	
  a	
  crash	
  
             •    Various	
  studies	
  indicated	
  this	
  was	
  the	
  minority	
  of	
  
                  downGme,	
  but	
  sGll	
  important	
  to	
  address	
  
     •    Requirement	
  of	
  a	
  NAS	
  device	
  made	
  deployment	
  
          complex,	
  expensive,	
  and	
  error-­‐prone	
  

                                                                          	
  
     (we	
  always	
  knew	
  this	
  was	
  just	
  the	
  first	
  phase!)




                                                           ©2013 Cloudera, Inc. All Rights
10
                                                                    Reserved.
HDFS	
  HA	
  Development	
  Phase	
  2	
  

     •    MulGple	
  new	
  features	
  for	
  high	
  availability	
  
            •    Automa4c	
  failover,	
  based	
  on	
  Apache	
  ZooKeeper	
  
            •    Remove	
  dependency	
  on	
  NAS	
  (network-­‐aIached	
  storage)	
  

     •    Address	
  new	
  HA	
  use	
  cases	
  
            •    Avoid	
  unplanned	
  downGme	
  due	
  to	
  soAware	
  or	
  hardware	
  
                 faults	
  
            •    Deploy	
  in	
  filer-­‐less	
  environments	
  
            •    Completely	
  stand-­‐alone	
  HA	
  with	
  no	
  external	
  hardware	
  or	
  
                 soAware	
  dependencies	
  
                   •    no	
  Linux-­‐HA,	
  filers,	
  etc	
  


                                                     ©2013 Cloudera, Inc. All Rights
11
                                                              Reserved.
AutomaGc	
  Failover	
  Overview	
  
     HDFS-­‐3042:	
  completed	
  May	
  2012	
  




12
AutomaGc	
  Failover	
  Goals	
  

     •    Automa4cally	
  detect	
  failure	
  of	
  the	
  AcGve	
  NameNode	
  
            •    Hardware,	
  soAware,	
  network,	
  etc.	
  
     •    Do	
  not	
  require	
  operator	
  interven4on	
  to	
  iniGate	
  
          failover	
  
            •    Once	
  failure	
  is	
  detected,	
  process	
  completes	
  automaGcally	
  
     •    Support	
  manually	
  ini4ated	
  failover	
  as	
  first-­‐class	
  
            •    Operators	
  can	
  sGll	
  trigger	
  failover	
  without	
  having	
  to	
  stop	
  
                 AcGve	
  
     •    Do	
  not	
  introduce	
  a	
  new	
  SPOF	
  
            •    All	
  parts	
  of	
  auto-­‐failover	
  deployment	
  must	
  themselves	
  be	
  
                 HA	
  

                                            ©2013 Cloudera, Inc. All Rights
13
                                                     Reserved.
AutomaGc	
  Failover	
  Architecture	
  

         •  AutomaGc	
  failover	
  requires	
  ZooKeeper	
  
              •  Not	
  required	
  for	
  manual	
  failover	
  
         •  ZK	
  makes	
  it	
  easy	
  to:	
  
              •  Detect	
  failure	
  of	
  AcGve	
  NameNode	
  
              •  Determine	
  which	
  NameNode	
  should	
  
               become	
  the	
  AcGve	
  NN	
  



                                         ©2013 Cloudera, Inc. All Rights
14	
  
                                                  Reserved.
AutomaGc	
  Failover	
  Architecture	
  

         •  New	
  daemon:	
  ZooKeeper	
  Failover	
  Controller	
  (ZKFC)	
  


         •  In	
  an	
  auto	
  failover	
  deployment,	
  run	
  two	
  ZKFCs	
  
             •  One	
  per	
  NameNode,	
  on	
  that	
  NameNode	
  machine	
  
         •  ZKFC	
  has	
  three	
  simple	
  responsibili4es:	
  
             •  Monitors	
  health	
  of	
  associated	
  NameNode	
  
             •  ParGcipates	
  in	
  leader	
  elec4on	
  of	
  NameNodes	
  
             •  Fences	
  the	
  other	
  NameNode	
  if	
  it	
  wins	
  elecGon	
  



                                        ©2013 Cloudera, Inc. All Rights
15	
  
                                                 Reserved.
AutomaGc	
  Failover	
  Architecture	
  




                             ©2013 Cloudera, Inc. All Rights
16	
  
                                      Reserved.
Removing	
  the	
  NAS	
  dependency	
  
     HDFS-­‐3077:	
  completed	
  October	
  2012	
  




17
Shared	
  Storage	
  in	
  HDFS	
  HA	
  

     •    The	
  Standby	
  NameNode	
  synchronizes	
  the	
  namespace	
  
          by	
  following	
  the	
  AcGve	
  NameNode’s	
  transacGon	
  log	
  
            •    Each	
  operaGon	
  (eg	
  mkdir(/foo))	
  is	
  wriIen	
  to	
  the	
  log	
  by	
  the	
  
                 AcGve	
  
            •    The	
  StandbyNode	
  periodically	
  reads	
  all	
  new	
  edits	
  and	
  
                 applies	
  them	
  to	
  its	
  own	
  metadata	
  structures	
  
     •    Reliable	
  shared	
  storage	
  is	
  required	
  for	
  correct	
  
          opera4on	
  
            •    In	
  phase	
  1,	
  shared	
  storage	
  was	
  synonymous	
  with	
  NFS-­‐
                 mounted	
  NAS	
  


                                             ©2013 Cloudera, Inc. All Rights
18
                                                      Reserved.
Shortcomings	
  of	
  NFS-­‐based	
  approach	
  

     •    Custom	
  hardware	
  
            •    Lots	
  of	
  our	
  customers	
  don’t	
  have	
  SAN/NAS	
  available	
  in	
  their	
  
                 datacenters	
  
            •    Costs	
  money,	
  Gme	
  and	
  experGse	
  
            •    Extra	
  “stuff”	
  to	
  monitor	
  outside	
  HDFS	
  
            •    We	
  just	
  moved	
  the	
  SPOF,	
  didn’t	
  eliminate	
  it!	
  
     •    Complicated	
  
            •    Storage	
  fencing,	
  NFS	
  mount	
  opGons,	
  mulGpath	
  networking,	
  etc	
  
            •    OrganizaGonally	
  complicated:	
  dependencies	
  on	
  storage	
  ops	
  
                 team	
  
     •    NFS	
  issues	
  
            •    Buggy	
  client	
  implementaGons,	
  liIle	
  control	
  over	
  Gmeout	
  
                 behavior,	
  etc	
  


                                               ©2013 Cloudera, Inc. All Rights
19
                                                        Reserved.
Primary	
  Requirements	
  for	
  Improved	
  Storage	
  

     •  No	
  special	
  hardware	
  (PDUs,	
  NAS)	
  
     •  No	
  custom	
  fencing	
  configuraGon	
  
            •    Too	
  complicated	
  ==	
  too	
  easy	
  to	
  misconfigure	
  
     •    No	
  SPOFs	
  
            •    punGng	
  to	
  filers	
  isn’t	
  a	
  good	
  opGon	
  
            •    need	
  something	
  inherently	
  distributed	
  




                                           ©2013 Cloudera, Inc. All Rights
20
                                                    Reserved.
Secondary	
  Requirements	
  

     •    Configurable	
  degree	
  of	
  fault	
  tolerance	
  
            •    Configure	
  N	
  nodes	
  to	
  tolerate	
  (N-­‐1)/2	
  
     •    Making	
  N	
  bigger	
  (within	
  reasonable	
  bounds)	
  
          shouldn’t	
  hurt	
  performance.	
  Implies:	
  
            •    Writes	
  done	
  in	
  parallel,	
  not	
  pipelined	
  
            •    Writes	
  should	
  not	
  wait	
  on	
  slowest	
  replica	
  
     •    Locate	
  replicas	
  on	
  exisGng	
  hardware	
  investment	
  (eg	
  
          share	
  with	
  JobTracker,	
  NN,	
  SBN)	
  



                                            ©2013 Cloudera, Inc. All Rights
21
                                                     Reserved.
OperaGonal	
  Requirements	
  

     •    Should	
  be	
  operable	
  by	
  exisGng	
  Hadoop	
  admins.	
  
          Implies:	
  
            •    Same	
  metrics	
  system	
  (“hadoop	
  metrics”)	
  
            •    Same	
  configuraGon	
  system	
  (xml)	
  
            •    Same	
  logging	
  infrastructure	
  (log4j)	
  
            •    Same	
  security	
  system	
  (Kerberos-­‐based)	
  
     •  Allow	
  exisGng	
  ops	
  to	
  easily	
  deploy	
  and	
  manage	
  the	
  
        new	
  feature	
  
     •  Allow	
  exisGng	
  Hadoop	
  tools	
  to	
  monitor	
  the	
  feature	
  
            •    (eg	
  Cloudera	
  Manager,	
  Ganglia,	
  etc)	
  

                                          ©2013 Cloudera, Inc. All Rights
22
                                                   Reserved.
Our	
  soluGon:	
  QuorumJournalManager	
  

     •      QuorumJournalManager	
  (client)	
  
              •    Plugs	
  into	
  JournalManager	
  abstracGon	
  in	
  NN	
  (instead	
  of	
  
                   exisGng	
  FileJournalManager)	
  
              •    Provides	
  edit	
  log	
  storage	
  abstracGon	
  
     •      JournalNode	
  (server)	
  
              •    Standalone	
  daemon	
  running	
  on	
  an	
  odd	
  number	
  of	
  nodes	
  
              •    Provides	
  actual	
  storage	
  of	
  edit	
  logs	
  on	
  local	
  disks	
  
              •    Could	
  run	
  inside	
  other	
  daemons	
  in	
  the	
  future	
  
     	
  


                                            ©2013 Cloudera, Inc. All Rights
23
                                                     Reserved.
Architecture	
  




                        ©2013 Cloudera, Inc. All Rights
24
                                 Reserved.
Commit	
  protocol	
  

     •  NameNode	
  accumulates	
  edits	
  locally	
  as	
  they	
  are	
  
        logged	
  
     •  On	
  logSync(),	
  sends	
  accumulated	
  batch	
  to	
  all	
  JNs	
  via	
  
        Hadoop	
  RPC	
  
     •  Waits	
  for	
  success	
  ACK	
  from	
  a	
  majority	
  of	
  nodes	
  
            •    Majority	
  commit	
  means	
  that	
  a	
  single	
  lagging	
  or	
  crashed	
  
                 replica	
  does	
  not	
  impact	
  NN	
  latency	
  
            •    Latency	
  @	
  NN	
  =	
  median(Latency	
  @	
  JNs)	
  
     •    Uses	
  the	
  well-­‐known	
  Paxos	
  algorithm	
  to	
  perform	
  
          recovery	
  of	
  any	
  in-­‐flight	
  edits	
  on	
  leader	
  switchover	
  
                                            ©2013 Cloudera, Inc. All Rights
25
                                                     Reserved.
JN	
  Fencing	
  

     •  How	
  do	
  we	
  prevent	
  split-­‐brain?	
  
     •  Each	
  instance	
  of	
  QJM	
  is	
  assigned	
  a	
  unique	
  epoch	
  
        number	
  
          •    provides	
  a	
  strong	
  ordering	
  between	
  client	
  NNs	
  
          •    Each	
  IPC	
  contains	
  the	
  client’s	
  epoch	
  
          •    JN	
  remembers	
  on	
  disk	
  the	
  highest	
  epoch	
  it	
  has	
  seen	
  
          •    Any	
  request	
  from	
  an	
  earlier	
  epoch	
  is	
  rejected.	
  Any	
  from	
  a	
  
               newer	
  one	
  is	
  recorded	
  on	
  disk	
  
          •    Distributed	
  Systems	
  folks	
  may	
  recognize	
  this	
  technique	
  
               from	
  Paxos	
  and	
  other	
  literature	
  

                                            ©2013 Cloudera, Inc. All Rights
26
                                                     Reserved.
Fencing	
  with	
  epochs	
  

     •  Fencing	
  is	
  now	
  implicit	
  
     •  The	
  act	
  of	
  becoming	
  acGve	
  causes	
  any	
  earlier	
  acGve	
  
        NN	
  to	
  be	
  fenced	
  out	
  
            •    Since	
  a	
  quorum	
  of	
  nodes	
  has	
  accepted	
  the	
  new	
  acGve,	
  any	
  
                 other	
  IPC	
  by	
  an	
  earlier	
  epoch	
  number	
  can’t	
  get	
  quorum	
  
     •    Eliminates	
  confusing	
  and	
  error-­‐prone	
  custom	
  fencing	
  
          configura4on	
  




                                            ©2013 Cloudera, Inc. All Rights
27
                                                     Reserved.
Other	
  implementaGon	
  features	
  

     •    Hadoop	
  Metrics	
  
            •    lag,	
  percenGle	
  latencies,	
  etc	
  from	
  perspecGve	
  of	
  JN,	
  NN	
  
            •    metrics	
  for	
  queued	
  txns,	
  %	
  of	
  Gme	
  each	
  JN	
  fell	
  behind,	
  etc,	
  
                 to	
  help	
  suss	
  out	
  a	
  slow	
  JN	
  before	
  it	
  causes	
  problems	
  
     •    Security	
  
            •    full	
  Kerberos	
  and	
  SSL	
  support:	
  edits	
  can	
  be	
  opGonally	
  
                 encrypted	
  in-­‐flight,	
  and	
  all	
  access	
  is	
  mutually	
  authenGcated	
  




                                              ©2013 Cloudera, Inc. All Rights
28
                                                       Reserved.
TesGng	
  

     •    Randomized	
  fault	
  test	
  
            •    Runs	
  all	
  communicaGons	
  in	
  a	
  single	
  thread	
  with	
  
                 determinisGc	
  order	
  and	
  fault	
  injecGons	
  based	
  on	
  a	
  seed	
  
            •    Caught	
  a	
  number	
  of	
  really	
  subtle	
  bugs	
  along	
  the	
  way	
  
            •    Run	
  as	
  an	
  MR	
  job:	
  5000	
  fault	
  tests	
  in	
  parallel	
  
            •    MulGple	
  CPU-­‐years	
  of	
  stress	
  tesGng:	
  found	
  2	
  bugs	
  in	
  JeIy!	
  	
  
     •    Cluster	
  tesGng:	
  100-­‐node,	
  MR,	
  HBase,	
  Hive,	
  etc	
  
            •    Commit	
  latency	
  in	
  pracGce:	
  within	
  same	
  range	
  as	
  local	
  
                 disks	
  (beIer	
  than	
  one	
  of	
  two	
  local	
  disks,	
  worse	
  than	
  the	
  
                 other	
  one)	
  

                                              ©2013 Cloudera, Inc. All Rights
30
                                                       Reserved.
Deployment	
  

     •    Most	
  customers	
  running	
  3	
  JNs	
  (tolerate	
  1	
  failure)	
  
            •    1	
  on	
  NN,	
  1	
  on	
  SBN,	
  1	
  on	
  JobTracker/ResourceManager	
  
            •    OpGonally	
  run	
  2	
  more	
  (eg	
  on	
  basGon/gateway	
  nodes)	
  to	
  
                 tolerate	
  2	
  failures	
  
     •  No	
  new	
  hardware	
  investment	
  
        	
  
     •  Refer	
  to	
  docs	
  for	
  detailed	
  configuraGon	
  info	
  




                                           ©2013 Cloudera, Inc. All Rights
31
                                                    Reserved.
Status	
  

     •  Merged	
  into	
  Hadoop	
  development	
  trunk	
  in	
  early	
  
        October	
  
     •  Available	
  in	
  CDH4.1,	
  will	
  be	
  in	
  upcoming	
  Hadoop	
  2.1	
  
     •  Deployed	
  at	
  several	
  customer/community	
  sites	
  with	
  
        good	
  success	
  so	
  far	
  (no	
  lost	
  data)	
  
          •    In	
  contrast,	
  we’ve	
  had	
  several	
  issues	
  with	
  misconfigured	
  
               NFS	
  filers	
  causing	
  downGme	
  
          •    Highly	
  recommend	
  you	
  use	
  Quorum	
  Journaling	
  instead	
  of	
  
               NFS!	
  



                                        ©2013 Cloudera, Inc. All Rights
32
                                                 Reserved.
Summary	
  of	
  HA	
  Improvements	
  

     •  Run	
  an	
  acGve	
  NameNode	
  and	
  a	
  hot	
  Standby	
  
        NameNode	
  
     •  AutomaGcally	
  triggers	
  seamless	
  failover	
  using	
  Apache	
  
        ZooKeeper	
  
     •  Stores	
  shared	
  metadata	
  on	
  QuorumJournalManager:	
  
        a	
  fully	
  distributed,	
  redundant,	
  low	
  latency	
  journaling	
  
        system.	
  

     •    All	
  improvements	
  available	
  now	
  in	
  HDFS	
  branch-­‐2	
  and	
  
          CDH4.1	
  
                                     ©2013 Cloudera, Inc. All Rights
33
                                              Reserved.
HDFS	
  Performance	
  Update	
  




34
Performance	
  Improvements	
  (overview)	
  

     •    Several	
  improvements	
  made	
  for	
  Impala	
  
            •    Much	
  faster	
  libhdfs	
  
            •    APIs	
  for	
  spindle-­‐based	
  scheduling	
  
     •    Other	
  more	
  general	
  improvements	
  (especially	
  for	
  
          HBase	
  and	
  Accumulo)	
  
            •    Ability	
  to	
  read	
  directly	
  from	
  block	
  files	
  in	
  secure	
  
                 environments	
  
            •    Ability	
  for	
  applicaGons	
  to	
  perform	
  their	
  own	
  checksums	
  
                 and	
  eliminate	
  IOPS	
  



                                           ©2013 Cloudera, Inc. All Rights
35
                                                    Reserved.
libhdfs	
  “direct	
  read”	
  support	
  (HDFS-­‐2834)	
  




•        This	
  can	
  also	
  benefit	
  apps	
  like	
  HBase,	
  Accumulo,	
  and	
  MR	
  with	
  a	
  bit	
  more	
  
         work	
  (TBD	
  in	
  2013)	
  
36	
  
Disk	
  locaGons	
  API	
  (HDFS-­‐3672)	
  

     •    HDFS	
  has	
  always	
  exposed	
  node	
  locality	
  informaGon	
  
            •    Map<Block,	
  List<Datanode	
  Addresess>>	
  
     •    Now	
  also	
  can	
  expose	
  disk	
  locality	
  informaGon	
  
            •    Map<Replica,	
  List<Spindle	
  IdenGfiers>>	
  


     •    Impala	
  uses	
  this	
  API	
  to	
  keep	
  all	
  disks	
  spinning	
  at	
  full	
  
          throughput	
  	
  
            •    ~2x	
  improvement	
  on	
  IO-­‐bound	
  workloads	
  on	
  12-­‐spindle	
  
                 machines	
  


                                          ©2013 Cloudera, Inc. All Rights
37
                                                   Reserved.
Short-­‐circuit	
  reads	
  

     •    “Short	
  circuit”	
  allows	
  HDFS	
  clients	
  to	
  open	
  HDFS	
  block	
  
          files	
  directly	
  from	
  the	
  local	
  filesystem	
  
            •    Avoids	
  context	
  switches	
  and	
  trips	
  back	
  and	
  forth	
  from	
  user	
  
                 space	
  to	
  kernel	
  space	
  memory,	
  TCP	
  stack,	
  etc	
  
            •    Uses	
  50%	
  less	
  CPU,	
  avoids	
  significant	
  latency	
  when	
  reading	
  
                 data	
  from	
  Linux	
  buffer	
  cache	
  
     •  SequenGal	
  IO	
  performance:	
  2x	
  improvement	
  
     •  Random	
  IO	
  performance:	
  3.5x	
  improvement	
  


     •    This	
  has	
  existed	
  for	
  a	
  while	
  in	
  insecure	
  setups	
  only!	
  
            •    Clients	
  need	
  read	
  access	
  to	
  all	
  block	
  files	
  L	
  

                                               ©2013 Cloudera, Inc. All Rights
38
                                                        Reserved.
Secure	
  short-­‐circuit	
  reads	
  (HDFS-­‐347)	
  

     •  DataNode	
  conGnues	
  to	
  arbitrate	
  access	
  to	
  block	
  files	
  
     •  Opens	
  input	
  streams	
  and	
  passes	
  them	
  to	
  the	
  DFS	
  
        client	
  aAer	
  authenGcaGon	
  and	
  authorizaGon	
  checks	
  
            •    Uses	
  a	
  trick	
  involving	
  Unix	
  Domain	
  Sockets	
  (sendmsg	
  with	
  
                 SCM_RIGHTS)	
  


     •    Now	
  perf-­‐sensiGve	
  apps	
  like	
  HBase,	
  Accumulo,	
  and	
  
          Impala	
  can	
  safely	
  configure	
  this	
  feature	
  in	
  all	
  
          environments	
  


                                           ©2013 Cloudera, Inc. All Rights
39
                                                    Reserved.
Checksum	
  skipping	
  (HDFS-­‐3429)	
  

     •    Problem:	
  HDFS	
  stores	
  block	
  data	
  and	
  block	
  
          checksums	
  in	
  separate	
  files	
  
            •    A	
  truly	
  random	
  read	
  incurs	
  two	
  seeks	
  instead	
  of	
  one!	
  
     •    Solu4on:	
  HBase	
  now	
  stores	
  its	
  own	
  checksums	
  on	
  its	
  
          own	
  internal	
  64KB	
  blocks	
  
            •    But	
  it	
  turns	
  out	
  that	
  prior	
  versions	
  of	
  HDFS	
  sGll	
  read	
  the	
  
                 checksum,	
  even	
  if	
  the	
  client	
  flipped	
  verificaGon	
  off	
  


     •    Fixing	
  this	
  yielded	
  a	
  40%	
  reduc4on	
  in	
  IOPS	
  and	
  
          latency	
  for	
  a	
  mulG-­‐TB	
  uniform	
  random-­‐read	
  
          workload!	
  
                                                ©2013 Cloudera, Inc. All Rights
40
                                                         Reserved.
SGll	
  more	
  to	
  come?	
  

     •  Not	
  a	
  ton	
  leA	
  on	
  the	
  read	
  path	
  
     •  Write	
  path	
  sGll	
  has	
  some	
  low	
  hanging	
  fruit	
  –	
  hang	
  
        Gght	
  for	
  next	
  year	
  

     •    Reality	
  check	
  (mulG-­‐threaded	
  random-­‐read)	
  
            •    Hadoop	
  1.0:	
  	
  264MB/sec	
  
            •    Hadoop	
  2.x:	
  1393MB/sec	
  
            •    We’ve	
  come	
  a	
  long	
  way	
  (5x)	
  in	
  a	
  few	
  years!	
  



                                               ©2013 Cloudera, Inc. All Rights
41
                                                        Reserved.
Other	
  key	
  new	
  features	
  




42
On-­‐the-­‐wire	
  EncrypGon	
  

     •    Strong	
  encrypGon	
  now	
  supported	
  for	
  all	
  traffic	
  on	
  the	
  
          wire	
  
            •    both	
  data	
  and	
  RPC	
  
            •    Configurable	
  cipher	
  (eg	
  RC5,	
  DES,	
  3DES)	
  


     •  Developed	
  specifically	
  based	
  on	
  requirements	
  from	
  
        the	
  IC	
  
     •  Reviewed	
  by	
  some	
  experts	
  here	
  today	
  (thanks!)	
  




                                           ©2013 Cloudera, Inc. All Rights
43
                                                    Reserved.
Rolling	
  Upgrades	
  and	
  Wire	
  CompaGbility	
  

     •  RPC	
  and	
  Data	
  Transfer	
  now	
  using	
  Protocol	
  Buffers	
  
     •  Easy	
  for	
  developers	
  to	
  add	
  new	
  features	
  without	
  
        breaking	
  compaGbility	
  

     •    Allows	
  zero-­‐downGme	
  upgrade	
  between	
  minor	
  
          releases	
  
            •    Planning	
  to	
  lock	
  down	
  client-­‐server	
  compaGbility	
  even	
  for	
  
                 more	
  major	
  releases	
  in	
  2013	
  



                                           ©2013 Cloudera, Inc. All Rights
44
                                                    Reserved.
What’s	
  up	
  next	
  in	
  2013?	
  




45
HDFS	
  Snapshots	
  

     •    Full	
  support	
  for	
  efficent	
  subtree	
  snapshots	
  
            •    Point-­‐in-­‐Gme	
  “copy”	
  of	
  a	
  part	
  of	
  the	
  filesystem	
  
            •    Like	
  a	
  NetApp	
  NAS:	
  simple	
  administraGve	
  API	
  
            •    Copy-­‐on-­‐write	
  (instantaneous	
  snapshoyng)	
  
            •    Can	
  serve	
  as	
  input	
  for	
  MR,	
  distcp,	
  backups,	
  etc	
  
     •  IniGally	
  read-­‐only,	
  some	
  thought	
  about	
  read-­‐write	
  in	
  
        the	
  future	
  
     •  In	
  progress	
  now,	
  hoping	
  to	
  merge	
  into	
  trunk	
  by	
  
        summerGme	
  


                                              ©2013 Cloudera, Inc. All Rights
46
                                                       Reserved.
Hierarchical	
  storage	
  

     •    Early	
  exploraGon	
  into	
  SSD/Flash	
  
            •    AnGcipaGng	
  “hybrid”	
  storage	
  will	
  become	
  common	
  soon	
  
            •    What	
  performance	
  improvements	
  do	
  we	
  need	
  to	
  take	
  
                 good	
  advantage	
  of	
  it?	
  
            •    Tiered	
  caching	
  of	
  hot	
  data	
  onto	
  flash?	
  
            •    Explicit	
  storage	
  “pools”	
  for	
  apps	
  to	
  manage?	
  
     •    Big-­‐RAM	
  boxes	
  
            •    256GB/box	
  not	
  so	
  expensive	
  anymore	
  
            •    How	
  can	
  we	
  best	
  make	
  use	
  of	
  all	
  this	
  RAM?	
  Caching!	
  


                                              ©2013 Cloudera, Inc. All Rights
47
                                                       Reserved.
Storage	
  efficiency	
  

     •  Transparent	
  re-­‐compression	
  of	
  cold	
  data?	
  
     •  More	
  efficient	
  file	
  formats	
  
            •    Columnar	
  storage	
  for	
  Hive,	
  Impala	
  
            •    Faster	
  to	
  operate	
  on	
  and	
  more	
  compact	
  
     •    Work	
  on	
  “fat	
  datanodes”	
  
            •    36-­‐72TB/node	
  will	
  require	
  some	
  investment	
  in	
  DataNode	
  
                 scaling	
  
            •    More	
  parallelism,	
  more	
  efficient	
  use	
  of	
  RAM,	
  etc.	
  




                                           ©2013 Cloudera, Inc. All Rights
48
                                                    Reserved.
What's New and Upcoming in HDFS - the Hadoop Distributed File System

More Related Content

What's hot

Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
SANG WON PARK
 
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDownscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Databricks
 

What's hot (20)

Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
 
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
 
Kafka Tutorial: Advanced Producers
Kafka Tutorial: Advanced ProducersKafka Tutorial: Advanced Producers
Kafka Tutorial: Advanced Producers
 
Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021
 
Under The Hood Of A Shard-Per-Core Database Architecture
Under The Hood Of A Shard-Per-Core Database ArchitectureUnder The Hood Of A Shard-Per-Core Database Architecture
Under The Hood Of A Shard-Per-Core Database Architecture
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
 
Building images efficiently and securely on Kubernetes with BuildKit
Building images efficiently and securely on Kubernetes with BuildKitBuilding images efficiently and securely on Kubernetes with BuildKit
Building images efficiently and securely on Kubernetes with BuildKit
 
Under the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database ArchitectureUnder the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database Architecture
 
FastR+Apache Flink
FastR+Apache FlinkFastR+Apache Flink
FastR+Apache Flink
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
 
Linux Network Stack
Linux Network StackLinux Network Stack
Linux Network Stack
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
 
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDownscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
 
Crimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent MemoryCrimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent Memory
 

Viewers also liked

Ambari Meetup: NameNode HA
Ambari Meetup: NameNode HAAmbari Meetup: NameNode HA
Ambari Meetup: NameNode HA
Hortonworks
 
[G6]hadoop이중화왜하는거지
[G6]hadoop이중화왜하는거지[G6]hadoop이중화왜하는거지
[G6]hadoop이중화왜하는거지
NAVER D2
 
하둡 타입과 포맷
하둡 타입과 포맷하둡 타입과 포맷
하둡 타입과 포맷
진호 박
 

Viewers also liked (20)

Strata + Hadoop World 2012: HDFS: Now and Future
Strata + Hadoop World 2012: HDFS: Now and FutureStrata + Hadoop World 2012: HDFS: Now and Future
Strata + Hadoop World 2012: HDFS: Now and Future
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS Federation
 
Hortonworks Hadoop @ Oslo Hadoop User Group
Hortonworks Hadoop @ Oslo Hadoop User GroupHortonworks Hadoop @ Oslo Hadoop User Group
Hortonworks Hadoop @ Oslo Hadoop User Group
 
Map Reduce data types and formats
Map Reduce data types and formatsMap Reduce data types and formats
Map Reduce data types and formats
 
HDFS User Reference
HDFS User ReferenceHDFS User Reference
HDFS User Reference
 
Ambari Meetup: NameNode HA
Ambari Meetup: NameNode HAAmbari Meetup: NameNode HA
Ambari Meetup: NameNode HA
 
[G6]hadoop이중화왜하는거지
[G6]hadoop이중화왜하는거지[G6]hadoop이중화왜하는거지
[G6]hadoop이중화왜하는거지
 
2012.04.11 미래사회와 빅 데이터(big data) 기술 nipa
2012.04.11 미래사회와 빅 데이터(big data) 기술 nipa2012.04.11 미래사회와 빅 데이터(big data) 기술 nipa
2012.04.11 미래사회와 빅 데이터(big data) 기술 nipa
 
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
 
Introducción a Hadoop
Introducción a HadoopIntroducción a Hadoop
Introducción a Hadoop
 
하둡 타입과 포맷
하둡 타입과 포맷하둡 타입과 포맷
하둡 타입과 포맷
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An Overview
 
구조화된 데이터: Schema.org와 Microdata, RDFa, JSON-LD
구조화된 데이터: Schema.org와 Microdata, RDFa, JSON-LD구조화된 데이터: Schema.org와 Microdata, RDFa, JSON-LD
구조화된 데이터: Schema.org와 Microdata, RDFa, JSON-LD
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
HDFS NameNode High Availability
HDFS NameNode High AvailabilityHDFS NameNode High Availability
HDFS NameNode High Availability
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
 
HDFS Namenode High Availability
HDFS Namenode High AvailabilityHDFS Namenode High Availability
HDFS Namenode High Availability
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
 
HBase 훑어보기
HBase 훑어보기HBase 훑어보기
HBase 훑어보기
 

Similar to What's New and Upcoming in HDFS - the Hadoop Distributed File System

Operate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmineOperate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmine
DataWorks Summit
 
ActiveMQ Performance Tuning
ActiveMQ Performance TuningActiveMQ Performance Tuning
ActiveMQ Performance Tuning
Christian Posta
 
SAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego CloudSAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego Cloud
aidanshribman
 

Similar to What's New and Upcoming in HDFS - the Hadoop Distributed File System (20)

Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2
Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2
Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2
 
Building an Apache Hadoop data application
Building an Apache Hadoop data applicationBuilding an Apache Hadoop data application
Building an Apache Hadoop data application
 
Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)
 
Webinar: The Future of Hadoop
Webinar: The Future of HadoopWebinar: The Future of Hadoop
Webinar: The Future of Hadoop
 
Flume and HBase
Flume and HBase Flume and HBase
Flume and HBase
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)
 
Hadoop Operations
Hadoop OperationsHadoop Operations
Hadoop Operations
 
The State of HBase Replication
The State of HBase ReplicationThe State of HBase Replication
The State of HBase Replication
 
Operate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmineOperate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmine
 
Postgres & Red Hat Cluster Suite
Postgres & Red Hat Cluster SuitePostgres & Red Hat Cluster Suite
Postgres & Red Hat Cluster Suite
 
ActiveMQ Performance Tuning
ActiveMQ Performance TuningActiveMQ Performance Tuning
ActiveMQ Performance Tuning
 
SAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego CloudSAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego Cloud
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop Operations
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
 
Yarns About Yarn
Yarns About YarnYarns About Yarn
Yarns About Yarn
 
Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
 
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
Tales from the Cloudera Field
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera Field
 

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

What's New and Upcoming in HDFS - the Hadoop Distributed File System

  • 1. What’s  new  and  upcoming  in  HDFS   January  30,  2013   Todd  Lipcon,  SoAware  Engineer   todd@cloudera.com      @tlipcon   1
  • 2. IntroducGons   •  SoAware  engineer  on  Cloudera’s  Storage  Engineering   team   •  CommiIer  and  PMC  Member  for  Apache  Hadoop  and   Apache  HBase   •  Projects  in  2012   •  Responsible  for  >50%  of  the  code  for  all  phases  of  HA   development   •  Also  worked  on  many  performance  and  stability   improvements   •  This  presentaGon  is  highly  technical  –  please  feel  free   to  grab/email  me  later  if  you’d  like  to  clarify  anything!   ©2013 Cloudera, Inc. All Rights 2 Reserved.
  • 3. Outline   •  HDFS  2.0  –  what’s  new  in  2012?   •  HA  Phase  1  (Q1  2012)   •  HA  Phase  2  (Q2-­‐Q4  2012)   •  Performance  improvements  and  other  new  features   •  What’s  coming  in  2013?   •  HDFS  Snapshots   •  BeIer  storage  density  and  file  formats   •  Caching  and  Hierarchical  Storage  Management   ©2013 Cloudera, Inc. All Rights 3   Reserved.
  • 4. HDFS  HA  Phase  1  Review   HDFS-­‐1623:  completed  March  2012   4
  • 5. HDFS  HA  Background   •  HDFS’s  strength  is  its  simple  and  robust  design   •  Single  master  NameNode  maintains  all  metadata   •  Scales  to  mul4-­‐petabyte  clusters  easily  on  modern   hardware   •  TradiGonally,  the  single  master  was  also  a  single  point   of  failure   •  Generally  good  availability,  but  not  ops-­‐friendly   •  No  hot  patch  ability,  no  hot  reconfiguraGon   •  No  hot  hardware  replacement   •  Hadoop  is  now  mission  cri4cal:  SPOF  not  OK!   ©2013 Cloudera, Inc. All Rights 5 Reserved.
  • 6. HDFS  HA  Development  Phase  1   •  Completed  March  2012  (HDFS-­‐1623)   •  Introduced  the  StandbyNode,  a  hot  backup  for  the  HDFS   NameNode.   •  Relied  on  shared  storage  to  synchronize  namespace  state   •  (e.g.  a  NAS  filer  appliance)   •  Allowed  operators  to  manually  trigger  failover  to  the   Standby   •  Sufficient  for  many  HA  use  cases:  avoided  planned   down4me  for  hardware  and  soAware  upgrades,  planned   machine/OS  maintenance,  configuraGon  changes,  etc.   ©2013 Cloudera, Inc. All Rights 6 Reserved.
  • 7. HDFS  HA  Architecture  Phase  1   •  Parallel  block  reports  sent  to  AcGve  and  Standby   NameNodes   •  NameNode  state  shared  by  locaGng  edit  log  on  NAS   over  NFS   •  AcGve  NameNode  writes  while  Standby  Node  “tails”     •  Client  failover  done  via  client  configuraGon   •  Each  client  configured  with  the  address  of  both  NNs:  try   both  to  find  acGve   ©2013 Cloudera, Inc. All Rights 7 Reserved.
  • 8. HDFS  HA  Architecture  Phase  1   ©2013 Cloudera, Inc. All Rights 8   Reserved.
  • 9. Fencing  and  NFS   •  Must  avoid  split-­‐brain  syndrome   •  Both  nodes  think  they  are  acGve  and  try  to  write  to  the   same  edit  log.  Your  metadata  becomes  corrupt  and   requires  manual  intervenGon  to  restart   •  Configure  a  fencing  script   •  Script  must  ensure  that  prior  acGve  has  stopped  wriGng   •  STONITH:  shoot-­‐the-­‐other-­‐node-­‐in-­‐the-­‐head   •  Storage  fencing:  e.g  using  NetApp  ONTAP  API  to  restrict   filer  access   •  Fencing  script  must  succeed  to  have  a  successful   failover   ©2013 Cloudera, Inc. All Rights 9 Reserved.
  • 10. Shortcomings  of  Phase  1   •  Insufficient  to  protect  against  unplanned  down4me   •  Manual  failover  only:  requires  an  operator  to  step  in   quickly  aAer  a  crash   •  Various  studies  indicated  this  was  the  minority  of   downGme,  but  sGll  important  to  address   •  Requirement  of  a  NAS  device  made  deployment   complex,  expensive,  and  error-­‐prone     (we  always  knew  this  was  just  the  first  phase!) ©2013 Cloudera, Inc. All Rights 10 Reserved.
  • 11. HDFS  HA  Development  Phase  2   •  MulGple  new  features  for  high  availability   •  Automa4c  failover,  based  on  Apache  ZooKeeper   •  Remove  dependency  on  NAS  (network-­‐aIached  storage)   •  Address  new  HA  use  cases   •  Avoid  unplanned  downGme  due  to  soAware  or  hardware   faults   •  Deploy  in  filer-­‐less  environments   •  Completely  stand-­‐alone  HA  with  no  external  hardware  or   soAware  dependencies   •  no  Linux-­‐HA,  filers,  etc   ©2013 Cloudera, Inc. All Rights 11 Reserved.
  • 12. AutomaGc  Failover  Overview   HDFS-­‐3042:  completed  May  2012   12
  • 13. AutomaGc  Failover  Goals   •  Automa4cally  detect  failure  of  the  AcGve  NameNode   •  Hardware,  soAware,  network,  etc.   •  Do  not  require  operator  interven4on  to  iniGate   failover   •  Once  failure  is  detected,  process  completes  automaGcally   •  Support  manually  ini4ated  failover  as  first-­‐class   •  Operators  can  sGll  trigger  failover  without  having  to  stop   AcGve   •  Do  not  introduce  a  new  SPOF   •  All  parts  of  auto-­‐failover  deployment  must  themselves  be   HA   ©2013 Cloudera, Inc. All Rights 13 Reserved.
  • 14. AutomaGc  Failover  Architecture   •  AutomaGc  failover  requires  ZooKeeper   •  Not  required  for  manual  failover   •  ZK  makes  it  easy  to:   •  Detect  failure  of  AcGve  NameNode   •  Determine  which  NameNode  should   become  the  AcGve  NN   ©2013 Cloudera, Inc. All Rights 14   Reserved.
  • 15. AutomaGc  Failover  Architecture   •  New  daemon:  ZooKeeper  Failover  Controller  (ZKFC)   •  In  an  auto  failover  deployment,  run  two  ZKFCs   •  One  per  NameNode,  on  that  NameNode  machine   •  ZKFC  has  three  simple  responsibili4es:   •  Monitors  health  of  associated  NameNode   •  ParGcipates  in  leader  elec4on  of  NameNodes   •  Fences  the  other  NameNode  if  it  wins  elecGon   ©2013 Cloudera, Inc. All Rights 15   Reserved.
  • 16. AutomaGc  Failover  Architecture   ©2013 Cloudera, Inc. All Rights 16   Reserved.
  • 17. Removing  the  NAS  dependency   HDFS-­‐3077:  completed  October  2012   17
  • 18. Shared  Storage  in  HDFS  HA   •  The  Standby  NameNode  synchronizes  the  namespace   by  following  the  AcGve  NameNode’s  transacGon  log   •  Each  operaGon  (eg  mkdir(/foo))  is  wriIen  to  the  log  by  the   AcGve   •  The  StandbyNode  periodically  reads  all  new  edits  and   applies  them  to  its  own  metadata  structures   •  Reliable  shared  storage  is  required  for  correct   opera4on   •  In  phase  1,  shared  storage  was  synonymous  with  NFS-­‐ mounted  NAS   ©2013 Cloudera, Inc. All Rights 18 Reserved.
  • 19. Shortcomings  of  NFS-­‐based  approach   •  Custom  hardware   •  Lots  of  our  customers  don’t  have  SAN/NAS  available  in  their   datacenters   •  Costs  money,  Gme  and  experGse   •  Extra  “stuff”  to  monitor  outside  HDFS   •  We  just  moved  the  SPOF,  didn’t  eliminate  it!   •  Complicated   •  Storage  fencing,  NFS  mount  opGons,  mulGpath  networking,  etc   •  OrganizaGonally  complicated:  dependencies  on  storage  ops   team   •  NFS  issues   •  Buggy  client  implementaGons,  liIle  control  over  Gmeout   behavior,  etc   ©2013 Cloudera, Inc. All Rights 19 Reserved.
  • 20. Primary  Requirements  for  Improved  Storage   •  No  special  hardware  (PDUs,  NAS)   •  No  custom  fencing  configuraGon   •  Too  complicated  ==  too  easy  to  misconfigure   •  No  SPOFs   •  punGng  to  filers  isn’t  a  good  opGon   •  need  something  inherently  distributed   ©2013 Cloudera, Inc. All Rights 20 Reserved.
  • 21. Secondary  Requirements   •  Configurable  degree  of  fault  tolerance   •  Configure  N  nodes  to  tolerate  (N-­‐1)/2   •  Making  N  bigger  (within  reasonable  bounds)   shouldn’t  hurt  performance.  Implies:   •  Writes  done  in  parallel,  not  pipelined   •  Writes  should  not  wait  on  slowest  replica   •  Locate  replicas  on  exisGng  hardware  investment  (eg   share  with  JobTracker,  NN,  SBN)   ©2013 Cloudera, Inc. All Rights 21 Reserved.
  • 22. OperaGonal  Requirements   •  Should  be  operable  by  exisGng  Hadoop  admins.   Implies:   •  Same  metrics  system  (“hadoop  metrics”)   •  Same  configuraGon  system  (xml)   •  Same  logging  infrastructure  (log4j)   •  Same  security  system  (Kerberos-­‐based)   •  Allow  exisGng  ops  to  easily  deploy  and  manage  the   new  feature   •  Allow  exisGng  Hadoop  tools  to  monitor  the  feature   •  (eg  Cloudera  Manager,  Ganglia,  etc)   ©2013 Cloudera, Inc. All Rights 22 Reserved.
  • 23. Our  soluGon:  QuorumJournalManager   •  QuorumJournalManager  (client)   •  Plugs  into  JournalManager  abstracGon  in  NN  (instead  of   exisGng  FileJournalManager)   •  Provides  edit  log  storage  abstracGon   •  JournalNode  (server)   •  Standalone  daemon  running  on  an  odd  number  of  nodes   •  Provides  actual  storage  of  edit  logs  on  local  disks   •  Could  run  inside  other  daemons  in  the  future     ©2013 Cloudera, Inc. All Rights 23 Reserved.
  • 24. Architecture   ©2013 Cloudera, Inc. All Rights 24 Reserved.
  • 25. Commit  protocol   •  NameNode  accumulates  edits  locally  as  they  are   logged   •  On  logSync(),  sends  accumulated  batch  to  all  JNs  via   Hadoop  RPC   •  Waits  for  success  ACK  from  a  majority  of  nodes   •  Majority  commit  means  that  a  single  lagging  or  crashed   replica  does  not  impact  NN  latency   •  Latency  @  NN  =  median(Latency  @  JNs)   •  Uses  the  well-­‐known  Paxos  algorithm  to  perform   recovery  of  any  in-­‐flight  edits  on  leader  switchover   ©2013 Cloudera, Inc. All Rights 25 Reserved.
  • 26. JN  Fencing   •  How  do  we  prevent  split-­‐brain?   •  Each  instance  of  QJM  is  assigned  a  unique  epoch   number   •  provides  a  strong  ordering  between  client  NNs   •  Each  IPC  contains  the  client’s  epoch   •  JN  remembers  on  disk  the  highest  epoch  it  has  seen   •  Any  request  from  an  earlier  epoch  is  rejected.  Any  from  a   newer  one  is  recorded  on  disk   •  Distributed  Systems  folks  may  recognize  this  technique   from  Paxos  and  other  literature   ©2013 Cloudera, Inc. All Rights 26 Reserved.
  • 27. Fencing  with  epochs   •  Fencing  is  now  implicit   •  The  act  of  becoming  acGve  causes  any  earlier  acGve   NN  to  be  fenced  out   •  Since  a  quorum  of  nodes  has  accepted  the  new  acGve,  any   other  IPC  by  an  earlier  epoch  number  can’t  get  quorum   •  Eliminates  confusing  and  error-­‐prone  custom  fencing   configura4on   ©2013 Cloudera, Inc. All Rights 27 Reserved.
  • 28. Other  implementaGon  features   •  Hadoop  Metrics   •  lag,  percenGle  latencies,  etc  from  perspecGve  of  JN,  NN   •  metrics  for  queued  txns,  %  of  Gme  each  JN  fell  behind,  etc,   to  help  suss  out  a  slow  JN  before  it  causes  problems   •  Security   •  full  Kerberos  and  SSL  support:  edits  can  be  opGonally   encrypted  in-­‐flight,  and  all  access  is  mutually  authenGcated   ©2013 Cloudera, Inc. All Rights 28 Reserved.
  • 29.
  • 30. TesGng   •  Randomized  fault  test   •  Runs  all  communicaGons  in  a  single  thread  with   determinisGc  order  and  fault  injecGons  based  on  a  seed   •  Caught  a  number  of  really  subtle  bugs  along  the  way   •  Run  as  an  MR  job:  5000  fault  tests  in  parallel   •  MulGple  CPU-­‐years  of  stress  tesGng:  found  2  bugs  in  JeIy!     •  Cluster  tesGng:  100-­‐node,  MR,  HBase,  Hive,  etc   •  Commit  latency  in  pracGce:  within  same  range  as  local   disks  (beIer  than  one  of  two  local  disks,  worse  than  the   other  one)   ©2013 Cloudera, Inc. All Rights 30 Reserved.
  • 31. Deployment   •  Most  customers  running  3  JNs  (tolerate  1  failure)   •  1  on  NN,  1  on  SBN,  1  on  JobTracker/ResourceManager   •  OpGonally  run  2  more  (eg  on  basGon/gateway  nodes)  to   tolerate  2  failures   •  No  new  hardware  investment     •  Refer  to  docs  for  detailed  configuraGon  info   ©2013 Cloudera, Inc. All Rights 31 Reserved.
  • 32. Status   •  Merged  into  Hadoop  development  trunk  in  early   October   •  Available  in  CDH4.1,  will  be  in  upcoming  Hadoop  2.1   •  Deployed  at  several  customer/community  sites  with   good  success  so  far  (no  lost  data)   •  In  contrast,  we’ve  had  several  issues  with  misconfigured   NFS  filers  causing  downGme   •  Highly  recommend  you  use  Quorum  Journaling  instead  of   NFS!   ©2013 Cloudera, Inc. All Rights 32 Reserved.
  • 33. Summary  of  HA  Improvements   •  Run  an  acGve  NameNode  and  a  hot  Standby   NameNode   •  AutomaGcally  triggers  seamless  failover  using  Apache   ZooKeeper   •  Stores  shared  metadata  on  QuorumJournalManager:   a  fully  distributed,  redundant,  low  latency  journaling   system.   •  All  improvements  available  now  in  HDFS  branch-­‐2  and   CDH4.1   ©2013 Cloudera, Inc. All Rights 33 Reserved.
  • 35. Performance  Improvements  (overview)   •  Several  improvements  made  for  Impala   •  Much  faster  libhdfs   •  APIs  for  spindle-­‐based  scheduling   •  Other  more  general  improvements  (especially  for   HBase  and  Accumulo)   •  Ability  to  read  directly  from  block  files  in  secure   environments   •  Ability  for  applicaGons  to  perform  their  own  checksums   and  eliminate  IOPS   ©2013 Cloudera, Inc. All Rights 35 Reserved.
  • 36. libhdfs  “direct  read”  support  (HDFS-­‐2834)   •  This  can  also  benefit  apps  like  HBase,  Accumulo,  and  MR  with  a  bit  more   work  (TBD  in  2013)   36  
  • 37. Disk  locaGons  API  (HDFS-­‐3672)   •  HDFS  has  always  exposed  node  locality  informaGon   •  Map<Block,  List<Datanode  Addresess>>   •  Now  also  can  expose  disk  locality  informaGon   •  Map<Replica,  List<Spindle  IdenGfiers>>   •  Impala  uses  this  API  to  keep  all  disks  spinning  at  full   throughput     •  ~2x  improvement  on  IO-­‐bound  workloads  on  12-­‐spindle   machines   ©2013 Cloudera, Inc. All Rights 37 Reserved.
  • 38. Short-­‐circuit  reads   •  “Short  circuit”  allows  HDFS  clients  to  open  HDFS  block   files  directly  from  the  local  filesystem   •  Avoids  context  switches  and  trips  back  and  forth  from  user   space  to  kernel  space  memory,  TCP  stack,  etc   •  Uses  50%  less  CPU,  avoids  significant  latency  when  reading   data  from  Linux  buffer  cache   •  SequenGal  IO  performance:  2x  improvement   •  Random  IO  performance:  3.5x  improvement   •  This  has  existed  for  a  while  in  insecure  setups  only!   •  Clients  need  read  access  to  all  block  files  L   ©2013 Cloudera, Inc. All Rights 38 Reserved.
  • 39. Secure  short-­‐circuit  reads  (HDFS-­‐347)   •  DataNode  conGnues  to  arbitrate  access  to  block  files   •  Opens  input  streams  and  passes  them  to  the  DFS   client  aAer  authenGcaGon  and  authorizaGon  checks   •  Uses  a  trick  involving  Unix  Domain  Sockets  (sendmsg  with   SCM_RIGHTS)   •  Now  perf-­‐sensiGve  apps  like  HBase,  Accumulo,  and   Impala  can  safely  configure  this  feature  in  all   environments   ©2013 Cloudera, Inc. All Rights 39 Reserved.
  • 40. Checksum  skipping  (HDFS-­‐3429)   •  Problem:  HDFS  stores  block  data  and  block   checksums  in  separate  files   •  A  truly  random  read  incurs  two  seeks  instead  of  one!   •  Solu4on:  HBase  now  stores  its  own  checksums  on  its   own  internal  64KB  blocks   •  But  it  turns  out  that  prior  versions  of  HDFS  sGll  read  the   checksum,  even  if  the  client  flipped  verificaGon  off   •  Fixing  this  yielded  a  40%  reduc4on  in  IOPS  and   latency  for  a  mulG-­‐TB  uniform  random-­‐read   workload!   ©2013 Cloudera, Inc. All Rights 40 Reserved.
  • 41. SGll  more  to  come?   •  Not  a  ton  leA  on  the  read  path   •  Write  path  sGll  has  some  low  hanging  fruit  –  hang   Gght  for  next  year   •  Reality  check  (mulG-­‐threaded  random-­‐read)   •  Hadoop  1.0:    264MB/sec   •  Hadoop  2.x:  1393MB/sec   •  We’ve  come  a  long  way  (5x)  in  a  few  years!   ©2013 Cloudera, Inc. All Rights 41 Reserved.
  • 42. Other  key  new  features   42
  • 43. On-­‐the-­‐wire  EncrypGon   •  Strong  encrypGon  now  supported  for  all  traffic  on  the   wire   •  both  data  and  RPC   •  Configurable  cipher  (eg  RC5,  DES,  3DES)   •  Developed  specifically  based  on  requirements  from   the  IC   •  Reviewed  by  some  experts  here  today  (thanks!)   ©2013 Cloudera, Inc. All Rights 43 Reserved.
  • 44. Rolling  Upgrades  and  Wire  CompaGbility   •  RPC  and  Data  Transfer  now  using  Protocol  Buffers   •  Easy  for  developers  to  add  new  features  without   breaking  compaGbility   •  Allows  zero-­‐downGme  upgrade  between  minor   releases   •  Planning  to  lock  down  client-­‐server  compaGbility  even  for   more  major  releases  in  2013   ©2013 Cloudera, Inc. All Rights 44 Reserved.
  • 45. What’s  up  next  in  2013?   45
  • 46. HDFS  Snapshots   •  Full  support  for  efficent  subtree  snapshots   •  Point-­‐in-­‐Gme  “copy”  of  a  part  of  the  filesystem   •  Like  a  NetApp  NAS:  simple  administraGve  API   •  Copy-­‐on-­‐write  (instantaneous  snapshoyng)   •  Can  serve  as  input  for  MR,  distcp,  backups,  etc   •  IniGally  read-­‐only,  some  thought  about  read-­‐write  in   the  future   •  In  progress  now,  hoping  to  merge  into  trunk  by   summerGme   ©2013 Cloudera, Inc. All Rights 46 Reserved.
  • 47. Hierarchical  storage   •  Early  exploraGon  into  SSD/Flash   •  AnGcipaGng  “hybrid”  storage  will  become  common  soon   •  What  performance  improvements  do  we  need  to  take   good  advantage  of  it?   •  Tiered  caching  of  hot  data  onto  flash?   •  Explicit  storage  “pools”  for  apps  to  manage?   •  Big-­‐RAM  boxes   •  256GB/box  not  so  expensive  anymore   •  How  can  we  best  make  use  of  all  this  RAM?  Caching!   ©2013 Cloudera, Inc. All Rights 47 Reserved.
  • 48. Storage  efficiency   •  Transparent  re-­‐compression  of  cold  data?   •  More  efficient  file  formats   •  Columnar  storage  for  Hive,  Impala   •  Faster  to  operate  on  and  more  compact   •  Work  on  “fat  datanodes”   •  36-­‐72TB/node  will  require  some  investment  in  DataNode   scaling   •  More  parallelism,  more  efficient  use  of  RAM,  etc.   ©2013 Cloudera, Inc. All Rights 48 Reserved.