HDFS ARCHITECTURE
How HDFS is evolving to meet new needs
✛  Aaron T. Myers
    ✛  Hadoop PMC Member / Committer at ASF
    ✛  Software Engineer at Cloudera
    ✛  Primarily work on HDFS and Hadoop Security




2
✛  HDFS architecture circa 2010
    ✛  New requirements for HDFS
       >  Random read patterns
       >  Higher scalability
       >  Higher availability
    ✛  HDFS evolutions to address requirements
       >  Read pipeline performance improvements
       >  Federated namespaces
       >  Highly available Name Node



3
HDFS ARCHITECTURE: 2010
✛  Each cluster has…
       >  A single Name Node
           ∗  Stores file system metadata
           ∗  Stores “Block ID” -> Data Node mapping
       >  Many Data Nodes
           ∗  Store actual file data
       >  Clients of HDFS…
           ∗  Communicate with Name Node to browse file system, get
              block locations for files
           ∗  Communicate directly with Data Nodes to read/write files




5
6
✛  Want to support larger clusters
       >  ~4,000 node limit with 2010 architecture
       >  New nodes beefier than old nodes
          ∗  2009: 8 cores, 16GB RAM, 4x1TB disks
          ∗  2012: 16 cores, 48GB RAM, 12x3TB disks

    ✛  Want to increase availability
       >  With rise of HBase, HDFS now serving live traffic
       >  Downtime means immediate user-facing impact
    ✛  Want to improve random read performance
       >  HBase usually does small, random reads, not bulk


7
✛  Single Name Node
       >  If Name Node goes offline, cluster is unavailable
       >  Name Node must fit all FS metadata in memory
    ✛  Inefficiencies in read pipeline
       >  Designed for large, streaming reads
       >  Not small, random reads (like HBase use case)




8
✛  Fine for offline, batch-oriented applications
    ✛  If cluster goes offline, external customers don’t
      notice
    ✛  Can always use separate clusters for different
      groups
    ✛  HBase didn’t exist when Hadoop first created
       >  MapReduce was the only client application




9
HDFS PERFORMANCE IMPROVEMENTS
HDFS CPU Improvements: Checksumming

•  HDFS checksums every piece of data in/out
•  Significant CPU overhead
   •  Measure by putting ~1G in HDFS, cat file in a loop
   •  0.20.2: ~30-50% of CPU time is CRC32 computation!
•  Optimizations:
   •  Switch to “bulk” API: verify/compute 64KB at a time
      instead of 512 bytes (better instruction cache locality,
      amortize JNI overhead)
   •  Switch to CRC32C polynomial, SSE4.2, highly tuned
      assembly (~8 bytes per cycle with instruction level
      parallelism!)


    11                 Copyright 2011 Cloudera Inc. All rights reserved
Checksum improvements
                              (lower is better)
            1360us
100%
 90%
 80%
 70%
 60%              760us
 50%
                                                                                             CDH3u0
 40%
                                                                                             Optimized
 30%
 20%
 10%
  0%
            Random-read     Random-read CPU                                Sequential-read
              latency            usage                                       CPU usage
 Post-optimization: only 16% overhead vs un-checksummed access
 Maintain ~800MB/sec from a single thread reading OS cache

       12                      Copyright 2011 Cloudera Inc. All rights reserved
HDFS Random access

•  0.20.2:
    •  Each individual read operation reconnects to
       DataNode
    •  Much TCP Handshake overhead, thread creation,
       etc
•  2.0.0:
    •  Clients cache open sockets to each datanode (like
       HTTP Keepalive)
    •  Local readers can bypass the DN in some
       circumstances to directly read data
    •  Rewritten BlockReader to eliminate a data copy
    •  Eliminated lock contention in DataNode’s
       FSDataset class

   13                 Copyright 2011 Cloudera Inc. All rights reserved
Random-read micro benchmark (higher is better)
                  700
                  600
 Speed (MB/sec)




                  500
                  400
                  300
                  200
                  100
                        106 253 299                        247 488 635                              187 477 633
                    0
                        4 threads, 1 file              16 threads, 1 file                          8 threads, 2 files
                            0.20.2     Trunk (no native)                                   Trunk (native)
       TestParallelRead benchmark, modified to 100% random read
       proportion.
       Quad core Core i7 Q820@1.73Ghz
                   14                       Copyright 2011 Cloudera Inc. All rights reserved
Random-read macro benchmark (HBase YCSB)

                CDH4
  Reads/sec




              CDH3u1




                                   time
      15         Copyright 2011 Cloudera Inc. All rights reserved
HDFS FEDERATION ARCHITECTURE
✛  Instead of one Name Node per cluster, several
   >  Before: Only one Name Node, many Data Nodes
   >  Now: A handful of Name Nodes, many Data Nodes
✛  Distribute file system metadata between the
  NNs
✛  Each Name Node operates independently
   >  Potentially overlapping ranges of block IDs
   >  Introduce a new concept: block pool ID
   >  Each Name Node manages a single block pool
HDFS Architecture: Federation
✛  Improve scalability to 6,000+ Data Nodes
    >  Bumping into single Data Node scalability now
 ✛  Allow for better isolation
    >  Could locate HBase dirs on dedicated Name Node
    >  Could locate /user dirs on dedicated Name Node
 ✛  Clients still see unified view of FS namespace
    >  Use ViewFS – client side mount table configuration


     Note: Federation != Increased Availability

19
HDFS HIGH AVAILABILITY ARCHITECTURE
Current HDFS Availability & Data Integrity

•  Simple design, storage fault tolerance
   •  Storage: Rely on OS’s file system rather
      than use raw disk
   •  Storage Fault Tolerance: multiple replicas,
      active monitoring
   •  Single NameNode Master
  •  Persistent state: multiple copies + checkpoints
  •  Restart on failure




                                  21
Current HDFS Availability & Data Integrity

•  How well did it work?

•  Lost 19 out of 329 Million blocks on 10 clusters with 20K
  nodes in 2009
   •  7-9’s of reliability, and that bug was fixed in 0.20


•  18 months Study: 22 failures on 25 clusters - 0.58 failures
  per year per cluster
   •  Only 8 would have benefitted from HA failover!! (0.23
     failures per cluster year)



                                               22
So why build an HA NameNode?

•  Most cluster downtime in practice is planned
  downtime
   •  Cluster restart for a NN configuration change (e.g
      new JVM configs, new HDFS configs)
   •  Cluster restart for a NN hardware upgrade/repair
   •  Cluster restart for a NN software upgrade (e.g. new
      Hadoop, new kernel, new JVM)
•  Planned downtimes cause the vast majority of
  outage!

•  Manual failover solves all of the above!
   •  Failover to NN2, fix NN1, fail back to NN1, zero
      downtime
    23
Approach and Terminology
•  Initial goal: Active-Standby with Hot
  Failover

•  Terminology
   •  Active NN: actively serves read/write
      operations from clients
   •  Standby NN: waits, becomes active when
      Active dies or is unhealthy
   •  Hot failover: standby able to take over
      instantly

                             24
HDFS Architecture: High Availability

•  Single NN configuration; no failover
•  Active and Standby with manual failover
   •  Addresses downtime during upgrades – main
      cause of unavailability
•  Active and Standby with automatic
  failover
   •  Addresses downtime during unplanned outages
       (kernel panics, bad memory, double PDU failure,
       etc)
    •  See HDFS-1623 for detailed use cases
•  With Federation each namespace volume has an
   active-standby NameNode pair

                                  25
HDFS Architecture: High Availability

•  Failover controller outside NN
•  Parallel Block reports to Active and
   Standby
•  NNs share namespace state via a shared
   edit log
   •  NAS or Journal Nodes
   •  Like RDBMS “log shipping replication”
•  Client failover
   •  Smart clients (e.g configuration, or ZooKeeper for
      coordination)
   •  IP Failover in the future
                                  26
HDFS Architecture: High Availability
HDFS ARCHITECTURE: WHAT’S NEXT
✛  Increase scalability of single Data Node
   >  Currently the most-noticed scalability limit
✛  Support for point-in-time snapshots
   >  To better support DR, backups
✛  Completely separate block / namespace layers
   >  Increase scalability even further, new use cases
✛  Fully distributed NN metadata
   >  No pre-determined “special nodes” in the system
[B4]deview 2012-hdfs

[B4]deview 2012-hdfs

  • 1.
    HDFS ARCHITECTURE How HDFSis evolving to meet new needs
  • 2.
    ✛  Aaron T.Myers ✛  Hadoop PMC Member / Committer at ASF ✛  Software Engineer at Cloudera ✛  Primarily work on HDFS and Hadoop Security 2
  • 3.
    ✛  HDFS architecturecirca 2010 ✛  New requirements for HDFS >  Random read patterns >  Higher scalability >  Higher availability ✛  HDFS evolutions to address requirements >  Read pipeline performance improvements >  Federated namespaces >  Highly available Name Node 3
  • 4.
  • 5.
    ✛  Each clusterhas… >  A single Name Node ∗  Stores file system metadata ∗  Stores “Block ID” -> Data Node mapping >  Many Data Nodes ∗  Store actual file data >  Clients of HDFS… ∗  Communicate with Name Node to browse file system, get block locations for files ∗  Communicate directly with Data Nodes to read/write files 5
  • 6.
  • 7.
    ✛  Want tosupport larger clusters >  ~4,000 node limit with 2010 architecture >  New nodes beefier than old nodes ∗  2009: 8 cores, 16GB RAM, 4x1TB disks ∗  2012: 16 cores, 48GB RAM, 12x3TB disks ✛  Want to increase availability >  With rise of HBase, HDFS now serving live traffic >  Downtime means immediate user-facing impact ✛  Want to improve random read performance >  HBase usually does small, random reads, not bulk 7
  • 8.
    ✛  Single NameNode >  If Name Node goes offline, cluster is unavailable >  Name Node must fit all FS metadata in memory ✛  Inefficiencies in read pipeline >  Designed for large, streaming reads >  Not small, random reads (like HBase use case) 8
  • 9.
    ✛  Fine foroffline, batch-oriented applications ✛  If cluster goes offline, external customers don’t notice ✛  Can always use separate clusters for different groups ✛  HBase didn’t exist when Hadoop first created >  MapReduce was the only client application 9
  • 10.
  • 11.
    HDFS CPU Improvements:Checksumming •  HDFS checksums every piece of data in/out •  Significant CPU overhead •  Measure by putting ~1G in HDFS, cat file in a loop •  0.20.2: ~30-50% of CPU time is CRC32 computation! •  Optimizations: •  Switch to “bulk” API: verify/compute 64KB at a time instead of 512 bytes (better instruction cache locality, amortize JNI overhead) •  Switch to CRC32C polynomial, SSE4.2, highly tuned assembly (~8 bytes per cycle with instruction level parallelism!) 11 Copyright 2011 Cloudera Inc. All rights reserved
  • 12.
    Checksum improvements (lower is better) 1360us 100% 90% 80% 70% 60% 760us 50% CDH3u0 40% Optimized 30% 20% 10% 0% Random-read Random-read CPU Sequential-read latency usage CPU usage Post-optimization: only 16% overhead vs un-checksummed access Maintain ~800MB/sec from a single thread reading OS cache 12 Copyright 2011 Cloudera Inc. All rights reserved
  • 13.
    HDFS Random access • 0.20.2: •  Each individual read operation reconnects to DataNode •  Much TCP Handshake overhead, thread creation, etc •  2.0.0: •  Clients cache open sockets to each datanode (like HTTP Keepalive) •  Local readers can bypass the DN in some circumstances to directly read data •  Rewritten BlockReader to eliminate a data copy •  Eliminated lock contention in DataNode’s FSDataset class 13 Copyright 2011 Cloudera Inc. All rights reserved
  • 14.
    Random-read micro benchmark(higher is better) 700 600 Speed (MB/sec) 500 400 300 200 100 106 253 299 247 488 635 187 477 633 0 4 threads, 1 file 16 threads, 1 file 8 threads, 2 files 0.20.2 Trunk (no native) Trunk (native) TestParallelRead benchmark, modified to 100% random read proportion. Quad core Core i7 Q820@1.73Ghz 14 Copyright 2011 Cloudera Inc. All rights reserved
  • 15.
    Random-read macro benchmark(HBase YCSB) CDH4 Reads/sec CDH3u1 time 15 Copyright 2011 Cloudera Inc. All rights reserved
  • 16.
  • 17.
    ✛  Instead ofone Name Node per cluster, several >  Before: Only one Name Node, many Data Nodes >  Now: A handful of Name Nodes, many Data Nodes ✛  Distribute file system metadata between the NNs ✛  Each Name Node operates independently >  Potentially overlapping ranges of block IDs >  Introduce a new concept: block pool ID >  Each Name Node manages a single block pool
  • 18.
  • 19.
    ✛  Improve scalabilityto 6,000+ Data Nodes >  Bumping into single Data Node scalability now ✛  Allow for better isolation >  Could locate HBase dirs on dedicated Name Node >  Could locate /user dirs on dedicated Name Node ✛  Clients still see unified view of FS namespace >  Use ViewFS – client side mount table configuration Note: Federation != Increased Availability 19
  • 20.
  • 21.
    Current HDFS Availability& Data Integrity •  Simple design, storage fault tolerance •  Storage: Rely on OS’s file system rather than use raw disk •  Storage Fault Tolerance: multiple replicas, active monitoring •  Single NameNode Master •  Persistent state: multiple copies + checkpoints •  Restart on failure 21
  • 22.
    Current HDFS Availability& Data Integrity •  How well did it work? •  Lost 19 out of 329 Million blocks on 10 clusters with 20K nodes in 2009 •  7-9’s of reliability, and that bug was fixed in 0.20 •  18 months Study: 22 failures on 25 clusters - 0.58 failures per year per cluster •  Only 8 would have benefitted from HA failover!! (0.23 failures per cluster year) 22
  • 23.
    So why buildan HA NameNode? •  Most cluster downtime in practice is planned downtime •  Cluster restart for a NN configuration change (e.g new JVM configs, new HDFS configs) •  Cluster restart for a NN hardware upgrade/repair •  Cluster restart for a NN software upgrade (e.g. new Hadoop, new kernel, new JVM) •  Planned downtimes cause the vast majority of outage! •  Manual failover solves all of the above! •  Failover to NN2, fix NN1, fail back to NN1, zero downtime 23
  • 24.
    Approach and Terminology • Initial goal: Active-Standby with Hot Failover •  Terminology •  Active NN: actively serves read/write operations from clients •  Standby NN: waits, becomes active when Active dies or is unhealthy •  Hot failover: standby able to take over instantly 24
  • 25.
    HDFS Architecture: HighAvailability •  Single NN configuration; no failover •  Active and Standby with manual failover •  Addresses downtime during upgrades – main cause of unavailability •  Active and Standby with automatic failover •  Addresses downtime during unplanned outages (kernel panics, bad memory, double PDU failure, etc) •  See HDFS-1623 for detailed use cases •  With Federation each namespace volume has an active-standby NameNode pair 25
  • 26.
    HDFS Architecture: HighAvailability •  Failover controller outside NN •  Parallel Block reports to Active and Standby •  NNs share namespace state via a shared edit log •  NAS or Journal Nodes •  Like RDBMS “log shipping replication” •  Client failover •  Smart clients (e.g configuration, or ZooKeeper for coordination) •  IP Failover in the future 26
  • 27.
  • 28.
  • 29.
    ✛  Increase scalabilityof single Data Node >  Currently the most-noticed scalability limit ✛  Support for point-in-time snapshots >  To better support DR, backups ✛  Completely separate block / namespace layers >  Increase scalability even further, new use cases ✛  Fully distributed NN metadata >  No pre-determined “special nodes” in the system