[B4]deview 2012-hdfs

2,940 views

Published on

Published in: Technology

[B4]deview 2012-hdfs

  1. 1. HDFS ARCHITECTUREHow HDFS is evolving to meet new needs
  2. 2. ✛  Aaron T. Myers ✛  Hadoop PMC Member / Committer at ASF ✛  Software Engineer at Cloudera ✛  Primarily work on HDFS and Hadoop Security2
  3. 3. ✛  HDFS architecture circa 2010 ✛  New requirements for HDFS >  Random read patterns >  Higher scalability >  Higher availability ✛  HDFS evolutions to address requirements >  Read pipeline performance improvements >  Federated namespaces >  Highly available Name Node3
  4. 4. HDFS ARCHITECTURE: 2010
  5. 5. ✛  Each cluster has… >  A single Name Node ∗  Stores file system metadata ∗  Stores “Block ID” -> Data Node mapping >  Many Data Nodes ∗  Store actual file data >  Clients of HDFS… ∗  Communicate with Name Node to browse file system, get block locations for files ∗  Communicate directly with Data Nodes to read/write files5
  6. 6. 6
  7. 7. ✛  Want to support larger clusters >  ~4,000 node limit with 2010 architecture >  New nodes beefier than old nodes ∗  2009: 8 cores, 16GB RAM, 4x1TB disks ∗  2012: 16 cores, 48GB RAM, 12x3TB disks ✛  Want to increase availability >  With rise of HBase, HDFS now serving live traffic >  Downtime means immediate user-facing impact ✛  Want to improve random read performance >  HBase usually does small, random reads, not bulk7
  8. 8. ✛  Single Name Node >  If Name Node goes offline, cluster is unavailable >  Name Node must fit all FS metadata in memory ✛  Inefficiencies in read pipeline >  Designed for large, streaming reads >  Not small, random reads (like HBase use case)8
  9. 9. ✛  Fine for offline, batch-oriented applications ✛  If cluster goes offline, external customers don’t notice ✛  Can always use separate clusters for different groups ✛  HBase didn’t exist when Hadoop first created >  MapReduce was the only client application9
  10. 10. HDFS PERFORMANCE IMPROVEMENTS
  11. 11. HDFS CPU Improvements: Checksumming•  HDFS checksums every piece of data in/out•  Significant CPU overhead •  Measure by putting ~1G in HDFS, cat file in a loop •  0.20.2: ~30-50% of CPU time is CRC32 computation!•  Optimizations: •  Switch to “bulk” API: verify/compute 64KB at a time instead of 512 bytes (better instruction cache locality, amortize JNI overhead) •  Switch to CRC32C polynomial, SSE4.2, highly tuned assembly (~8 bytes per cycle with instruction level parallelism!) 11 Copyright 2011 Cloudera Inc. All rights reserved
  12. 12. Checksum improvements (lower is better) 1360us100% 90% 80% 70% 60% 760us 50% CDH3u0 40% Optimized 30% 20% 10% 0% Random-read Random-read CPU Sequential-read latency usage CPU usage Post-optimization: only 16% overhead vs un-checksummed access Maintain ~800MB/sec from a single thread reading OS cache 12 Copyright 2011 Cloudera Inc. All rights reserved
  13. 13. HDFS Random access•  0.20.2: •  Each individual read operation reconnects to DataNode •  Much TCP Handshake overhead, thread creation, etc•  2.0.0: •  Clients cache open sockets to each datanode (like HTTP Keepalive) •  Local readers can bypass the DN in some circumstances to directly read data •  Rewritten BlockReader to eliminate a data copy •  Eliminated lock contention in DataNode’s FSDataset class 13 Copyright 2011 Cloudera Inc. All rights reserved
  14. 14. Random-read micro benchmark (higher is better) 700 600 Speed (MB/sec) 500 400 300 200 100 106 253 299 247 488 635 187 477 633 0 4 threads, 1 file 16 threads, 1 file 8 threads, 2 files 0.20.2 Trunk (no native) Trunk (native) TestParallelRead benchmark, modified to 100% random read proportion. Quad core Core i7 Q820@1.73Ghz 14 Copyright 2011 Cloudera Inc. All rights reserved
  15. 15. Random-read macro benchmark (HBase YCSB) CDH4 Reads/sec CDH3u1 time 15 Copyright 2011 Cloudera Inc. All rights reserved
  16. 16. HDFS FEDERATION ARCHITECTURE
  17. 17. ✛  Instead of one Name Node per cluster, several >  Before: Only one Name Node, many Data Nodes >  Now: A handful of Name Nodes, many Data Nodes✛  Distribute file system metadata between the NNs✛  Each Name Node operates independently >  Potentially overlapping ranges of block IDs >  Introduce a new concept: block pool ID >  Each Name Node manages a single block pool
  18. 18. HDFS Architecture: Federation
  19. 19. ✛  Improve scalability to 6,000+ Data Nodes >  Bumping into single Data Node scalability now ✛  Allow for better isolation >  Could locate HBase dirs on dedicated Name Node >  Could locate /user dirs on dedicated Name Node ✛  Clients still see unified view of FS namespace >  Use ViewFS – client side mount table configuration Note: Federation != Increased Availability19
  20. 20. HDFS HIGH AVAILABILITY ARCHITECTURE
  21. 21. Current HDFS Availability & Data Integrity•  Simple design, storage fault tolerance •  Storage: Rely on OS’s file system rather than use raw disk •  Storage Fault Tolerance: multiple replicas, active monitoring •  Single NameNode Master •  Persistent state: multiple copies + checkpoints •  Restart on failure 21
  22. 22. Current HDFS Availability & Data Integrity•  How well did it work?•  Lost 19 out of 329 Million blocks on 10 clusters with 20K nodes in 2009 •  7-9’s of reliability, and that bug was fixed in 0.20•  18 months Study: 22 failures on 25 clusters - 0.58 failures per year per cluster •  Only 8 would have benefitted from HA failover!! (0.23 failures per cluster year) 22
  23. 23. So why build an HA NameNode?•  Most cluster downtime in practice is planned downtime •  Cluster restart for a NN configuration change (e.g new JVM configs, new HDFS configs) •  Cluster restart for a NN hardware upgrade/repair •  Cluster restart for a NN software upgrade (e.g. new Hadoop, new kernel, new JVM)•  Planned downtimes cause the vast majority of outage!•  Manual failover solves all of the above! •  Failover to NN2, fix NN1, fail back to NN1, zero downtime 23
  24. 24. Approach and Terminology•  Initial goal: Active-Standby with Hot Failover•  Terminology •  Active NN: actively serves read/write operations from clients •  Standby NN: waits, becomes active when Active dies or is unhealthy •  Hot failover: standby able to take over instantly 24
  25. 25. HDFS Architecture: High Availability•  Single NN configuration; no failover•  Active and Standby with manual failover •  Addresses downtime during upgrades – main cause of unavailability•  Active and Standby with automatic failover •  Addresses downtime during unplanned outages (kernel panics, bad memory, double PDU failure, etc) •  See HDFS-1623 for detailed use cases•  With Federation each namespace volume has an active-standby NameNode pair 25
  26. 26. HDFS Architecture: High Availability•  Failover controller outside NN•  Parallel Block reports to Active and Standby•  NNs share namespace state via a shared edit log •  NAS or Journal Nodes •  Like RDBMS “log shipping replication”•  Client failover •  Smart clients (e.g configuration, or ZooKeeper for coordination) •  IP Failover in the future 26
  27. 27. HDFS Architecture: High Availability
  28. 28. HDFS ARCHITECTURE: WHAT’S NEXT
  29. 29. ✛  Increase scalability of single Data Node >  Currently the most-noticed scalability limit✛  Support for point-in-time snapshots >  To better support DR, backups✛  Completely separate block / namespace layers >  Increase scalability even further, new use cases✛  Fully distributed NN metadata >  No pre-determined “special nodes” in the system

×