Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Strata + Hadoop World 2012: HDFS: Now and Future


Published on

Hadoop 1.0 is a significant milestone in being the most stable and robust Hadoop release tested in production against a variety of applications. It offers improved performance, support for HBase, disk-fail-in-place, Webhdfs, etc over previous releases. The next major release, Hadoop 2.0 offers several significant HDFS improvements including new append-pipeline, federation, wire compatibility, NameNode HA, further performance improvements, etc. We describe how to take advantages of the new features and their benefits. We also discuss some of the misconceptions and myths about HDFS.

  • Be the first to comment

Strata + Hadoop World 2012: HDFS: Now and Future

  1. 1. HDFS: Now and Future Todd Lipcon ( Radia (
  2. 2. OutlinePart 1 – Todd Lipcon (Cloudera)• Namenode HA• HDFS Performance improvements• Taking advantage of next-gen hardware• Storage Efficiency (RAID and compression)Part 2 - Sanjay Radia (Hortonworks)• Federation and Generalized storage service – Leverage it for further innovation• Snapshots• Other – WebHDFS – Wire compatibility 2 OReilly Strata & Hadoop World
  3. 3. HDFS HA in Hadoop 2.0.0• Initial implementation last year – Introduced Standby NameNode and manual hot failover (see Hadoop World 2011 presentation) • Handled planned maintenance (eg upgrades) but not unplanned – Required a highly-available NFS filer to store NameNode metadata • Complicated and expensive to set up3 OReilly Strata & Hadoop World
  4. 4. HDFS HA Phase 2• Automatic failover – Uses Apache ZooKeeper to automatically detect NameNode failures and trigger a failover – Ops may invoke manual failover for planned maintenance windows• Removed dependency on NFS storage – HDFS HA is entirely self-contained – No special hardware or software required – No SPOF anywhere in the system4 OReilly Strata & Hadoop World
  5. 5. Automatic Failover• Each NameNode has a new process called ZooKeeperFailoverController (ZKFC) – Maintains a session to ZooKeeper – Periodically runs a health-check against its local NameNode to verify that it is running properly• Triggers failover if the health check fails or the ZK session expires• Operators may still issue manual failover commands for planned maintenance• Failover time: 30-40 seconds unplanned; 0-3 seconds planned.• Handles all types of faults: machine, software, network, etc. 5 OReilly Strata & Hadoop World
  6. 6. Removed NFS/filer dependency• Shared storage on NFS practical for some organizations, but difficult for others – Complex configuration, custom fencing scripts – Filer itself must be highly available – Expensive to buy, expensive to support – Buggy NFS clients in Linux• Introduced new system for reliable edit log storage: QuorumJournalManager6 OReilly Strata & Hadoop World
  7. 7. QuorumJournalManager• Run 3 or 5 JournalNodes, collocated on existing hardware investment• Each edit must be committed to a majority of the nodes (i.e a quorum) – A minority of nodes may crash or be slow without affecting system availability – Run N nodes to tolerate (N-1)/2 failures (same as ZooKeeper)• Built into HDFS – Designed for existing Hadoop ops teams to understand – Hadoop Metrics support, full Kerberos support, etc. 7 OReilly Strata & Hadoop World
  8. 8. HDFS HA Architecture (with Automatic Failover and QuorumJournalManager) ZK ZK ZK Heartbeat Heartbeat FailoverController FailoverController Active Standby Cmds JN JN JN NN Shared NN state NN Monitor Health through Quorum of NN. OS, HW Active of JournalNodes Standby Monitor Health of NN. OS, HW Block Reports to Active & Standby DN fencing: only obey commands from active DN DN DN DN8 OReilly Strata & Hadoop World
  9. 9. HA Improvements Summary• Automatic failover – Avoid both planned an unplanned downtime• Non-NFS Shared Storage – No need to buy or configure a filer• Result: HA with no external dependencies• Available now in HDFS trunk and CDH4.1• Come to our 5pm talk in this room for more details on these HA improvements!9 OReilly Strata & Hadoop World
  10. 10. HDFS Performance Update: 2.x vs 1.x• Significant speedups from SSE4.2 hardware checksum calculation (2.5-3x less CPU on read path)• Rewritten read path for fewer memory copies• Short-circuit past datanodes for 2-3x faster random read (HBase workloads)• I/O scheduling improvements: push down hints to Linux using posix_fadvise()• Covered in my presentation from Hadoop World 201110 OReilly Strata & Hadoop World
  11. 11. HDFS Performance: Recent Work• Completed – Zero-copy read for libhdfs (2-3x improvement for C++ clients like Impala reading cached data) – Expose mapping of blocks to disks: 2x improvement by avoiding contention on slower drives (HDFS-3672)• In progress – Using native checksum computation on write path – Avoiding copies and allocation on write path11 OReilly Strata & Hadoop World
  12. 12. HDFS Performance Benchmarks 1000 (as of June 2012) 800Throughput (MB/sec) 600 Read 400 Write 200 0 Raw ext4 HDFS HDFS with disk awarenessDual quad-core, 12x2T 7200RPM drives, measured max disk throughput at900MB/sec.Write throughput is CPU bound; improvements in progress bring it to max diskthroughput as wellEasily saturates SATA3 bus bandwidth on common hardware12 OReilly Strata & Hadoop World
  13. 13. Hardware Trends• Denser storage – 36T per node already common – Millions of blocks per DN • New need to invest in scaling DataNode memory usage• More RAM – 64GB common today. 256GB soon inexpensive – Customers want to explicitly pin recently ingested data in RAM (especially with efficient query engines like Impala)• Solid state storage (SSD, FusionIO, etc) – HDFS should transparently or explicitly migrate hot random- access data to/from flash13 – Hierarchical storage management OReilly Strata & Hadoop World
  14. 14. HDFS Storage Efficiency• Many customers are expanding their clusters simply to add storage – How can we better utilize the disks they already have?• RAID (Reed-Solomon coding) – Store blocks at low replication, keep parity blocks to allow reconstruction if they are lost – Effective replication: 1.5x with same durability, less locality• Transparent compression – Automatically detect infrequently used files, transparently re- compress with Snappy, GZip, bz2, or LZMA – Cloudera workload traces indicate 10% of files accessed 90% of the time!14 OReilly Strata & Hadoop World
  15. 15. OutlinePart 1 – Todd Lipcon (Cloudera)• Namenode HA• HDFS Performance improvements• Taking advantage of next-gen hardware• Storage Efficiency (RAID and compression)Part 2 - Sanjay Radia (Hortonworks)• Federation and Generalized storage service – Leverage it for further innovation• Snapshots• Other – WebHDFS – Wire compatibilityHA in Hadoop 1!15 OReilly Strata & Hadoop World
  16. 16. Federation: Generalized Block Storage NN-1 NN-k NN-n Namespace Foreign NS1 NS k NS n .. .. . . Pool 1 Pool k Pool n Block Storage Block Pools DN 1 DN 2 DN m .. .. .. Common Storage• Block Storage as generic storage service – Set of blocks for a Namespace Volume is called a Block Pool – DNs store blocks for all the Namespace Volumes – no partitioning• Multiple independent Namenodes and Namespace Volumes in a cluster – Namespace Volume = Namespace + Block Pool16 OReilly Strata & Hadoop World
  17. 17. HDFS’ Generic Storage Service Opportunities for Innovation• Federation - Distributed (Partitioned) Namespace – Simple and Robust due to independent masters Alternate NN – Scalability, Isolation, Availability Implementation HBase• New Services – Independent Block Pools HDFS Namespace MR tmp – New FS - Partial namespace in memory – MR Tmp storage directly on block storage – Shadow file system – caches HDFS, NFS, S3 Storage Service• Future: move Block Management in DataNodes – Simplifies namespace/application implementation – Distributed namenode becomes significantly simple17 OReilly Strata & Hadoop World
  18. 18. Managing Namespaces• Federation has multiple namespaces / Client-side• Don’t you need a single global namespace? mount-table – Some tenants want private namespace – Do you create a single DB or Single Table? – Many volumes, share what you want data project home tmp – Global? Key is to share the data and the names used to access the data• Client-side mount table can implement global or private namespaces – Shared mount-table => “global” shared view NS4 – Personalized mount-table => per-application view • Share the data that matter by mounting it• Client-side implementation of mount tables NS1 NS2 NS3 – xInclude from shared place – global view – No single point of failure – No hotspot for root and top level directories18 OReilly Strata & Hadoop World
  19. 19. Next Steps… first class support for volumes • NameServer - Container for namespaces – Lots of small namespace volumes • Chosen per user, tenant, data feed • Management policies (quota, …) • Mount tables for unified namespace … – Centrally managed – (xInclude, ZK, ..) NameServers asContainers of Namespaces • Keep only WorkingSet of namespace in memory – Break away from old NN’s full namespace in memory Datanode … Datanode – Faster startup, Billions of names, Hundreds of volumes • Number of NameServers = Storage Layer – Sum of (Namespace working set) – Sum of (Namespace throughput)19 – Move namespace for balancing OReilly Strata & Hadoop World
  20. 20. Snapshots• Take snapshot of any directory – Multiple snapshots allowed• Snapshot metadata info stored in Namemode – Datanodes have no knowledge – Blocks are shared• All regular commands/apis can be used against snapshots – Cp /foo/bar/.snapshot/x/y /a/b/z• New CLI’s to create and delete snapshots20 OReilly Strata & Hadoop World
  21. 21. Snapshots - Status• HDFS-2802 (feature branch) – Initial design and prototype – March 2012 – Development active • Updated design document and test plan posted – Review meeting – 1st week November • 15 + patches – Expected completion – early December!21 OReilly Strata & Hadoop World
  22. 22. Enterprise Use Cases• Storage fault-tolerance – built into HDFS Architecture  – Over 7’9s of data reliability• High Availability • Standard Interfaces  – WebHdfs(REST) , Fuse  and NFS access • HTTPFS – (WebHDFS as farm of proxy servers) • libWebhdfs – pure c-library for HDFS• Wire protocol compatibility  – Protocol buffers• Rolling upgrades – Rolling upgrades for dot-releases • Snapshots - Under active development• Disaster Recovery – Distcp does parallel and incremental copies across cluster  • Future - Enhance using journal interface & Snapshots22 OReilly Strata & Hadoop World
  23. 23. Summary• HA for Namenode – Hot failover, shared storage not required (QJM)• Performance improvements• Utilize today’s and tomorrow’s hardware to full potential• Federation and Generalized storage layer – Opportunities for innovation • Partial namespace in memory, shadow/caching file system, MR tmp, etc.• Wire compatibility, WebHdfs, …• Snapshots - Development well in progress23 OReilly Strata & Hadoop World
  24. 24. Questions?24 OReilly Strata & Hadoop World