March 2011 HUG: HDFS Federation


Published on

Slides from the HDFS Federation talk at the March 2011 Hug by Suresh Srinivas.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

March 2011 HUG: HDFS Federation

  1. 1. HDFS Federation<br />Suresh Srinivas<br />Yahoo! Inc<br />
  2. 2. Single Namenode Limitations<br />Namespace<br />NN process stores entire metadata in memory<br />Number of objects (files + blocks) are limited by the heap size<br />50G heap for 200 million objects - supports 4000 DNs, 12 PB of storage at 40 MB average file size<br />Storage Growth– DN storage 4TB to 36TB; cluster size to 8000 DNs => Storage from 12PB to > 100PB<br />Performance<br />File system operations limited to a single NN throughput<br />Bottleneck for Next Generation Of MapReduce<br />Isolation<br />Experimental apps can affect production apps<br />Cluster Availability<br />Failure of single namenode brings down the entire cluster<br />
  3. 3. Scaling the Name Service:<br />Separate<br />Block Management<br />from NN<br />Not to scale<br />Block-reports for Billions of blocks requires rethinking<br /> block layer<br /># clients<br />Good isolation <br />properties<br />100x<br />50x<br />Distributed Namenode<br />20x<br />Multiple<br /> Namespace <br />volumes<br />Partial<br />NS in memory<br />With Namespace<br /> volumes <br />4x<br />All NS<br /> in memory<br />Partial<br /> NS (Cache)<br /> in memory<br />1x<br />Archives<br /># names<br />100M<br />10B<br />200M<br />1B<br />2B<br />20B<br />3<br />
  4. 4. Why Vertical Scaling is Not Sufficient?<br />Why not use NNs with 512GB memory?<br />Startup time is huge – currently 30mins to 2 hrs for 50GB NN<br />Stop the world GC failures can bring down the cluster<br />All DNs could be declared dead<br />Debugging problems with large JVM heap is harder<br />Optimizing NN memory usage is expensive<br />Changes in trunk reduces used memory; expensive development time, code complexity<br />Diminishing returns<br />
  5. 5. Why Federation?<br />Simplicity<br />Simpler robust design<br />Multiple independent namenodes<br />Core development in 3.5 months<br />Changes mostly in Datanode, Config and Tools<br />Very little change in Namenode<br />Simpler implementation than Distributed Namenode<br />Lesser scalability – but will serve the immediate needs<br />Federation is an optional feature<br />Existing single NN configuration supported as is<br />
  6. 6. HDFS Background<br />Namenode<br />Block Management<br />Datanode<br />Datanode <br />…<br />Physical Storage<br />HDFS has 2 main layers<br />Namespace management<br />Manages namespace consisting of directories, files and blocks<br />Supports file system operations such as create/modify/list files & dirs<br />Block storage<br />Block management<br />Manages DN membership<br />Supports add/delete/modify/get block location<br />Manages replication and replica placement<br />Physical storage<br />Supports read/write access to blocks.<br />Namespace<br />NS<br />Block Storage<br />
  7. 7. Federation<br />Datanode 2<br />Datanode m<br />Datanode 1<br />...<br />...<br />...<br />Pools k<br />Pools n<br />Pools 1<br /> Block Pools<br />Balancer<br />NN-n<br />NN-k<br />NN-1<br />Foreign NS n<br /> NS1<br />...<br />...<br /> NS k<br /><ul><li>Multiple independent namenodes/namespaces in a cluster
  8. 8. NNs provide both namespace and block management
  9. 9. DNs common storage layer
  10. 10. Stores blocks for all the block pools
  11. 11. Non-HDFS namespaces can share the same storage</li></li></ul><li>Federated HDFS Cluster<br />Current<br />Federated HDFS<br />Multiple independent namespaces<br />A namespace uses 1 block pool<br />Multiple independent set of blocks<br />A block pool is a set of blocks belonging to a single namespace<br />Implemented as<br />Multiple namenodes<br />Set of datanodes<br />Each datanode stores blocks for all block pools<br />1 Namespace<br />1 set of blocks<br />Implemented as<br />1 Namenode<br />Set of datanodes<br />
  12. 12. Datanode Changes<br />A thread per NN<br />register with all the NNs<br />periodic heartbeat to all the NNs with utilization summary<br />block report to the NN for its block pool<br />NNs can be added/removed/upgraded on the fly<br />Block Pools<br />Automatically created when DN talks to NN<br />Block identified by ExtendedBlockID = BlockPoolID + BlockID<br />Unique Block Pool ID across clusters - enables merging clusters<br />DN data structures are “indexed” by BPID<br />BlockMap, storage etc. indexed by BPID<br />Upgrade/rollback happens per Block Pool/per NN<br />
  13. 13. Other Changes<br />Decommissioning<br />Tools to initiate and monitor decom at all the NNs<br />Balancer<br />Allows balancing at datanode or block pool level<br />Datanode daemons<br />disk scanner and directory scanner adapted to federation<br />NN Web UI<br />Additionally shows NN’s block pool storage utilization<br />
  14. 14. New Cluster Manager Web UI<br />Cluster Summary<br />Shows overall cluster storage utilization<br />List of namenodes<br />For each NN - BPID, storage utilization, number of missing blocks, number of live & dead DNs<br />NN link to go to NN Web UI<br />Decommissioning status of DNs<br />
  15. 15. Managing Namespaces<br />Client-side mount-table<br />/<br />Federation has multiple namespaces – don’t you need a single global namespace?<br />Key is to share the data and the names used to access the shared data.<br />A global namespace is one way to do that – but even there we talk of several large “global” namespaces<br />Client-side mount table is another way to share<br />Shared mount-table => “global” shared view<br />Personalized mount-table => per-application view<br />Share the data that matter by mounting it<br />tmp<br />home<br />project<br />data<br />
  16. 16. Impact On Existing Deployments<br />Very little impact on clusters with single NN<br />Old configuration runs as is<br />Two commands change<br />NN format and first upgrade has a new ClusterID option<br />During design/implementation lot of effort went into ensure single NN deployments work as is<br />A lot of testing effort to validate this<br />
  17. 17. Summary<br />Federated HDFS (Jira HDFS-1052)<br /><ul><li>Existing single Namenode deployments run as is
  18. 18. Scale by adding independent Namenodes</li></ul>Preserves the robustness of the Namenodes<br />Not much code change to the Namenode<br /><ul><li>Generalizes the Block Storage layer</li></ul>Can add other implementations of the Namenodes<br />Even other name services (HBase?)<br />Could move the Block management out of the Namenode in the future<br /><ul><li>Other Benefits</li></ul>Improved isolation and hence availability<br />Isolate different application categories – e.g. separate Namenode for HBase<br />
  19. 19. Questions?<br />