Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

March 2011 HUG: HDFS Federation


Published on

Slides from the HDFS Federation talk at the March 2011 Hug by Suresh Srinivas.

Published in: Technology
  • Be the first to comment

March 2011 HUG: HDFS Federation

  1. 1. HDFS Federation<br />Suresh Srinivas<br />Yahoo! Inc<br />
  2. 2. Single Namenode Limitations<br />Namespace<br />NN process stores entire metadata in memory<br />Number of objects (files + blocks) are limited by the heap size<br />50G heap for 200 million objects - supports 4000 DNs, 12 PB of storage at 40 MB average file size<br />Storage Growth– DN storage 4TB to 36TB; cluster size to 8000 DNs => Storage from 12PB to > 100PB<br />Performance<br />File system operations limited to a single NN throughput<br />Bottleneck for Next Generation Of MapReduce<br />Isolation<br />Experimental apps can affect production apps<br />Cluster Availability<br />Failure of single namenode brings down the entire cluster<br />
  3. 3. Scaling the Name Service:<br />Separate<br />Block Management<br />from NN<br />Not to scale<br />Block-reports for Billions of blocks requires rethinking<br /> block layer<br /># clients<br />Good isolation <br />properties<br />100x<br />50x<br />Distributed Namenode<br />20x<br />Multiple<br /> Namespace <br />volumes<br />Partial<br />NS in memory<br />With Namespace<br /> volumes <br />4x<br />All NS<br /> in memory<br />Partial<br /> NS (Cache)<br /> in memory<br />1x<br />Archives<br /># names<br />100M<br />10B<br />200M<br />1B<br />2B<br />20B<br />3<br />
  4. 4. Why Vertical Scaling is Not Sufficient?<br />Why not use NNs with 512GB memory?<br />Startup time is huge – currently 30mins to 2 hrs for 50GB NN<br />Stop the world GC failures can bring down the cluster<br />All DNs could be declared dead<br />Debugging problems with large JVM heap is harder<br />Optimizing NN memory usage is expensive<br />Changes in trunk reduces used memory; expensive development time, code complexity<br />Diminishing returns<br />
  5. 5. Why Federation?<br />Simplicity<br />Simpler robust design<br />Multiple independent namenodes<br />Core development in 3.5 months<br />Changes mostly in Datanode, Config and Tools<br />Very little change in Namenode<br />Simpler implementation than Distributed Namenode<br />Lesser scalability – but will serve the immediate needs<br />Federation is an optional feature<br />Existing single NN configuration supported as is<br />
  6. 6. HDFS Background<br />Namenode<br />Block Management<br />Datanode<br />Datanode <br />…<br />Physical Storage<br />HDFS has 2 main layers<br />Namespace management<br />Manages namespace consisting of directories, files and blocks<br />Supports file system operations such as create/modify/list files & dirs<br />Block storage<br />Block management<br />Manages DN membership<br />Supports add/delete/modify/get block location<br />Manages replication and replica placement<br />Physical storage<br />Supports read/write access to blocks.<br />Namespace<br />NS<br />Block Storage<br />
  7. 7. Federation<br />Datanode 2<br />Datanode m<br />Datanode 1<br />...<br />...<br />...<br />Pools k<br />Pools n<br />Pools 1<br /> Block Pools<br />Balancer<br />NN-n<br />NN-k<br />NN-1<br />Foreign NS n<br /> NS1<br />...<br />...<br /> NS k<br /><ul><li>Multiple independent namenodes/namespaces in a cluster
  8. 8. NNs provide both namespace and block management
  9. 9. DNs common storage layer
  10. 10. Stores blocks for all the block pools
  11. 11. Non-HDFS namespaces can share the same storage</li></li></ul><li>Federated HDFS Cluster<br />Current<br />Federated HDFS<br />Multiple independent namespaces<br />A namespace uses 1 block pool<br />Multiple independent set of blocks<br />A block pool is a set of blocks belonging to a single namespace<br />Implemented as<br />Multiple namenodes<br />Set of datanodes<br />Each datanode stores blocks for all block pools<br />1 Namespace<br />1 set of blocks<br />Implemented as<br />1 Namenode<br />Set of datanodes<br />
  12. 12. Datanode Changes<br />A thread per NN<br />register with all the NNs<br />periodic heartbeat to all the NNs with utilization summary<br />block report to the NN for its block pool<br />NNs can be added/removed/upgraded on the fly<br />Block Pools<br />Automatically created when DN talks to NN<br />Block identified by ExtendedBlockID = BlockPoolID + BlockID<br />Unique Block Pool ID across clusters - enables merging clusters<br />DN data structures are “indexed” by BPID<br />BlockMap, storage etc. indexed by BPID<br />Upgrade/rollback happens per Block Pool/per NN<br />
  13. 13. Other Changes<br />Decommissioning<br />Tools to initiate and monitor decom at all the NNs<br />Balancer<br />Allows balancing at datanode or block pool level<br />Datanode daemons<br />disk scanner and directory scanner adapted to federation<br />NN Web UI<br />Additionally shows NN’s block pool storage utilization<br />
  14. 14. New Cluster Manager Web UI<br />Cluster Summary<br />Shows overall cluster storage utilization<br />List of namenodes<br />For each NN - BPID, storage utilization, number of missing blocks, number of live & dead DNs<br />NN link to go to NN Web UI<br />Decommissioning status of DNs<br />
  15. 15. Managing Namespaces<br />Client-side mount-table<br />/<br />Federation has multiple namespaces – don’t you need a single global namespace?<br />Key is to share the data and the names used to access the shared data.<br />A global namespace is one way to do that – but even there we talk of several large “global” namespaces<br />Client-side mount table is another way to share<br />Shared mount-table => “global” shared view<br />Personalized mount-table => per-application view<br />Share the data that matter by mounting it<br />tmp<br />home<br />project<br />data<br />
  16. 16. Impact On Existing Deployments<br />Very little impact on clusters with single NN<br />Old configuration runs as is<br />Two commands change<br />NN format and first upgrade has a new ClusterID option<br />During design/implementation lot of effort went into ensure single NN deployments work as is<br />A lot of testing effort to validate this<br />
  17. 17. Summary<br />Federated HDFS (Jira HDFS-1052)<br /><ul><li>Existing single Namenode deployments run as is
  18. 18. Scale by adding independent Namenodes</li></ul>Preserves the robustness of the Namenodes<br />Not much code change to the Namenode<br /><ul><li>Generalizes the Block Storage layer</li></ul>Can add other implementations of the Namenodes<br />Even other name services (HBase?)<br />Could move the Block management out of the Namenode in the future<br /><ul><li>Other Benefits</li></ul>Improved isolation and hence availability<br />Isolate different application categories – e.g. separate Namenode for HBase<br />
  19. 19. Questions?<br />