HDFS FederationSuresh SrinivasYahoo! Inc
Single Namenode LimitationsNamespaceNN process stores entire metadata in memoryNumber of objects (files + blocks) are limited by the heap size50G heap for 200 million objects - supports 4000 DNs, 12 PB of storage at 40 MB average file sizeStorage Growth– DN storage 4TB to 36TB; cluster size to 8000 DNs => Storage from 12PB to > 100PBPerformanceFile system operations limited to a single NN throughputBottleneck for Next Generation Of MapReduceIsolationExperimental apps can affect production appsCluster AvailabilityFailure of single namenode brings down the entire cluster
Scaling the Name Service:SeparateBlock Managementfrom NNNot to scaleBlock-reports for Billions of blocks requires rethinking block layer# clientsGood isolation properties100x50xDistributed Namenode20xMultiple Namespace volumesPartialNS in memoryWith Namespace volumes 4xAll NS in memoryPartial NS (Cache) in memory1xArchives# names100M10B200M1B2B20B3
Why Vertical Scaling is Not Sufficient?Why not use NNs with 512GB memory?Startup time is huge – currently 30mins to 2 hrs for 50GB NNStop the world GC failures can bring down the clusterAll DNs could be declared deadDebugging problems with large JVM heap is harderOptimizing NN memory usage is expensiveChanges in trunk reduces used memory; expensive development time, code complexityDiminishing returns
Why Federation?SimplicitySimpler robust designMultiple independent namenodesCore development in 3.5 monthsChanges mostly in Datanode, Config and ToolsVery little change in NamenodeSimpler implementation than Distributed NamenodeLesser scalability – but will serve the immediate needsFederation is an optional featureExisting single NN configuration supported as is
HDFS BackgroundNamenodeBlock ManagementDatanodeDatanode …Physical StorageHDFS has 2 main layersNamespace managementManages namespace consisting of directories, files and blocksSupports file system operations such as create/modify/list files & dirsBlock storageBlock managementManages DN membershipSupports add/delete/modify/get block locationManages replication and replica placementPhysical storageSupports read/write access to blocks.NamespaceNSBlock Storage
FederationDatanode 2Datanode mDatanode 1.........Pools  kPools  nPools  1            Block  PoolsBalancerNN-nNN-kNN-1Foreign NS n          NS1......          NS kMultiple independent namenodes/namespaces in a cluster
NNs provide both namespace and block management
DNs common storage layer
Stores blocks for all the block pools
Non-HDFS namespaces can share the same storageFederated HDFS ClusterCurrentFederated HDFSMultiple independent namespacesA namespace uses 1 block poolMultiple independent set of blocksA block pool is a set of blocks belonging to a single namespaceImplemented asMultiple namenodesSet of datanodesEach datanode stores blocks for all block pools1 Namespace1 set of blocksImplemented as1 NamenodeSet of datanodes
Datanode ChangesA thread per NNregister with all the NNsperiodic heartbeat to all the NNs with  utilization summaryblock report to the NN for its block poolNNs can be added/removed/upgraded on the flyBlock PoolsAutomatically created when DN talks to NNBlock identified by ExtendedBlockID = BlockPoolID + BlockIDUnique Block Pool ID across clusters - enables merging clustersDN data structures are “indexed” by BPIDBlockMap, storage etc. indexed by BPIDUpgrade/rollback happens per Block Pool/per NN
Other ChangesDecommissioningTools to initiate and monitor decom at all the NNsBalancerAllows balancing at datanode or block pool levelDatanode daemonsdisk scanner and directory scanner adapted to federationNN Web UIAdditionally shows NN’s block pool storage utilization
New Cluster Manager Web UICluster SummaryShows overall cluster storage utilizationList of namenodesFor each NN - BPID, storage utilization, number of missing blocks, number of live & dead DNsNN link to go to NN Web UIDecommissioning status of DNs
Managing NamespacesClient-side mount-table/Federation has multiple namespaces – don’t you need a single global namespace?Key is to share the data and the names used to access the shared data.A global namespace is one way to do that – but even there we talk of several large “global” namespacesClient-side mount table is another way to shareShared mount-table => “global” shared viewPersonalized mount-table => per-application viewShare the data that matter by mounting ittmphomeprojectdata

March 2011 HUG: HDFS Federation

  • 1.
  • 2.
    Single Namenode LimitationsNamespaceNNprocess stores entire metadata in memoryNumber of objects (files + blocks) are limited by the heap size50G heap for 200 million objects - supports 4000 DNs, 12 PB of storage at 40 MB average file sizeStorage Growth– DN storage 4TB to 36TB; cluster size to 8000 DNs => Storage from 12PB to > 100PBPerformanceFile system operations limited to a single NN throughputBottleneck for Next Generation Of MapReduceIsolationExperimental apps can affect production appsCluster AvailabilityFailure of single namenode brings down the entire cluster
  • 3.
    Scaling the NameService:SeparateBlock Managementfrom NNNot to scaleBlock-reports for Billions of blocks requires rethinking block layer# clientsGood isolation properties100x50xDistributed Namenode20xMultiple Namespace volumesPartialNS in memoryWith Namespace volumes 4xAll NS in memoryPartial NS (Cache) in memory1xArchives# names100M10B200M1B2B20B3
  • 4.
    Why Vertical Scalingis Not Sufficient?Why not use NNs with 512GB memory?Startup time is huge – currently 30mins to 2 hrs for 50GB NNStop the world GC failures can bring down the clusterAll DNs could be declared deadDebugging problems with large JVM heap is harderOptimizing NN memory usage is expensiveChanges in trunk reduces used memory; expensive development time, code complexityDiminishing returns
  • 5.
    Why Federation?SimplicitySimpler robustdesignMultiple independent namenodesCore development in 3.5 monthsChanges mostly in Datanode, Config and ToolsVery little change in NamenodeSimpler implementation than Distributed NamenodeLesser scalability – but will serve the immediate needsFederation is an optional featureExisting single NN configuration supported as is
  • 6.
    HDFS BackgroundNamenodeBlock ManagementDatanodeDatanode…Physical StorageHDFS has 2 main layersNamespace managementManages namespace consisting of directories, files and blocksSupports file system operations such as create/modify/list files & dirsBlock storageBlock managementManages DN membershipSupports add/delete/modify/get block locationManages replication and replica placementPhysical storageSupports read/write access to blocks.NamespaceNSBlock Storage
  • 7.
    FederationDatanode 2Datanode mDatanode1.........Pools kPools nPools 1 Block PoolsBalancerNN-nNN-kNN-1Foreign NS n NS1...... NS kMultiple independent namenodes/namespaces in a cluster
  • 8.
    NNs provide bothnamespace and block management
  • 9.
  • 10.
    Stores blocks forall the block pools
  • 11.
    Non-HDFS namespaces canshare the same storageFederated HDFS ClusterCurrentFederated HDFSMultiple independent namespacesA namespace uses 1 block poolMultiple independent set of blocksA block pool is a set of blocks belonging to a single namespaceImplemented asMultiple namenodesSet of datanodesEach datanode stores blocks for all block pools1 Namespace1 set of blocksImplemented as1 NamenodeSet of datanodes
  • 12.
    Datanode ChangesA threadper NNregister with all the NNsperiodic heartbeat to all the NNs with utilization summaryblock report to the NN for its block poolNNs can be added/removed/upgraded on the flyBlock PoolsAutomatically created when DN talks to NNBlock identified by ExtendedBlockID = BlockPoolID + BlockIDUnique Block Pool ID across clusters - enables merging clustersDN data structures are “indexed” by BPIDBlockMap, storage etc. indexed by BPIDUpgrade/rollback happens per Block Pool/per NN
  • 13.
    Other ChangesDecommissioningTools toinitiate and monitor decom at all the NNsBalancerAllows balancing at datanode or block pool levelDatanode daemonsdisk scanner and directory scanner adapted to federationNN Web UIAdditionally shows NN’s block pool storage utilization
  • 14.
    New Cluster ManagerWeb UICluster SummaryShows overall cluster storage utilizationList of namenodesFor each NN - BPID, storage utilization, number of missing blocks, number of live & dead DNsNN link to go to NN Web UIDecommissioning status of DNs
  • 15.
    Managing NamespacesClient-side mount-table/Federationhas multiple namespaces – don’t you need a single global namespace?Key is to share the data and the names used to access the shared data.A global namespace is one way to do that – but even there we talk of several large “global” namespacesClient-side mount table is another way to shareShared mount-table => “global” shared viewPersonalized mount-table => per-application viewShare the data that matter by mounting ittmphomeprojectdata