Apache Hadoop YARN, NameNode HA, HDFS Federation

22,355 views

Published on

Introduction to new features available in new versions of Apache Hadoop: YARN, NameNode HA, HDFS Federation.

Published in: Technology

Apache Hadoop YARN, NameNode HA, HDFS Federation

  1. 1. 4/27/13Introduction To YARN, NameNode HAand HDFS FederationAdam Kawa, Spotify
  2. 2. 4/27/13About MeData Engineer at Spotify, SwedenHadoop Instructor at Compendium (Cloudera Training Partner)+2.5 year of experience in Hadoop
  3. 3. 4/27/13“Classical” Apache Hadoop ClusterCan consist of one, five, hundred or 4000 nodes
  4. 4. 4/27/13Hadoop Golden EraHadoop Golden EraOne of the most exciting open-source projects on the planetOne of the most exciting open-source projects on the planetBecomes the standard for large-scale processingBecomes the standard for large-scale processingSuccessfully deployed by hundreds/thousands of companiesSuccessfully deployed by hundreds/thousands of companiesLike a salutary virusLike a salutary virus... But it has some little drawbacks... But it has some little drawbacks
  5. 5. 4/27/13HDFS Main LimitationsHDFS Main LimitationsSingle NameNode thatSingle NameNode thatkeeps all metadata in RAMkeeps all metadata in RAMperforms all metadata operationsperforms all metadata operationsbecomes a single point of failure (SPOF)becomes a single point of failure (SPOF)
  6. 6. 4/27/13HDFS Main ImprovementHDFS Main ImprovementIntroduce multiple NameNodesHDFS FederationHDFS High Availability (HA)
  7. 7. 4/27/13Motivation for HDFS FederationLimited namespace scalabilityDecreased metadata operations performanceLack of isolation
  8. 8. 4/27/13HDFS – master and slavesMaster (NameNode) manages all metadata informationSlaves (DataNodes) store blocks of data and serve them tothe client
  9. 9. 4/27/13HDFS NameNode (NN)Supports all the namespace-related file system operationsKeeps information in RAM for fast lookupThe filesystem treeMetadata for all files/directories (e.g. ownerships, permissions)Names and locations of blocksMetadata (not all) is additionally stored on disk(s)Filesystem snapshot (fsimage) + editlog (edits) files
  10. 10. 4/27/13HDFS DataNode (DN)Stores and retrieves blocksA block is stored as a regular files on local diske.g. blk_-992391354910561645A block itself does not know which file it belongs toSends the block report to NN periodicallySends the heartbeat signals to NN to say that it is alive
  11. 11. 4/27/13NameNode scalability issuesNumber of files limited by the amount of RAM of one machineMore RAM means also heavier GCStats based on Y! ClustersAn average file ≈ 1.5 blocks (block size = 128 MB)An average file ≈ 600 bytes of metadata (1 file and 2 blocksobjects)100M files ≈ 60 GB of metadata1 GB of metadata ≈ 1 PB physical storage (but usually less)
  12. 12. 4/27/13NameNode performanceRead/write operations throughput limited by one machine~120K read ops/sec, ~6K write ops/secMapReduce tasks are also HDFS clientsInternal load increases as the cluster growsBlock reports and heartbeats from DataNodesBigger snapshots transferred from/to Secondary NameNodeNNBench - benchmark that stresses the NameNode
  13. 13. 4/27/13Lack of isolationExperimental applications(can) overload the NN and slow down production jobsProduction applications with different requirements(can) benefit from “own” NameNodee.g. HBase requires extremely fast responses
  14. 14. 4/27/13HDFS FederationPartition the filesystem namespace over multiple separatedNameNodes
  15. 15. 4/27/13Multiple independent namespacesNameNodes does nottalk each otherDatanode can store blocksmanaged by any NameNodeNameNode manages onlyslice of namespace
  16. 16. 4/27/13The client-view of the clusterThere are multiple namespaces (and corresponding NNs)Client can use any of them to compose its own view of HDFSIdea is much like the Linux /etc/fstab fileMapping between paths and NameNodes
  17. 17. 4/27/13Client-side mount tables
  18. 18. 4/27/13(More funky) Client-side mount tables
  19. 19. 4/27/13Deploying HDFS FederationDeploying HDFS FederationDeploying HDFS FederationUseful for small (+isolation) and large (+scalability) clusterUseful for small (+isolation) and large (+scalability) clusterHowever not so many clusters use itHowever not so many clusters use itNameNodes can be added/removed at any timeNameNodes can be added/removed at any timeCluster does have to be restartedCluster does have to be restartedAdd Federation to the existing clusterAdd Federation to the existing clusterDo not format existing NNDo not format existing NNNewly added NN will be “empty”Newly added NN will be “empty”
  20. 20. 4/27/13Motivation for HDFS High AvailabilitySingle NameNode is a single point of failure (SPOF)Single NameNode is a single point of failure (SPOF)Secondary NN and HDFS Federation do not change thatSecondary NN and HDFS Federation do not change thatRecovery from failed NN may take even tens of minutesRecovery from failed NN may take even tens of minutesTwo types of downtimesTwo types of downtimesUnexpected failure (infrequent)Unexpected failure (infrequent)Planned maintenance (common)Planned maintenance (common)
  21. 21. 4/27/13HDFS High AvailabilityIntroduces a pair of redundant NameNodesOne Active and one StandbyIf the Active NameNode crashes or is intentionally stopped,Then the Standby takes over quickly
  22. 22. 4/27/13NameNodes resposibilitiesNameNodes resposibilitiesThe Active NameNode is responsible for all client operationsThe Active NameNode is responsible for all client operationsThe Standby NameNode is watchfulThe Standby NameNode is watchfulMaintains enough state to provide a fast failoverMaintains enough state to provide a fast failoverDoes also checkpointing (disappearance of Secondary NN)Does also checkpointing (disappearance of Secondary NN)
  23. 23. 4/27/13Synchronizing the state of metadataThe Standby must keep its state as up-to-date as possibleEdit logs - two alternative ways of sharing themShared storage using NFSQuorum-based storage (recommended)Block locationsDataNodes send block reports to both NameNodes
  24. 24. 4/27/13Quorum-based storage
  25. 25. 4/27/13Supported failover modesManual fail-overInitiated by administrator by a dedicated command - 0-3secondsAutomatic fail-overDetected by a fault in the active NN (more complicated andlonger) - 30-40 seconds
  26. 26. 4/27/13Manual failoverInitiate a failover from the first NN to the second onehdfs haadmin ­failover <to­be­standby­NN> <to­be­active­NN>
  27. 27. 4/27/13Automatic failoverRequires ZooKeeper and an additional process“ZKFailoverController”
  28. 28. 4/27/13The split brain scenarioA potential scenario when two NameNodesBoth think they are activeBoth make conflicting changes to the namespaceExample:The ZKFC crashes, but its NN is still running
  29. 29. 4/27/13FencingFencingEnsure that the previous Active NameNode is no longer able toEnsure that the previous Active NameNode is no longer able tomake any changes to the system metadatamake any changes to the system metadata
  30. 30. 4/27/13Fencing with Quorum-based storageOnly a single NN can be a writer to JournalNodes at a timeWhenever NN becomes active, it generated an “epochnumber”NN that has a higher “epoch number” can be a writerThis prevents from corrupting the file system metadata
  31. 31. 4/27/13“Read” requests fencingPrevious active NN will be fenced when tries to write to the JNUntil it happens, it may still serve HDFS read requestsTry to fence this NN before that happensThis prevents from reading outdated metadataMight be done automatically by ZKFC or hdfs haadmin
  32. 32. 4/27/13HDFS Highly-Available Federated ClusterAny combination is possibleFederation without HAHA without FederationHA with Federatione.g. NameNode HA for HBase, but not for the others
  33. 33. 4/27/13“Classical” MapReduce LimitationsLimited scalabilityPoor resource utilizationLack of support for the alternative frameworksLack of wire-compatible protocols
  34. 34. 4/27/13Limited ScalabilityLimited ScalabilityScales to onlyScales to only~4,000-node cluster~4,000-node cluster~40,000 concurrent tasks~40,000 concurrent tasksSmaller and less-powerful clusters must be createdSmaller and less-powerful clusters must be createdImage source: http://www.flickr.com/photos/rsms/660332848/sizes/l/in/set-72157607427138760/Image source: http://www.flickr.com/photos/rsms/660332848/sizes/l/in/set-72157607427138760/
  35. 35. 4/27/13Poor Cluster Resources UtilizationHard-coded values. TaskTrackersmust be restarted after a change.
  36. 36. 4/27/13Poor Cluster Resources UtilizationMap tasks are waiting for the slotswhich areNOT currently used by Reduce tasksHard-coded values. TaskTrackersmust be restarted after a change.
  37. 37. 4/27/13Resources Needs That Varies Over TimeHow tohard-code thenumber of mapand reduce slotsefficiently?
  38. 38. 4/27/13(Secondary) Benefits Of Better Utilization(Secondary) Benefits Of Better UtilizationThe same calculation on more efficient Hadoop clusterThe same calculation on more efficient Hadoop clusterLess data center spaceLess data center spaceLess silicon wasteLess silicon wasteLess power utilizationsLess power utilizationsLess carbon emissionsLess carbon emissions==Less disastrous environmental consequencesLess disastrous environmental consequences
  39. 39. 4/27/13“Classical” MapReduce Daemons
  40. 40. 4/27/13JobTracker ResponsibilitiesJobTracker ResponsibilitiesManages the computational resources (map and reduce slots)Manages the computational resources (map and reduce slots)Schedules all user jobsSchedules all user jobsSchedules all tasks that belongs to a jobSchedules all tasks that belongs to a jobMonitors tasks executionsMonitors tasks executionsRestarts failed and speculatively runs slow tasksRestarts failed and speculatively runs slow tasksCalculates job counters totalsCalculates job counters totals
  41. 41. 4/27/13Simple ObservationJobTracker does a lot...
  42. 42. 4/27/13JobTracker Redesign IdeaReduce responsibilities of JobTrackerSeparate cluster resource management from jobcoordinationUse slaves (many of them!) to manage jobs life-cycleScale to (at least) 10K nodes, 10K jobs, 100K tasks
  43. 43. 4/27/13Disappearance Of The Centralized JobTracker
  44. 44. 4/27/13YARN – Yet Another Resource Negotiator
  45. 45. 4/27/13Resource Manager and Application Master
  46. 46. 4/27/13YARN ApplicationsMRv2 (simply MRv1 rewritten to run on top of YARN)No need to rewrite existing MapReduce jobsDistributed shellSpark, Apache S4 (a real-time processing)Hamster (MPI on Hadoop)Apache Giraph (a graph processing)Apache HAMA (matrix, graph and network algorithms)... and http://wiki.apache.org/hadoop/PoweredByYarn
  47. 47. 4/27/13NodeManagerMore flexible and efficient than TaskTrackerExecutes any computation that makes sense to ApplicationMasterNot only map or reduce tasksContainers with variable resource sizes (e.g. RAM, CPU, network, disk)No hard-coded split into map and reduce slots
  48. 48. 4/27/13NodeManager ContainersNodeManager creates a container for each taskContainer contains variable resources sizese.g. 2GB RAM, 1 CPU, 1 diskNumber of created containers is limitedBy total sum of resources available on NodeManager
  49. 49. 4/27/13Application Submission In YARN
  50. 50. 4/27/13Resource Manager Components
  51. 51. 4/27/13UberizationUberizationRunning the small jobs in the same JVM as AplicationMasterRunning the small jobs in the same JVM as AplicationMastermapreduce.job.ubertask.enablemapreduce.job.ubertask.enable true (false is default)true (false is default)mapreduce.job.ubertask.maxmapsmapreduce.job.ubertask.maxmaps 9*9*mapreduce.job.ubertask.maxreducesmapreduce.job.ubertask.maxreduces 1*1*mapreduce.job.ubertask.maxbytesmapreduce.job.ubertask.maxbytes dfs.block.size*dfs.block.size** Users may override these values, but only downward* Users may override these values, but only downward
  52. 52. 4/27/13Interesting DifferencesInteresting DifferencesTask progress, status, counters are passed directly to theTask progress, status, counters are passed directly to theApplication MasterApplication MasterShuffle handlers (long-running auxiliary services in NodeShuffle handlers (long-running auxiliary services in NodeManagers) serve map outputs to reduce tasksManagers) serve map outputs to reduce tasksYARN does not support JVM reuse for map/reduce tasksYARN does not support JVM reuse for map/reduce tasks
  53. 53. 4/27/13Fault-TolleranceFailure of the running tasks or NodeManagers is similar as in MRv1Applications can be retried several timesyarn.resourcemanager.am.max-retries 1yarn.app.mapreduce.am.job.recovery.enable falseResource Manager can start Application Master in a new containeryarn.app.mapreduce.am.job.recovery.enable false
  54. 54. 4/27/13Resource Manager FailureTwo interesting featuresRM restarts without the need to re-run currentlyrunning/submitted applications [YARN-128]ZK-based High Availability (HA) for RM [YARN-149] (still inprogress)
  55. 55. 4/27/13Wire-Compatible ProtocolWire-Compatible ProtocolClient and cluster may use different versionsMore manageable upgradesRolling upgrades without disrupting the serviceActive and standby NameNodes upgraded independentlyProtocol buffers chosen for serialization (instead of Writables)
  56. 56. 4/27/13YARN MaturityYARN MaturityStill considered as both production and not production-readyStill considered as both production and not production-readyThe code is already promoted to the trunkThe code is already promoted to the trunkYahoo! runs it on 2000 and 6000-node clustersYahoo! runs it on 2000 and 6000-node clustersUsing Apache Hadoop 2.0.2 (Alpha)Using Apache Hadoop 2.0.2 (Alpha)There is no production support for YARN yetThere is no production support for YARN yetAnyway, it will replace MRv1 sooner than laterAnyway, it will replace MRv1 sooner than later
  57. 57. 4/27/13How To Quickly Setup The YARN Cluster
  58. 58. 4/27/13Basic YARN Configuration Parametersyarn.nodemanager.resource.memory-mb 8192yarn.nodemanager.resource.cpu-cores 8yarn.scheduler.minimum-allocation-mb 1024yarn.scheduler.maximum-allocation-mb 8192yarn.scheduler.minimum-allocation-vcores 1yarn.scheduler.maximum-allocation-vcores 32
  59. 59. 4/27/13Basic MR Apps Configuration Parametersmapreduce.map.cpu.vcores 1mapreduce.reduce.cpu.vcores 1mapreduce.map.memory.mb 1536mapreduce.map.java.opts -Xmx1024Mmapreduce.reduce.memory.mb 3072mapreduce.reduce.java.opts -Xmx2560Mmapreduce.task.io.sort.mb 512
  60. 60. 4/27/13Apache Whirr Configuration File
  61. 61. 4/27/13Running And Destroying Cluster$ whirr launch-cluster --config hadoop-yarn.properties$ whirr destroy-cluster --config hadoop-yarn.properties
  62. 62. 4/27/13Running And Destroying Cluster$ whirr$ whirr launch-clusterlaunch-cluster --config hadoop-yarn.properties--config hadoop-yarn.properties$ whirr$ whirr destroy-clusterdestroy-cluster --config hadoop-yarn.properties--config hadoop-yarn.properties
  63. 63. 4/27/13It Might Be A DemoBut what about runningBut what about runningproduction applicationsproduction applicationson the real clusteron the real clusterconsisting of hundreds of nodes?consisting of hundreds of nodes?Join Spotify!Join Spotify!jobs@spotify.comjobs@spotify.com
  64. 64. 4/27/13Thank You!

×