• Like
  • Save

YARN High Availability

  • 584 views
Uploaded on

Speakers: Karthik Kambatla – Cloudera Inc and Xuan Gong – Hortonworks Inc …

Speakers: Karthik Kambatla – Cloudera Inc and Xuan Gong – Hortonworks Inc

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
584
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
0
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. YARN High Availability Karthik Kambatla – Cloudera Inc Xuan Gong – Hortonworks Inc
  • 2. Outline • Background – YARN architecture and need for HA • RM HA architecture – Persisting the state – Active/ Standby pair and Fencing – Failover and redirection • Configuring HA • Demo 6/30/2014 YARN High Availability, Hadoop Summit 2
  • 3. YARN Architecture 6/30/2014 YARN High Availability, Hadoop Summit 3 Resource Manager Node Manager Node Manager Node Manager App Master Container Client Client Cluster State Applications State
  • 4. Fault-tolerance 6/30/2014 YARN High Availability, Hadoop Summit 4 Resource Manager Node Manager Node Manager Node Manager App Master Container Client Client App Master ContainerCluster State Applications State
  • 5. Naïve RM Restart 6/30/2014 YARN High Availability, Hadoop Summit 5 Resource Manager Node Manager Node Manager Client Client App Master Cluster State Applications State
  • 6. ResourceManager is a YARN cluster’s single point of failure. 6/30/2014 YARN High Availability, Hadoop Summit 6 Need stateful restart and multiple RMs.
  • 7. Highly Available Resource Manager a.k.a. HARMful YARN • Currently shipped – Beta in Apache Hadoop 2.3.0 – Stable in Apache Hadoop 2.4.0 – More stable in Apache Hadoop 2.4.1 6/30/2014 YARN High Availability, Hadoop Summit 7
  • 8. Stateful RM Restart (Phase 1) 6/30/2014 YARN High Availability, Hadoop Summit 8 Node Manager Node Manager App Master Container Client Client Resource Manager Cluster State Applications State RM Store App Master Container
  • 9. RM Store Implementations • Memory store – testing purposes • Filesystem based store – Any file system: local, HDFS or any other • Zookeeper based store (ZKRMStateStore) – Recommended (for fencing) – Loading 10,000 applications takes about 8.5 secs. 6/30/2014 YARN High Availability, Hadoop Summit 9
  • 10. Implications to Running applications • In-flight work is lost. • AMs are restarted. • AMs could checkpoint completed work. – MapReduce AM does. – Consider a job with 100 map tasks • If RM goes down after 90 map tasks finish. • After restart, only the remaining 10 are run. 6/30/2014 YARN High Availability, Hadoop Summit 10
  • 11. Stateful RM Restart (Phase 2) • Under development (YARN-556) – No loss of in-flight work • Related work – Work-preserving NodeManager restart (YARN- 1336) – Work-preserving ApplicationMaster restart (YARN- 1489) 6/30/2014 YARN High Availability, Hadoop Summit 11
  • 12. Multiple RMs • Active / Standby architecture – Potentially multiple standbys – Warm standby • Running • Loads state and starts RPC servers on becoming Active – Manual / automatic failover – Clients and Web UI failover automatically 6/30/2014 YARN High Availability, Hadoop Summit 12
  • 13. Active / Standby 6/30/2014 YARN High Availability, Hadoop Summit 13 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Standby Resource Manager
  • 14. Manual Failover through CLI 6/30/2014 YARN High Availability, Hadoop Summit 14 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Standby Resource Manager
  • 15. Client Failover (ConfiguredFailoverProxyProvider) 6/30/2014 YARN High Availability, Hadoop Summit 15 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Standby Resource Manager App Master
  • 16. Automatic Failover 6/30/2014 YARN High Availability, Hadoop Summit 16 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Standby Resource Manager Elector Elector ZK
  • 17. Automatic Failover 6/30/2014 YARN High Availability, Hadoop Summit 17 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Standby Resource Manager Elector Elector ZK
  • 18. Automatic Failover • Zookeeper based – Uses ActiveStandbyElector for Active election • No need for a FailoverController – Can’t monitor RM process health and recover 6/30/2014 YARN High Availability, Hadoop Summit 18
  • 19. Network Hiccup 6/30/2014 YARN High Availability, Hadoop Summit 19 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Standby Resource Manager Elector Elector ZK
  • 20. Multiple Actives? 6/30/2014 YARN High Availability, Hadoop Summit 20 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Active Resource Manager Elector Elector ZK
  • 21. Fencing • The state store gets corrupted when multiple RMs assume the Active role. • Exclusive access to a single RM. – ZKRMStateStore takes care of this. – Shared “admin” access. – Exclusive “create-delete” access on transition to Active 6/30/2014 YARN High Availability, Hadoop Summit 21
  • 22. Network Hiccup 6/30/2014 YARN High Availability, Hadoop Summit 22 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Standby Resource Manager Elector Elector ZK
  • 23. Active / Standby 6/30/2014 YARN High Availability, Hadoop Summit 23 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Standby Resource Manager Elector Elector ZK
  • 24. In-flight RPCs • In-flight RPCs: Retry or not? – E.g. Submit application – we clearly don’t want two applications submitted. • Depends on whether failover happens before, during, or after the RM acts on the call. • Solution – Annotate APIs as Idempotent or AtMostOnce 6/30/2014 YARN High Availability, Hadoop Summit 24
  • 25. Web UI • Standby RM has no/stale information. • Users don’t know which RM is Active. • Redirect Web UI and REST calls to Active RM. – Except a few pages that give information about the RM. 6/30/2014 YARN High Availability, Hadoop Summit 25
  • 26. Admin Refresh • Admin refresh ($ yarn rmadmin –refresh): – Refreshes that particular RM – Active/Standby – Uses local configuration file • FileSystemBasedConfigurationProvider – Upload the configuration files to (potentially shared) filesystem like HDFS. 6/30/2014 YARN High Availability, Hadoop Summit 26
  • 27. Setting up HA Config name Value yarn.resourcemanager.ha.enabled true yarn.resourcemanager.ha.rm-ids rm1,rm2 yarn.resourcemanager.hostname.rm1 <host1> yarn.resourcemanager.hostname.rm2 <host2> yarn.resourcemanager.recovery.enabled true yarn.resourcemanager.store.class ZKRMStateStore1 yarn.resourcemanager.zk-address <zk-quorum> yarn.resourcemanager.cluster-id <cluster-id> 6/30/2014 YARN High Availability, Hadoop Summit 27 1. org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore
  • 28. Demo! 6/30/2014 YARN High Availability, Hadoop Summit 28
  • 29. Questions? 6/30/2014 YARN High Availability, Hadoop Summit 29