Your SlideShare is downloading. ×
0
YARN High Availability
YARN High Availability
YARN High Availability
YARN High Availability
YARN High Availability
YARN High Availability
YARN High Availability
YARN High Availability
YARN High Availability
YARN High Availability
YARN High Availability
YARN High Availability
YARN High Availability
YARN High Availability
YARN High Availability
YARN High Availability
YARN High Availability
YARN High Availability
YARN High Availability
YARN High Availability
YARN High Availability
YARN High Availability
YARN High Availability
YARN High Availability
YARN High Availability
YARN High Availability
YARN High Availability
YARN High Availability
YARN High Availability
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

YARN High Availability

907

Published on

Speakers: Karthik Kambatla – Cloudera Inc and Xuan Gong – Hortonworks Inc …

Speakers: Karthik Kambatla – Cloudera Inc and Xuan Gong – Hortonworks Inc

Published in: Software, Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
907
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. YARN High Availability Karthik Kambatla – Cloudera Inc Xuan Gong – Hortonworks Inc
  • 2. Outline • Background – YARN architecture and need for HA • RM HA architecture – Persisting the state – Active/ Standby pair and Fencing – Failover and redirection • Configuring HA • Demo 6/30/2014 YARN High Availability, Hadoop Summit 2
  • 3. YARN Architecture 6/30/2014 YARN High Availability, Hadoop Summit 3 Resource Manager Node Manager Node Manager Node Manager App Master Container Client Client Cluster State Applications State
  • 4. Fault-tolerance 6/30/2014 YARN High Availability, Hadoop Summit 4 Resource Manager Node Manager Node Manager Node Manager App Master Container Client Client App Master ContainerCluster State Applications State
  • 5. Naïve RM Restart 6/30/2014 YARN High Availability, Hadoop Summit 5 Resource Manager Node Manager Node Manager Client Client App Master Cluster State Applications State
  • 6. ResourceManager is a YARN cluster’s single point of failure. 6/30/2014 YARN High Availability, Hadoop Summit 6 Need stateful restart and multiple RMs.
  • 7. Highly Available Resource Manager a.k.a. HARMful YARN • Currently shipped – Beta in Apache Hadoop 2.3.0 – Stable in Apache Hadoop 2.4.0 – More stable in Apache Hadoop 2.4.1 6/30/2014 YARN High Availability, Hadoop Summit 7
  • 8. Stateful RM Restart (Phase 1) 6/30/2014 YARN High Availability, Hadoop Summit 8 Node Manager Node Manager App Master Container Client Client Resource Manager Cluster State Applications State RM Store App Master Container
  • 9. RM Store Implementations • Memory store – testing purposes • Filesystem based store – Any file system: local, HDFS or any other • Zookeeper based store (ZKRMStateStore) – Recommended (for fencing) – Loading 10,000 applications takes about 8.5 secs. 6/30/2014 YARN High Availability, Hadoop Summit 9
  • 10. Implications to Running applications • In-flight work is lost. • AMs are restarted. • AMs could checkpoint completed work. – MapReduce AM does. – Consider a job with 100 map tasks • If RM goes down after 90 map tasks finish. • After restart, only the remaining 10 are run. 6/30/2014 YARN High Availability, Hadoop Summit 10
  • 11. Stateful RM Restart (Phase 2) • Under development (YARN-556) – No loss of in-flight work • Related work – Work-preserving NodeManager restart (YARN- 1336) – Work-preserving ApplicationMaster restart (YARN- 1489) 6/30/2014 YARN High Availability, Hadoop Summit 11
  • 12. Multiple RMs • Active / Standby architecture – Potentially multiple standbys – Warm standby • Running • Loads state and starts RPC servers on becoming Active – Manual / automatic failover – Clients and Web UI failover automatically 6/30/2014 YARN High Availability, Hadoop Summit 12
  • 13. Active / Standby 6/30/2014 YARN High Availability, Hadoop Summit 13 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Standby Resource Manager
  • 14. Manual Failover through CLI 6/30/2014 YARN High Availability, Hadoop Summit 14 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Standby Resource Manager
  • 15. Client Failover (ConfiguredFailoverProxyProvider) 6/30/2014 YARN High Availability, Hadoop Summit 15 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Standby Resource Manager App Master
  • 16. Automatic Failover 6/30/2014 YARN High Availability, Hadoop Summit 16 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Standby Resource Manager Elector Elector ZK
  • 17. Automatic Failover 6/30/2014 YARN High Availability, Hadoop Summit 17 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Standby Resource Manager Elector Elector ZK
  • 18. Automatic Failover • Zookeeper based – Uses ActiveStandbyElector for Active election • No need for a FailoverController – Can’t monitor RM process health and recover 6/30/2014 YARN High Availability, Hadoop Summit 18
  • 19. Network Hiccup 6/30/2014 YARN High Availability, Hadoop Summit 19 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Standby Resource Manager Elector Elector ZK
  • 20. Multiple Actives? 6/30/2014 YARN High Availability, Hadoop Summit 20 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Active Resource Manager Elector Elector ZK
  • 21. Fencing • The state store gets corrupted when multiple RMs assume the Active role. • Exclusive access to a single RM. – ZKRMStateStore takes care of this. – Shared “admin” access. – Exclusive “create-delete” access on transition to Active 6/30/2014 YARN High Availability, Hadoop Summit 21
  • 22. Network Hiccup 6/30/2014 YARN High Availability, Hadoop Summit 22 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Standby Resource Manager Elector Elector ZK
  • 23. Active / Standby 6/30/2014 YARN High Availability, Hadoop Summit 23 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Standby Resource Manager Elector Elector ZK
  • 24. In-flight RPCs • In-flight RPCs: Retry or not? – E.g. Submit application – we clearly don’t want two applications submitted. • Depends on whether failover happens before, during, or after the RM acts on the call. • Solution – Annotate APIs as Idempotent or AtMostOnce 6/30/2014 YARN High Availability, Hadoop Summit 24
  • 25. Web UI • Standby RM has no/stale information. • Users don’t know which RM is Active. • Redirect Web UI and REST calls to Active RM. – Except a few pages that give information about the RM. 6/30/2014 YARN High Availability, Hadoop Summit 25
  • 26. Admin Refresh • Admin refresh ($ yarn rmadmin –refresh): – Refreshes that particular RM – Active/Standby – Uses local configuration file • FileSystemBasedConfigurationProvider – Upload the configuration files to (potentially shared) filesystem like HDFS. 6/30/2014 YARN High Availability, Hadoop Summit 26
  • 27. Setting up HA Config name Value yarn.resourcemanager.ha.enabled true yarn.resourcemanager.ha.rm-ids rm1,rm2 yarn.resourcemanager.hostname.rm1 <host1> yarn.resourcemanager.hostname.rm2 <host2> yarn.resourcemanager.recovery.enabled true yarn.resourcemanager.store.class ZKRMStateStore1 yarn.resourcemanager.zk-address <zk-quorum> yarn.resourcemanager.cluster-id <cluster-id> 6/30/2014 YARN High Availability, Hadoop Summit 27 1. org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore
  • 28. Demo! 6/30/2014 YARN High Availability, Hadoop Summit 28
  • 29. Questions? 6/30/2014 YARN High Availability, Hadoop Summit 29

×