YARN High Availability
Karthik Kambatla – Cloudera Inc
Xuan Gong – Hortonworks Inc
Outline
• Background
– YARN architecture and need for HA
• RM HA architecture
– Persisting the state
– Active/ Standby pai...
YARN Architecture
6/30/2014 YARN High Availability, Hadoop Summit 3
Resource
Manager
Node Manager
Node Manager
Node Manage...
Fault-tolerance
6/30/2014 YARN High Availability, Hadoop Summit 4
Resource
Manager
Node Manager
Node Manager
Node Manager
...
Naïve RM Restart
6/30/2014 YARN High Availability, Hadoop Summit 5
Resource
Manager
Node Manager
Node Manager
Client
Clien...
ResourceManager is a YARN cluster’s
single point of failure.
6/30/2014 YARN High Availability, Hadoop Summit 6
Need statef...
Highly Available Resource Manager
a.k.a. HARMful YARN
• Currently shipped
– Beta in Apache Hadoop 2.3.0
– Stable in Apache...
Stateful RM Restart (Phase 1)
6/30/2014 YARN High Availability, Hadoop Summit 8
Node Manager
Node Manager
App
Master
Conta...
RM Store Implementations
• Memory store – testing purposes
• Filesystem based store
– Any file system: local, HDFS or any ...
Implications to Running applications
• In-flight work is lost.
• AMs are restarted.
• AMs could checkpoint completed work....
Stateful RM Restart (Phase 2)
• Under development (YARN-556)
– No loss of in-flight work
• Related work
– Work-preserving ...
Multiple RMs
• Active / Standby architecture
– Potentially multiple standbys
– Warm standby
• Running
• Loads state and st...
Active / Standby
6/30/2014 YARN High Availability, Hadoop Summit 13
Node Manager
Node Manager
App
Master
Client
Client
Act...
Manual Failover through CLI
6/30/2014 YARN High Availability, Hadoop Summit 14
Node Manager
Node Manager
App
Master
Client...
Client Failover
(ConfiguredFailoverProxyProvider)
6/30/2014 YARN High Availability, Hadoop Summit 15
Node Manager
Node Man...
Automatic Failover
6/30/2014 YARN High Availability, Hadoop Summit 16
Node Manager
Node Manager
App
Master
Client
Client
A...
Automatic Failover
6/30/2014 YARN High Availability, Hadoop Summit 17
Node Manager
Node Manager
App
Master
Client
Client
A...
Automatic Failover
• Zookeeper based
– Uses ActiveStandbyElector for Active election
• No need for a FailoverController
– ...
Network Hiccup
6/30/2014 YARN High Availability, Hadoop Summit 19
Node Manager
Node Manager
App
Master
Client
Client
Activ...
Multiple Actives?
6/30/2014 YARN High Availability, Hadoop Summit 20
Node Manager
Node Manager
App
Master
Client
Client
Ac...
Fencing
• The state store gets corrupted when multiple
RMs assume the Active role.
• Exclusive access to a single RM.
– ZK...
Network Hiccup
6/30/2014 YARN High Availability, Hadoop Summit 22
Node Manager
Node Manager
App
Master
Client
Client
Activ...
Active / Standby
6/30/2014 YARN High Availability, Hadoop Summit 23
Node Manager
Node Manager
App
Master
Client
Client
Act...
In-flight RPCs
• In-flight RPCs: Retry or not?
– E.g. Submit application – we clearly don’t want
two applications submitte...
Web UI
• Standby RM has no/stale information.
• Users don’t know which RM is Active.
• Redirect Web UI and REST calls to A...
Admin Refresh
• Admin refresh ($ yarn rmadmin –refresh):
– Refreshes that particular RM – Active/Standby
– Uses local conf...
Setting up HA
Config name Value
yarn.resourcemanager.ha.enabled true
yarn.resourcemanager.ha.rm-ids rm1,rm2
yarn.resourcem...
Demo!
6/30/2014 YARN High Availability, Hadoop Summit 28
Questions?
6/30/2014 YARN High Availability, Hadoop Summit 29
Upcoming SlideShare
Loading in...5
×

YARN High Availability

938

Published on

Speakers: Karthik Kambatla – Cloudera Inc and Xuan Gong – Hortonworks Inc

Published in: Software, Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
938
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "YARN High Availability"

  1. 1. YARN High Availability Karthik Kambatla – Cloudera Inc Xuan Gong – Hortonworks Inc
  2. 2. Outline • Background – YARN architecture and need for HA • RM HA architecture – Persisting the state – Active/ Standby pair and Fencing – Failover and redirection • Configuring HA • Demo 6/30/2014 YARN High Availability, Hadoop Summit 2
  3. 3. YARN Architecture 6/30/2014 YARN High Availability, Hadoop Summit 3 Resource Manager Node Manager Node Manager Node Manager App Master Container Client Client Cluster State Applications State
  4. 4. Fault-tolerance 6/30/2014 YARN High Availability, Hadoop Summit 4 Resource Manager Node Manager Node Manager Node Manager App Master Container Client Client App Master ContainerCluster State Applications State
  5. 5. Naïve RM Restart 6/30/2014 YARN High Availability, Hadoop Summit 5 Resource Manager Node Manager Node Manager Client Client App Master Cluster State Applications State
  6. 6. ResourceManager is a YARN cluster’s single point of failure. 6/30/2014 YARN High Availability, Hadoop Summit 6 Need stateful restart and multiple RMs.
  7. 7. Highly Available Resource Manager a.k.a. HARMful YARN • Currently shipped – Beta in Apache Hadoop 2.3.0 – Stable in Apache Hadoop 2.4.0 – More stable in Apache Hadoop 2.4.1 6/30/2014 YARN High Availability, Hadoop Summit 7
  8. 8. Stateful RM Restart (Phase 1) 6/30/2014 YARN High Availability, Hadoop Summit 8 Node Manager Node Manager App Master Container Client Client Resource Manager Cluster State Applications State RM Store App Master Container
  9. 9. RM Store Implementations • Memory store – testing purposes • Filesystem based store – Any file system: local, HDFS or any other • Zookeeper based store (ZKRMStateStore) – Recommended (for fencing) – Loading 10,000 applications takes about 8.5 secs. 6/30/2014 YARN High Availability, Hadoop Summit 9
  10. 10. Implications to Running applications • In-flight work is lost. • AMs are restarted. • AMs could checkpoint completed work. – MapReduce AM does. – Consider a job with 100 map tasks • If RM goes down after 90 map tasks finish. • After restart, only the remaining 10 are run. 6/30/2014 YARN High Availability, Hadoop Summit 10
  11. 11. Stateful RM Restart (Phase 2) • Under development (YARN-556) – No loss of in-flight work • Related work – Work-preserving NodeManager restart (YARN- 1336) – Work-preserving ApplicationMaster restart (YARN- 1489) 6/30/2014 YARN High Availability, Hadoop Summit 11
  12. 12. Multiple RMs • Active / Standby architecture – Potentially multiple standbys – Warm standby • Running • Loads state and starts RPC servers on becoming Active – Manual / automatic failover – Clients and Web UI failover automatically 6/30/2014 YARN High Availability, Hadoop Summit 12
  13. 13. Active / Standby 6/30/2014 YARN High Availability, Hadoop Summit 13 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Standby Resource Manager
  14. 14. Manual Failover through CLI 6/30/2014 YARN High Availability, Hadoop Summit 14 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Standby Resource Manager
  15. 15. Client Failover (ConfiguredFailoverProxyProvider) 6/30/2014 YARN High Availability, Hadoop Summit 15 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Standby Resource Manager App Master
  16. 16. Automatic Failover 6/30/2014 YARN High Availability, Hadoop Summit 16 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Standby Resource Manager Elector Elector ZK
  17. 17. Automatic Failover 6/30/2014 YARN High Availability, Hadoop Summit 17 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Standby Resource Manager Elector Elector ZK
  18. 18. Automatic Failover • Zookeeper based – Uses ActiveStandbyElector for Active election • No need for a FailoverController – Can’t monitor RM process health and recover 6/30/2014 YARN High Availability, Hadoop Summit 18
  19. 19. Network Hiccup 6/30/2014 YARN High Availability, Hadoop Summit 19 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Standby Resource Manager Elector Elector ZK
  20. 20. Multiple Actives? 6/30/2014 YARN High Availability, Hadoop Summit 20 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Active Resource Manager Elector Elector ZK
  21. 21. Fencing • The state store gets corrupted when multiple RMs assume the Active role. • Exclusive access to a single RM. – ZKRMStateStore takes care of this. – Shared “admin” access. – Exclusive “create-delete” access on transition to Active 6/30/2014 YARN High Availability, Hadoop Summit 21
  22. 22. Network Hiccup 6/30/2014 YARN High Availability, Hadoop Summit 22 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Standby Resource Manager Elector Elector ZK
  23. 23. Active / Standby 6/30/2014 YARN High Availability, Hadoop Summit 23 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Standby Resource Manager Elector Elector ZK
  24. 24. In-flight RPCs • In-flight RPCs: Retry or not? – E.g. Submit application – we clearly don’t want two applications submitted. • Depends on whether failover happens before, during, or after the RM acts on the call. • Solution – Annotate APIs as Idempotent or AtMostOnce 6/30/2014 YARN High Availability, Hadoop Summit 24
  25. 25. Web UI • Standby RM has no/stale information. • Users don’t know which RM is Active. • Redirect Web UI and REST calls to Active RM. – Except a few pages that give information about the RM. 6/30/2014 YARN High Availability, Hadoop Summit 25
  26. 26. Admin Refresh • Admin refresh ($ yarn rmadmin –refresh): – Refreshes that particular RM – Active/Standby – Uses local configuration file • FileSystemBasedConfigurationProvider – Upload the configuration files to (potentially shared) filesystem like HDFS. 6/30/2014 YARN High Availability, Hadoop Summit 26
  27. 27. Setting up HA Config name Value yarn.resourcemanager.ha.enabled true yarn.resourcemanager.ha.rm-ids rm1,rm2 yarn.resourcemanager.hostname.rm1 <host1> yarn.resourcemanager.hostname.rm2 <host2> yarn.resourcemanager.recovery.enabled true yarn.resourcemanager.store.class ZKRMStateStore1 yarn.resourcemanager.zk-address <zk-quorum> yarn.resourcemanager.cluster-id <cluster-id> 6/30/2014 YARN High Availability, Hadoop Summit 27 1. org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore
  28. 28. Demo! 6/30/2014 YARN High Availability, Hadoop Summit 28
  29. 29. Questions? 6/30/2014 YARN High Availability, Hadoop Summit 29

×