• Save
YARN High Availability
 

YARN High Availability

on

  • 560 views

Speakers: Karthik Kambatla – Cloudera Inc and Xuan Gong – Hortonworks Inc

Speakers: Karthik Kambatla – Cloudera Inc and Xuan Gong – Hortonworks Inc

Statistics

Views

Total Views
560
Views on SlideShare
550
Embed Views
10

Actions

Likes
1
Downloads
0
Comments
0

2 Embeds 10

http://www.slideee.com 7
https://www.linkedin.com 3

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

YARN High Availability YARN High Availability Presentation Transcript

  • YARN High Availability Karthik Kambatla – Cloudera Inc Xuan Gong – Hortonworks Inc
  • Outline • Background – YARN architecture and need for HA • RM HA architecture – Persisting the state – Active/ Standby pair and Fencing – Failover and redirection • Configuring HA • Demo 6/30/2014 YARN High Availability, Hadoop Summit 2
  • YARN Architecture 6/30/2014 YARN High Availability, Hadoop Summit 3 Resource Manager Node Manager Node Manager Node Manager App Master Container Client Client Cluster State Applications State View slide
  • Fault-tolerance 6/30/2014 YARN High Availability, Hadoop Summit 4 Resource Manager Node Manager Node Manager Node Manager App Master Container Client Client App Master ContainerCluster State Applications State View slide
  • Naïve RM Restart 6/30/2014 YARN High Availability, Hadoop Summit 5 Resource Manager Node Manager Node Manager Client Client App Master Cluster State Applications State
  • ResourceManager is a YARN cluster’s single point of failure. 6/30/2014 YARN High Availability, Hadoop Summit 6 Need stateful restart and multiple RMs.
  • Highly Available Resource Manager a.k.a. HARMful YARN • Currently shipped – Beta in Apache Hadoop 2.3.0 – Stable in Apache Hadoop 2.4.0 – More stable in Apache Hadoop 2.4.1 6/30/2014 YARN High Availability, Hadoop Summit 7
  • Stateful RM Restart (Phase 1) 6/30/2014 YARN High Availability, Hadoop Summit 8 Node Manager Node Manager App Master Container Client Client Resource Manager Cluster State Applications State RM Store App Master Container
  • RM Store Implementations • Memory store – testing purposes • Filesystem based store – Any file system: local, HDFS or any other • Zookeeper based store (ZKRMStateStore) – Recommended (for fencing) – Loading 10,000 applications takes about 8.5 secs. 6/30/2014 YARN High Availability, Hadoop Summit 9
  • Implications to Running applications • In-flight work is lost. • AMs are restarted. • AMs could checkpoint completed work. – MapReduce AM does. – Consider a job with 100 map tasks • If RM goes down after 90 map tasks finish. • After restart, only the remaining 10 are run. 6/30/2014 YARN High Availability, Hadoop Summit 10
  • Stateful RM Restart (Phase 2) • Under development (YARN-556) – No loss of in-flight work • Related work – Work-preserving NodeManager restart (YARN- 1336) – Work-preserving ApplicationMaster restart (YARN- 1489) 6/30/2014 YARN High Availability, Hadoop Summit 11
  • Multiple RMs • Active / Standby architecture – Potentially multiple standbys – Warm standby • Running • Loads state and starts RPC servers on becoming Active – Manual / automatic failover – Clients and Web UI failover automatically 6/30/2014 YARN High Availability, Hadoop Summit 12
  • Active / Standby 6/30/2014 YARN High Availability, Hadoop Summit 13 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Standby Resource Manager
  • Manual Failover through CLI 6/30/2014 YARN High Availability, Hadoop Summit 14 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Standby Resource Manager
  • Client Failover (ConfiguredFailoverProxyProvider) 6/30/2014 YARN High Availability, Hadoop Summit 15 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Standby Resource Manager App Master
  • Automatic Failover 6/30/2014 YARN High Availability, Hadoop Summit 16 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Standby Resource Manager Elector Elector ZK
  • Automatic Failover 6/30/2014 YARN High Availability, Hadoop Summit 17 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Standby Resource Manager Elector Elector ZK
  • Automatic Failover • Zookeeper based – Uses ActiveStandbyElector for Active election • No need for a FailoverController – Can’t monitor RM process health and recover 6/30/2014 YARN High Availability, Hadoop Summit 18
  • Network Hiccup 6/30/2014 YARN High Availability, Hadoop Summit 19 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Standby Resource Manager Elector Elector ZK
  • Multiple Actives? 6/30/2014 YARN High Availability, Hadoop Summit 20 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Active Resource Manager Elector Elector ZK
  • Fencing • The state store gets corrupted when multiple RMs assume the Active role. • Exclusive access to a single RM. – ZKRMStateStore takes care of this. – Shared “admin” access. – Exclusive “create-delete” access on transition to Active 6/30/2014 YARN High Availability, Hadoop Summit 21
  • Network Hiccup 6/30/2014 YARN High Availability, Hadoop Summit 22 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Standby Resource Manager Elector Elector ZK
  • Active / Standby 6/30/2014 YARN High Availability, Hadoop Summit 23 Node Manager Node Manager App Master Client Client Active Resource Manager RM Store Standby Resource Manager Elector Elector ZK
  • In-flight RPCs • In-flight RPCs: Retry or not? – E.g. Submit application – we clearly don’t want two applications submitted. • Depends on whether failover happens before, during, or after the RM acts on the call. • Solution – Annotate APIs as Idempotent or AtMostOnce 6/30/2014 YARN High Availability, Hadoop Summit 24
  • Web UI • Standby RM has no/stale information. • Users don’t know which RM is Active. • Redirect Web UI and REST calls to Active RM. – Except a few pages that give information about the RM. 6/30/2014 YARN High Availability, Hadoop Summit 25
  • Admin Refresh • Admin refresh ($ yarn rmadmin –refresh): – Refreshes that particular RM – Active/Standby – Uses local configuration file • FileSystemBasedConfigurationProvider – Upload the configuration files to (potentially shared) filesystem like HDFS. 6/30/2014 YARN High Availability, Hadoop Summit 26
  • Setting up HA Config name Value yarn.resourcemanager.ha.enabled true yarn.resourcemanager.ha.rm-ids rm1,rm2 yarn.resourcemanager.hostname.rm1 <host1> yarn.resourcemanager.hostname.rm2 <host2> yarn.resourcemanager.recovery.enabled true yarn.resourcemanager.store.class ZKRMStateStore1 yarn.resourcemanager.zk-address <zk-quorum> yarn.resourcemanager.cluster-id <cluster-id> 6/30/2014 YARN High Availability, Hadoop Summit 27 1. org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore
  • Demo! 6/30/2014 YARN High Availability, Hadoop Summit 28
  • Questions? 6/30/2014 YARN High Availability, Hadoop Summit 29