Hadoop 2.0 beta
Stable YARN APIs
MR binary compatibility
Testing with the whole stack
Ready for prime-time!
YARN API stability
Broke APIs for one last time
Security: Tokens used irrespective of kerberos
Read-only IDs, factories for creating records
Compatibility for existing applications
Old mapred APIs binary compatible
New mapreduce APIs source compatible
Pig, Hive, Oozie etc work with latest versions.
No need to rewrite your scripts.
Design and Work Plan
• YARN-128 – RM Restart
– Creates framework to store and load state information.
Forms the basis of failover and HA. Work close to
completion and being actively tested.
• YARN-149 – RM HA
– Adds HA service to RM in order to failover between
instances. Work in active progress.
• YARN-556 – Work preserving RM Restart
– Support loss-less recovery of cluster state when RM
restarts or fails over
– Design proposal up
• All the work is being done in a carefully planned
manner directly on trunk. Code is always stable and
RM Restart (YARN-128)
Current state of the impl
Impact on applications/frameworks
How to use
• Supports ZooKeeper, HDFS, and local
FileSystem as the underlying store.
– all Clients (NM, AM, clients) of RM have the same
retry behavior while RM is down.
• RM restart is working in secure environment
• Two types of State Info:
– Application related state info: asynchronously
– ApplicationSubmissionContext ( AM ContainerLaunchContext,
– AM container, AMRMToken, ClientTokenMasterKey, etc.
– RMDelegationTokenSecretManager State(not
application specific) : synchronously
• RMDelegationToken MasterKey
• RMDelegationToken Sequence Number
RM Recovery Workflow
• Save the app on app submission
– User Provided credentials (HDFSDelegationToken)
• Save the attempt on AM attempt launch
– AMRMToken, ClientToken
– Save the token and sequence number on token
– Save master key when it rolls
• RM crashes….
What happens after RM restarts?
• Instruct the old AM to shutdown
• Load the ApplicationSubmissionContext
– Submit the application
• Load the earlier attempts
– Loads the attempt credentials (AMRMToken,
• Launch a new attempt
Consistency between Downstream
consumers of AM and YARN
• AM should notify its consumers that the job is
done only after YARN reports it’s done
– User is expected to retry this API until it becomes
– Similarly, kill-application (fix in progress)
For MR AM
– JobClient: AM crashes after JobClient sees
FINISHED but before RM removes the app when
• Bugs: relaunch FINISHED application(succeeded, failed,
– HistoryServer: History files flushed before RM
removes the app when app finishes
How to use: 3 steps
• 1. Enable RM restart
• 2. Choose the underlying store you want (HDFS, ZooKeeper,
– FileSystemRMStateStore / ZKRMStateStore
• 3. Configure the address of the store
● Active / Standby
Standby is powered up, but doesn’t have any
● Restructure RM services (YARN-1098)
Always On services
Active Services (e.g. Client <-> RM, AM <-> RM)
● RMHAService (YARN-1027)
Failover / Admin
● CLI: yarn rmhaadmin
● Manual failover
● Automatic failover (YARN-1177)
Start it as an RM service instead of a separate
Re-visit and strip out unnecessary parts.
● Implicit fencing through ZK RM StateStore
● ACL-based fencing on store.load() during
transition to active
Shared read-write-admin access to the store
Claim exclusive create-delete access
All store operations create-delete a fencing node
The other RM can’t write to the store anymore