SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
1.
Building Applications on YARN
Chris Riccomini
10/11/2012
2.
Staff Software Engineer at LinkedIn
http://riccomini.name
@criccomini
3.
What I want to Talk About
Anatomy of a YARN Application
Things to consider when building your application
Architecture
Operations
4.
Anatomy of a YARN App
Client
Application Master
Container Code
Resource Manager
Node Manager
5.
Anatomy of a YARN App
Client
Client
Client RM
RM
Application Master
Container Code
Resource Manager
NM
NM NM
NM
Node Manager
AM
AM CC
CC
* simplified
6.
A lot to consider
Deployment Logging
Metrics Fault Tolerance
Configuration Isolation
Security Dashboard
Language State
7.
Deployment
HDFS
HTTP
File (NFS)
DDOS’ing your servers
What we do: Tarball over HTTP. Life is easier with HDFS,
but operational overhead is too high.
8.
Metrics
Application-level metrics
YARN-level metrics
metrics2
Containers are transient
What we do: Both app-level and framework-level metrics use
same metrics framework. Pipe to in-house metrics
dashboard. We don’t use metrics2 since we don’t want a
dependency on Hadoop in our core jar.
10.
Configuration
YARN config (yarn-site.xml, core-site.xml, etc)
Application Configuration
Transporting Configuration
What we do: Config is fully resolved at client execution time.
No admin-override/locked config protection yet. Config is
passed from client to AM to containers via environment
variables.
11.
Security
Kerberos?
Firewalls are your friend
Gateway machine
Dashboard
What we do: Firewall all YARN machines so they can only
talk to each-other. All users go through LDAP controlled
dashboard.
12.
Language
Favor complexity in Application Master, and make
container-logic thin
Talk to RM via REST
Potential to talk to RM via Protobuf RPC
What we do: Application AM is Java. Tasks-side of
application has Python and Java implementations.
13.
Logging
Local storage (application is running)
HDFS storage (application has stopped for a while)
Be careful with STDOUT/STDERR (rollover)
What we do: No HDFS. Logs sit for 7 days, then disappear.
Not ideal.
14.
Fault Tolerance
Failure matrix
HA RM/NM
Orphaned processes
Pay attention to process trees
What we do: No HA. Manual fail over when RM dies.
Orphaned process monitor (proc start time < RM start time).
16.
Isolation
Memory
Disk
CPU
Network
What we do: Nothing, right now. Hoping YARN will solve
this before we need it (cgroups?).
17.
Dashboard
Application-specific information
Integrate with YARN
Application Master or Standalone?
What we do: Dashboard enforces security, talks to RM/AM
via HTTP/JSON to get information about jobs.
19.
State
HDFS
Deployed with Application
Remote data store
What we do: Nothing, right now.
20.
Takeaways
There’s a lot more than just the YARN API
Look for examples (Spark, Storm, Map-Reduce)
Decide your level of Hadoop integration
Metrics2
HDFS
Config
Kerberos and doAs