Writing Application Frameworks
on Apache Hadoop YARN


Hitesh Shah
hitesh@hortonworks.com




© Hortonworks Inc. 2011      Page 1
Hitesh Shah - Background
• Member of Technical Staff at Hortonworks Inc.
• Committer for Apache MapReduce and Ambari
• Earlier, spent 8+ years at Yahoo! building various
  infrastructure pieces all the way from data storage
  platforms to high throughput online ad-serving
  systems.




     Architecting the Future of Big Data
                                                        Page 2
     © Hortonworks Inc. 2011
Agenda

•YARN Architecture and Concepts
•Writing a New Framework




   Architecting the Future of Big Data
                                         Page 3
   © Hortonworks Inc. 2011
YARN Architecture
• Resource Manager
  –Global resource scheduler
  –Hierarchical queues
• Node Manager
  –Per-machine agent
  –Manages the life-cycle of container
  –Container resource monitoring
• Application Master
  –Per-application
  –Manages application scheduling and task execution
  –E.g. MapReduce Application Master

     Architecting the Future of Big Data
                                                       Page 4
     © Hortonworks Inc. 2011
YARN Architecture

                                                            Node
                                                           Manager


                                                    Container   App Mstr


            Client

                                         Resource           Node
                                         Manager           Manager
            Client

                                                    App Mstr    Container




              MapReduce Status                              Node
                                                           Manager
                 Job Submission
                 Node Status
               Resource Request                     Container   Container




   Architecting the Future of Big Data
                                                                            Page 5
   © Hortonworks Inc. 2011
YARN Concepts
• Application ID
  –Application Attempt IDs
• Container
  –ContainerLaunchContext
• ResourceRequest
  –Host/Rack/Any match
  –Priority
  –Resource constraints
• Local Resource
  –File/Archive
  –Visibility – public/private/application


      Architecting the Future of Big Data
                                             Page 6
      © Hortonworks Inc. 2011
What you need for a new Framework
• Application Submission Client
  –For example, the MR Job Client
• Application Master
  –The core framework library
• Application History ( optional )
  –History of all previously run instances
• Auxiliary Services ( optional )
  –Long-running application-specific services running on the
   NodeManager




     Architecting the Future of Big Data
                                                               Page 7
     © Hortonworks Inc. 2011
Use Case: Distributed Shell
• Take a user-provided script               Node
  or application and run it on a            Manager
  set of nodes in the Cluster
                                               DS AppMaster

• Input:
   – User Script to execute
   – Number of containers to run on         Node
                                            Manager
   – Variable arguments for each
     different container                         Shell Script
   – Memory requirements for the
     shell script                           Node
   – Output Location/Dir                    Manager
                                                 Shell Script


      Architecting the Future of Big Data
                                                                Page 8
      © Hortonworks Inc. 2011
Client: RPC calls
• Uses ClientRM Protocol
                                                        ClientRMProtocol#getNewApplication

• Get a new Application
  ID from the RM
                                                        ClientRMProtocol#submitApplication



• Application Submission                       CLIENT
                                                                                                RM

                                                        ClientRMProtocol#getApplicationReport


• Application Monitoring
                                                         ClientRMProtocol#killApplication


• Kill the Application?




         Architecting the Future of Big Data
                                                                                                Page 9
         © Hortonworks Inc. 2011
Client
• Registration with the RM
  –New Application ID


• Application Submission
  –User information
  –Scheduler queue
  –Define the container for the Distributed Shell App Master via
   the ContainerLaunchContext

• Application Monitoring
  – AppMaster host details with tokens if needed, tracking url
  – Application Status (submitted/running/finished)


      Architecting the Future of Big Data
                                                                 Page 10
      © Hortonworks Inc. 2011
Defining a Container
• ContainerLaunchContext class
  –Can run a shell script, a java process or launch a VM


• Command(s) to run
• Local resources needed for the process to run
  –Dependent jars, native libs, data files/archives
• Environment to setup
  –Java Classpath
• Security-related data
  –Container Tokens



      Architecting the Future of Big Data
                                                           Page 11
      © Hortonworks Inc. 2011
Application Master: RPC calls
• AMRM and CM protocols
                                             Client

• Register AM with RM                                         AMRM.registerAM


• Ask RM to allocate
  resources                                                       AMRM.allocate
                                                         AM
                                                                                         RM
• Launch tasks on
  allocated containers                                                       AMRM.
                                                                            finishAM
                                                App-specific
• Manage tasks to final                            RPC

  completion
                                                               CM.startContainer

• Inform RM of completion                               NM      NM




       Architecting the Future of Big Data
                                                                                      Page 12
       © Hortonworks Inc. 2011
Application Master
• Setup RPC to handle requests from Client and/or tasks launched
  on Containers

• Register and send regular heartbeats to the RM

• Request resources from the RM.

• Launch user shell script on containers as and when allocated.

• Monitor status of user script of remote containers and manage
  failures by retrying if needed.

• Inform RM of completion when application is done.


      Architecting the Future of Big Data
                                                                  Page 13
      © Hortonworks Inc. 2011
AMRM#allocate
• Request:
  – Containers needed
      – Not a delta protocol
      – Locality constraints: Host/Rack/Any
      – Resource constraints: memory
      – Priority-based assignments

  – Containers to release – extra/unwanted?
      – Only non-launched containers

• Response:
  – Allocated Containers
      – Launch or release

  – Completed Containers
      – Status of completion

     Architecting the Future of Big Data
                                              Page 14
     © Hortonworks Inc. 2011
YARN Applications
• Data Processing:
  – OpenMPI on Hadoop
  – Spark (UC Berkeley)
       – Shark ( Hive-on-Spark )

  – Real-time data processing
       – Storm ( Twitter )
       – Apache S4

  – Graph processing – Apache Giraph
• Beyond data:
  – Deploying Apache HBase via YARN (HBASE-4329)
  – Hbase Co-processors via YARN (HBASE-4047)




      Architecting the Future of Big Data
                                                   Page 15
      © Hortonworks Inc. 2011
References

•Doc on writing new applications:
  –WritingYarnApplications.html ( available at
   http://hadoop.apache.org/common/docs/r2.0.0-
   alpha/ )




     Architecting the Future of Big Data
                                                 Page 16
     © Hortonworks Inc. 2011
Questions?


Thank You!
Hitesh Shah
hitesh@hortonworks.com




       Architecting the Future of Big Data
                                             Page 17
       © Hortonworks Inc. 2011
Appendix: Code
Examples



  Architecting the Future of Big Data
                                        Page 18
  © Hortonworks Inc. 2011
Client: Registration
ClientRMProtocol applicationsManager;
YarnConfiguration yarnConf = new YarnConfiguration(conf);
InetSocketAddress rmAddress = NetUtils.createSocketAddr(
  yarnConf.get(YarnConfiguration.RM_ADDRESS));

applicationsManager = ((ClientRMProtocol)
  rpc.getProxy(ClientRMProtocol.class,
               rmAddress, appsManagerServerConf));

GetNewApplicationRequest request =
  Records.newRecord(GetNewApplicationRequest.class);
GetNewApplicationResponse response =
  applicationsManager.getNewApplication(request);




       Architecting the Future of Big Data
                                                            Page 19
       © Hortonworks Inc. 2011
Client: App Submission
ApplicationSubmissionContext appContext;

ContainerLaunchContext amContainer;
amContainer.setLocalResources(Map<String, LocalResource> localResources);
amContainer.setEnvironment(Map<String, String> env);
String command = "${JAVA_HOME}" + /bin/java" + " MyAppMaster " + " arg1 arg2
“;
amContainer.setCommands(List<String> commands);
Resource capability; capability.setMemory(amMemory);
amContainer.setResource(capability);

appContext.setAMContainerSpec(amContainer);

SubmitApplicationRequest appRequest;
appRequest.setApplicationSubmissionContext(appContext);

applicationsManager.submitApplication(appRequest);


        Architecting the Future of Big Data
                                                                          Page 20
        © Hortonworks Inc. 2011
Client: App Monitoring
• Get Application Status

GetApplicationReportRequest reportRequest =
    Records.newRecord(GetApplicationReportRequest.class);
reportRequest.setApplicationId(appId);
GetApplicationReportResponse reportResponse =
  applicationsManager.getApplicationReport(reportRequest);
ApplicationReport report = reportResponse.getApplicationReport();


• Kill the application

KillApplicationRequest killRequest =
      Records.newRecord(KillApplicationRequest.class);
killRequest.setApplicationId(appId);
applicationsManager.forceKillApplication(killRequest);

       Architecting the Future of Big Data
                                                                    Page 21
       © Hortonworks Inc. 2011
AM: Ask RM for Containers
ResourceRequest rsrcRequest;
rsrcRequest.setHostName("*”); // hostname, rack, wildcard
rsrcRequest.setPriority(pri);
Resource capability; capability.setMemory(containerMemory);
rsrcRequest.setCapability(capability)
rsrcRequest.setNumContainers(numContainers);

List<ResourceRequest> requestedContainers;
List<ContainerId> releasedContainers;

AllocateRequest req;
req.setResponseId(rmRequestID);
req.addAllAsks(requestedContainers);
req.addAllReleases(releasedContainers);
req.setProgress(currentProgress);
AllocateResponse allocateResponse = resourceManager.allocate(req);



        Architecting the Future of Big Data
                                                                     Page 22
        © Hortonworks Inc. 2011
AM: Launch Containers
AMResponse amResp = allocateResponse.getAMResponse();

ContainerManager cm = (ContainerManager)rpc.getProxy
  (ContainerManager.class, cmAddress, conf);

List<Container> allocatedContainers = amResp.getAllocatedContainers();
for (Container allocatedContainer : allocatedContainers) {
   ContainerLaunchContext ctx;
   ctx.setContainerId(allocatedContainer .getId());
   ctx.setResource(allocatedContainer .getResource());
   // set env, command, local resources, …

    StartContainerRequest startReq;
    startReq.setContainerLaunchContext(ctx);
    cm.startContainer(startReq);
}

        Architecting the Future of Big Data
                                                                         Page 23
        © Hortonworks Inc. 2011
AM: Monitoring Containers
• Running Containers
GetContainerStatusRequest statusReq;
statusReq.setContainerId(containerId);
GetContainerStatusResponse statusResp =
  cm.getContainerStatus(statusReq);


• Completed Containers
AMResponse amResp = allocateResponse.getAMResponse();
List<Container> completedContainersStatus =
  amResp.getCompletedContainerStatuses();
for (ContainerStatus containerStatus : completedContainers) {
    // containerStatus.getContainerId()
    // containerStatus.getExitStatus()
    // containerStatus.getDiagnostics()
}



        Architecting the Future of Big Data
                                                                Page 24
        © Hortonworks Inc. 2011
AM: I am done
FinishApplicationMasterRequest finishReq;
finishReq.setAppAttemptId(appAttemptID);

finishReq.setFinishApplicationStatus
   (FinalApplicationStatus.SUCCEEDED); // or FAILED

finishReq.setDiagnostics(diagnostics);

resourceManager.finishApplicationMaster(finishReq);




       Architecting the Future of Big Data
                                                      Page 25
       © Hortonworks Inc. 2011

Writing Yarn Applications Hadoop Summit 2012

  • 1.
    Writing Application Frameworks onApache Hadoop YARN Hitesh Shah hitesh@hortonworks.com © Hortonworks Inc. 2011 Page 1
  • 2.
    Hitesh Shah -Background • Member of Technical Staff at Hortonworks Inc. • Committer for Apache MapReduce and Ambari • Earlier, spent 8+ years at Yahoo! building various infrastructure pieces all the way from data storage platforms to high throughput online ad-serving systems. Architecting the Future of Big Data Page 2 © Hortonworks Inc. 2011
  • 3.
    Agenda •YARN Architecture andConcepts •Writing a New Framework Architecting the Future of Big Data Page 3 © Hortonworks Inc. 2011
  • 4.
    YARN Architecture • ResourceManager –Global resource scheduler –Hierarchical queues • Node Manager –Per-machine agent –Manages the life-cycle of container –Container resource monitoring • Application Master –Per-application –Manages application scheduling and task execution –E.g. MapReduce Application Master Architecting the Future of Big Data Page 4 © Hortonworks Inc. 2011
  • 5.
    YARN Architecture Node Manager Container App Mstr Client Resource Node Manager Manager Client App Mstr Container MapReduce Status Node Manager Job Submission Node Status Resource Request Container Container Architecting the Future of Big Data Page 5 © Hortonworks Inc. 2011
  • 6.
    YARN Concepts • ApplicationID –Application Attempt IDs • Container –ContainerLaunchContext • ResourceRequest –Host/Rack/Any match –Priority –Resource constraints • Local Resource –File/Archive –Visibility – public/private/application Architecting the Future of Big Data Page 6 © Hortonworks Inc. 2011
  • 7.
    What you needfor a new Framework • Application Submission Client –For example, the MR Job Client • Application Master –The core framework library • Application History ( optional ) –History of all previously run instances • Auxiliary Services ( optional ) –Long-running application-specific services running on the NodeManager Architecting the Future of Big Data Page 7 © Hortonworks Inc. 2011
  • 8.
    Use Case: DistributedShell • Take a user-provided script Node or application and run it on a Manager set of nodes in the Cluster DS AppMaster • Input: – User Script to execute – Number of containers to run on Node Manager – Variable arguments for each different container Shell Script – Memory requirements for the shell script Node – Output Location/Dir Manager Shell Script Architecting the Future of Big Data Page 8 © Hortonworks Inc. 2011
  • 9.
    Client: RPC calls •Uses ClientRM Protocol ClientRMProtocol#getNewApplication • Get a new Application ID from the RM ClientRMProtocol#submitApplication • Application Submission CLIENT RM ClientRMProtocol#getApplicationReport • Application Monitoring ClientRMProtocol#killApplication • Kill the Application? Architecting the Future of Big Data Page 9 © Hortonworks Inc. 2011
  • 10.
    Client • Registration withthe RM –New Application ID • Application Submission –User information –Scheduler queue –Define the container for the Distributed Shell App Master via the ContainerLaunchContext • Application Monitoring – AppMaster host details with tokens if needed, tracking url – Application Status (submitted/running/finished) Architecting the Future of Big Data Page 10 © Hortonworks Inc. 2011
  • 11.
    Defining a Container •ContainerLaunchContext class –Can run a shell script, a java process or launch a VM • Command(s) to run • Local resources needed for the process to run –Dependent jars, native libs, data files/archives • Environment to setup –Java Classpath • Security-related data –Container Tokens Architecting the Future of Big Data Page 11 © Hortonworks Inc. 2011
  • 12.
    Application Master: RPCcalls • AMRM and CM protocols Client • Register AM with RM AMRM.registerAM • Ask RM to allocate resources AMRM.allocate AM RM • Launch tasks on allocated containers AMRM. finishAM App-specific • Manage tasks to final RPC completion CM.startContainer • Inform RM of completion NM NM Architecting the Future of Big Data Page 12 © Hortonworks Inc. 2011
  • 13.
    Application Master • SetupRPC to handle requests from Client and/or tasks launched on Containers • Register and send regular heartbeats to the RM • Request resources from the RM. • Launch user shell script on containers as and when allocated. • Monitor status of user script of remote containers and manage failures by retrying if needed. • Inform RM of completion when application is done. Architecting the Future of Big Data Page 13 © Hortonworks Inc. 2011
  • 14.
    AMRM#allocate • Request: – Containers needed – Not a delta protocol – Locality constraints: Host/Rack/Any – Resource constraints: memory – Priority-based assignments – Containers to release – extra/unwanted? – Only non-launched containers • Response: – Allocated Containers – Launch or release – Completed Containers – Status of completion Architecting the Future of Big Data Page 14 © Hortonworks Inc. 2011
  • 15.
    YARN Applications • DataProcessing: – OpenMPI on Hadoop – Spark (UC Berkeley) – Shark ( Hive-on-Spark ) – Real-time data processing – Storm ( Twitter ) – Apache S4 – Graph processing – Apache Giraph • Beyond data: – Deploying Apache HBase via YARN (HBASE-4329) – Hbase Co-processors via YARN (HBASE-4047) Architecting the Future of Big Data Page 15 © Hortonworks Inc. 2011
  • 16.
    References •Doc on writingnew applications: –WritingYarnApplications.html ( available at http://hadoop.apache.org/common/docs/r2.0.0- alpha/ ) Architecting the Future of Big Data Page 16 © Hortonworks Inc. 2011
  • 17.
    Questions? Thank You! Hitesh Shah hitesh@hortonworks.com Architecting the Future of Big Data Page 17 © Hortonworks Inc. 2011
  • 18.
    Appendix: Code Examples Architecting the Future of Big Data Page 18 © Hortonworks Inc. 2011
  • 19.
    Client: Registration ClientRMProtocol applicationsManager; YarnConfigurationyarnConf = new YarnConfiguration(conf); InetSocketAddress rmAddress = NetUtils.createSocketAddr( yarnConf.get(YarnConfiguration.RM_ADDRESS)); applicationsManager = ((ClientRMProtocol) rpc.getProxy(ClientRMProtocol.class, rmAddress, appsManagerServerConf)); GetNewApplicationRequest request = Records.newRecord(GetNewApplicationRequest.class); GetNewApplicationResponse response = applicationsManager.getNewApplication(request); Architecting the Future of Big Data Page 19 © Hortonworks Inc. 2011
  • 20.
    Client: App Submission ApplicationSubmissionContextappContext; ContainerLaunchContext amContainer; amContainer.setLocalResources(Map<String, LocalResource> localResources); amContainer.setEnvironment(Map<String, String> env); String command = "${JAVA_HOME}" + /bin/java" + " MyAppMaster " + " arg1 arg2 “; amContainer.setCommands(List<String> commands); Resource capability; capability.setMemory(amMemory); amContainer.setResource(capability); appContext.setAMContainerSpec(amContainer); SubmitApplicationRequest appRequest; appRequest.setApplicationSubmissionContext(appContext); applicationsManager.submitApplication(appRequest); Architecting the Future of Big Data Page 20 © Hortonworks Inc. 2011
  • 21.
    Client: App Monitoring •Get Application Status GetApplicationReportRequest reportRequest = Records.newRecord(GetApplicationReportRequest.class); reportRequest.setApplicationId(appId); GetApplicationReportResponse reportResponse = applicationsManager.getApplicationReport(reportRequest); ApplicationReport report = reportResponse.getApplicationReport(); • Kill the application KillApplicationRequest killRequest = Records.newRecord(KillApplicationRequest.class); killRequest.setApplicationId(appId); applicationsManager.forceKillApplication(killRequest); Architecting the Future of Big Data Page 21 © Hortonworks Inc. 2011
  • 22.
    AM: Ask RMfor Containers ResourceRequest rsrcRequest; rsrcRequest.setHostName("*”); // hostname, rack, wildcard rsrcRequest.setPriority(pri); Resource capability; capability.setMemory(containerMemory); rsrcRequest.setCapability(capability) rsrcRequest.setNumContainers(numContainers); List<ResourceRequest> requestedContainers; List<ContainerId> releasedContainers; AllocateRequest req; req.setResponseId(rmRequestID); req.addAllAsks(requestedContainers); req.addAllReleases(releasedContainers); req.setProgress(currentProgress); AllocateResponse allocateResponse = resourceManager.allocate(req); Architecting the Future of Big Data Page 22 © Hortonworks Inc. 2011
  • 23.
    AM: Launch Containers AMResponseamResp = allocateResponse.getAMResponse(); ContainerManager cm = (ContainerManager)rpc.getProxy (ContainerManager.class, cmAddress, conf); List<Container> allocatedContainers = amResp.getAllocatedContainers(); for (Container allocatedContainer : allocatedContainers) { ContainerLaunchContext ctx; ctx.setContainerId(allocatedContainer .getId()); ctx.setResource(allocatedContainer .getResource()); // set env, command, local resources, … StartContainerRequest startReq; startReq.setContainerLaunchContext(ctx); cm.startContainer(startReq); } Architecting the Future of Big Data Page 23 © Hortonworks Inc. 2011
  • 24.
    AM: Monitoring Containers •Running Containers GetContainerStatusRequest statusReq; statusReq.setContainerId(containerId); GetContainerStatusResponse statusResp = cm.getContainerStatus(statusReq); • Completed Containers AMResponse amResp = allocateResponse.getAMResponse(); List<Container> completedContainersStatus = amResp.getCompletedContainerStatuses(); for (ContainerStatus containerStatus : completedContainers) { // containerStatus.getContainerId() // containerStatus.getExitStatus() // containerStatus.getDiagnostics() } Architecting the Future of Big Data Page 24 © Hortonworks Inc. 2011
  • 25.
    AM: I amdone FinishApplicationMasterRequest finishReq; finishReq.setAppAttemptId(appAttemptID); finishReq.setFinishApplicationStatus (FinalApplicationStatus.SUCCEEDED); // or FAILED finishReq.setDiagnostics(diagnostics); resourceManager.finishApplicationMaster(finishReq); Architecting the Future of Big Data Page 25 © Hortonworks Inc. 2011