Storm-on-YARN: Convergence of Low-Latency and Big-Data


Published on

adoop plays a central role for Yahoo! to provide personalized experiences for our users and create value for our advertisers. In this talk, we will discuss the convergence of low-latency processing and Hadoop platform. To enable the convergence, we have developed Storm-on-YARN to enable Storm streaming/microbatch applications and Hadoop batch applications hosted in a single cluster. Storm applications could leverage YARN for resource management, and apply Hadoop style security to Hadoop datasets on HDFS and HBase. In Storm-on-YARN, YARN is used to launch Storm application master (Nimbus), and enable Nimbus to request resources for Storm workers (Supervisors). YARN resource manager and Storm scheduler work together to support multi-tenancy and high availability. HDFS enables Storm to achieve higher availability of Nimbus itself. We are introducing Hadoop style security into Storm through JAAS authentication (Kerberos and Digest). Storm servers (Nimbus and DRPC) will be configured with authorization plugins for access control and audit. The security context enables Storm applications to access authorized datasets only (including those created by Hadoop applications). Yahoo! is making our contribution on Storm and YARN available as open source. We will work with industry partners to foster the convergence of low-latency processing and big-data.

Published in: Technology, Business
1 Comment
  • hi feng! how to deploy storm-yarn , when i run
    [hd@master bin]$ ./storm-yarn launch
    storm is not installed
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • This talk shares our works at Yahoo! to bring Hadoop and Storm together.
  • I am here representing work by Yahoo! Hadoop team. We have worked hard to reduce latency of big-data processing.We have been very active at Storm community, and made significant contribution there.In the past, I have worked on real-time serving systems: online ad serving, personalized web services.Those experience has given me extra motivation to bring low-latency into big-data.
  • This talk consists of 3 sections.We will explain our business motivation behind Storm-on-YARN.Then, we will take a look at the technical overview of Storm-on-YARN.Yahoo! has made Storm-on-YARN available as an open source. We will also talk about that.
  • Let’s talk about business. Recently, we re-launched Yahoo! homepage with emphasis of personalization. One of the main change is around the story stream below today model. This is a stream of stories that a user is likely interested at reading. These articles are identified based on user’s interests.In this screenshot here, I got several interesting articles about Silicon Valley companies including HP and Yahoo!.This article about Yahoo! certainly got my attention.
  • To deliver personalized user experience, Yahoo! homepage server selects contents and ads so that are relevant to user interests.User interests etc are stored in our online stores.Traditionally, user interests are exclusively computed through batch processing of huge amount of data about users, contents and ads.User interests are reflected in their own activities. Users search different keywords, and visited different web sites.For 700 millions of users, Yahoo! processes over 100 Petabyte datasets. At Yahoo!, Hadoop continues to be our primary technology for batch processing.Hadoop is an excellent solution for batch processing. However, Hadoop is not a good solution when latency is critical. MapReduced based solution took x minutes.
  • The latency could become a barrier for us to provide ideal experience for our user for 2 reasons:#1 … User interests could change rapidly. We want a low-latency processing mechanism that adjustsuser’s interest graphs from their very recent activities. #2 … This low-latency requirement also applies to content. Once a piece of content is provided to Yahoo!, we want to make them available to user very quickly. We want to reduce that to a few seconds. Therefore, we expanded our offline tier to include low-latency processing components.This new component enables Y! to update user interest profiles in near realtime fashion,and enable Y! to bring content online very quickly.At Yahoo!, we have benefited from the convergence of batch processing and low-latency processing.Yahoo! homepage has seen user engagement improved by x%.
  • Personalization is one success story at Yahoo!. The design pattern of big-data/low-latency combination could be applied in many other use cases.At Yahoo!, we were positively surprised by the # of use cases that adopts this design pattern. For example, online advertising system needs to control budget spending. Batch processing could help you define the pacing of ad ad campaigns,and low-latency processing ensures that we don’t over spend daily budget. Such an application has an impact of over $xx.
  • Next, let’s have a technical overview of our proposed solution: Storm-on-YARN.
  • Storm-on-YARN is based on 2 technologies.First piece of technology is called YARN, also known as Hadoop 2.0.Comparing with Hadoop 1.0, YARN is designed to support applications beyond Map/Reduce.A variety of applications could exist in a single cluster. It could be a MapReduce application for batch processing, Other applications for low-latency.YARN is a cornerstone for the future of big-data processing frameworks. As the birthplace of YARN, Yahoo! is committed to make YARN widely adopted.In addition to our contribution in design and development, Yahoo! has deployed YARN into our Hadoop clusters to YARN in Q1 2013. Bobby Evans will share our experience with YARN deployment at his talk titled “Running YARN at scale” 2pm tomorrow.
  • Storm is an emerging technology for distributed low-latency processing. In Storm, applications are modeled as a DAG for daemon processes. There are two type of processing units: spouts and bolts. Spouts emit events, and bolts process events. Bolts could also produce new events. All these processes are long lived, and events are processed whenever arrived, without any delay.Therefore, you could achieve end-to-end latency within a few seconds.As in Hadoop, Storm has built-in for - parallel computation … You could easily state # of processes required for your spout/bolts. - failover … The failure of any executor has no impact to applications.Storm is designed to process any type of streams. At Yahoo!, we are processing various streams listed here.
  • This diagram illustrates how Storm applications are managed in Hadoop Grid. Storm application owners launch their topologies at Grid Gateway. From there, storm applications are submitted to Storm Nimbus server.Nimbus server is similar to Job Tracker in MapReduce. It’s the control center of Storm cluster.Nimbus then decide which processes should handle the submitted Storm topology. Nimbus makes its decision known to the cluster via Zookeeper.Once being informed by Zookeeper, each designated supervisors will installed the submitted topology, and start its execution. Since Storm is integrated with Hadoop, Storm applications could thus access Hadoop resources easily.For example, Storm applications could access data from Hbase.Furthermore, we enable Storm clusters and applications be to monitored by Grid infrastructure.
  • Setting up a Storm cluster is not as simple, when we need to make it cooperates with Hadoop ecosystem.Storm-YARN aims to make launching Storm clusters very simple.We provide a command line command storm-yarn. To launch storm cluster, you will give a parameter called “launch” and a configuration file.The configuration should state the size of your Storm cluster, and the resource allocation to YARN containers.With such a command, our launcher client will request YARN to allocate a container to host a Storm application master. Storm master will then launch Storm nimbus and UI server as separated processes in that YARN container. Storm app master will ask YARN resource managers for additional containers to launch Storm supervisors.
  • Once Storm cluster is launched, you could manage your Storm cluster by specifying the application ID associated with your Storm master.Storm master has a built-in Thrift server, which accepts a set of admin requests. The admin requests enable you to add more supervisors.Start and stop various daemon processes.
  • Once the Nimbus server is up running, you should submit your Storm topologies via the standard storm command.With “storm jar”, you will give the JAR file of your storm topology. The topology will be submitted to Nimbus.Nimbus makes assignment to topologies to supervisors, and announce the assignment via Zookeeper.Supervisors will then download the topology from Nimbus, and launch spouts and bolts in worker processes.
  • In addition to simplify the deployment process, we also want to make sure Storm clusters are secure.Toward that, Yahoo! has lead an effort to introduce authentication, authorization and audit into storm. With the latest Storm release, you are empowered to write plugins for authentication and authorization. We have included digest authentication in current release, and will release kerberos authentication in near future.We have released some simple plugins for authorization, and access logs.The new authentication/authorization/audit mechanism will help us secure access to Nimbus and DRPC servers.We ensure that only authorized users are allowed to perform certain actions in a Storm cluster.
  • Now, it’s time for you to play with Storm-YARN.
  • Recently, Yahoo! has made Storm-YARN available as an open source for early access. The code is available at github under Apache 2 license. We plan to move it to Apache later.We would love to see contribution from community.Please check it out and give us your feedback.
  • We encourage you to try out Storm-YARN. You could just check out the code, and run “mvn test”.“mvn test” will perform a few key steps: 1. launch storm cluster 2. retrieve storm.yamlconfig 3. submit a storm topology 4. stop the topology 5. shutdown storm cluster
  • To run Storm-YARN in a true YARN cluster, you will take 5 steps:install Storm software into HDFS. By putting in HDFS, Storm software is accessible by all nodes in a Hadoop cluster.Then, you will invoke storm-yarn command to launch a storm cluster with desirable size. You use storm-yarn to obtain a storm configuration file associated with the newly launched cluster.Finally, you will submit your Storm topology by specifying your Jar file and a storm.yaml file.
  • In summary, Yahoo! believes the importance of the convergence of big-data and low-latency processing. The convergence enables Yahoo! to solve various business problems more efficiently. We are releasing 2 YARN application to assist such convergence. In this talk, w covered Storm-YARN. We also developed a Spark-YARN plugin to enable Spark apps to be executed in YARN cluster.
  • Storm-on-YARN: Convergence of Low-Latency and Big-Data

    1. 1. Storm-on-YARN: Convergence of Low-Latency and Big-Data Andrew Feng
    2. 2. Self Introduction • Current – Distinguished Architect, Yahoo! Hadoop Team – Core contributor at Storm project • Past – Online advertisement – Personalization – Serving containers – Cloud services – NoSQL database – Application server
    3. 3. Agenda • Business motivation • Technical overview • Open source
    4. 4. Yahoo!: Personalized Web
    5. 5. Personalization w/ Hadoop Understand user & content/ads Select relevant content & ads
    6. 6. Personalization w/ Low-Latency Latest content per current interests
    7. 7. Big Data + Low Latency: Design Pattern • Personalization • Ad targeting • Reporting • Ad budgeting • Fraud detection • Trending topics
    8. 8. Agenda • Business motivation • Technical overview • Open source
    9. 9. Hadoop YARN: MapReduce & Beyond • Yahoo! deployed YARN into 30k+ nodes in production. • YARN Apps … MapReduce, Storm, etc.
    10. 10. Storm: Distributed Stream Processing X Streams • User activities • Ad beacons • Content feeds • Social feeds • …
    11. 11. Storm Clusters on Hadoop Grid
    12. 12. Storm-YARN: Launch Cluster • Result: <appID> of the newly launched Storm master • storm-yarn launch <conf> – Initial # of supervisors – memory size of allocated container
    13. 13. Storm-YARN: Manage Cluster 1. addSupervisors <appID> <count> 2. getStormConfig <appID> 3. setStormConfig <appID> 4. startNimbus <appID> 5. stopNimbus <appID> 6. startUI <appID> 7. stopUI <appID> 8. startSupervisors <appID> 9. stopSupervisors <appID>
    14. 14. Storm-YARN: Deploy Apps storm jar <appJar>
    15. 15. Authentication/Authorization/Audit • Authentication plugins – Digest – Kerberos (soon) – None – Bring your own • Authorization plugins – Accept all – Limited operations only – User whitelist – Bring your own • Audit – Access log
    16. 16. Agenda • Business motivation • Technical overview • Open source
    17. 17. Storm-YARN: Open Source • Code released for early access – under the Apache 2.0 License – move to later • Welcome contribution! – Submit proposals – Sign Apache style CLA – Submit git pull requests
    18. 18. Storm-YARN: mvn test 1. storm-yarn launch – ./conf/storm.yaml -- stormZip lib/ -- appname storm-on-yarn-test -- output target/appId.txt 2. storm-yarn getStormConfig – ./conf/storm.yaml -- appId application_1372121842369_ 0001 --output ./lib/storm/storm.yaml 3. storm jar – lib/storm-starter-0.0.1- SNAPSHOT.jar – storm.starter.WordCountTopology – word-count-topology 4. storm kill – word-count-topology 5. storm-yarn shutdown – ./conf/storm.yaml -- appId application_1372121842369_ 0001
    19. 19. Storm-YARN: Deployment Install Storm S/W 1. hadoop fs –put /lib/storm/<version>/stor Apply Storm-YARN 2. storm-yarn launch  <appID> 3. storm-yarn getStormConfig <appID>  <storm.yaml> 4. storm jar <appJar>
    20. 20. Conclusion • YARN empowers the emergence of big-data & low-latency processing • Yahoo! open source: – Storm-yarn @ github/yahoo – Spark-yarn @ spark-
    21. 21. ?Questions