Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Yahoo compares Storm and Spark


Published on

Bobby Evans and Tom Graves, the engineering leads for Spark and Storm development at Yahoo will talk about how these technologies are used on Yahoo's grids and reasons why to use one or the other.

Bobby Evans is the low latency data processing architect at Yahoo. He is a PMC member on many Apache projects including Storm, Hadoop, Spark, and Tez. His team is responsible for delivering Storm as a service to all of Yahoo and maintaining Spark on Yarn for Yahoo (Although Tom really does most of that work).

Tom Graves a Senior Software Engineer on the Platform team at Yahoo. He is an Apache PMC member on Hadoop, Spark, and Tez. His team is responsible for delivering and maintaining Spark on Yarn for Yahoo.

Published in: Technology
  • Nice !! Download 100 % Free Ebooks, PPts, Study Notes, Novels, etc @
    Are you sure you want to  Yes  No
    Your message goes here
  • My dear, How are you today? i will like to be your friend My name is Sheikha Ghunaim , am a 43 years old divorcee. Please write to me in my email ( ). im honest and open mind single woman. im going to tell more when i see your response. Regards Sheikha.
    Are you sure you want to  Yes  No
    Your message goes here

Yahoo compares Storm and Spark

  1. 1. Spark and Storm at Yahoo Wh y c h o o s e o n e o v e r t h e o t h e r ? P R E S E N T E D B Y B o b b y E v a n s a n d T o m G r a v e s
  2. 2. Tom Graves Bobby Evans ( 2  Committers and PMC/PPMC Members for › Apache Storm incubating (Bobby) › Apache Hadoop (Tom and Bobby) › Apache Spark (Tom and Bobby) › Apache TEZ (Tom and Bobby)  Low Latency Big Data team at Yahoo (Part of the Hadoop Team) › Apache Storm as a service • 1,300+ nodes total, 250 node cluster (soon to be 4000 nodes). › Apache Spark on YARN • 40,000 nodes total, 5000+ node cluster › Help with distributed ML and deep learning.
  3. 3. Where we come from Yahoo Champaign: • 100+ engineers • Located in UIUC Research Park • Split between Advertising and Data Platform team and Hadoop team. • Hadoop team provides the Hadoop ecosystem as a service to all of Yahoo. • Site is 7 years old, and we are building a new building with room for 200. • We are Hiring • •
  4. 4. Agenda Spark Overview (1.1) Storm Overview (0.9.2) Things to Consider Example Architectures 4 Yahoo Confidential & Proprietary
  5. 5. Apache Spark 5
  6. 6. Spark Key Concepts Write programs in terms of transformations on distributed Resilient Distributed Datasets  Collections of objects spread across a cluster, stored in RAM or on Disk  Built through parallel transformations  Automatically rebuilt on failure Operations  Transformations (e.g. map, filter, groupBy)  Actions (e.g. count, collect, save) datasets
  7. 7. Working With RDDs RDD RDD RDD RDD Transformations textFile = sc.textFile(”SomeFile.txt”) Action Value linesWithSpark = textFile.filter(lambda line: "Spark” in line) linesWithSpark.count() 74 linesWithSpark.first() # Apache Spark
  8. 8. Example: Word Count > lines = sc.textFile(“hamlet.txt”) > counts = lines.flatMap(lambda line: line.split(“ ”)) .map(lambda word => (word, 1)) .reduceByKey(lambda x, y: x + y) “to be or” “not to be” “to” “be” “or” “not” “to” “be” (to, 1) (be, 1) (or, 1) (not, 1) (to, 1) (be, 1) (be, 2) (not, 1) (or, 1) (to, 2)
  9. 9. Spark Streaming Word Count updateFunc = (values: Seq[Int], state: Option[Int]) => { val currentCount = values.foldLeft(0)(_ + _) val previousCount = state.getOrElse(0) Some(currentCount + previousCount) } … lines = ssc.socketTextStream(args(0), args(1).toInt) Words = lines.flatMap(lambda line: line.split(“ ”)) wordDstream = word => (word, 1)) stateDstream = wordDstream.updateStateByKey[Int](updateFunc) ssc.start() ssc.awaitTermination()
  10. 10. 10 Apache Storm
  11. 11. Storm Concepts 1. Streams › Unbounded sequence of tuples 2. Spout › Source of Stream › E.g. Read from Twitter streaming API 3. Bolts › Processes input streams and produces new streams › E.g. Functions, Filters, Aggregation, Joins 4. Topologies › Network of spouts and bolts
  12. 12. Storm Architecture Master Node Cluster Coordination Worker Worker Worker Worker Processes Nimbus Zookeeper Zookeeper Zookeeper Supervisor Supervisor Supervisor Supervisor Worker Launches Workers
  13. 13. Trident (Storm) Word Count TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")).parallelismHint(6); “to be or” “to” “be” “or” (to, 1) (be, 1) (or, 1) 1) 1) “not to be” “not” “to” “be” (not, 1) (to, 1) (be, 1) (be, 2) (not, 1) (or, 1) (to, 2)
  14. 14. Use the Right Tool for the Job 14
  15. 15. Things to Consider 15 Scale Latency  Iterative Processing › Are there suitable non-iterative alternatives? Use What You Know Code Reuse Maturity
  16. 16. When We Recommend Spark 16  Iterative Batch Processing (most Machine Learning) › There really is nothing else right now. › Has some scale issues.  Tried ETL (Not at Yahoo scale yet)  Tried Shark/Interactive Queries (Not at Yahoo scale yet)  < 1 TB (or memory size of your cluster)  Tuning it to run well can be a pain  Data Bricks and others are working on scaling.  Streaming is all μ-batch so latency is at least 1 sec  Streaming has single points of failure still  All streaming inputs are replicated in memory
  17. 17. When We Recommend Storm 17  Latency < 1 second (single event at a time) › There is little else (especially not open source)  “Real Time” … › Analytics › Budgeting › ML › Anything  Lower Level API than Spark  No built-in concept of look back aggregations  Takes more effort to combine batch with streaming
  18. 18. Fictitious Example: My Commute App 18  Mobile App that lets users track their commute.  Cities, users, companies, etc. compete daily for › Shortest commute time › Greenest commute  Make money by selling location based ads and aggregate data to › Governments › Advertisers  Feel free to steal my crazy idea, I just want to be invited to the launch party, and I wouldn't say no to some stock.
  19. 19. Chicago vs. Champaign Urbana 19 Champaign Urbana: 14-15 min Chicago: 20-30 min 35 30 25 20 15 10 5 0 Bobby CU Chicago Source:
  20. 20. Things to Consider 20 Scale › everyone in the world!!! Latency › a few seconds max  Iterative Processing › Possibly for targeting, but there are alternatives
  21. 21. Architecture App Web Service (User, Commute ID, Location History, MPG) Kafka Storm HBase/NoSQ L HDFS Spark Customer 21
  22. 22. Architecture (Alternative) App Web Service (User, Commute ID, Location History, MPG) HBase/NOS QL HDFS Spark Customer 22 Go directly to Spark Streaming, but data loss potential goes up.
  23. 23. Architecture (Alternative 2) App Web Service (User, Commute ID, Location History, MPG) Kafka Storm HBase/NOS QL Customer 23 Streaming Operations Only (Kappa Architecture)
  24. 24. Fictitious Example 2: Web Scale Monitoring 24  Look for trends that can indicate a problem. › Alert or provide automated corrections  Provide an interface to visualize › Current data very quickly › Historical data in depth  If you commercialize this one please give me/Yahoo a free license for life (open source works too)
  25. 25. Things to Consider 25 Scale › Lots of events from many different servers Latency › a few seconds max, but the fewer the better  Iterative Processing › For in depth analysis definetly
  26. 26. Fictitious Example 2: Web Scale Monitoring 26 Servers HBase Kafka Storm HDFS Spark UI Alert!! JDBC Server Rules ML and trend analysis
  27. 27. Questions?