Yahoo compares Storm and Spark

Spark and Storm at Yahoo
Wh y c h o o s e o n e o v e r t h e o t h e r ?
P R E S E N T E D B Y B o b b y E v a n s a n d T o m G r a v e s

Tom Graves
Bobby Evans (bobby@apache.org)
2
 Committers and PMC/PPMC Members for
› Apache Storm incubating (Bobby)
› Apache Hadoop (Tom and Bobby)
› Apache Spark (Tom and Bobby)
› Apache TEZ (Tom and Bobby)
 Low Latency Big Data team at Yahoo (Part of the Hadoop Team)
› Apache Storm as a service
• 1,300+ nodes total, 250 node cluster (soon to be 4000 nodes).
› Apache Spark on YARN
• 40,000 nodes total, 5000+ node cluster
› Help with distributed ML and deep learning.

Where we come from
Yahoo Champaign:
• 100+ engineers
• Located in UIUC Research Park http://researchpark.illinois.edu/
• Split between Advertising and Data Platform team and Hadoop team.
• Hadoop team provides the Hadoop ecosystem as a service to all of Yahoo.
• Site is 7 years old, and we are building a new building with room for 200.
• We are Hiring
• resume-hadoop@yahoo-inc.com
• http://bit.ly/1ybTXMe

Agenda
Spark Overview (1.1)
Storm Overview (0.9.2)
Things to Consider
Example Architectures
4 Yahoo Confidential & Proprietary

Spark Key Concepts
Write programs in terms of
transformations on distributed
Resilient Distributed
Datasets
 Collections of objects spread
across a cluster, stored in RAM
or on Disk
 Built through parallel
transformations
 Automatically rebuilt on failure
Operations
 Transformations
(e.g. map, filter,
groupBy)
 Actions
(e.g. count, collect,
save)
datasets

Working With RDDs
RDD
RDD
RDD
RDD
Transformations
textFile = sc.textFile(”SomeFile.txt”)
Action Value
linesWithSpark = textFile.filter(lambda line: "Spark” in line)
linesWithSpark.count()
74
linesWithSpark.first()
# Apache Spark

Example: Word Count
> lines = sc.textFile(“hamlet.txt”)
> counts = lines.flatMap(lambda line: line.split(“ ”))
.map(lambda word => (word, 1))
.reduceByKey(lambda x, y: x + y)
“to be or”
“not to be”
“to”
“be”
“or”
“not”
“to”
“be”
(to, 1)
(be, 1)
(or, 1)
(not, 1)
(to, 1)
(be, 1)
(be, 2)
(not, 1)
(or, 1)
(to, 2)

Spark Streaming Word Count
updateFunc = (values: Seq[Int], state: Option[Int]) => {
val currentCount = values.foldLeft(0)(_ + _)
val previousCount = state.getOrElse(0)
Some(currentCount + previousCount)
}
…
lines = ssc.socketTextStream(args(0), args(1).toInt)
Words = lines.flatMap(lambda line: line.split(“ ”))
wordDstream = words.map(lambda word => (word, 1))
stateDstream = wordDstream.updateStateByKey[Int](updateFunc)
ssc.start()
ssc.awaitTermination()

Storm Concepts
1. Streams
› Unbounded sequence of tuples
2. Spout
› Source of Stream
› E.g. Read from Twitter streaming API
3. Bolts
› Processes input streams and produces
new streams
› E.g. Functions, Filters, Aggregation,
Joins
4. Topologies
› Network of spouts and bolts

Storm Architecture
Master
Node
Cluster
Coordination
Worker
Worker
Worker
Worker
Processes
Nimbus
Zookeeper
Zookeeper
Zookeeper
Supervisor
Supervisor
Supervisor
Supervisor Worker
Launches
Workers

Trident (Storm) Word Count
TridentTopology topology = new TridentTopology();
TridentState wordCounts = topology.newStream("spout1", spout)
.each(new Fields("sentence"), new Split(), new Fields("word"))
.groupBy(new Fields("word"))
.persistentAggregate(new MemoryMapState.Factory(), new Count(),
new Fields("count")).parallelismHint(6);
“to be or”
“to”
“be”
“or”
(to, 1)
(be, 1)
(or, 1)
1)
1)
“not to be”
“not”
“to”
“be”
(not, 1)
(to, 1)
(be, 1)
(be, 2)
(not, 1)
(or, 1)
(to, 2)

Use the Right Tool for the Job
14
https://www.flickr.com/photos/hikingartist/4193330368/

Things to Consider
15
Scale
Latency
 Iterative Processing
› Are there suitable non-iterative alternatives?
Use What You Know
Code Reuse
Maturity

When We Recommend Spark
16
 Iterative Batch Processing (most Machine Learning)
› There really is nothing else right now.
› Has some scale issues.
 Tried ETL (Not at Yahoo scale yet)
 Tried Shark/Interactive Queries (Not at Yahoo scale yet)
 < 1 TB (or memory size of your cluster)
 Tuning it to run well can be a pain
 Data Bricks and others are working on scaling.
 Streaming is all μ-batch so latency is at least 1 sec
 Streaming has single points of failure still
 All streaming inputs are replicated in memory

When We Recommend Storm
17
 Latency < 1 second (single event at a time)
› There is little else (especially not open source)
 “Real Time” …
› Analytics
› Budgeting
› ML
› Anything
 Lower Level API than Spark
 No built-in concept of look back aggregations
 Takes more effort to combine batch with streaming

Fictitious Example: My Commute App
18
 Mobile App that lets users track their commute.
 Cities, users, companies, etc. compete daily for
› Shortest commute time
› Greenest commute
 Make money by selling location based ads and aggregate data to
› Governments
› Advertisers
 Feel free to steal my crazy idea, I just want to be invited to the launch
party, and I wouldn't say no to some stock.

Chicago vs. Champaign Urbana
19
Champaign Urbana: 14-15 min
Chicago: 20-30 min
35
30
25
20
15
10
5
0
Bobby
CU Chicago
Source: http://project.wnyc.org/commute-times-us/embed.html#5.00/42.000/-89.500

Things to Consider
20
Scale
› everyone in the world!!!
Latency
› a few seconds max
› Possibly for targeting, but there are alternatives

Architecture
App Web
Service
(User, Commute
ID, Location
History, MPG)
Kafka Storm
HBase/NoSQ
L
HDFS Spark
Customer
21

Architecture (Alternative)
App Web
Service
(User, Commute
ID, Location
History, MPG)
HBase/NOS
QL
HDFS Spark
Customer
22
Go directly to Spark Streaming,
but data loss potential goes up.

Architecture (Alternative 2)
App Web
Service
(User, Commute
ID, Location
History, MPG)
Kafka Storm
HBase/NOS
QL
Customer
23
Streaming Operations Only
(Kappa Architecture)

Fictitious Example 2: Web Scale Monitoring
24
 Look for trends that can indicate a problem.
› Alert or provide automated corrections
 Provide an interface to visualize
› Current data very quickly
› Historical data in depth
 If you commercialize this one please give me/Yahoo a free license for
life (open source works too)

Things to Consider
25
Scale
› Lots of events from many different servers
Latency
› a few seconds max, but the fewer the better
› For in depth analysis definetly

Fictitious Example 2: Web Scale Monitoring
26
Servers
HBase
Kafka Storm
HDFS Spark
UI
Alert!!
JDBC
Server
Rules
ML and trend
analysis

Questions?
bobby@apache.org resume-hadoop@yahoo-inc.com
http://bit.ly/1ybTXMe

Yahoo compares Storm and Spark

More Related Content

What's hot

Viewers also liked

Similar to Yahoo compares Storm and Spark

More from Chicago Hadoop Users Group

Recently uploaded

Yahoo compares Storm and Spark

Editor's Notes