Apache Storm vs. Spark Streaming - two stream processing platforms compared

2015 © Trivadis
BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH STUTTGART VIENNA
2014 © Trivadis
Apache Storm vs. Spark Streaming –
Two Stream Processing Platforms compared
Juni 2015
Guido Schmutz
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
1

2015 © Trivadis
Guido Schmutz
§  Working for Trivadis for more than 18 years
§  Oracle ACE Director for Fusion Middleware and SOA
§  Co-Author of different books
§  Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
§  Member of Trivadis Architecture Board
§  Technology Manager @ Trivadis
§  More than 25 years of software development
experience
§  Contact: guido.schmutz@trivadis.com
§  Blog: http://guidoschmutz.wordpress.com
§  Twitter: gschmutz
Juli 2015
2

2015 © Trivadis
Trivadis is a market leader in IT consulting, system integration,
solution engineering and the provision of IT services focusing
on and technologies in Switzerland,
Germany and Austria.
We offer our services in the following strategic business fields:
Trivadis Services takes over the interacting operation of your IT systems.
Trivadis
O P E R A T I O N
Juli 2015
3

2015 © Trivadis
Agenda
1.  Introduction / Motivation
2.  Apache Storm
3.  Apache Spark (Streaming)
4.  Stream Processing in the Architecture
Juli 2015
4

2015 © Trivadis
What is Stream Processing?
Infrastructure for continuous data processing
Computational model can be as general as MapReduce but with the ability
to produce low-latency results
Data collected continuously is naturally processed continuously
aka. Event Processing / Complex Event Processing (CEP)
Juli 2015
5

2015 © Trivadis
Trivadis Stream Processing Demo System
Juli 2015
6
Use Hashtag #JFS2015 plus #storm and/or #spark

2015 © Trivadis
How to design a Stream Processing System?
Juli 2015
7
Event
Stream
event
Collecting
event
Queue
(Persist)
Event
Stream
event
Collecting
event
Processing
event
Processing
result
result
Event
Stream
event
Collecting/
Processing
result

2015 © Trivadis
How to scale a Stream Processing System?
Juli 2015
8
Queue
(Persist)
Event
Stream
event
Collecting
Thread 1 event event
Processing
Thread 1 result
Collecting
Thread 2
Processing
Thread 2
event event event result
Collecting
Thread n
Processing
Thread n

2015 © Trivadis
Collecting
Process 1
Collecting
Process 1
Collecting
Process 1
Collecting
Process 1
Collecting
Process 1
Juli 2015
9
Queue 1
(Persist)
Event
Stream
event
Collecting
Thread 1
event event
Processing
Process 1 result
Collecting
Thread 1
Processing
Process 1
Queue 2
(Persist)event
event result
Processing
Process 1
Queue n
(Persist)
event

2015 © Trivadis
Collecting
Process 1
Collecting
Process 2
Processing
A
Process 2
Processing
B
Process 2
Processing
A
Process 1
Processing
B
Process 1
Juli 2015
10
Event
Stream
Collecting
Process 1
Collecting
Process 2
Processing A
Thread 2
Q2
e
Processing B
Thread 2
Q2
e
Processing A
Thread 1
Q1
e
Processing B
Thread 1
Q1
e
Processing
A
Process 2
Processing A
Thread n
Qn
e

2015 © Trivadis
How to make (stateful) Stream Processing System
reliable?
Faults and stragglers inevitable in large clusters running big data
applications
Streaming applications must recover from them quickly
Juli 2015
11
Collecting
Process 2
Processing
A
Process 2
Processing
B
Process 2
Event
Stream
Collecting
Process 2
Processing A
Thread 2
Q2
e
Processing B
Thread 2
Q2
e
Collecting
Process 2
Processing
A
Process 2
Processing
B
Process 2
Event
Stream
Collecting
Process 2
Processing A
Thread 2
Q2
e
Processing B
Thread 2
Q2
e

2015 © Trivadis
reliable?
Solution 1: using active/passive system (hot replication)
•  Both systems process the full load
•  In case of a failure, automatically switch and use the “passive” system
•  Stragglers slow down both active and passive system
Juli 2015
12
State = State in-memory and/or on-disk
Collecting
Process 2
Processing
A
Process 2
Processing
B
Process 2
Event
Stream
Collecting
Process 2
Processing A
Thread 2
Q2
e
Processing B
Thread 2
Q2
e
Active
Collecting
Process 2
Processing
A
Process 2
Processing
B
Process 2
Collecting
Process 2
Processing A
Thread 2
Q2
e
Processing B
Thread 2
Q2
e
Passive
State
State

2015 © Trivadis
reliable?
Solution 2: Upstream backup
•  Nodes buffer messages and reply them to new node in case of failure
•  Stragglers are treated as failures
Juli 2015
13
State = State in-memory and/or on-disk
buffer = Buffer for replay in-memory and/or on-disk
Collecting
Process 2
Processing
A
Process 2
Processing
B
Process 2
Event
Stream
Collecting
Process 2
Processing A
Thread 2
Q2
e
Processing B
Thread 2
Q2
e
State

2015 © Trivadis
Processing Models
Batch Processing
•  Familiar concept of processing data en masse
•  Generally incurs a high-latency
(Event-) Stream Processing
•  A one-at-a-time processing model
•  A datum is processed as it arrives
•  Sub-second latency
•  Difficult to process state data efficiently
Micro-Batching
•  A special case of batch processing with very small batch sizes (tiny)
•  A nice mix between batching and streaming
•  At cost of latency
•  Allows Stateful computation, making windowing an easy task
Juli 2015
14

2015 © Trivadis
Message Delivery Semantics
At most once [0,1]
•  Messages my be lost
•  Messages never redelivered
At least once [1 .. n]
•  Messages will never be lost
•  but messages may be redelivered (might be ok if consumer can handle it)
Exactly once [1]
•  Messages are never lost
•  Messages are never redelivered
•  Perfect message delivery
•  Incurs higher latency for transactional semantics
Juli 2015
15

2015 © Trivadis
Agenda
2.  Apache Storm
Juli 2015
16

2015 © Trivadis
Apache Storm
A platform for doing analysis on streams of data as they come in, so you
can react to data as it happens.
•  highly distributed real-time computation system
•  Provides general primitives to do
real-time computation
•  To simplify working with queues & workers
•  scalable and fault-tolerant
Originated at Backtype, acquired by Twitter in 2011
Open Sourced late 2011
Part of Apache since September 2013
Juli 2015
17

2015 © Trivadis
Apache Storm – Core concepts
Tuple
•  Immutable Set of Key/value pairs
Stream
•  an unbounded sequence of tuples that can be processed in parallel by Storm
Topology
•  Wires data and functions via a DAG (directed acyclic graph)
•  Executes on many machines similar to a MR job in Hadoop
Spout
•  Source of data streams (tuples)
•  can be run in “reliable” and “unreliable” mode
Bolt
•  Consumes 1+ streams and produces new streams
•  Complex operations often require multiple
steps and thus multiple bolts
Juli 2015
18
Spout
Spout
Bolt
Bolt
Bolt
Bolt
Source of
Stream B
Subscribes: A
Emits: C
Subscribes: A
Emits: D
Subscribes: A & B
Emits: -
Subscribes: C & D
Emits: -
T T T T T T T T

2015 © Trivadis
Apache Storm – Core concepts
Each Spout or Bolt are running N instances in parallel
Juli 2015
19
Split Text
nth
Text
Spout
Word
Count nth
Split Text
1th
Word
Count 1st
Shuffle Fields
Shuffle grouping is random grouping
Fields grouping is grouped by value, such that equal value results in equal task
All grouping replicates to all tasks
Global grouping makes all tuples go to one task
None grouping makes bolt run in the same thread as bolt/spout it subscribes to
Direct grouping producer (task that emits) controls which consumer will receive
Local or Shuffle
grouping
similar to the shuffle grouping but will shuffle tuples among bolt tasks
running in the same worker process, if any. Falls back to shuffle
grouping behavior.
ReportGlobal

2015 © Trivadis
Storm – How does it work ?
Juli 2015
20
Who will win: Barca,
Real, Juve or Bayern?
… bit.ly/1yRsPmE #fcb
#barca
Sentence
Splitter
Twitter
Spout
Sentence
Splitter
… #barca
Shuffle
Grouping
Sentence
Splitter
… #fcb
bayern
fcb
juve
real
barca
barca

2015 © Trivadis
Juli 2015
21
Sentence
Splitter
Twitter
Spout
Word
Counter
Sentence
Splitter
Word
Counter
Sentence
Splitter
#barca
Shuffle
Grouping
… #barca
… #fcb
Fields
Grouping
real
juve
barca
barca
bayern
fcb

2015 © Trivadis
Juli 2015
22
Sentence
Splitter
Twitter
Spout
Word
Counter
Sentence
Splitter
Word
Counter
Sentence
Splitter
#barca
Shuffle
Grouping
real
juve
barca
barca
bayern
fcb
… #barca
… #fcb
Fields
Grouping
INCR
barca
INCR
real
INCR
juve
real = 1
juve = 1
INCR
barca
INCR
bayern bayern = 1
barca = 1
barca = 2
INCR
fcb fcb = 1

2015 © Trivadis
Juli 2015
23
Sentence
Splitter
Twitter
Spout
Word
Counter
Sentence
Splitter
Word
Counter
Report
real = 1
juve = 1
barca = 2
bayern = 1
Sentence
Splitter
#barca
Shuffle
Grouping
real
juve
barca
barca
bayern
fcb
… #barca
… #fcb
Fields
Grouping
Global
Grouping
real = 1
juve = 1
bayern = 1
barca = 2
30sec
fcb = 1
fcb = 1

2015 © Trivadis
Using a NoSQL datastore for persisting
results
Keep state in a NoSQL datastore
Using counter type columns of Cassandra
Juli 2015
24
Twitter
Stream
Sentence
Splitter
Twitter
Spout
Word
Counter
Sentence
Splitter
Word
Counter
#barca
… #barca
… #fcb real = 1
juve = 1
barca = 2
bayern = 1
INCR
barca
INCR
real
INCR
juve
INCR
barca
INCR
bayern
real
juve
barca
barca
bayern
fcb
fcb = 1
INCR
fcb

2015 © Trivadis
Storm Trident
High-Level abstraction on top of storm
•  Processing as a series of batches (micro-batches)
•  Stream is partitioned among nodes in cluster
5 kinds of operations in Trident
•  Operations that apply locally to each partition and cause no network transfer
•  Repartitioning operations that don‘t change the contents
•  Aggregation operations that do network transfer
•  Operations on grouped streams
•  Merges and Joins
Juli 2015
25
Twitter
Stream
tweet tweet
Sentence
Splitter
Twitter
Spout
hashtag Sentence
Normalizer
Persistent
Aggregate
hashtag
groupBylocal
Bolt Bolt

2015 © Trivadis
Storm Core vs. Storm Trident
Juli 2015
26
Storm Core Storm Trident
Community > 100 contributors > 100 contributors
Adoption *** *
Language Options Java, Clojure, Scala,
Python, Ruby, …
Java, Clojure,
Scala
Processing Models Event-Streaming Micro-Batching
Processing DSL No Yes
Stateful Ops No Yes
Distributed RPC Yes Yes
Delivery
Guarantees
At most once / At least
once
Exactly Once
Latency sub-second seconds
Platform Storm Cluster, YARN Storm Cluster, YARN

2015 © Trivadis
Agenda
1.  Introduction
2.  Apache Storm
4.  Unified Log (Enterprise Event Bus)
Juli 2015
27

2015 © Trivadis
Apache Spark
Apache Spark is a fast and general engine for large-scale data processing
•  The hot trend in Big Data!
•  Based on 2007 Microsoft Dryad paper
•  Written in Scala, supports Java, Python, SQL and R
•  Can run programs up to 100x faster than Hadoop MapReduce in memory, or
10x faster on disk
•  Runs everywhere – runs on Hadoop, Mesos, standalone or in the cloud
•  One of the largest OSS communities in big data with over 200 contributors in
50+ organizations
•  Originally developed 2009 in UC Berkley’s AMPLab
•  Open Sourced in 2010 – since 2014 part of Apache Software foundation
Juli 2015
28

2015 © Trivadis
Apache Spark
Spark Core
•  General execution engine for the Spark platform
•  In-memory computing capabilities deliver speed
•  General execution model supports wide variety of use cases
•  DAG-based
•  Ease of development – native APIs in Java, Scala and Python
Spark Streaming
•  Run a streaming computation as a series of very small, deterministic batch jobs
•  Batch size as low as ½ sec, latency of about 1 sec
•  Exactly-once semantics
•  Potential for combining batch and streaming processing in same system
•  Started in 2012, first alpha release in 2013
Juli 2015
29

2015 © Trivadis
Apache Spark - Generality
Juli 2015
30
Spark SQL
(Batch
Processing)
Blink DB
(Approximate
Querying)
Spark Streaming
(Real-Time)
MLlib, Spark R
(Machine
Learning)
GraphX
(Graph
Processing)
Spark Core API and Execution Model
Spark
Standalone
MESOS YARN HDFS
Elastic
Search
Cassandra
S3 /
DynamoDB
Libraries
Core Runtime
Cluster Resource Managers Data Stores
Adapted from C. Fregly: http://slidesha.re/11PP7FV

2015 © Trivadis
Apache Spark – Core concepts
Resilient Distributed Dataset (RDD)
•  Core Spark abstraction
•  Collections of objects (partitions) spread across cluster
•  Can be stored in-memory or on-disk (local)
•  Enables parallel processing on data sets
•  Build through parallel transformations
•  Immutable, re-computable, fault tolerant
•  Contains transformation history (“lineage”) for whole data set
Operations
•  Stateless Transformations (map, filter, groupBy)
•  Actions (count, collect, save)
Juli 2015
31

2015 © Trivadis
RDD Lineage Example
Juli 2015
32
HDFS File Input 1
HadoopRDD
FilteredRDD
MappedRDD
ShuffledRDD
HDFS File
Output
HadoopRDD
MappedRDD
HDFS File Input
2
SparkContext.hadoopFile()

SparkContext.hadoopFile()
filter()

map()
map()

join()

SparkContext.saveAsHadoopFile()

Transformations
(Lazy)
Action
(Execute Transformations)
Adapted from Chris Fregly: http://slidesha.re/11PP7FV

2015 © Trivadis
Apache Spark Streaming – Core concepts
Discretized Stream (DStream)
•  Core Spark Streaming abstraction
•  micro batches of RDD’s
•  Operations similar to RDD
Input DStreams
•  Represents the stream of raw data received from streaming sources
•  Data can be ingested from many sources: Kafka, Kinesis, Flume, Twitter,
ZeroMQ, TCP Socket, Akka actors, etc.
•  Custom Sources can be easily written for custom data sources
Operations
•  Same as Spark Core
•  Additional Stateful transformations (window, reduceByWindow)
Juli 2015
33

2015 © Trivadis
Discretized Stream (DStream)
Juli 2015
34
time 1 time 2 time 3
message

time n….
f(message
1)

RDD @time 1
f(message
2)

f(message
n)

….
message
1

RDD @time 1
message
2

message
n

….
result
1

result
2

result
n

….
message
message
message

f(message
1)

RDD @time 2
f(message
2)

f(message
n)

….
message
1

RDD @time 2
message
2

message
n

….
result
1

result
2

result
n

….
f(message
1)

RDD @time 3
f(message
2)

f(message
n)

….
message
1

RDD @time 3
message
2

message
n

….
result
1

result
2

result
n

….
f(message
1)

RDD @time n
f(message
2)

f(message
n)

….
message
1

RDD @time n
message
2

message
n

….
result
1

result
2

result
n

….
Input Stream
DStream
MappedDStream
map()

saveAsHadoopFiles()

Time Increasing
DStreamTransformationLineageActions
TriggerSpark
Jobs
Adapted from Chris Fregly: http://slidesha.re/11PP7FV

2015 © Trivadis
Storm Core vs. Storm Trident vs. Spark Streaming
Juli 2015
35
Storm Core Storm Trident Spark Streaming
Community > 100 contributors > 100 contributors > 280 contributors
Adoption *** * *
Language
Options
Java, Clojure, Scala,
Python, Ruby, …
Java, Clojure,
Scala
Java, Scala
Python (coming)
Processing
Models
Event-Streaming Micro-Batching Micro-Batching
Batch (Spark Core)
Processing DSL No Yes Yes
Stateful Ops No Yes Yes
Distributed RPC Yes Yes No
Delivery
Guarantees
At most once / At
least once
Exactly Once Exactly Once
Latency sub-second seconds seconds
Platform Storm Cluster, YARN Storm Cluster, YARN YARN, Mesos
Standalone, DataStax EE

2015 © Trivadis
Architectural Pattern: Standalone Event Stream
Processing
Juli 2015
3737
Event Processing
(ESP / CEP)
State Store /
Event Store
EnterpriseEventBus
(Ingress)
Event
Cloud
Internet of
Things
Social Media
Streams
Enterprise
EventBus
37
Analytical
Applications
DB
Enterprise
Service
Bus
Business Rule
Management
SystemRules
Event Processing
Result
Store

2015 © Trivadis
Hadoop Big Data
Infrastructure
Architectural Pattern: Event Stream Processing as part
of Lambda Architecture
Juli 2015
3838
Event Processing
(ESP / CEP)
State Store /
Event Store
EnterpriseEventBus
(Ingress)
Event
Cloud
Internet of
Things
Social Media
Streams
Enterprise
EventBus
38
Analytical
Applications
DB
Enterprise
Service
Bus
Event Processing
Map/
Reduce
HDF
S
Result
Store
Result
Store

2015 © Trivadis
Hadoop Big Data
Infrastructure
Architectural Pattern: Event Stream Processing as part
of Kappa Architecture
Juli 2015
3939
Event Processing
(ESP / CEP)
State Store /
Event Store
EnterpriseEventBus
(Ingress)
Event
Cloud
Internet of
Things
Social Media
Streams
39
Analytical
Applications
DB
Enterprise
Service
Bus
Event Processing
Replay
HDF
S
Result
Store

2015 © Trivadis
Unified Log (Event) Architecture
Stream processing
allows
for computing feeds
off of other feeds
Derived feeds
are no different
than original feeds
they are computed off
Single deployment of
“Unified Log” but
logically different
feeds
Juli 2015
40
Meter
Readings
Collector
Enrich /
Transform
Aggregate
by Minute
Raw Meter
Readings
Meter with
Customer
Meter by Customer
by Minute
Customer
Aggregate
by Minute
Meter by
Minute
Persist
Meter by
Minute
Persist
Raw Meter
Readings

2015 © Trivadis
Juli 2015
41
Tweets
Filter
Persist
Filtered
Tweets
Persist
Sensor
Readings
Tweet
Distribution Layer
Kafka Storm Cassandra ElasticsearchTitan
Speed Layer
Feature
extractor
Count
Skill
Matcher
sensor
reading
Feature
Occurrences
Matches
Feature
counter Skill
Unified Log/
Event
Architecture
for Trivadis
Streaming
Demo System

2015 © Trivadis
Juli 2015
42
Tweets
Filter
Persist
Filtered
Tweets
Persist
Sensor
Readings
Tweet
Distribution Layer
Kafka Storm Cassandra ElasticsearchTitan
Speed Layer
Feature
extractor
Count
Skill
Matcher
sensor
reading
Feature
Occurrences
Matches
Feature
counter Skill
Unified Log/
Event
Architecture
for Trivadis
Streaming
Demo System
Storm Topology
Splitter
Kafka
Spout
Word
Remover
Splitter
Word
Remover
Shuffle Fields
Kafka
Kafka
Word
Remover

2015 © Trivadis
Central Unified Log for (real-time) subscription
Take all the organization’s data and put it into a central log for subscription
Properties of the Unified Log:
•  Unified: “Enterprise”, single deployment
•  Append-Only: events are appended, no update in place => immutable
•  Ordered: each event has an offset, which is unique within a shard
•  Fast: should be able to handle thousands of messages / sec
•  Distributed: lives on a cluster of machines
Juli 2015
43
0 1 2 3 4 5 6 7 8 9 10 11
reads
writes
Collector
Consumer
System A
(time = 6)
Consumer
System B
(time = 10)
reads

2015 © Trivadis
Apache Kafka - Overview
•  A distributed publish-subscribe messaging system
•  Designed for processing of real time activity stream data (logs, metrics
collections, social media streams, …)
•  Initially developed at LinkedIn, now part of Apache
•  Does not follow JMS Standards and does not use JMS API
•  Kafka maintains feeds of messages in topics
Juli 2015
44
Kafka Cluster
Consumer Consumer Consumer
Producer Producer Producer
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
Anatomy of a topic:
Partition 0
Partition 1
Partition 2
Writes
old new

2015 © Trivadis
Questions and answers ...
2014 © Trivadis
BASEL BERN BRUGES LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH STUTTGART VIENNA
Guido Schmutz
Technology Manager
Juli 2015
46

Apache Storm vs. Spark Streaming - two stream processing platforms compared

More Related Content

What's hot

Viewers also liked

Similar to Apache Storm vs. Spark Streaming - two stream processing platforms compared

More from Guido Schmutz

Recently uploaded

Apache Storm vs. Spark Streaming - two stream processing platforms compared