Performance Comparison of Streaming Big Data Platforms

Performance Comparison of Streaming Big
Data Platforms
Reza Farivar
Capital One Inc.
Kyle Knusbaum
Yahoo Inc.

Streaming Computation engines
• Designed to process a continuous stream of data.
• Designed to process data with low latency – data (ideally) doesn’t buffer up before
being processed. Contrasts with batch processing - MapReduce.
• Designed to handle big data. The systems are distributed by design.

• Apache Storm has the TopologyBuilder API to create a directed graph (topology) through
which streams of data flow.
• “Spouts” are the entry point to the graph, and “bolts” perform the processing.
• Data flows through the system as individual tuples.
• Graphs are not necessarily acyclic (although that is often the case)
Kafka Spout
Database

• Apache Flink has the DataStream API to perform operations on streams of data. (map,
filter, reduce, join, etc.)
• These operations are turned into a graph at job submission time by Flink.
• Underlying graph works similarly to Storm’s model.
• Also supports a Storm-compatible API
Database

• Apache Spark has the DStream API to perform operations on streams of data. (map,
filter, reduce, join, etc.) Based on Spark’s RDD (Resilient Distributed Dataset)
abstraction.
• Similar to Flink’s API.
• Streaming accomplished through micro-batches.
• Spark streaming job consists of one small batch after another.
RDDRDDRDDRDDRDD
RDDRDDRDDRDDRDD
RDDRDDRDDRDDRDD
Spark Streaming
Database

Benchmark
• We would like to compare the platforms, but which benchmark?
– How to compare the relative effectiveness of these systems?
• Throughput (events per second)
• End-to-end latency (How long for an event to get through the system)
• Completeness (Is the computation correct?)
– Current benchmarks did not test with workloads similar to a real world use
case
• Speed of light tests only reveal so much information
• So we created a new benchmark (on github)
– A simple advertisement counting application
– Mimic some common ETL operations on data streams

Our Streaming benchmark
• Goal is to correlate latency with throughput.
• Simulation of an advertisement analytics pipeline.
• Must be implemented and run in all three engines.
• Initial data:
– Some number of advertising campaigns.
– Some number of ads per campaign.
• Initial data stored in Redis.
• Our producers read the initial data, and start generating various events. (view, click, purchase)
• Events are then sent to a Kafka cluster.
Benchmark Event
Producer

Measuring Latency
– Windows periodically stored into Redis along with a timestamp of when the window
was written into Redis.
• Application given an SLA (Service-Level Agreement) as part of the simulation,
demanding that tuples be processed in under 1 second.
• The period of writes was chosen to meet the SLA. Writes to Redis were performed once
per second. Spark is an exception. It wrote windows out once per batch

Measuring Latency
• Ten second window
• First event generated
• 10 seconds of events – 10’s of thousands of events
per second
• Last event generated near end of window
• At some point later, the window is written into Redis.
• We know the time of the end of the window,
and the time the window was written.
• This time gives us a data point of latency – length of
time between event generation and being written in
database.
• Events processed late will cause their windows to be
written at a later time, and will be reflected in the
data.
10 s
1st event
in window
Last event
in window
Window data
written into
Redis
Latency data point
(Ideally less than SLA)

Our methodology
• Generate a particular throughput of events, then measure the latency.
– Throughputs measured varied between 50,000 events/s and 170,000 events/s
• 100 advertising campaigns
• 10 ads per campaign
• SLA set at 1 second
• 10 second windows
• 5 Kafka nodes with 5 topic partitions
• 1 Redis node
• 3 ZooKeeper nodes (cluster-coordination software)
• 10 worker nodes (doing computation)
• Handful of nodes used by the systems as masters, other non-compute servers.

Our methodology
1. Totally clear Kafka of data
2. Populate Redis with initial data
3. Launch the advertising analytics application on Spark, Flink, or Storm
4. Wait a bit for all workers to finish launching
5. Start up producers with instructions to produce tuples at a given rate – this rate determines the throughput.
– Ex: 5 producers writing 10,000 events per second generates a throughput of 50,000 events/s.
6. Let the system run for 30 minutes after starting the producers, then shut the producers down.
7. Run data gathering tool on the Redis database to generate latency points from the windows.

Hardware Setup
• Homogeneous nodes, each with two Intel E5530 @2.4GHz, 16 hyperthreading cores per
node
• 24GiB of memory
• Machines on the same rack
• Gigabit Ethernet switch
• The cluster has 40 nodes, 20-25 used in benchmark
• Multiple instances of Kafka producers to create load
– individual producers fall behind at around 17,000 events per second
• The use of 10 workers for a topology is near the average number we see being used by
topologies internal to Yahoo
– The Storm clusters are larger, but multi-tenant & run many topologies

About the implementations
• Apache Flink
– Tested 0.10.1-SNAPSHOT (commit hash 7364ce1).
– Application written in Java using the DataStream API.
– Checkpointing – a feature that guarantees at-least-once processing – was disabled.
• Apache Spark
– Tested version 1.5
– Application written in Scala using the DStreams API.
– At-least-once processing not implemented.
• Apache Storm
– Tested both versions 0.10 and 0.11-SNAPSHOT (commit hash a8d253a).
– Application written using the Java API.
– Acking provides at-least-once processing – turned off for high throughputs in 0.11-SNAPSHOT

Flink
• Most tuples finished
within 1 second SLA.
• Sharp curve indicates
there was a very small
number of straggling
tuples that were written
into Redis late.
• Red dots mark 1st 10th
25th 50th 75th 90th 99th
and 100th percentiles.

Flink
Late Tuples
• Of late tuples, most were
written within a few
milliseconds of the SLA’s
deadline.
• This emphasizes only a
very small number were
significantly late.
• Beyond about 170,000
tuples, Flink was unable
to handle the
throughput, and tuples
backed up.

Spark Streaming
• Benchmark written in Scala, using DStreams (a.k.a streaming RDDs) and direct
Kafka Consumer
• Micro-batching
– different than the pure streaming nature of Storm and Flink
– To meet 1 sec SLA, the batch duration was set to 1 second
• Forced to increase the batch duration for larger throughputs
• Transformations (e.g. maps and filters) applied on the Dstreams
• Joining data with Redis a special case
– Should not create a separate connection to Redis for each record  use a mapPartitions
operation that can give control of a whole RDD partition to our code
• create one connection to Redis and use this single connection to query information from Redis for
all the events in that RDD partition.

Spark 2-dimensional Parameter Adjustment
• Micro-batch duration
– This is a control dimension that is not present in a pure streaming system like Storm
– Increasing the duration increases latency while reducing overhead and therefore increasing
maximum throughput
– Finding optimal batch duration that minimizes latency while allowing spark to handle the
throughput is a time consuming process
• Set a batch duration, run the benchmark for 30 minutes, check the results  decrease/increase the
duration
• Parallelism
– increasing parallelism is simpler said than done in Spark
– In a true streaming system like Storm, one bolt instance can send its results to any number of
subsequent bolt instances
– In a micro batch system like Spark, perform a reshuffle operation
• similar to how intermediate data in a Hadoop MapReduce program are shuffled and merged across the
cluster.
• But the reshuffling itself introduces considerable overhead.

Spark
• Spark had more
interesting results than
Flink.
• Due to the micro-batch
design, it was unable to
process events at low
latencies
• The overhead of
scheduling and
launching a task per
batch is very high
• Batch size had to be
increased – this
overcame the launch
overhead.

Spark
• If we reduce the batch
duration sufficiently, we
get into a region where
the incoming events are
processed within 3 or 4
subsequent batches.
• The system on the verge
of falling behind, but is
still manageable, and
results in better latency.

Spark
Falling behind
• Without increasing the
batch size, Spark was
unable to keep up with
the throughput, tuples
backed up, and latencies
continuously increased
until the job was shut
down.
• After increasing the
batchsize, Spark handled
larger throughputs than
either Storm or Flink.

Spark
• Tuning the batch size was time-consuming, since it had to be done manually – this was one of the largest
problems we faced in testing Spark’s Streaming capabilities.
• If the batch size was set too high, latency numbers would be bad. If it was set too low, Spark would fall behind,
tuples would back up, and latency numbers would be worse.
• Spark had a new feature at the time called ‘backpressure’ which was supposed to help address this, but we were
unable to make it work properly. In fact, enabling backpressure hindered our numbers in all cases.

Storm Results
• Benchmark uses Java API, One worker process per host, each worker has 16 tasks to run in 16
executors - one for each core.
• In 0.11.0, Storm added a simple back pressure controller  avoid the overhead of acking
– In 0.10.0 benchmark topology, acking was used for flow control but not for processing guarantees.
• With acking disabled, Storm even beat Flink for latency at high throughput.
– But no tuple failure handling
Storm 0.10.0 Storm 0.11.0

Storm
• Storm behaved very
similarly to Flink.
• However, Storm was
unable to handle more
than 130,000 events/s
with its acking system
enabled.
• Acking keeps track of
successfully processed
events within Storm.
• With acking disabled,
Storm achieved numbers
similar to Flink at
throughputs up to
170,000 events/s.

Storm
Late Tuples
• Similar to Flink’s late
tuple graph.
• Tuples that were late
were slightly less late
than Flink’s.

Three-way Comparison
• Flink and Storm have
similar linear
performance profiles
– These two systems
process an incoming
event as it becomes
available
• Spark Streaming has
much higher latency,
but is expected to
handle higher
throughputs
– System behaves in a
stepwise function, a
direct result from its
micro-batching
nature

Flink
Spark
Storm
• Comparisons of 99-th
percentile latencies are
revealing.
• Storm 0.11 consistently
lower latency than Flink
and Spark.
• Flink’s latency comparable
to Storm 0.10, but
handled higher
throughput with at-least-
once guarantees.
• Spark had the highest
latency, but was able to
handle higher throughput
than either Storm or Flink

Future work
• Many variables involved – many we didn’t adjust.
• Applications were not optimized – all were written in a fairly plain manner and configuration
settings were not tweaked
• SLA deadline of 1 second is very low. We did this to test the limits of the low-latency streaming
systems. Higher SLA deadlines are reasonable, and testing those would be worthwhile – likely
showing Spark being highly competitive with the others.
• The throughputs we tested at were incredibly high.
– 170,000 events/s comes to 14688000000 events per day – 1.4*1010 events per day
• Didn’t test with exactly-once semantics.
• Ran small tests and checked for correctness of computations, but didn’t check correctness at
large scale.
• There are many more tests that can be run.
• Other streaming engines can be added.

Conclusions
• The competition between near real time streaming systems is
heating up, and there is no clear winner at this point
• Each of the platforms studied here have their advantages and
disadvantages
• Other important factors:
– Security or integration with tools and libraries
• Active communities for these and other big data processing
projects continue to innovate and benefit from each other’s
advancements

Performance Comparison of Streaming Big Data Platforms

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Performance Comparison of Streaming Big Data Platforms

Similar to Performance Comparison of Streaming Big Data Platforms (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Performance Comparison of Streaming Big Data Platforms

Editor's Notes