Extending the Yahoo!
Streaming Benchmark
Jamie Grier
@jamiegrier
jamie@data-artisans.com
Who am I?
• Director of Applications Engineering at data
Artisans
• Previously working on streaming computation at
Twitter, Gnip and Boulder Imaging
• Involved in various kinds of stream processing for
about a decade
• High-speed video, social media streaming, general
frameworks for stream processing
Overview
• Yahoo! performed a benchmark comparing
Apache Flink, Storm and Spark
• The benchmark never actually pushed Flink to it’s
throughput limits but stopped at Storms limits
• I knew Flink was capable of much more so I
repeated the benchmarks myself
• I did a follow up blog post explaining my findings
and will summarize them here
Yahoo! Benchmark
• Count ad impressions grouped by campaign
• Compute aggregates over a 10 second window
• Emit current value of window aggregates to
Redis every second for query
• Map ads to campaigns using Redis as well
Any questions so far?
Storm Code
Flink Code
Hardware Specs
• 10 Kafka brokers with 2 partitions each
• 10 compute nodes (Flink / Storm)
• Each machine has 1 Xeon E3-1230-V2@3.30GHz CPU
• 4 cores w/ hyperthreading
• 32 GB RAM (only 8GB allocated to JVMs)
• 10 GigE Ethernet between compute nodes
• 1 GigE Ethernet between Kafka cluster and compute nodes
Logical Deployment
Data
Generat
or
Kafka Source Filter Project Join
Redis
Windo
w
Sink Redis
Stream Processor
Redis
Apache Storm
Deployment
Kafka
Kafka
Kafka
Source Filter Project Join Window Sink
Flink
Data Generator
Redis
Shuffle
Apache Storm
10 Gige Link
1 Gige Link
Redis
Kafka
Kafka
Kafka
Source Filter Project Join Window Sink
Flink
Data Generator
Redis
Shuffle
10 Gige Link
1 Gige Link
Redis
Kafka
Kafka
Kafka
Source / Filter Project Join Window Sink
Flink
Data Generator
Redis
Shuffle
10 Gige Link
1 Gige Link
Redis
Kafka
Kafka
Kafka
Source / Filter / Project Join Window Sink
Flink
Data Generator
Redis
Shuffle
10 Gige Link
1 Gige Link
Redis
Kafka
Kafka
Kafka
Source / Filter / Project / Join Window Sink
Flink
Data Generator
Redis
Shuffle
10 Gige Link
1 Gige Link
Redis
Kafka
Kafka
Kafka
Window / Sink
Flink
Data Generator
Redis
Shuffle
Source / Filter / Project / Join
10 Gige Link
1 Gige Link
Redis
Kafka
Kafka
Kafka
Flink
Data Generator
Redis
Shuffle
Window / SinkSource / Filter / Project / Join
10 Gige Link
1 Gige Link
Redis
Kafka
Kafka
Kafka
Flink
Data Generator
Redis
Shuffle
Apache Flink
Deployment
Apache Flink
Window / SinkSource / Filter / Project / Join
10 Gige Link
1 Gige Link
Processing Guarantees
Apples and Oranges
Apache Storm Apache Flink
At least once
semantics
Exactly once
semantics
Double counting after
failures
No double counting
Lost state after
failures
No state loss
Benchmark
0 750,000 1,500,000 2,250,000 3,000,000 3,750,000
Storm
Flink
Throughput: msgs/sec
Baseline
Bottleneck Analysis
Apache Storm
Kafka
Kafka
Kafka
Source Filter Project Join Window Sink
Flink
Data Generator
Shuffle
Apache Storm
10 Gige Link
1 Gige Link
Redis
Redis
Bottleneck Analysis
Apache Storm
Kafka
Kafka
Kafka
Source Filter Project Join Window Sink
Flink
Data Generator
Shuffle
Apache Storm
10 Gige Link
1 Gige Link
Redis
Redis
CPU
Redis
Kafka
Kafka
Kafka
Flink
Data Generator
Redis
Shuffle
Bottleneck Analysis
Apache Flink
Apache Flink
Window / SinkSource / Filter / Project / Join
10 Gige Link
1 Gige Link
Redis
Kafka
Kafka
Kafka
Flink
Data Generator
Redis
Shuffle
Bottleneck Analysis
Apache Flink
Apache Flink
Window / SinkSource / Filter / Project / Join
10 Gige Link
1 Gige Link
Network
Redis
Kafka
Kafka
Kafka
Flink
Data Generator
Redis
Shuffle
Eliminate the
Bottleneck
Apache Flink
Window / SinkSource / Filter / Project / Join
10 Gige Link
1 Gige Link
Redis
Flink
Data Generator
Redis
Shuffle
Apache Flink
Window / SinkSource / Filter / Project / Join
10 Gige Link
1 Gige Link
Eliminate the
Bottleneck
Redis
Redis
Shuffle
Apache Flink
Window / SinkSource / Filter / Project / Join
10 Gige Link
1 Gige Link
Data
Generator
Eliminate the
Bottleneck
Redis
Redis
Shuffle
Apache Flink
Window / SinkSource / Filter / Project / Join
10 Gige Link
1 Gige Link
Data
Generator
Apache Flink
Deployment
Round 2
Benchmark
0 750,000 1,500,000 2,250,000 3,000,000 3,750,000
Storm
Flink
Throughput: msgs/sec
Baseline
Benchmark
Round 2
0 4,000,000 8,000,000 12,000,000 16,000,000
Storm
Flink
Flink (10 GigE)
Throughput: msgs/sec
10 GigE end-to-end
Results
• Apache Flink achieved 15 million messages / sec
on Yahoo! benchmark
• Much stronger processing guarantees: Exactly
once
• 80x higher than what was reported in the original
Yahoo! benchmark on similar hardware
Questions?
Storm Compatibility
• Lot’s of companies already have applications written
using the Storm API
• Flink provides a Storm compatibility layer
• Run your Storm jobs on Flink with a one line code
change
• Flink also allows you to reuse your existing Storm
spout and bolt code from a Flink job
• Give it a try!
Thanks!

Extending the Yahoo Streaming Benchmark