Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Extending the Yahoo!
Streaming Benchmark
Jamie Grier
@jamiegrier
jamie@data-artisans.com
Who am I?
• Director of Applications Engineering at data
Artisans
• Previously working on streaming computation at
Twitter...
Overview
• Yahoo! performed a benchmark comparing
Apache Flink, Storm and Spark
• The benchmark never actually pushed Flin...
Yahoo! Benchmark
• Count ad impressions grouped by campaign
• Compute aggregates over a 10 second window
• Emit current va...
Any questions so far?
Storm Code
Flink Code
Hardware Specs
• 10 Kafka brokers with 2 partitions each
• 10 compute nodes (Flink / Storm)
• Each machine has 1 Xeon E3-1...
Logical Deployment
Data
Generat
or
Kafka Source Filter Project Join
Redis
Windo
w
Sink Redis
Stream Processor
Redis
Apache Storm
Deployment
Kafka
Kafka
Kafka
Source Filter Project Join Window Sink
Flink
Data Generator
Redis
Shuffle
...
Redis
Kafka
Kafka
Kafka
Source Filter Project Join Window Sink
Flink
Data Generator
Redis
Shuffle
10 Gige Link
1 Gige Link
Redis
Kafka
Kafka
Kafka
Source / Filter Project Join Window Sink
Flink
Data Generator
Redis
Shuffle
10 Gige Link
1 Gige Li...
Redis
Kafka
Kafka
Kafka
Source / Filter / Project Join Window Sink
Flink
Data Generator
Redis
Shuffle
10 Gige Link
1 Gige ...
Redis
Kafka
Kafka
Kafka
Source / Filter / Project / Join Window Sink
Flink
Data Generator
Redis
Shuffle
10 Gige Link
1 Gig...
Redis
Kafka
Kafka
Kafka
Window / Sink
Flink
Data Generator
Redis
Shuffle
Source / Filter / Project / Join
10 Gige Link
1 G...
Redis
Kafka
Kafka
Kafka
Flink
Data Generator
Redis
Shuffle
Window / SinkSource / Filter / Project / Join
10 Gige Link
1 Gi...
Redis
Kafka
Kafka
Kafka
Flink
Data Generator
Redis
Shuffle
Apache Flink
Deployment
Apache Flink
Window / SinkSource / Filt...
Processing Guarantees
Apples and Oranges
Apache Storm Apache Flink
At least once
semantics
Exactly once
semantics
Double c...
Benchmark
0M
3M
0 1 2 2 3 4
Storm (Kafka, 1 GigE)
Flink (Kafka, 1 GigE)
Throughput: msgs/sec
Baseline
Bottleneck Analysis
Apache Storm
Kafka
Kafka
Kafka
Source Filter Project Join Window Sink
Flink
Data Generator
Shuffle
Apa...
Bottleneck Analysis
Apache Storm
Kafka
Kafka
Kafka
Source Filter Project Join Window Sink
Flink
Data Generator
Shuffle
Apa...
Redis
Kafka
Kafka
Kafka
Flink
Data Generator
Redis
Shuffle
Bottleneck Analysis
Apache Flink
Apache Flink
Window / SinkSour...
Redis
Kafka
Kafka
Kafka
Flink
Data Generator
Redis
Shuffle
Bottleneck Analysis
Apache Flink
Apache Flink
Window / SinkSour...
Redis
Kafka
Kafka
Kafka
Flink
Data Generator
Redis
Shuffle
Eliminate the
Bottleneck
Apache Flink
Window / SinkSource / Fil...
Redis
Flink
Data Generator
Redis
Shuffle
Apache Flink
Window / SinkSource / Filter / Project / Join
10 Gige Link
1 Gige Li...
Redis
Redis
Shuffle
Apache Flink
Window / SinkSource / Filter / Project / Join
10 Gige Link
1 Gige Link
Data
Generator
Eli...
Redis
Redis
Shuffle
Apache Flink
Window / SinkSource / Filter / Project / Join
10 Gige Link
1 Gige Link
Data
Generator
Apa...
Benchmark
0M
3M
0 1 2 2 3 4
Storm (Kafka, 1 GigE)
Flink (Kafka, 1 GigE)
Throughput: msgs/sec
Baseline
Benchmark
Round 2
0M
3M
15M
0 4 8 12 16
Storm (Kafka, 1 GigE)
Flink (Kafka, 1 GigE)
Flink (DataGen, 10 GigE)
Throughput: m...
Results
• Apache Flink achieved 15 million messages / sec
on Yahoo! benchmark
• Much stronger processing guarantees: Exact...
Questions?
Redis
Redis
Shuffle
MapR Cluster
Window / SinkSource / Filter / Project / Join
10 Gige Link
Data
Generator
Apache Flink an...
MapR Benchmark
Hardware Specs
• 10 MapR nodes, 3X data replication
• Each node has 1 Xeon E5-2660-v3 @ 2.60GHz
CPU
• 10 co...
Benchmarking on MapR
HPC Cluster
0 2,500,000 5,000,000 7,500,000 10,000,000 12,500,000
Throughput: msgs/sec
40 GigE end-to...
Benchmarking on MapR
HPC Cluster
10M
72M
0 20 40 60 80
Flink (MapR Streams)
Flink (w/ Data Generator)
Throughput: msgs/sec...
Benchmarking
Summary
0M
3M
10M
15M
72M
0 20 40 60 80
Storm (Kafka, 1 GigE)
Flink (Kafka, 1 GigE)
Flink (MapR, 40 GigE)
Fli...
What’s missing?
0 1
Flink (Kafka, 10 GigE)
Flink (Kafka, 40 GigE)
Throughput: msgs/sec
???
???
Results
• Apache Flink achieved 10 million messages / sec
on Yahoo! benchmark when paired with MapR
Streams and a high-per...
Storm Compatibility
• Lot’s of companies already have applications written
using the Storm API
• Flink provides a Storm co...
Thanks to MapR!
Special thanks to:
Terry He
Ted Dunning
Upcoming SlideShare
Loading in …5
×

Extending the Yahoo Streaming Benchmark + MapR Benchmarks

2,626 views

Published on

Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Published in: Engineering
  • Be the first to comment

Extending the Yahoo Streaming Benchmark + MapR Benchmarks

  1. 1. Extending the Yahoo! Streaming Benchmark Jamie Grier @jamiegrier jamie@data-artisans.com
  2. 2. Who am I? • Director of Applications Engineering at data Artisans • Previously working on streaming computation at Twitter, Gnip and Boulder Imaging • Involved in various kinds of stream processing for about a decade • High-speed video, social media streaming, general frameworks for stream processing
  3. 3. Overview • Yahoo! performed a benchmark comparing Apache Flink, Storm and Spark • The benchmark never actually pushed Flink to it’s throughput limits but stopped at Storms limits • I knew Flink was capable of much more so I repeated the benchmarks myself • I did a follow up blog post explaining my findings and will summarize them here
  4. 4. Yahoo! Benchmark • Count ad impressions grouped by campaign • Compute aggregates over a 10 second window • Emit current value of window aggregates to Redis every second for query • Map ads to campaigns using Redis as well
  5. 5. Any questions so far?
  6. 6. Storm Code
  7. 7. Flink Code
  8. 8. Hardware Specs • 10 Kafka brokers with 2 partitions each • 10 compute nodes (Flink / Storm) • Each machine has 1 Xeon E3-1230-V2@3.30GHz CPU • 4 cores, 8 vCores (hyperthreading) • 32 GB RAM (only 8GB allocated to JVMs) • 10 GigE Ethernet between compute nodes • 1 GigE Ethernet between Kafka cluster and compute nodes
  9. 9. Logical Deployment Data Generat or Kafka Source Filter Project Join Redis Windo w Sink Redis Stream Processor
  10. 10. Redis Apache Storm Deployment Kafka Kafka Kafka Source Filter Project Join Window Sink Flink Data Generator Redis Shuffle Apache Storm 10 Gige Link 1 Gige Link
  11. 11. Redis Kafka Kafka Kafka Source Filter Project Join Window Sink Flink Data Generator Redis Shuffle 10 Gige Link 1 Gige Link
  12. 12. Redis Kafka Kafka Kafka Source / Filter Project Join Window Sink Flink Data Generator Redis Shuffle 10 Gige Link 1 Gige Link
  13. 13. Redis Kafka Kafka Kafka Source / Filter / Project Join Window Sink Flink Data Generator Redis Shuffle 10 Gige Link 1 Gige Link
  14. 14. Redis Kafka Kafka Kafka Source / Filter / Project / Join Window Sink Flink Data Generator Redis Shuffle 10 Gige Link 1 Gige Link
  15. 15. Redis Kafka Kafka Kafka Window / Sink Flink Data Generator Redis Shuffle Source / Filter / Project / Join 10 Gige Link 1 Gige Link
  16. 16. Redis Kafka Kafka Kafka Flink Data Generator Redis Shuffle Window / SinkSource / Filter / Project / Join 10 Gige Link 1 Gige Link
  17. 17. Redis Kafka Kafka Kafka Flink Data Generator Redis Shuffle Apache Flink Deployment Apache Flink Window / SinkSource / Filter / Project / Join 10 Gige Link 1 Gige Link
  18. 18. Processing Guarantees Apples and Oranges Apache Storm Apache Flink At least once semantics Exactly once semantics Double counting after failures No double counting Lost state after failures No state loss
  19. 19. Benchmark 0M 3M 0 1 2 2 3 4 Storm (Kafka, 1 GigE) Flink (Kafka, 1 GigE) Throughput: msgs/sec Baseline
  20. 20. Bottleneck Analysis Apache Storm Kafka Kafka Kafka Source Filter Project Join Window Sink Flink Data Generator Shuffle Apache Storm 10 Gige Link 1 Gige Link Redis Redis
  21. 21. Bottleneck Analysis Apache Storm Kafka Kafka Kafka Source Filter Project Join Window Sink Flink Data Generator Shuffle Apache Storm 10 Gige Link 1 Gige Link Redis Redis CPU
  22. 22. Redis Kafka Kafka Kafka Flink Data Generator Redis Shuffle Bottleneck Analysis Apache Flink Apache Flink Window / SinkSource / Filter / Project / Join 10 Gige Link 1 Gige Link
  23. 23. Redis Kafka Kafka Kafka Flink Data Generator Redis Shuffle Bottleneck Analysis Apache Flink Apache Flink Window / SinkSource / Filter / Project / Join 10 Gige Link 1 Gige Link Network
  24. 24. Redis Kafka Kafka Kafka Flink Data Generator Redis Shuffle Eliminate the Bottleneck Apache Flink Window / SinkSource / Filter / Project / Join 10 Gige Link 1 Gige Link
  25. 25. Redis Flink Data Generator Redis Shuffle Apache Flink Window / SinkSource / Filter / Project / Join 10 Gige Link 1 Gige Link Eliminate the Bottleneck
  26. 26. Redis Redis Shuffle Apache Flink Window / SinkSource / Filter / Project / Join 10 Gige Link 1 Gige Link Data Generator Eliminate the Bottleneck
  27. 27. Redis Redis Shuffle Apache Flink Window / SinkSource / Filter / Project / Join 10 Gige Link 1 Gige Link Data Generator Apache Flink Deployment Round 2
  28. 28. Benchmark 0M 3M 0 1 2 2 3 4 Storm (Kafka, 1 GigE) Flink (Kafka, 1 GigE) Throughput: msgs/sec Baseline
  29. 29. Benchmark Round 2 0M 3M 15M 0 4 8 12 16 Storm (Kafka, 1 GigE) Flink (Kafka, 1 GigE) Flink (DataGen, 10 GigE) Throughput: msgs/sec 10 GigE end-to-end
  30. 30. Results • Apache Flink achieved 15 million messages / sec on Yahoo! benchmark • Much stronger processing guarantees: Exactly once • 80x higher than what was reported in the original Yahoo! benchmark on similar hardware
  31. 31. Questions?
  32. 32. Redis Redis Shuffle MapR Cluster Window / SinkSource / Filter / Project / Join 10 Gige Link Data Generator Apache Flink and MapR Streams MapR Streams MapR Streams MapR Streams
  33. 33. MapR Benchmark Hardware Specs • 10 MapR nodes, 3X data replication • Each node has 1 Xeon E5-2660-v3 @ 2.60GHz CPU • 10 cores, 20 vCores (hyperthreading) • 16 vCores used for Flink on each node • 256 GB RAM (only 8GB allocated to Flink) • 40 GigE Ethernet between compute nodes
  34. 34. Benchmarking on MapR HPC Cluster 0 2,500,000 5,000,000 7,500,000 10,000,000 12,500,000 Throughput: msgs/sec 40 GigE end-to-end 10 Million msgs/sec (with 3x replication)
  35. 35. Benchmarking on MapR HPC Cluster 10M 72M 0 20 40 60 80 Flink (MapR Streams) Flink (w/ Data Generator) Throughput: msgs/sec 40 GigE end-to-end
  36. 36. Benchmarking Summary 0M 3M 10M 15M 72M 0 20 40 60 80 Storm (Kafka, 1 GigE) Flink (Kafka, 1 GigE) Flink (MapR, 40 GigE) Flink (DataGen, 10 GigE) Flink (DataGen, 40 GigE) Throughput: msgs/sec
  37. 37. What’s missing? 0 1 Flink (Kafka, 10 GigE) Flink (Kafka, 40 GigE) Throughput: msgs/sec ??? ???
  38. 38. Results • Apache Flink achieved 10 million messages / sec on Yahoo! benchmark when paired with MapR Streams and a high-performance 10 node cluster • On the same cluster hardware Apache Flink achieved 72 millions message / sec when using direct data generation
  39. 39. Storm Compatibility • Lot’s of companies already have applications written using the Storm API • Flink provides a Storm compatibility layer • Run your Storm jobs on Flink with a one line code change • Flink also allows you to reuse your existing Storm spout and bolt code from a Flink job • Give it a try!
  40. 40. Thanks to MapR! Special thanks to: Terry He Ted Dunning

×