Extending Yahoo Streaming computation Benchmark to Apache Apex
- Application topology
- Comparison of results between Storm, Flink and Apex
- Variation of the Apex Benchmarking App with event time and 'results query' support
Extending The Yahoo Streaming Benchmark to Apache Apex
1. Extending the Yahoo
Streaming Benchmark for Apache Apex
San Jose Apache Apex Meetup
May 4th
2016
Sandesh Hegde
sandesh@apache.org
2. Background
• Yahoo created a benchmark to compare Stream processing systems and
compared Storm, Flink and Spark Streaming [1]
• dataArtisans extended the benchmark by comparing Flink and Storm with
different scenarios [2]
• No benchmark comparison about Stream processing is complete without
including Apache Apex.
2
3. Yahoo Streaming Benchmark
Simple Advertisement Application : To see how many times an ad
campaign has been seen in an window.
• Read ads from Kafka
• Deserialize JSON string
• Filter unnecessary ads
• Projection of Fields ( remove non-essential fields )
• Join ad id with campaign id from Redis
• Windowed count per campaign and output to Redis
3
8. Quick Primer on Locality
8
• CONTAINER_LOCAL
■ Deployed in the same process, different threads
■ No serialization
■ Queue between the operators
• THREAD_LOCAL
■ Same thread
■ No serialization
■ Use it only when operators do light work
Note: [New feature] Anti Affinity is not covered here.
9. Benchmarking Against Previous Releases
9
https://www.datatorrent.com/blog/blog-apex-performance-benchmark/
Part of Release Certification
10. Application : with Kafka
10
https://github.com/sandeshh/streaming-benchmarks
12. Application - With Generator
12
https://github.com/sandeshh/streaming-benchmarks
Setup: Single Partition
13. State of the Art & Streaming
13
Generator Filter Redis OutputRedis JoinFilter Fields
What’s our recommendation to query the State?
In memory Key-Value store in the operators?
14. Application - State Store & Query
14
Generator Filter
Dimensional
Computation
Redis JoinFilter Fields Store (HDHT) QueryResult
1. Durable state ( HDHT is a key value store native to Hadoop ) [4]
2. Single System, scales with your application
3. Easy integration with external Consoles [7]
4. Low operability cost
5. Complex Dimensional Computation [5][6]