SlideShare a Scribd company logo
Nexmark with Beam
Evaluating Big Data systems
with Apache Beam
Etienne Chauchot, Ismaël Mejía. Talend
1
Who are we?
2
Agenda
1. Big Data Benchmarking
a. State of the art
b. NEXMark: A benchmark over continuous data streams
2. Nexmark on Apache Beam
a. Introducing Beam
b. Advantages of using Beam for benchmarking
c. Implementation
d. Nexmark + Beam: a win-win story
e. Neutral benchmarking: a difficult issue
f. Example: Running Nexmark on Spark
3. Current state and future work
3
Big Data benchmarking
4
Benchmarking
Why do we benchmark?
1. Performance
2. Correctness
Benchmark suites steps:
1. Generate data
2. Compute data
3. Measure performance
4. Validate results
Types of benchmarks
● Microbenchmarks
● Functional
● Business case
● Data Mining / Machine Learning
5
Issues of Benchmarking Suites for Big Data
● No standard suite: Terasort, TPCx-HS (Hadoop), HiBench, ...
● No common model/API: Strongly tied to each processing engine or SQL
● Too focused on Hadoop infrastructure
● Mixed Benchmarks for storage/processing
● Few Benchmarking suites support streaming
6
State of the art
Batch
● Terasoft: Sort random data
● TPCx-HS: Sort to measure Hadoop compatible distributions
● TPC-DS on Spark: TPC-DS business case with Spark SQL
● Berkeley Big Data Benchmark: SQL-like queries on Hive, Redshift, Impala
● HiBench* and BigBench
Streaming
● Yahoo Streaming Benchmark
* HiBench includes also some streaming / windowing benchmarks
7
Nexmark
Benchmark for queries over data streams
Business case: Online Auction System
Research paper draft 2004
Example:
Query 4: What is the average selling price for each auction category?
Query 8: Who has entered the system and created an auction in the last period?
Auction
Person
Seller
Person
Bidder
Person
Bidder
Bid
8
Item
Nexmark on Google Dataflow
● Port of SQL style queries described in the NEXMark research paper to
Google Cloud Dataflow by Mark Shields and others at Google.
● Enriched queries set with Google Cloud Dataflow client use cases
● Used as rich integration testing scenario on the Google Cloud Dataflow
9
Nexmark on Beam
10
Apache Beam
Beam Model: Fn Runners
Apache
Flink
Apache
Spark
Beam Model: Pipeline Construction
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Cloud
Dataflow
Execution
1. The Beam Programming Model
2. SDKs for writing Beam pipelines (Java/Python)
3. Runners for existing distributed processing
backends
11
The Beam Model: What is Being Computed?
12
Event Time: timestamp when the event happened
Processing Time: wall clock absolute program time
The Beam Model: Where in Event Time?
Event Time
Processing
Time
12:0212:00 12:1012:0812:0612:04
12:0212:00 12:1012:0812:0612:04
Input
Output
13
● Split infinite data into finite chunks
The Beam Model: Where in Event Time?
14
Apache Beam Pipeline concepts
Data processing Pipeline
(executed via a Beam runner)
PTransform PTransform
Read
(Source)
Write
(Sink)
Input
PCollection
Window per min Count
Output
HDFS
15
GroupByKey
CoGroupByKey
Combine -> Reduce
Sum
Count
Min / Max
Mean
...
ParDo -> DoFn
MapElements
FlatMapElements
Filter
WithKeys
Keys
Values
Windowing/Triggers
Windows
FixedWindows
GlobalWindows
SlidingWindows
Sessions
Triggers
AfterWatermark
AfterProcessingTime
Repeatedly
Element-wise Grouping
Apache Beam - Programming Model
16
Nexmark on Apache Beam
● Nexmark was ported from Dataflow to Beam 0.2.0 as an integration test case
● Refactored to most recent Beam version
● Made code more generic to support all the Beam runners
● Changed some queries to use new APIs
● Validated queries in all the runners to test their support of the Beam model
17
Advantages of using Beam for benchmarking
● Rich model: all use cases that we had could be expressed using Beam API
● Can test both batch and streaming modes with exactly the same code
● Multiple runners: queries can be executed on Beam supported runners
(provided that the given runner supports the used features)
● Monitoring features (metrics)
18
Implementation
19
Components of Nexmark
● Generator:
○ generation of timestamped events (bids, persons, auctions)
correlated between each other
● NexmarkLauncher:
○ creates sources that use the generator
○ queries pipelines launching, monitoring
● Output metrics:
○ Each query includes ParDos to update metrics
○ execution time, processing event rate, number of results,
but also invalid auctions/bids, …
● Modes:
○ Batch mode: test data is finite and uses a BoundedSource
○ Streaming mode: test data is finite but uses an
UnboundedSource to trigger streaming mode in runners
20
Some of the queries
21
Query Description Use of Beam model
3 Who is selling in particular US states? Join, State, Timer
5 Which auctions have seen the most bids in the last period? Sliding Window, Combiners
6 What is the average selling price per seller for their last 10
closed auctions?
Global Window, Custom Combiner
7 What are the highest bids per period? Fixed Windows, Side Input
9 * Winning bids Custom Window
11 * How many bids did a user make in each session he was active? Session Window, Triggering
12 * How many bids does a user make within a fixed processing time
limit?
Global Window, working in
Processing Time
*: not in original NexMark paper
Query structure
22
1. Get PCollection<Event> as input
2. Apply ParDo + Filter to extract object of interest: Bids, Auctions, Person(s)
3. Apply transforms: Filter, Count, GroupByKey, Window, etc.
4. Apply ParDo to output the final PCollection: collection of AuctionPrice,
AuctionCount ...
Key point: When to compute data?
23
● Windows: divide data into event-time-based finite chunks.
○ Often required when doing aggregations over unbounded data
Key point: When to compute data?
● Triggers: condition to fire computation
● Default trigger: at the end of the window
● Required when working on unbouded data in Global Window
● Q11: trigger fires when 20 elements were received
24
Key point: When to compute data?
● Q12: Trigger fired when first element is received + delay
(works in processing in global window time to create a duration)
● Processing time: wall clock absolute program time
● Event time: timestamp in which the event occurred
25
Key point: How to make a join?
● CoGroupByKey (in Q3, Q9): groups values of PCollections<KV> that share the same key
○ Q3: Join Auctions and Persons by their person id and tag them
26
Key point: How to temporarily group events?
● Custom window function (in Q9)
○ As CoGroupByKey is per window, need
to put bids and auctions in the same
window before joining them.
27
Key point: How to deal with out of order events?
● State and Timer APIs in an incremental join (Q3):
○ Memorize person event waiting for corresponding auctions and clear at timer
○ Memorize auction events waiting for corresponding person event
28
Key point: How to tweak reduction phase?
Custom combiner (in Q6) to be
able to specify
1. how elements are added to
accumulators
2. how accumulators merge
3. how to extract final data
to calculate the average price of the
last 3 closed auctions
29
Conclusion on queries
● Wide coverage of the Beam API
○ Most of the API
○ Illustrates also working in processing time
● Realistic
○ Real use cases, valid queries for an end user auction system
○ Extra queries inspired by Google Cloud Dataflow client use cases
● Complex queries
○ Leverage all the runners capabilities
30
Beam + Nexmark = A win-win story
● Streaming test
● A/B testing of big data execution engines (regression and performance
comparison between 2 versions of the same engine or of the same runner, ...)
● Integration testing (SDK with runners, runners with engines, …)
● Validate Beam runners capability matrix
31
Benchmarking results
32
Neutral Benchmarking: A difficult issue
● Different levels of support of capabilities of the Beam model among runners
● All execution systems have different strengths: we would end up comparing
things that are not always comparable
○ Some runners were designed to be batch oriented, others streaming oriented
○ Some are designed towards sub-second latency, others support auto-scaling
● Runners / Systems can have multiple knobs to tweak the options
● Benchmarking on a distributed environment can be inconsistent.
● Even worse if you benchmark on the cloud (e.g. Noisy neighbors)
33
Nexmark - How to run
mvn exec:java -Dexec.mainClass=org.apache.beam.integration.nexmark.Main -Pflink-runner
"-Dexec.args=--runner=FlinkRunner --suite=SMOKE --streamTimeout=60 --streaming=true
--manageResources=false --monitorJobs=true --flinkMaster=tbd-bench"
mvn exec:java -Dexec.mainClass=org.apache.beam.integration.nexmark.Main -Pspark-runner
"-Dexec.args=--runner=SparkRunner --suite=SMOKE --streamTimeout=60 --streaming=false
--manageResources=false --monitorJobs=true --sparkMaster=local"
spark-submit --master yarn-client --class org.apache.beam.integration.nexmark.Main
--driver-memory 512m --executor-memory 512m --executor-cores 1
/home/imejia/beam-integration-java-nexmark-bundled-0.7.0-SNAPSHOT.jar --runner=SparkRunner
--query=5 --streamTimeout=60 --streaming=true --manageResources=false
34
Benchmark workload configuration
Events generation (defaults)
● 100 000 events generated
● 100 generator threads
● Event rate in SIN curve
● Initial event rate of 10 000
● Event rate step of 10 000
● 100 concurrent auctions
● 1000 concurrent persons bidding / creating auctions
Windows
● size 10s
● sliding period 5s
● watermark hold for 0s
Events Proportions:
● Hot Auctions = ½
● Hot Bidders =¼
● Hot Sellers=¼
Technical
● Artificial CPU load
● Artificial IO load
35
Nexmark Output - Spark Runner (Batch)
Conf Runtime(sec) Events(/sec) Results
0000 3.8 26267.4 100000
0001 3.5 28232.6 92000
0002 3.6 27964.2 713
0003 0 0.0 0
0004 10.0 10006.0 50
0005 5.8 17214.7 3
0006 9.4 10642.8 1631
0007 7.4 13539.1 1
0000 7.2 13861.9 6000
0009 9.5 10517.5 5243
0010 5.9 16877.6 1
0011 5.8 17388.3 1992
0012 5.5 18181.8 1992
36
Nexmark Output - Spark Runner (Streaming)
Conf Runtime(sec) Events(/sec) Results
0000 1.0 10256.1 100000
0001 1.3 7722.1 92000
0002 0.7 14705.8 713
0003 0 0.0 0
0004 17.3 5779.7 50
0005 16.6 6020.8 3
0006 26.5 3773.4 1631
0007 0 0.0 0
0008 12.3 8142.0 6000
0009 17.7 5650.0 5243
0010 13.1 768.8 1
0011 10.0 9962.1 1992
0012 10.2 9783.8 1992
37
Comparing different versions of the Spark engine
38
Current Status
Future work
39
Execution Matrix
40
Batch
Streaming
* Metrics support
** waitUntilFinish fix
Current state
41
● Manage Nexmark issues in a dedicated place
● 5 open issues / 46 closed issues
Current state
42
● Nexmark helped discover bugs and missing features in Beam
● 9 open issues / 6 closed issues on Beam upstream
● Nexmark PR is opened, expected to be merged in Beam master soon
Future work
● Resolve open Nexmark and Beam issues
● Add more queries to evaluate corner cases
● Validate new runners: Gearpump, Storm, JStorm
● Integrate Nexmark into the current set of tests of Beam
● Streaming SQL-based queries (using the ongoing work on Calcite DSL)
43
Contribute
You are welcome to contribute!
● 5 open Github issues and 9 Beam Jiras that need to be taken care of
● Improve documentation + more refactoring
● New ideas, more queries, support for IOs, etc
44
Greetings
● Mark Shields (Google): Contributing Nexmark + answering our questions.
● Thomas Groh, Kenneth Knowles (Google): Direct runner + State/Timer API.
● Amit Sela, Aviem Zur (Paypal): Spark Runner + Metrics.
● Aljoscha Krettek (data Artisans), Jinsong Lee (Ali Baba): Flink Runner.
● Jean-Baptiste Onofre, Abbass Marouni (Talend): comments and help to run
Nexmark in our YARN cluster.
● The rest of the Beam community in general for being awesome.
45
References
Apache Beam
NEXMark
BEAM-160
Nexmark on Beam Issues
Big Data Benchmarks
46
Questions?
47
Thanks
48
Addendum
49
Query 3: Who is selling in particular US states?
● Illustrates incremental join of the auctions and the persons collections
● uses global window and using per-key state and timer APIs
○ Apply global window to events with trigger repeatedly after at least nbEvents in pane =>
results will be materialized each time nbEvents are received.
○ input1: collection of auctions events filtered by category and keyed by seller id
○ input2: collection of persons events filtered by US state codes and keyed by person id
○ CoGroupByKey to group auctions and persons by personId/sellerId + tags to distinguish
persons and auctions
○ ParDo to do the incremental join: auctions and person events can arrive out of order
■ person element stored in persistent state in order to match future auctions by that
person. Set a timer to clear the person state after a TTL
■ auction elements stored in persistent state until we have seen the corresponding person
record. Then, it can be output and cleared
○ output NameCityStateId(person.name, person.city, person.state, auction.id) objects
50
Query 5: Which auctions have seen the most bids in
the last period?
● Illustrates sliding windows and combiners (i.e. reducers) to compare the
elements in auctions Collection
○ Input: (sliding) window (to have a result over 1h period updated every 1 min) collection of
bids events
○ ParDo to replace bid elements by their auction id
○ Count.PerElement to count the occurrences of each auction id
○ Combine.globally to select only the auctions with the maximum number of bids
■ BinaryCombineFn to compare one to one the elements of the collection (auction id
occurrences, i.e. number of bids)
■ Return KV(auction id, max occurrences)
○ output: AuctionCount(auction id, max occurrences) objects
51
Query 6: What is the average selling price per seller
for their last 10 closed auctions?
● Illustrates specialized combiner that allows specifying the combine process of
the bids
○ input: winning-bids keyed by seller id
○ GlobalWindow + trigerring at each element (to have a continuous flow of updates at each
new winning-bid)
○ Combine.perKey to calculate average price of last 10 winning bids for each seller.
■ create Arraylist accumulators for chunks of data
■ add all elements of the chunks to the accumulators, sort them by bid timeStamp then
price keeping last 10 elements
■ iteratively merge the accumulators until there is only one: just add all bids of all
accumulators to a final accumulator and sort by timeStamp then price keeping last 10
elements
■ extractOutput: sum all the prices of the bids and divide by accumulator size
○ output SellerPrice(sellerId, avgPrice) object 52
Query 7: What are the highest bids per period?
● Could have been implemented with a combiner like query5 but deliberately
implemented using Max(prices) as a side input and illustrate fanout.
● Fanout is a redistribution using an intermediate implicit combine step to
reduce the load in the final step of the Max transform
○ input: (fixed) windowed collection of 100 000 bids events
○ ParDo to replace bids by their price
○ Max.withFanout to get the max per window and use it as a side input for next step. Fanout is
useful if there are many events to be computed in a window using the Max transform.
○ ParDo on the bids with side input to output the bid if bid.price equals maxPrice (that comes
from side input)
53
Query 9 Winning-bids (not part of original Nexmark):
extract the most recent of the highest bids
● Illustrates custom window function to reconcile auctions and bids + join them
○ input: collection of events
○ Apply custom windowing function to temporarily reconcile auctions and bids events in the same
custom window (AuctionOrBidWindow)
■ assign auctions to window [auction.timestamp, auction.expiring]
■ assign bids to window [bid.timestamp, bid.timestamp + expectedAuctionDuration (generator
configuration parameter)]
■ merge all 'bid' windows into their corresponding 'auction' window, provided the auction has not
expired.
○ Filter + ParDos to extract auctions out of events and key them by auction id
○ Filter + ParDos to extract bids out of events and key them by auction id
○ CogroupByKey (groups values of PCollections<KV> that share the same key) to group auctions and
bids by auction id + tags to distinguish auctions and bids
○ ParDo to
■ determine best bid price: verification of valid bid, sort prices by price ASC then time DESC
and keep the max price
■ and output AuctionBid(auction, bestBid) objects
54
Query 11 (not part of original Nexmark): How many
bids did a user make in each session he was active?
● Illustrates session windows + triggering on the bids collection
○ input: collection of bids events
○ ParDo to replace bids with their bidder id
○ Apply session windows with gap duration = windowDuration (configuration item) and trigger
repeatedly after at least nbEvents in pane => each window (i.e. session) will contain bid ids
received since last windowDuration period of inactivity and materialized every nbEvents bids
○ Count.perElement to count bids per bidder id (number of occurrences of bidder id)
○ output idsPerSession(bidder, bidsCount) objects
55
Query 12 (not part of original Nexmark): How many
bids does a user make within a fixed processing time
limit?
● Illustrates working in processing time in the Global window to count
occurrences of bidder
○ input: collection of bid events
○ ParDo to replace bids by their bidder id
○ Apply global window with trigger repeatedly after processingTime pass the first element in
pane + windowDuration (configuration item) => each pane will contain elements processed
within windowDuration time
○ Count.perElement to count bids per bidder id (occurrences of bidder id)
○ output BidsPerWindow(bidder, bidsCount) objects
56

More Related Content

What's hot

Gelly in Apache Flink Bay Area Meetup
Gelly in Apache Flink Bay Area MeetupGelly in Apache Flink Bay Area Meetup
Gelly in Apache Flink Bay Area Meetup
Vasia Kalavri
 
Design and Implementation of the Security Graph Language
Design and Implementation of the Security Graph LanguageDesign and Implementation of the Security Graph Language
Design and Implementation of the Security Graph Language
Asankhaya Sharma
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System Overview
Flink Forward
 
Predictive Maintenance with Deep Learning and Apache Flink
Predictive Maintenance with Deep Learning and Apache FlinkPredictive Maintenance with Deep Learning and Apache Flink
Predictive Maintenance with Deep Learning and Apache Flink
Dongwon Kim
 
FlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache FlinkFlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache Flink
Theodoros Vasiloudis
 
Flink Streaming Berlin Meetup
Flink Streaming Berlin MeetupFlink Streaming Berlin Meetup
Flink Streaming Berlin Meetup
Márton Balassi
 
Flink Forward Berlin 2017: Boris Lublinsky, Stavros Kontopoulos - Introducing...
Flink Forward Berlin 2017: Boris Lublinsky, Stavros Kontopoulos - Introducing...Flink Forward Berlin 2017: Boris Lublinsky, Stavros Kontopoulos - Introducing...
Flink Forward Berlin 2017: Boris Lublinsky, Stavros Kontopoulos - Introducing...
Flink Forward
 
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Vasia Kalavri
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Somnath Mazumdar
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
ucelebi
 
Pulsar connector on flink 1.14
Pulsar connector on flink 1.14Pulsar connector on flink 1.14
Pulsar connector on flink 1.14
宇帆 盛
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
datamantra
 
Javantura v3 - Going Reactive with RxJava – Hrvoje Crnjak
Javantura v3 - Going Reactive with RxJava – Hrvoje CrnjakJavantura v3 - Going Reactive with RxJava – Hrvoje Crnjak
Javantura v3 - Going Reactive with RxJava – Hrvoje Crnjak
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Real-time driving score service using Flink
Real-time driving score service using FlinkReal-time driving score service using Flink
Real-time driving score service using Flink
Dongwon Kim
 
Flink Connector Development Tips & Tricks
Flink Connector Development Tips & TricksFlink Connector Development Tips & Tricks
Flink Connector Development Tips & Tricks
Eron Wright
 
Apache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupApache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink Meetup
Stephan Ewen
 
Java 8 streams
Java 8 streams Java 8 streams
Java 8 streams
Srinivasan Raghvan
 
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Till Rohrmann
 
Local Optimizations in Eclipse QVTc and QVTr using the Micro-Mapping Model of...
Local Optimizations in Eclipse QVTc and QVTr using the Micro-Mapping Model of...Local Optimizations in Eclipse QVTc and QVTr using the Micro-Mapping Model of...
Local Optimizations in Eclipse QVTc and QVTr using the Micro-Mapping Model of...
Edward Willink
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
datamantra
 

What's hot (20)

Gelly in Apache Flink Bay Area Meetup
Gelly in Apache Flink Bay Area MeetupGelly in Apache Flink Bay Area Meetup
Gelly in Apache Flink Bay Area Meetup
 
Design and Implementation of the Security Graph Language
Design and Implementation of the Security Graph LanguageDesign and Implementation of the Security Graph Language
Design and Implementation of the Security Graph Language
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System Overview
 
Predictive Maintenance with Deep Learning and Apache Flink
Predictive Maintenance with Deep Learning and Apache FlinkPredictive Maintenance with Deep Learning and Apache Flink
Predictive Maintenance with Deep Learning and Apache Flink
 
FlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache FlinkFlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache Flink
 
Flink Streaming Berlin Meetup
Flink Streaming Berlin MeetupFlink Streaming Berlin Meetup
Flink Streaming Berlin Meetup
 
Flink Forward Berlin 2017: Boris Lublinsky, Stavros Kontopoulos - Introducing...
Flink Forward Berlin 2017: Boris Lublinsky, Stavros Kontopoulos - Introducing...Flink Forward Berlin 2017: Boris Lublinsky, Stavros Kontopoulos - Introducing...
Flink Forward Berlin 2017: Boris Lublinsky, Stavros Kontopoulos - Introducing...
 
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
 
Pulsar connector on flink 1.14
Pulsar connector on flink 1.14Pulsar connector on flink 1.14
Pulsar connector on flink 1.14
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Javantura v3 - Going Reactive with RxJava – Hrvoje Crnjak
Javantura v3 - Going Reactive with RxJava – Hrvoje CrnjakJavantura v3 - Going Reactive with RxJava – Hrvoje Crnjak
Javantura v3 - Going Reactive with RxJava – Hrvoje Crnjak
 
Real-time driving score service using Flink
Real-time driving score service using FlinkReal-time driving score service using Flink
Real-time driving score service using Flink
 
Flink Connector Development Tips & Tricks
Flink Connector Development Tips & TricksFlink Connector Development Tips & Tricks
Flink Connector Development Tips & Tricks
 
Apache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupApache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink Meetup
 
Java 8 streams
Java 8 streams Java 8 streams
Java 8 streams
 
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
 
Local Optimizations in Eclipse QVTc and QVTr using the Micro-Mapping Model of...
Local Optimizations in Eclipse QVTc and QVTr using the Micro-Mapping Model of...Local Optimizations in Eclipse QVTc and QVTr using the Micro-Mapping Model of...
Local Optimizations in Eclipse QVTc and QVTr using the Micro-Mapping Model of...
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 

Similar to Nexmark with beam

Measurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNetMeasurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNet
Vasyl Senko
 
Kubecon seattle 2018 workshop slides
Kubecon seattle 2018 workshop slidesKubecon seattle 2018 workshop slides
Kubecon seattle 2018 workshop slides
Weaveworks
 
Continuous Performance Testing
Continuous Performance TestingContinuous Performance Testing
Continuous Performance Testing
C4Media
 
Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3
LibbySchulze
 
Performance Modeling of Serverless Computing Platforms - CASCON2020 Workshop ...
Performance Modeling of Serverless Computing Platforms - CASCON2020 Workshop ...Performance Modeling of Serverless Computing Platforms - CASCON2020 Workshop ...
Performance Modeling of Serverless Computing Platforms - CASCON2020 Workshop ...
Nima Mahmoudi
 
Citi Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and PerformanceCiti Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and Performance
confluent
 
Kubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the DatacenterKubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the Datacenter
Kevin Lynch
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open Source
DataWorks Summit/Hadoop Summit
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Data Con LA
 
Temporal Performance Modelling of Serverless Computing Platforms - WoSC6
Temporal Performance Modelling of Serverless Computing Platforms - WoSC6Temporal Performance Modelling of Serverless Computing Platforms - WoSC6
Temporal Performance Modelling of Serverless Computing Platforms - WoSC6
Nima Mahmoudi
 
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
NETWAYS
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01
Giridhar Addepalli
 
Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuning
Yosuke Mizutani
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Revealing ALLSTOCKER
Revealing ALLSTOCKERRevealing ALLSTOCKER
Revealing ALLSTOCKER
Masashi Umezawa
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...
Omid Vahdaty
 
Etienne chauchot spark structured streaming runner
Etienne chauchot spark structured streaming runnerEtienne chauchot spark structured streaming runner
Etienne chauchot spark structured streaming runner
Etienne Chauchot
 
Web automation in BDD
Web automation in BDDWeb automation in BDD
Web automation in BDD
Sandy Yu
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - final
Sub Szabolcs Feczak
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark Streaming
Apache Apex
 

Similar to Nexmark with beam (20)

Measurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNetMeasurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNet
 
Kubecon seattle 2018 workshop slides
Kubecon seattle 2018 workshop slidesKubecon seattle 2018 workshop slides
Kubecon seattle 2018 workshop slides
 
Continuous Performance Testing
Continuous Performance TestingContinuous Performance Testing
Continuous Performance Testing
 
Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3
 
Performance Modeling of Serverless Computing Platforms - CASCON2020 Workshop ...
Performance Modeling of Serverless Computing Platforms - CASCON2020 Workshop ...Performance Modeling of Serverless Computing Platforms - CASCON2020 Workshop ...
Performance Modeling of Serverless Computing Platforms - CASCON2020 Workshop ...
 
Citi Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and PerformanceCiti Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and Performance
 
Kubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the DatacenterKubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the Datacenter
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open Source
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
Temporal Performance Modelling of Serverless Computing Platforms - WoSC6
Temporal Performance Modelling of Serverless Computing Platforms - WoSC6Temporal Performance Modelling of Serverless Computing Platforms - WoSC6
Temporal Performance Modelling of Serverless Computing Platforms - WoSC6
 
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01
 
Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuning
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
 
Revealing ALLSTOCKER
Revealing ALLSTOCKERRevealing ALLSTOCKER
Revealing ALLSTOCKER
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...
 
Etienne chauchot spark structured streaming runner
Etienne chauchot spark structured streaming runnerEtienne chauchot spark structured streaming runner
Etienne chauchot spark structured streaming runner
 
Web automation in BDD
Web automation in BDDWeb automation in BDD
Web automation in BDD
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - final
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark Streaming
 

Recently uploaded

Transformers design and coooling methods
Transformers design and coooling methodsTransformers design and coooling methods
Transformers design and coooling methods
Roger Rozario
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
ydzowc
 
Digital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptxDigital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptx
aryanpankaj78
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
nedcocy
 
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
Paris Salesforce Developer Group
 
OOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming languageOOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming language
PreethaV16
 
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
ijseajournal
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
ijaia
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
VANDANAMOHANGOUDA
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
ElakkiaU
 
SCALING OF MOS CIRCUITS m .pptx
SCALING OF MOS CIRCUITS m                 .pptxSCALING OF MOS CIRCUITS m                 .pptx
SCALING OF MOS CIRCUITS m .pptx
harshapolam10
 
Zener Diode and its V-I Characteristics and Applications
Zener Diode and its V-I Characteristics and ApplicationsZener Diode and its V-I Characteristics and Applications
Zener Diode and its V-I Characteristics and Applications
Shiny Christobel
 
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
MadhavJungKarki
 
This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...
DharmaBanothu
 
Object Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOADObject Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOAD
PreethaV16
 
Power Electronics- AC -AC Converters.pptx
Power Electronics- AC -AC Converters.pptxPower Electronics- AC -AC Converters.pptx
Power Electronics- AC -AC Converters.pptx
Poornima D
 
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
upoux
 
Ericsson LTE Throughput Troubleshooting Techniques.ppt
Ericsson LTE Throughput Troubleshooting Techniques.pptEricsson LTE Throughput Troubleshooting Techniques.ppt
Ericsson LTE Throughput Troubleshooting Techniques.ppt
wafawafa52
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
Divyanshu
 
Bituminous road construction project based learning report
Bituminous road construction project based learning reportBituminous road construction project based learning report
Bituminous road construction project based learning report
CE19KaushlendraKumar
 

Recently uploaded (20)

Transformers design and coooling methods
Transformers design and coooling methodsTransformers design and coooling methods
Transformers design and coooling methods
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
 
Digital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptxDigital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptx
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
 
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
 
OOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming languageOOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming language
 
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
 
SCALING OF MOS CIRCUITS m .pptx
SCALING OF MOS CIRCUITS m                 .pptxSCALING OF MOS CIRCUITS m                 .pptx
SCALING OF MOS CIRCUITS m .pptx
 
Zener Diode and its V-I Characteristics and Applications
Zener Diode and its V-I Characteristics and ApplicationsZener Diode and its V-I Characteristics and Applications
Zener Diode and its V-I Characteristics and Applications
 
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
 
This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...
 
Object Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOADObject Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOAD
 
Power Electronics- AC -AC Converters.pptx
Power Electronics- AC -AC Converters.pptxPower Electronics- AC -AC Converters.pptx
Power Electronics- AC -AC Converters.pptx
 
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
 
Ericsson LTE Throughput Troubleshooting Techniques.ppt
Ericsson LTE Throughput Troubleshooting Techniques.pptEricsson LTE Throughput Troubleshooting Techniques.ppt
Ericsson LTE Throughput Troubleshooting Techniques.ppt
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
 
Bituminous road construction project based learning report
Bituminous road construction project based learning reportBituminous road construction project based learning report
Bituminous road construction project based learning report
 

Nexmark with beam

  • 1. Nexmark with Beam Evaluating Big Data systems with Apache Beam Etienne Chauchot, Ismaël Mejía. Talend 1
  • 3. Agenda 1. Big Data Benchmarking a. State of the art b. NEXMark: A benchmark over continuous data streams 2. Nexmark on Apache Beam a. Introducing Beam b. Advantages of using Beam for benchmarking c. Implementation d. Nexmark + Beam: a win-win story e. Neutral benchmarking: a difficult issue f. Example: Running Nexmark on Spark 3. Current state and future work 3
  • 5. Benchmarking Why do we benchmark? 1. Performance 2. Correctness Benchmark suites steps: 1. Generate data 2. Compute data 3. Measure performance 4. Validate results Types of benchmarks ● Microbenchmarks ● Functional ● Business case ● Data Mining / Machine Learning 5
  • 6. Issues of Benchmarking Suites for Big Data ● No standard suite: Terasort, TPCx-HS (Hadoop), HiBench, ... ● No common model/API: Strongly tied to each processing engine or SQL ● Too focused on Hadoop infrastructure ● Mixed Benchmarks for storage/processing ● Few Benchmarking suites support streaming 6
  • 7. State of the art Batch ● Terasoft: Sort random data ● TPCx-HS: Sort to measure Hadoop compatible distributions ● TPC-DS on Spark: TPC-DS business case with Spark SQL ● Berkeley Big Data Benchmark: SQL-like queries on Hive, Redshift, Impala ● HiBench* and BigBench Streaming ● Yahoo Streaming Benchmark * HiBench includes also some streaming / windowing benchmarks 7
  • 8. Nexmark Benchmark for queries over data streams Business case: Online Auction System Research paper draft 2004 Example: Query 4: What is the average selling price for each auction category? Query 8: Who has entered the system and created an auction in the last period? Auction Person Seller Person Bidder Person Bidder Bid 8 Item
  • 9. Nexmark on Google Dataflow ● Port of SQL style queries described in the NEXMark research paper to Google Cloud Dataflow by Mark Shields and others at Google. ● Enriched queries set with Google Cloud Dataflow client use cases ● Used as rich integration testing scenario on the Google Cloud Dataflow 9
  • 11. Apache Beam Beam Model: Fn Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other LanguagesBeam Java Beam Python Execution Execution Cloud Dataflow Execution 1. The Beam Programming Model 2. SDKs for writing Beam pipelines (Java/Python) 3. Runners for existing distributed processing backends 11
  • 12. The Beam Model: What is Being Computed? 12 Event Time: timestamp when the event happened Processing Time: wall clock absolute program time
  • 13. The Beam Model: Where in Event Time? Event Time Processing Time 12:0212:00 12:1012:0812:0612:04 12:0212:00 12:1012:0812:0612:04 Input Output 13 ● Split infinite data into finite chunks
  • 14. The Beam Model: Where in Event Time? 14
  • 15. Apache Beam Pipeline concepts Data processing Pipeline (executed via a Beam runner) PTransform PTransform Read (Source) Write (Sink) Input PCollection Window per min Count Output HDFS 15
  • 16. GroupByKey CoGroupByKey Combine -> Reduce Sum Count Min / Max Mean ... ParDo -> DoFn MapElements FlatMapElements Filter WithKeys Keys Values Windowing/Triggers Windows FixedWindows GlobalWindows SlidingWindows Sessions Triggers AfterWatermark AfterProcessingTime Repeatedly Element-wise Grouping Apache Beam - Programming Model 16
  • 17. Nexmark on Apache Beam ● Nexmark was ported from Dataflow to Beam 0.2.0 as an integration test case ● Refactored to most recent Beam version ● Made code more generic to support all the Beam runners ● Changed some queries to use new APIs ● Validated queries in all the runners to test their support of the Beam model 17
  • 18. Advantages of using Beam for benchmarking ● Rich model: all use cases that we had could be expressed using Beam API ● Can test both batch and streaming modes with exactly the same code ● Multiple runners: queries can be executed on Beam supported runners (provided that the given runner supports the used features) ● Monitoring features (metrics) 18
  • 20. Components of Nexmark ● Generator: ○ generation of timestamped events (bids, persons, auctions) correlated between each other ● NexmarkLauncher: ○ creates sources that use the generator ○ queries pipelines launching, monitoring ● Output metrics: ○ Each query includes ParDos to update metrics ○ execution time, processing event rate, number of results, but also invalid auctions/bids, … ● Modes: ○ Batch mode: test data is finite and uses a BoundedSource ○ Streaming mode: test data is finite but uses an UnboundedSource to trigger streaming mode in runners 20
  • 21. Some of the queries 21 Query Description Use of Beam model 3 Who is selling in particular US states? Join, State, Timer 5 Which auctions have seen the most bids in the last period? Sliding Window, Combiners 6 What is the average selling price per seller for their last 10 closed auctions? Global Window, Custom Combiner 7 What are the highest bids per period? Fixed Windows, Side Input 9 * Winning bids Custom Window 11 * How many bids did a user make in each session he was active? Session Window, Triggering 12 * How many bids does a user make within a fixed processing time limit? Global Window, working in Processing Time *: not in original NexMark paper
  • 22. Query structure 22 1. Get PCollection<Event> as input 2. Apply ParDo + Filter to extract object of interest: Bids, Auctions, Person(s) 3. Apply transforms: Filter, Count, GroupByKey, Window, etc. 4. Apply ParDo to output the final PCollection: collection of AuctionPrice, AuctionCount ...
  • 23. Key point: When to compute data? 23 ● Windows: divide data into event-time-based finite chunks. ○ Often required when doing aggregations over unbounded data
  • 24. Key point: When to compute data? ● Triggers: condition to fire computation ● Default trigger: at the end of the window ● Required when working on unbouded data in Global Window ● Q11: trigger fires when 20 elements were received 24
  • 25. Key point: When to compute data? ● Q12: Trigger fired when first element is received + delay (works in processing in global window time to create a duration) ● Processing time: wall clock absolute program time ● Event time: timestamp in which the event occurred 25
  • 26. Key point: How to make a join? ● CoGroupByKey (in Q3, Q9): groups values of PCollections<KV> that share the same key ○ Q3: Join Auctions and Persons by their person id and tag them 26
  • 27. Key point: How to temporarily group events? ● Custom window function (in Q9) ○ As CoGroupByKey is per window, need to put bids and auctions in the same window before joining them. 27
  • 28. Key point: How to deal with out of order events? ● State and Timer APIs in an incremental join (Q3): ○ Memorize person event waiting for corresponding auctions and clear at timer ○ Memorize auction events waiting for corresponding person event 28
  • 29. Key point: How to tweak reduction phase? Custom combiner (in Q6) to be able to specify 1. how elements are added to accumulators 2. how accumulators merge 3. how to extract final data to calculate the average price of the last 3 closed auctions 29
  • 30. Conclusion on queries ● Wide coverage of the Beam API ○ Most of the API ○ Illustrates also working in processing time ● Realistic ○ Real use cases, valid queries for an end user auction system ○ Extra queries inspired by Google Cloud Dataflow client use cases ● Complex queries ○ Leverage all the runners capabilities 30
  • 31. Beam + Nexmark = A win-win story ● Streaming test ● A/B testing of big data execution engines (regression and performance comparison between 2 versions of the same engine or of the same runner, ...) ● Integration testing (SDK with runners, runners with engines, …) ● Validate Beam runners capability matrix 31
  • 33. Neutral Benchmarking: A difficult issue ● Different levels of support of capabilities of the Beam model among runners ● All execution systems have different strengths: we would end up comparing things that are not always comparable ○ Some runners were designed to be batch oriented, others streaming oriented ○ Some are designed towards sub-second latency, others support auto-scaling ● Runners / Systems can have multiple knobs to tweak the options ● Benchmarking on a distributed environment can be inconsistent. ● Even worse if you benchmark on the cloud (e.g. Noisy neighbors) 33
  • 34. Nexmark - How to run mvn exec:java -Dexec.mainClass=org.apache.beam.integration.nexmark.Main -Pflink-runner "-Dexec.args=--runner=FlinkRunner --suite=SMOKE --streamTimeout=60 --streaming=true --manageResources=false --monitorJobs=true --flinkMaster=tbd-bench" mvn exec:java -Dexec.mainClass=org.apache.beam.integration.nexmark.Main -Pspark-runner "-Dexec.args=--runner=SparkRunner --suite=SMOKE --streamTimeout=60 --streaming=false --manageResources=false --monitorJobs=true --sparkMaster=local" spark-submit --master yarn-client --class org.apache.beam.integration.nexmark.Main --driver-memory 512m --executor-memory 512m --executor-cores 1 /home/imejia/beam-integration-java-nexmark-bundled-0.7.0-SNAPSHOT.jar --runner=SparkRunner --query=5 --streamTimeout=60 --streaming=true --manageResources=false 34
  • 35. Benchmark workload configuration Events generation (defaults) ● 100 000 events generated ● 100 generator threads ● Event rate in SIN curve ● Initial event rate of 10 000 ● Event rate step of 10 000 ● 100 concurrent auctions ● 1000 concurrent persons bidding / creating auctions Windows ● size 10s ● sliding period 5s ● watermark hold for 0s Events Proportions: ● Hot Auctions = ½ ● Hot Bidders =¼ ● Hot Sellers=¼ Technical ● Artificial CPU load ● Artificial IO load 35
  • 36. Nexmark Output - Spark Runner (Batch) Conf Runtime(sec) Events(/sec) Results 0000 3.8 26267.4 100000 0001 3.5 28232.6 92000 0002 3.6 27964.2 713 0003 0 0.0 0 0004 10.0 10006.0 50 0005 5.8 17214.7 3 0006 9.4 10642.8 1631 0007 7.4 13539.1 1 0000 7.2 13861.9 6000 0009 9.5 10517.5 5243 0010 5.9 16877.6 1 0011 5.8 17388.3 1992 0012 5.5 18181.8 1992 36
  • 37. Nexmark Output - Spark Runner (Streaming) Conf Runtime(sec) Events(/sec) Results 0000 1.0 10256.1 100000 0001 1.3 7722.1 92000 0002 0.7 14705.8 713 0003 0 0.0 0 0004 17.3 5779.7 50 0005 16.6 6020.8 3 0006 26.5 3773.4 1631 0007 0 0.0 0 0008 12.3 8142.0 6000 0009 17.7 5650.0 5243 0010 13.1 768.8 1 0011 10.0 9962.1 1992 0012 10.2 9783.8 1992 37
  • 38. Comparing different versions of the Spark engine 38
  • 40. Execution Matrix 40 Batch Streaming * Metrics support ** waitUntilFinish fix
  • 41. Current state 41 ● Manage Nexmark issues in a dedicated place ● 5 open issues / 46 closed issues
  • 42. Current state 42 ● Nexmark helped discover bugs and missing features in Beam ● 9 open issues / 6 closed issues on Beam upstream ● Nexmark PR is opened, expected to be merged in Beam master soon
  • 43. Future work ● Resolve open Nexmark and Beam issues ● Add more queries to evaluate corner cases ● Validate new runners: Gearpump, Storm, JStorm ● Integrate Nexmark into the current set of tests of Beam ● Streaming SQL-based queries (using the ongoing work on Calcite DSL) 43
  • 44. Contribute You are welcome to contribute! ● 5 open Github issues and 9 Beam Jiras that need to be taken care of ● Improve documentation + more refactoring ● New ideas, more queries, support for IOs, etc 44
  • 45. Greetings ● Mark Shields (Google): Contributing Nexmark + answering our questions. ● Thomas Groh, Kenneth Knowles (Google): Direct runner + State/Timer API. ● Amit Sela, Aviem Zur (Paypal): Spark Runner + Metrics. ● Aljoscha Krettek (data Artisans), Jinsong Lee (Ali Baba): Flink Runner. ● Jean-Baptiste Onofre, Abbass Marouni (Talend): comments and help to run Nexmark in our YARN cluster. ● The rest of the Beam community in general for being awesome. 45
  • 46. References Apache Beam NEXMark BEAM-160 Nexmark on Beam Issues Big Data Benchmarks 46
  • 50. Query 3: Who is selling in particular US states? ● Illustrates incremental join of the auctions and the persons collections ● uses global window and using per-key state and timer APIs ○ Apply global window to events with trigger repeatedly after at least nbEvents in pane => results will be materialized each time nbEvents are received. ○ input1: collection of auctions events filtered by category and keyed by seller id ○ input2: collection of persons events filtered by US state codes and keyed by person id ○ CoGroupByKey to group auctions and persons by personId/sellerId + tags to distinguish persons and auctions ○ ParDo to do the incremental join: auctions and person events can arrive out of order ■ person element stored in persistent state in order to match future auctions by that person. Set a timer to clear the person state after a TTL ■ auction elements stored in persistent state until we have seen the corresponding person record. Then, it can be output and cleared ○ output NameCityStateId(person.name, person.city, person.state, auction.id) objects 50
  • 51. Query 5: Which auctions have seen the most bids in the last period? ● Illustrates sliding windows and combiners (i.e. reducers) to compare the elements in auctions Collection ○ Input: (sliding) window (to have a result over 1h period updated every 1 min) collection of bids events ○ ParDo to replace bid elements by their auction id ○ Count.PerElement to count the occurrences of each auction id ○ Combine.globally to select only the auctions with the maximum number of bids ■ BinaryCombineFn to compare one to one the elements of the collection (auction id occurrences, i.e. number of bids) ■ Return KV(auction id, max occurrences) ○ output: AuctionCount(auction id, max occurrences) objects 51
  • 52. Query 6: What is the average selling price per seller for their last 10 closed auctions? ● Illustrates specialized combiner that allows specifying the combine process of the bids ○ input: winning-bids keyed by seller id ○ GlobalWindow + trigerring at each element (to have a continuous flow of updates at each new winning-bid) ○ Combine.perKey to calculate average price of last 10 winning bids for each seller. ■ create Arraylist accumulators for chunks of data ■ add all elements of the chunks to the accumulators, sort them by bid timeStamp then price keeping last 10 elements ■ iteratively merge the accumulators until there is only one: just add all bids of all accumulators to a final accumulator and sort by timeStamp then price keeping last 10 elements ■ extractOutput: sum all the prices of the bids and divide by accumulator size ○ output SellerPrice(sellerId, avgPrice) object 52
  • 53. Query 7: What are the highest bids per period? ● Could have been implemented with a combiner like query5 but deliberately implemented using Max(prices) as a side input and illustrate fanout. ● Fanout is a redistribution using an intermediate implicit combine step to reduce the load in the final step of the Max transform ○ input: (fixed) windowed collection of 100 000 bids events ○ ParDo to replace bids by their price ○ Max.withFanout to get the max per window and use it as a side input for next step. Fanout is useful if there are many events to be computed in a window using the Max transform. ○ ParDo on the bids with side input to output the bid if bid.price equals maxPrice (that comes from side input) 53
  • 54. Query 9 Winning-bids (not part of original Nexmark): extract the most recent of the highest bids ● Illustrates custom window function to reconcile auctions and bids + join them ○ input: collection of events ○ Apply custom windowing function to temporarily reconcile auctions and bids events in the same custom window (AuctionOrBidWindow) ■ assign auctions to window [auction.timestamp, auction.expiring] ■ assign bids to window [bid.timestamp, bid.timestamp + expectedAuctionDuration (generator configuration parameter)] ■ merge all 'bid' windows into their corresponding 'auction' window, provided the auction has not expired. ○ Filter + ParDos to extract auctions out of events and key them by auction id ○ Filter + ParDos to extract bids out of events and key them by auction id ○ CogroupByKey (groups values of PCollections<KV> that share the same key) to group auctions and bids by auction id + tags to distinguish auctions and bids ○ ParDo to ■ determine best bid price: verification of valid bid, sort prices by price ASC then time DESC and keep the max price ■ and output AuctionBid(auction, bestBid) objects 54
  • 55. Query 11 (not part of original Nexmark): How many bids did a user make in each session he was active? ● Illustrates session windows + triggering on the bids collection ○ input: collection of bids events ○ ParDo to replace bids with their bidder id ○ Apply session windows with gap duration = windowDuration (configuration item) and trigger repeatedly after at least nbEvents in pane => each window (i.e. session) will contain bid ids received since last windowDuration period of inactivity and materialized every nbEvents bids ○ Count.perElement to count bids per bidder id (number of occurrences of bidder id) ○ output idsPerSession(bidder, bidsCount) objects 55
  • 56. Query 12 (not part of original Nexmark): How many bids does a user make within a fixed processing time limit? ● Illustrates working in processing time in the Global window to count occurrences of bidder ○ input: collection of bid events ○ ParDo to replace bids by their bidder id ○ Apply global window with trigger repeatedly after processingTime pass the first element in pane + windowDuration (configuration item) => each pane will contain elements processed within windowDuration time ○ Count.perElement to count bids per bidder id (occurrences of bidder id) ○ output BidsPerWindow(bidder, bidsCount) objects 56