SlideShare a Scribd company logo
1 of 19
Download to read offline
9. April 2018
Exploratory Analysis of Spark
Structured Streaming
[Work-in-Progress]
Todor Ivanov todor@dbis.cs.uni-frankfurt.de
Jason Taaffe jasontaaffe@yahoo.de
Goethe University Frankfurt am Main, Germany
http://www.bigdata.uni-frankfurt.de/
9. April 2018
Motivation
• Increasing popularity and variety of Stream Processing Engines such as Spark Streaming,
Flink Streaming, Samza, etc.
• Growing support for stream SQL features such as Spark Structured Streaming, Flink SQL,
Samza SQL, etc.
Research Objectives
• Evaluate the features of Spark Structured Streaming
• Prototype a benchmark for testing stream SQL based on BigBench
• Perform experiments validating the prototype benchmark and exploring the Spark
Structured Streaming capabilities/features
2PABS 2018, Berlin, Germany, April 9
9. April 2018
Spark Structured Streaming (introduced in Spark 2.0)
● Structured Streaming provides fast, scalable, fault-tolerant, end-to-end
exactly-once stream processing without the user having to reason about
streaming.
● built and executed on top of the Spark SQL engine
● use the Dataset/DataFrame API in Scala, Java, Python or R to express
streaming aggregations, event-time windows, stream-to-batch joins, etc
● Micro-batch processing mode
○ processes data streams as a series of small batch jobs thereby
achieving end-to-end latencies as low as 100 milliseconds and
exactly-once fault-tolerance guarantees
● Continuous processing mode (introduced in Spark 2.3)
○ achieve end-to-end latencies as low as 1 millisecond with at-
least-once guarantees (future work)
3PABS 2018, Berlin, Germany, April 9
Source: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
9. April 2018
Spark Structured Streaming (2)
“The key idea in Structured Streaming is to treat a live data stream as a table that is being
continuously appended. [...] Every data item that is arriving on the stream is like a row being
appended to the Input Table.” [https://spark.apache.org/docs/latest/structured-streaming-
programming-guide.html ]
“... at any time, the output of the application is equivalent to executing a batch job on a prefix of
the data.” [https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html ].
4PABS 2018, Berlin, Germany, April 9
9. April 2018
Trigger Execution Metrics
5PABS 2018, Berlin, Germany, April 9
9. April 2018
Structured Streaming Micro-Benchmark
● Popular streaming benchmarks such as HiBench, SparkBench and Yahoo Streaming
Benchmark
→ do not stress test the streaming SQL capabilities of engines.
● Idea: use BigBench [1,2] as a base for new streaming SQL micro-benchmark
● Simplified 2 phase benchmark methodology
○ Phase 1: generate and prepare the data
○ Phase 2: execute the continuous workloads and validate the results
6PABS 2018, Berlin, Germany, April 9
ExecutingthestreamingSQL
micro-benchmark
Phase 1 - Data Preparation
Phase 2 - Workload
Execution
Generate data for scale
factor
Split data into file chunks
simulating stream on disk
Execute streaming queries
Store results and metrics
9. April 2018
Workloads
● 5 queries were chosen for the evaluation
● 4 of them are from BigBench V2 and represent relevant streaming questions
○ Q05 : Find the 10 most browsed products in the last X seconds.
○ Q06 : Find the 5 most browsed products that were not purchased across all users
(or only specific user) in the last X seconds.
○ Q16 : Find the top ten pages visited by all users (or specific user) in the last X
seconds.
○ Q22 : Show the number of unique visitors in the X seconds.
● 1 new query performing simple monitoring operation:
○ Qmilk : Show the sold products (of a certain product or category) in the last X
seconds.
7PABS 2018, Berlin, Germany, April 9
9. April 2018
Implementation and Limitations
● Q05, Q16 and Qmilk were successfully implemented.
○ Spark code: https://github.com/Taaffy/Structured-Streaming-Micro-Benchmark
● Q06 - performs join operations on two streaming datasets, which was not supported
○ however, the latest version supports stream-stream joins
(https://databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-
spark-2-3.html )
● Q22 - uses a distinct operation, sorting operation (Order By) and Limit keyword, which
were not supported at the time of implementation
8
var web_logs_05 = web_logsDF
.groupBy ( "wl _ item_id" ).count( )
.orderBy ( "count" )
.select ( "wl _item_id" , "count" )
.where ( "wl _ item _ id ␣ IS ␣ NOT ␣ NULL " )
var web_logs_16 = web_logsDF
.groupBy ( " wl_webpage_name " ) . count ( )
.orderBy ( "count " )
.select ( " wl_webpage_name " , " count " )
.where ( " wl_webpage_name ␣ IS ␣ NOT ␣ NULL "
)
var web _ sales _ milk = web_salesDF
.groupBy ( "ws _ product _ id") . count ( )
.orderBy ( "ws _ product_ id")
.select ( "ws _ product _ id " , " count ")
.where ( "ws_product_ID ␣ IS ␣ NOT ␣ NULL “)
PABS 2018, Berlin, Germany, April 9
9. April 2018
Benchmarking Setup
● Hardware Setup:
○ Intel Core i5 CPU 760 @3.47GHz x 4
○ 8 GB RAM & 1 TB Hard Disk
● Software
○ Ubuntu 16.04 LTS OS
○ Java Version 1.8.0_131 & Scala Version 2.11.2
○ Apache Spark 2.3 - Standalone with Default parameters
9PABS 2018, Berlin, Germany, April 9
9. April 2018
Data Preparation and Execution
● Generate data with the BigBench V2 data generator for scale factor 1
○ web logs consisting of web clicks in JSON format (around 20GB)
○ web sales structured data (around 10MB)
● Web logs with variable file sizes ranging from 50MB to 2 GB were created.
● For every file size, 10 files were created with subsequent timestamps to simulate a stream
of 10 data files.
● Q05 and Q16 were tested with varied web logs files.
● Qmilk was tested 10MB web sales file copied 10 times to ensure 10 files could be
streamed successfully.
10PABS 2018, Berlin, Germany, April 9
9. April 2018
Performed Experiments
• Query and File Size Combinations (total of 31 combinations)
11PABS 2018, Berlin, Germany, April 9
9. April 2018
Exploratory Analysis
● Warmup run for all experimental executions taken into account. (excluded from
measures)
● After the second execution the average query latency becomes constant in all 3 cases.
12PABS 2018, Berlin, Germany, April 9
9. April 2018
Average Processing Times
● 3 groups of queries:
○ single executions - Q05 and Q16
○ parallel pair executions - Q05, Q16 and Qmilk
○ triple parallel executions - Q05, Q16 and Qmilk
● In general, it can be said that the more queries are run in parallel and the larger the file
sizes used, the longer these take, to be completed and the more resource intensive
they are.
13PABS 2018, Berlin, Germany, April 9
9. April 2018
Other Observations
● Average Latency increased by more than 200% between the 50MB and 2 GB file sizes.
● Comparing single query latency to triple query latency, it increased 2 to 3-fold.
● CPU utilization was the highest at the 50MB level and decreased every time the file size
was increased.
● The JAVA heap size increased for larger file sizes, making the memory size to be the
limiting factor when using larger files.
14PABS 2018, Berlin, Germany, April 9
9. April 2018
Experimental Limitations
● File Creation Date Issue
○ It appeared that when all the streaming files were copied to the streaming directory
and had the same file creation or modification date, Spark considered these to be
the same file even if the files had different names (web-logs1, web-logs2, etc.).
● Files larger than 2GB and with more than 10 file streams should be investigated in future
research for potential bottlenecks and peak performance.
● Spark in Cluster mode should be tested and evaluated with the end-to-end BigBench
execution.
15PABS 2018, Berlin, Germany, April 9
9. April 2018
Lessons Learned
● Pros
○ Simple programing model and streaming API
○ Several built-in metrics
○ Several extraction and sink options for metrics
○ Possible to run queries in parallel
● Cons
○ Undocumented trigger process
○ Query limitations due to streaming API (Syntax)
○ Complicated metric extraction process
16PABS 2018, Berlin, Germany, April 9
9. April 2018
Future Work
● Extend the benchmark with more queries and implementations on other engines (Flink,
Samza, etc.).
● Build automated benchmark kit with all executable components (as part of ABench).
● Perform experiments with larger file sizes on multi-node Spark cluster.
17PABS 2018, Berlin, Germany, April 9
9. April 2018
Thank you for your attention!
9. April 2018
References
[1] Ahmad Ghazal, Todor Ivanov, Pekka Kostamaa, Alain Crolotte, Ryan Voong, Mohammed
Al-Kateb, Waleed Ghazal, and Roberto V. Zicari. 2017. BigBench V2: The New and
Improved BigBench. In ICDE 2017, San Diego, CA, USA, April 19-22.
[2] Ahmad Ghazal, Tilmann Rabl, Minqing Hu, Francois Raab, Meikel Poess, Alain Crolotte,
and Hans-Arno Jacobsen. 2013. BigBench: Towards An Industry Standard Benchmark for
Big Data Analytics. In SIGMOD 2013. 1197–1208.
19PABS 2018, Berlin, Germany, April 9

More Related Content

What's hot

BDE SC3.3 Workshop - BDE Platform: Technical overview
 BDE SC3.3 Workshop -  BDE Platform: Technical overview BDE SC3.3 Workshop -  BDE Platform: Technical overview
BDE SC3.3 Workshop - BDE Platform: Technical overviewBigData_Europe
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven productsLars Albertsson
 
Baymeetup-FlinkResearch
Baymeetup-FlinkResearchBaymeetup-FlinkResearch
Baymeetup-FlinkResearchFoo Sounds
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & MarquezJulien Le Dem
 
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...Databricks
 
Gobblin meetup-whats new in 0.7
Gobblin meetup-whats new in 0.7Gobblin meetup-whats new in 0.7
Gobblin meetup-whats new in 0.7Vasanth Rajamani
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data qualityLars Albertsson
 
Automatic Detection of Web Trackers by Vasia Kalavri
Automatic Detection of Web Trackers by Vasia KalavriAutomatic Detection of Web Trackers by Vasia Kalavri
Automatic Detection of Web Trackers by Vasia KalavriFlink Forward
 
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSGrowing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSDatabricks
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Julien Le Dem
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks
 
Apache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya MeetupApache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya MeetupRobert Metzger
 
Flink Case Study: Capital One
Flink Case Study: Capital OneFlink Case Study: Capital One
Flink Case Study: Capital OneFlink Forward
 
Slide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataSlide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataTrieu Nguyen
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeNear Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeDatabricks
 

What's hot (20)

BDE SC3.3 Workshop - BDE Platform: Technical overview
 BDE SC3.3 Workshop -  BDE Platform: Technical overview BDE SC3.3 Workshop -  BDE Platform: Technical overview
BDE SC3.3 Workshop - BDE Platform: Technical overview
 
hlit resource roster
hlit resource rosterhlit resource roster
hlit resource roster
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
Baymeetup-FlinkResearch
Baymeetup-FlinkResearchBaymeetup-FlinkResearch
Baymeetup-FlinkResearch
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
 
Gobblin meetup-whats new in 0.7
Gobblin meetup-whats new in 0.7Gobblin meetup-whats new in 0.7
Gobblin meetup-whats new in 0.7
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data quality
 
A deep dive into neuton
A deep dive into neutonA deep dive into neuton
A deep dive into neuton
 
Automatic Detection of Web Trackers by Vasia Kalavri
Automatic Detection of Web Trackers by Vasia KalavriAutomatic Detection of Web Trackers by Vasia Kalavri
Automatic Detection of Web Trackers by Vasia Kalavri
 
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSGrowing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RS
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 
Apache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya MeetupApache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya Meetup
 
Flink Case Study: Capital One
Flink Case Study: Capital OneFlink Case Study: Capital One
Flink Case Study: Capital One
 
IMDb Data Integration
IMDb Data IntegrationIMDb Data Integration
IMDb Data Integration
 
Slide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataSlide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big data
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeNear Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
 
Prashant_Agrawal_CV
Prashant_Agrawal_CVPrashant_Agrawal_CV
Prashant_Agrawal_CV
 

Similar to Exploratory Analysis of Spark Structured Streaming

ABench: Big Data Architecture Stack Benchmark
ABench: Big Data Architecture Stack BenchmarkABench: Big Data Architecture Stack Benchmark
ABench: Big Data Architecture Stack Benchmarkt_ivanov
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...t_ivanov
 
An elastic batch-and stream-processing stack with Pravega and Apache Flink
An elastic batch-and stream-processing stack with Pravega and Apache FlinkAn elastic batch-and stream-processing stack with Pravega and Apache Flink
An elastic batch-and stream-processing stack with Pravega and Apache FlinkDataWorks Summit
 
Adding Velocity to BigBench
Adding Velocity to BigBenchAdding Velocity to BigBench
Adding Velocity to BigBencht_ivanov
 
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...DataBench
 
Fluent 2018: Tracking Performance of the Web with HTTP Archive
Fluent 2018: Tracking Performance of the Web with HTTP ArchiveFluent 2018: Tracking Performance of the Web with HTTP Archive
Fluent 2018: Tracking Performance of the Web with HTTP ArchivePaul Calvano
 
Start with version control and experiments management in machine learning
Start with version control and experiments management in machine learningStart with version control and experiments management in machine learning
Start with version control and experiments management in machine learningMikhail Rozhkov
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Knoldus Inc.
 
Nikhil summer internship 2016
Nikhil   summer internship 2016Nikhil   summer internship 2016
Nikhil summer internship 2016Nikhil Shekhar
 
Airline Reservations and Routing: A Graph Use Case
Airline Reservations and Routing: A Graph Use CaseAirline Reservations and Routing: A Graph Use Case
Airline Reservations and Routing: A Graph Use CaseJason Plurad
 
Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Itai Yaffe
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used forAljoscha Krettek
 
Airline reservations and routing: a graph use case
Airline reservations and routing: a graph use caseAirline reservations and routing: a graph use case
Airline reservations and routing: a graph use caseDataWorks Summit
 
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...DataBench
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
Lightweight Collection and Storage of Software Repository Data with DataRover
Lightweight Collection and Storage of  Software Repository Data with DataRoverLightweight Collection and Storage of  Software Repository Data with DataRover
Lightweight Collection and Storage of Software Repository Data with DataRoverChristoph Matthies
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingDatabricks
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko GlobalLogic Ukraine
 
IBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query IntroductionIBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query IntroductionTorsten Steinbach
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit
 

Similar to Exploratory Analysis of Spark Structured Streaming (20)

ABench: Big Data Architecture Stack Benchmark
ABench: Big Data Architecture Stack BenchmarkABench: Big Data Architecture Stack Benchmark
ABench: Big Data Architecture Stack Benchmark
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
 
An elastic batch-and stream-processing stack with Pravega and Apache Flink
An elastic batch-and stream-processing stack with Pravega and Apache FlinkAn elastic batch-and stream-processing stack with Pravega and Apache Flink
An elastic batch-and stream-processing stack with Pravega and Apache Flink
 
Adding Velocity to BigBench
Adding Velocity to BigBenchAdding Velocity to BigBench
Adding Velocity to BigBench
 
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
 
Fluent 2018: Tracking Performance of the Web with HTTP Archive
Fluent 2018: Tracking Performance of the Web with HTTP ArchiveFluent 2018: Tracking Performance of the Web with HTTP Archive
Fluent 2018: Tracking Performance of the Web with HTTP Archive
 
Start with version control and experiments management in machine learning
Start with version control and experiments management in machine learningStart with version control and experiments management in machine learning
Start with version control and experiments management in machine learning
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0
 
Nikhil summer internship 2016
Nikhil   summer internship 2016Nikhil   summer internship 2016
Nikhil summer internship 2016
 
Airline Reservations and Routing: A Graph Use Case
Airline Reservations and Routing: A Graph Use CaseAirline Reservations and Routing: A Graph Use Case
Airline Reservations and Routing: A Graph Use Case
 
Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
Airline reservations and routing: a graph use case
Airline reservations and routing: a graph use caseAirline reservations and routing: a graph use case
Airline reservations and routing: a graph use case
 
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Lightweight Collection and Storage of Software Repository Data with DataRover
Lightweight Collection and Storage of  Software Repository Data with DataRoverLightweight Collection and Storage of  Software Repository Data with DataRover
Lightweight Collection and Storage of Software Repository Data with DataRover
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
 
IBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query IntroductionIBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query Introduction
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
 

More from t_ivanov

CoreBigBench: Benchmarking Big Data Core Operations
CoreBigBench: Benchmarking Big Data Core OperationsCoreBigBench: Benchmarking Big Data Core Operations
CoreBigBench: Benchmarking Big Data Core Operationst_ivanov
 
Building the DataBench Workflow and Architecture
Building the DataBench Workflow and ArchitectureBuilding the DataBench Workflow and Architecture
Building the DataBench Workflow and Architecturet_ivanov
 
Lessons Learned on Benchmarking Big Data Platforms
Lessons Learned on Benchmarking  Big Data PlatformsLessons Learned on Benchmarking  Big Data Platforms
Lessons Learned on Benchmarking Big Data Platformst_ivanov
 
BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBencht_ivanov
 
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBenchWBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBencht_ivanov
 
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop ClustersWBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop Clusterst_ivanov
 

More from t_ivanov (6)

CoreBigBench: Benchmarking Big Data Core Operations
CoreBigBench: Benchmarking Big Data Core OperationsCoreBigBench: Benchmarking Big Data Core Operations
CoreBigBench: Benchmarking Big Data Core Operations
 
Building the DataBench Workflow and Architecture
Building the DataBench Workflow and ArchitectureBuilding the DataBench Workflow and Architecture
Building the DataBench Workflow and Architecture
 
Lessons Learned on Benchmarking Big Data Platforms
Lessons Learned on Benchmarking  Big Data PlatformsLessons Learned on Benchmarking  Big Data Platforms
Lessons Learned on Benchmarking Big Data Platforms
 
BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBench
 
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBenchWBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
 
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop ClustersWBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
 

Recently uploaded

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software Solutions
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 

Exploratory Analysis of Spark Structured Streaming

  • 1. 9. April 2018 Exploratory Analysis of Spark Structured Streaming [Work-in-Progress] Todor Ivanov todor@dbis.cs.uni-frankfurt.de Jason Taaffe jasontaaffe@yahoo.de Goethe University Frankfurt am Main, Germany http://www.bigdata.uni-frankfurt.de/
  • 2. 9. April 2018 Motivation • Increasing popularity and variety of Stream Processing Engines such as Spark Streaming, Flink Streaming, Samza, etc. • Growing support for stream SQL features such as Spark Structured Streaming, Flink SQL, Samza SQL, etc. Research Objectives • Evaluate the features of Spark Structured Streaming • Prototype a benchmark for testing stream SQL based on BigBench • Perform experiments validating the prototype benchmark and exploring the Spark Structured Streaming capabilities/features 2PABS 2018, Berlin, Germany, April 9
  • 3. 9. April 2018 Spark Structured Streaming (introduced in Spark 2.0) ● Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming. ● built and executed on top of the Spark SQL engine ● use the Dataset/DataFrame API in Scala, Java, Python or R to express streaming aggregations, event-time windows, stream-to-batch joins, etc ● Micro-batch processing mode ○ processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as 100 milliseconds and exactly-once fault-tolerance guarantees ● Continuous processing mode (introduced in Spark 2.3) ○ achieve end-to-end latencies as low as 1 millisecond with at- least-once guarantees (future work) 3PABS 2018, Berlin, Germany, April 9 Source: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
  • 4. 9. April 2018 Spark Structured Streaming (2) “The key idea in Structured Streaming is to treat a live data stream as a table that is being continuously appended. [...] Every data item that is arriving on the stream is like a row being appended to the Input Table.” [https://spark.apache.org/docs/latest/structured-streaming- programming-guide.html ] “... at any time, the output of the application is equivalent to executing a batch job on a prefix of the data.” [https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html ]. 4PABS 2018, Berlin, Germany, April 9
  • 5. 9. April 2018 Trigger Execution Metrics 5PABS 2018, Berlin, Germany, April 9
  • 6. 9. April 2018 Structured Streaming Micro-Benchmark ● Popular streaming benchmarks such as HiBench, SparkBench and Yahoo Streaming Benchmark → do not stress test the streaming SQL capabilities of engines. ● Idea: use BigBench [1,2] as a base for new streaming SQL micro-benchmark ● Simplified 2 phase benchmark methodology ○ Phase 1: generate and prepare the data ○ Phase 2: execute the continuous workloads and validate the results 6PABS 2018, Berlin, Germany, April 9 ExecutingthestreamingSQL micro-benchmark Phase 1 - Data Preparation Phase 2 - Workload Execution Generate data for scale factor Split data into file chunks simulating stream on disk Execute streaming queries Store results and metrics
  • 7. 9. April 2018 Workloads ● 5 queries were chosen for the evaluation ● 4 of them are from BigBench V2 and represent relevant streaming questions ○ Q05 : Find the 10 most browsed products in the last X seconds. ○ Q06 : Find the 5 most browsed products that were not purchased across all users (or only specific user) in the last X seconds. ○ Q16 : Find the top ten pages visited by all users (or specific user) in the last X seconds. ○ Q22 : Show the number of unique visitors in the X seconds. ● 1 new query performing simple monitoring operation: ○ Qmilk : Show the sold products (of a certain product or category) in the last X seconds. 7PABS 2018, Berlin, Germany, April 9
  • 8. 9. April 2018 Implementation and Limitations ● Q05, Q16 and Qmilk were successfully implemented. ○ Spark code: https://github.com/Taaffy/Structured-Streaming-Micro-Benchmark ● Q06 - performs join operations on two streaming datasets, which was not supported ○ however, the latest version supports stream-stream joins (https://databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache- spark-2-3.html ) ● Q22 - uses a distinct operation, sorting operation (Order By) and Limit keyword, which were not supported at the time of implementation 8 var web_logs_05 = web_logsDF .groupBy ( "wl _ item_id" ).count( ) .orderBy ( "count" ) .select ( "wl _item_id" , "count" ) .where ( "wl _ item _ id ␣ IS ␣ NOT ␣ NULL " ) var web_logs_16 = web_logsDF .groupBy ( " wl_webpage_name " ) . count ( ) .orderBy ( "count " ) .select ( " wl_webpage_name " , " count " ) .where ( " wl_webpage_name ␣ IS ␣ NOT ␣ NULL " ) var web _ sales _ milk = web_salesDF .groupBy ( "ws _ product _ id") . count ( ) .orderBy ( "ws _ product_ id") .select ( "ws _ product _ id " , " count ") .where ( "ws_product_ID ␣ IS ␣ NOT ␣ NULL “) PABS 2018, Berlin, Germany, April 9
  • 9. 9. April 2018 Benchmarking Setup ● Hardware Setup: ○ Intel Core i5 CPU 760 @3.47GHz x 4 ○ 8 GB RAM & 1 TB Hard Disk ● Software ○ Ubuntu 16.04 LTS OS ○ Java Version 1.8.0_131 & Scala Version 2.11.2 ○ Apache Spark 2.3 - Standalone with Default parameters 9PABS 2018, Berlin, Germany, April 9
  • 10. 9. April 2018 Data Preparation and Execution ● Generate data with the BigBench V2 data generator for scale factor 1 ○ web logs consisting of web clicks in JSON format (around 20GB) ○ web sales structured data (around 10MB) ● Web logs with variable file sizes ranging from 50MB to 2 GB were created. ● For every file size, 10 files were created with subsequent timestamps to simulate a stream of 10 data files. ● Q05 and Q16 were tested with varied web logs files. ● Qmilk was tested 10MB web sales file copied 10 times to ensure 10 files could be streamed successfully. 10PABS 2018, Berlin, Germany, April 9
  • 11. 9. April 2018 Performed Experiments • Query and File Size Combinations (total of 31 combinations) 11PABS 2018, Berlin, Germany, April 9
  • 12. 9. April 2018 Exploratory Analysis ● Warmup run for all experimental executions taken into account. (excluded from measures) ● After the second execution the average query latency becomes constant in all 3 cases. 12PABS 2018, Berlin, Germany, April 9
  • 13. 9. April 2018 Average Processing Times ● 3 groups of queries: ○ single executions - Q05 and Q16 ○ parallel pair executions - Q05, Q16 and Qmilk ○ triple parallel executions - Q05, Q16 and Qmilk ● In general, it can be said that the more queries are run in parallel and the larger the file sizes used, the longer these take, to be completed and the more resource intensive they are. 13PABS 2018, Berlin, Germany, April 9
  • 14. 9. April 2018 Other Observations ● Average Latency increased by more than 200% between the 50MB and 2 GB file sizes. ● Comparing single query latency to triple query latency, it increased 2 to 3-fold. ● CPU utilization was the highest at the 50MB level and decreased every time the file size was increased. ● The JAVA heap size increased for larger file sizes, making the memory size to be the limiting factor when using larger files. 14PABS 2018, Berlin, Germany, April 9
  • 15. 9. April 2018 Experimental Limitations ● File Creation Date Issue ○ It appeared that when all the streaming files were copied to the streaming directory and had the same file creation or modification date, Spark considered these to be the same file even if the files had different names (web-logs1, web-logs2, etc.). ● Files larger than 2GB and with more than 10 file streams should be investigated in future research for potential bottlenecks and peak performance. ● Spark in Cluster mode should be tested and evaluated with the end-to-end BigBench execution. 15PABS 2018, Berlin, Germany, April 9
  • 16. 9. April 2018 Lessons Learned ● Pros ○ Simple programing model and streaming API ○ Several built-in metrics ○ Several extraction and sink options for metrics ○ Possible to run queries in parallel ● Cons ○ Undocumented trigger process ○ Query limitations due to streaming API (Syntax) ○ Complicated metric extraction process 16PABS 2018, Berlin, Germany, April 9
  • 17. 9. April 2018 Future Work ● Extend the benchmark with more queries and implementations on other engines (Flink, Samza, etc.). ● Build automated benchmark kit with all executable components (as part of ABench). ● Perform experiments with larger file sizes on multi-node Spark cluster. 17PABS 2018, Berlin, Germany, April 9
  • 18. 9. April 2018 Thank you for your attention!
  • 19. 9. April 2018 References [1] Ahmad Ghazal, Todor Ivanov, Pekka Kostamaa, Alain Crolotte, Ryan Voong, Mohammed Al-Kateb, Waleed Ghazal, and Roberto V. Zicari. 2017. BigBench V2: The New and Improved BigBench. In ICDE 2017, San Diego, CA, USA, April 19-22. [2] Ahmad Ghazal, Tilmann Rabl, Minqing Hu, Francois Raab, Meikel Poess, Alain Crolotte, and Hans-Arno Jacobsen. 2013. BigBench: Towards An Industry Standard Benchmark for Big Data Analytics. In SIGMOD 2013. 1197–1208. 19PABS 2018, Berlin, Germany, April 9