SlideShare a Scribd company logo
1 of 19
Download to read offline
9. April 2018
Exploratory Analysis of Spark
Structured Streaming
[Work-in-Progress]
Todor Ivanov todor@dbis.cs.uni-frankfurt.de
Jason Taaffe jasontaaffe@yahoo.de
Goethe University Frankfurt am Main, Germany
http://www.bigdata.uni-frankfurt.de/
9. April 2018
Motivation
• Increasing popularity and variety of Stream Processing Engines such as Spark Streaming,
Flink Streaming, Samza, etc.
• Growing support for stream SQL features such as Spark Structured Streaming, Flink SQL,
Samza SQL, etc.
Research Objectives
• Evaluate the features of Spark Structured Streaming
• Prototype a benchmark for testing stream SQL based on BigBench
• Perform experiments validating the prototype benchmark and exploring the Spark
Structured Streaming capabilities/features
2PABS 2018, Berlin, Germany, April 9
9. April 2018
Spark Structured Streaming (introduced in Spark 2.0)
● Structured Streaming provides fast, scalable, fault-tolerant, end-to-end
exactly-once stream processing without the user having to reason about
streaming.
● built and executed on top of the Spark SQL engine
● use the Dataset/DataFrame API in Scala, Java, Python or R to express
streaming aggregations, event-time windows, stream-to-batch joins, etc
● Micro-batch processing mode
○ processes data streams as a series of small batch jobs thereby
achieving end-to-end latencies as low as 100 milliseconds and
exactly-once fault-tolerance guarantees
● Continuous processing mode (introduced in Spark 2.3)
○ achieve end-to-end latencies as low as 1 millisecond with at-
least-once guarantees (future work)
3PABS 2018, Berlin, Germany, April 9
Source: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
9. April 2018
Spark Structured Streaming (2)
“The key idea in Structured Streaming is to treat a live data stream as a table that is being
continuously appended. [...] Every data item that is arriving on the stream is like a row being
appended to the Input Table.” [https://spark.apache.org/docs/latest/structured-streaming-
programming-guide.html ]
“... at any time, the output of the application is equivalent to executing a batch job on a prefix of
the data.” [https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html ].
4PABS 2018, Berlin, Germany, April 9
9. April 2018
Trigger Execution Metrics
5PABS 2018, Berlin, Germany, April 9
9. April 2018
Structured Streaming Micro-Benchmark
● Popular streaming benchmarks such as HiBench, SparkBench and Yahoo Streaming
Benchmark
→ do not stress test the streaming SQL capabilities of engines.
● Idea: use BigBench [1,2] as a base for new streaming SQL micro-benchmark
● Simplified 2 phase benchmark methodology
○ Phase 1: generate and prepare the data
○ Phase 2: execute the continuous workloads and validate the results
6PABS 2018, Berlin, Germany, April 9
ExecutingthestreamingSQL
micro-benchmark
Phase 1 - Data Preparation
Phase 2 - Workload
Execution
Generate data for scale
factor
Split data into file chunks
simulating stream on disk
Execute streaming queries
Store results and metrics
9. April 2018
Workloads
● 5 queries were chosen for the evaluation
● 4 of them are from BigBench V2 and represent relevant streaming questions
○ Q05 : Find the 10 most browsed products in the last X seconds.
○ Q06 : Find the 5 most browsed products that were not purchased across all users
(or only specific user) in the last X seconds.
○ Q16 : Find the top ten pages visited by all users (or specific user) in the last X
seconds.
○ Q22 : Show the number of unique visitors in the X seconds.
● 1 new query performing simple monitoring operation:
○ Qmilk : Show the sold products (of a certain product or category) in the last X
seconds.
7PABS 2018, Berlin, Germany, April 9
9. April 2018
Implementation and Limitations
● Q05, Q16 and Qmilk were successfully implemented.
○ Spark code: https://github.com/Taaffy/Structured-Streaming-Micro-Benchmark
● Q06 - performs join operations on two streaming datasets, which was not supported
○ however, the latest version supports stream-stream joins
(https://databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-
spark-2-3.html )
● Q22 - uses a distinct operation, sorting operation (Order By) and Limit keyword, which
were not supported at the time of implementation
8
var web_logs_05 = web_logsDF
.groupBy ( "wl _ item_id" ).count( )
.orderBy ( "count" )
.select ( "wl _item_id" , "count" )
.where ( "wl _ item _ id ␣ IS ␣ NOT ␣ NULL " )
var web_logs_16 = web_logsDF
.groupBy ( " wl_webpage_name " ) . count ( )
.orderBy ( "count " )
.select ( " wl_webpage_name " , " count " )
.where ( " wl_webpage_name ␣ IS ␣ NOT ␣ NULL "
)
var web _ sales _ milk = web_salesDF
.groupBy ( "ws _ product _ id") . count ( )
.orderBy ( "ws _ product_ id")
.select ( "ws _ product _ id " , " count ")
.where ( "ws_product_ID ␣ IS ␣ NOT ␣ NULL “)
PABS 2018, Berlin, Germany, April 9
9. April 2018
Benchmarking Setup
● Hardware Setup:
○ Intel Core i5 CPU 760 @3.47GHz x 4
○ 8 GB RAM & 1 TB Hard Disk
● Software
○ Ubuntu 16.04 LTS OS
○ Java Version 1.8.0_131 & Scala Version 2.11.2
○ Apache Spark 2.3 - Standalone with Default parameters
9PABS 2018, Berlin, Germany, April 9
9. April 2018
Data Preparation and Execution
● Generate data with the BigBench V2 data generator for scale factor 1
○ web logs consisting of web clicks in JSON format (around 20GB)
○ web sales structured data (around 10MB)
● Web logs with variable file sizes ranging from 50MB to 2 GB were created.
● For every file size, 10 files were created with subsequent timestamps to simulate a stream
of 10 data files.
● Q05 and Q16 were tested with varied web logs files.
● Qmilk was tested 10MB web sales file copied 10 times to ensure 10 files could be
streamed successfully.
10PABS 2018, Berlin, Germany, April 9
9. April 2018
Performed Experiments
• Query and File Size Combinations (total of 31 combinations)
11PABS 2018, Berlin, Germany, April 9
9. April 2018
Exploratory Analysis
● Warmup run for all experimental executions taken into account. (excluded from
measures)
● After the second execution the average query latency becomes constant in all 3 cases.
12PABS 2018, Berlin, Germany, April 9
9. April 2018
Average Processing Times
● 3 groups of queries:
○ single executions - Q05 and Q16
○ parallel pair executions - Q05, Q16 and Qmilk
○ triple parallel executions - Q05, Q16 and Qmilk
● In general, it can be said that the more queries are run in parallel and the larger the file
sizes used, the longer these take, to be completed and the more resource intensive
they are.
13PABS 2018, Berlin, Germany, April 9
9. April 2018
Other Observations
● Average Latency increased by more than 200% between the 50MB and 2 GB file sizes.
● Comparing single query latency to triple query latency, it increased 2 to 3-fold.
● CPU utilization was the highest at the 50MB level and decreased every time the file size
was increased.
● The JAVA heap size increased for larger file sizes, making the memory size to be the
limiting factor when using larger files.
14PABS 2018, Berlin, Germany, April 9
9. April 2018
Experimental Limitations
● File Creation Date Issue
○ It appeared that when all the streaming files were copied to the streaming directory
and had the same file creation or modification date, Spark considered these to be
the same file even if the files had different names (web-logs1, web-logs2, etc.).
● Files larger than 2GB and with more than 10 file streams should be investigated in future
research for potential bottlenecks and peak performance.
● Spark in Cluster mode should be tested and evaluated with the end-to-end BigBench
execution.
15PABS 2018, Berlin, Germany, April 9
9. April 2018
Lessons Learned
● Pros
○ Simple programing model and streaming API
○ Several built-in metrics
○ Several extraction and sink options for metrics
○ Possible to run queries in parallel
● Cons
○ Undocumented trigger process
○ Query limitations due to streaming API (Syntax)
○ Complicated metric extraction process
16PABS 2018, Berlin, Germany, April 9
9. April 2018
Future Work
● Extend the benchmark with more queries and implementations on other engines (Flink,
Samza, etc.).
● Build automated benchmark kit with all executable components (as part of ABench).
● Perform experiments with larger file sizes on multi-node Spark cluster.
17PABS 2018, Berlin, Germany, April 9
9. April 2018
Thank you for your attention!
9. April 2018
References
[1] Ahmad Ghazal, Todor Ivanov, Pekka Kostamaa, Alain Crolotte, Ryan Voong, Mohammed
Al-Kateb, Waleed Ghazal, and Roberto V. Zicari. 2017. BigBench V2: The New and
Improved BigBench. In ICDE 2017, San Diego, CA, USA, April 19-22.
[2] Ahmad Ghazal, Tilmann Rabl, Minqing Hu, Francois Raab, Meikel Poess, Alain Crolotte,
and Hans-Arno Jacobsen. 2013. BigBench: Towards An Industry Standard Benchmark for
Big Data Analytics. In SIGMOD 2013. 1197–1208.
19PABS 2018, Berlin, Germany, April 9

More Related Content

What's hot

BDE SC3.3 Workshop - BDE Platform: Technical overview
 BDE SC3.3 Workshop -  BDE Platform: Technical overview BDE SC3.3 Workshop -  BDE Platform: Technical overview
BDE SC3.3 Workshop - BDE Platform: Technical overviewBigData_Europe
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven productsLars Albertsson
 
Baymeetup-FlinkResearch
Baymeetup-FlinkResearchBaymeetup-FlinkResearch
Baymeetup-FlinkResearchFoo Sounds
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & MarquezJulien Le Dem
 
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...Databricks
 
Gobblin meetup-whats new in 0.7
Gobblin meetup-whats new in 0.7Gobblin meetup-whats new in 0.7
Gobblin meetup-whats new in 0.7Vasanth Rajamani
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data qualityLars Albertsson
 
Automatic Detection of Web Trackers by Vasia Kalavri
Automatic Detection of Web Trackers by Vasia KalavriAutomatic Detection of Web Trackers by Vasia Kalavri
Automatic Detection of Web Trackers by Vasia KalavriFlink Forward
 
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSGrowing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSDatabricks
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Julien Le Dem
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks
 
Apache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya MeetupApache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya MeetupRobert Metzger
 
Flink Case Study: Capital One
Flink Case Study: Capital OneFlink Case Study: Capital One
Flink Case Study: Capital OneFlink Forward
 
Slide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataSlide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataTrieu Nguyen
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeNear Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeDatabricks
 

What's hot (20)

BDE SC3.3 Workshop - BDE Platform: Technical overview
 BDE SC3.3 Workshop -  BDE Platform: Technical overview BDE SC3.3 Workshop -  BDE Platform: Technical overview
BDE SC3.3 Workshop - BDE Platform: Technical overview
 
hlit resource roster
hlit resource rosterhlit resource roster
hlit resource roster
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
Baymeetup-FlinkResearch
Baymeetup-FlinkResearchBaymeetup-FlinkResearch
Baymeetup-FlinkResearch
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
 
Gobblin meetup-whats new in 0.7
Gobblin meetup-whats new in 0.7Gobblin meetup-whats new in 0.7
Gobblin meetup-whats new in 0.7
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data quality
 
A deep dive into neuton
A deep dive into neutonA deep dive into neuton
A deep dive into neuton
 
Automatic Detection of Web Trackers by Vasia Kalavri
Automatic Detection of Web Trackers by Vasia KalavriAutomatic Detection of Web Trackers by Vasia Kalavri
Automatic Detection of Web Trackers by Vasia Kalavri
 
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSGrowing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RS
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 
Apache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya MeetupApache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya Meetup
 
Flink Case Study: Capital One
Flink Case Study: Capital OneFlink Case Study: Capital One
Flink Case Study: Capital One
 
IMDb Data Integration
IMDb Data IntegrationIMDb Data Integration
IMDb Data Integration
 
Slide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataSlide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big data
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeNear Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
 
Prashant_Agrawal_CV
Prashant_Agrawal_CVPrashant_Agrawal_CV
Prashant_Agrawal_CV
 

Similar to Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe, ICPE 2018, 9-13/04/2018

ABench: Big Data Architecture Stack Benchmark
ABench: Big Data Architecture Stack BenchmarkABench: Big Data Architecture Stack Benchmark
ABench: Big Data Architecture Stack Benchmarkt_ivanov
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...t_ivanov
 
An elastic batch-and stream-processing stack with Pravega and Apache Flink
An elastic batch-and stream-processing stack with Pravega and Apache FlinkAn elastic batch-and stream-processing stack with Pravega and Apache Flink
An elastic batch-and stream-processing stack with Pravega and Apache FlinkDataWorks Summit
 
Adding Velocity to BigBench
Adding Velocity to BigBenchAdding Velocity to BigBench
Adding Velocity to BigBencht_ivanov
 
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...DataBench
 
Fluent 2018: Tracking Performance of the Web with HTTP Archive
Fluent 2018: Tracking Performance of the Web with HTTP ArchiveFluent 2018: Tracking Performance of the Web with HTTP Archive
Fluent 2018: Tracking Performance of the Web with HTTP ArchivePaul Calvano
 
Start with version control and experiments management in machine learning
Start with version control and experiments management in machine learningStart with version control and experiments management in machine learning
Start with version control and experiments management in machine learningMikhail Rozhkov
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Knoldus Inc.
 
Nikhil summer internship 2016
Nikhil   summer internship 2016Nikhil   summer internship 2016
Nikhil summer internship 2016Nikhil Shekhar
 
Airline Reservations and Routing: A Graph Use Case
Airline Reservations and Routing: A Graph Use CaseAirline Reservations and Routing: A Graph Use Case
Airline Reservations and Routing: A Graph Use CaseJason Plurad
 
Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Itai Yaffe
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used forAljoscha Krettek
 
Airline reservations and routing: a graph use case
Airline reservations and routing: a graph use caseAirline reservations and routing: a graph use case
Airline reservations and routing: a graph use caseDataWorks Summit
 
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...DataBench
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
Lightweight Collection and Storage of Software Repository Data with DataRover
Lightweight Collection and Storage of  Software Repository Data with DataRoverLightweight Collection and Storage of  Software Repository Data with DataRover
Lightweight Collection and Storage of Software Repository Data with DataRoverChristoph Matthies
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingDatabricks
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko GlobalLogic Ukraine
 
IBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query IntroductionIBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query IntroductionTorsten Steinbach
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit
 

Similar to Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe, ICPE 2018, 9-13/04/2018 (20)

ABench: Big Data Architecture Stack Benchmark
ABench: Big Data Architecture Stack BenchmarkABench: Big Data Architecture Stack Benchmark
ABench: Big Data Architecture Stack Benchmark
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
 
An elastic batch-and stream-processing stack with Pravega and Apache Flink
An elastic batch-and stream-processing stack with Pravega and Apache FlinkAn elastic batch-and stream-processing stack with Pravega and Apache Flink
An elastic batch-and stream-processing stack with Pravega and Apache Flink
 
Adding Velocity to BigBench
Adding Velocity to BigBenchAdding Velocity to BigBench
Adding Velocity to BigBench
 
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
 
Fluent 2018: Tracking Performance of the Web with HTTP Archive
Fluent 2018: Tracking Performance of the Web with HTTP ArchiveFluent 2018: Tracking Performance of the Web with HTTP Archive
Fluent 2018: Tracking Performance of the Web with HTTP Archive
 
Start with version control and experiments management in machine learning
Start with version control and experiments management in machine learningStart with version control and experiments management in machine learning
Start with version control and experiments management in machine learning
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0
 
Nikhil summer internship 2016
Nikhil   summer internship 2016Nikhil   summer internship 2016
Nikhil summer internship 2016
 
Airline Reservations and Routing: A Graph Use Case
Airline Reservations and Routing: A Graph Use CaseAirline Reservations and Routing: A Graph Use Case
Airline Reservations and Routing: A Graph Use Case
 
Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
Airline reservations and routing: a graph use case
Airline reservations and routing: a graph use caseAirline reservations and routing: a graph use case
Airline reservations and routing: a graph use case
 
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Lightweight Collection and Storage of Software Repository Data with DataRover
Lightweight Collection and Storage of  Software Repository Data with DataRoverLightweight Collection and Storage of  Software Repository Data with DataRover
Lightweight Collection and Storage of Software Repository Data with DataRover
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
 
IBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query IntroductionIBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query Introduction
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
 

More from DataBench

Welcome to DataBench
Welcome to DataBenchWelcome to DataBench
Welcome to DataBenchDataBench
 
Session 1 - The Current Landscape of Big Data Benchmarks
Session 1 - The Current Landscape of Big Data BenchmarksSession 1 - The Current Landscape of Big Data Benchmarks
Session 1 - The Current Landscape of Big Data BenchmarksDataBench
 
Session 2 - A Project Perspective on Big Data Architectural Pipelines and Ben...
Session 2 - A Project Perspective on Big Data Architectural Pipelines and Ben...Session 2 - A Project Perspective on Big Data Architectural Pipelines and Ben...
Session 2 - A Project Perspective on Big Data Architectural Pipelines and Ben...DataBench
 
Session 3 - The DataBench Framework: A compelling offering to measure the Imp...
Session 3 - The DataBench Framework: A compelling offering to measure the Imp...Session 3 - The DataBench Framework: A compelling offering to measure the Imp...
Session 3 - The DataBench Framework: A compelling offering to measure the Imp...DataBench
 
Session 4 - A practical journey on how to use the DataBench Toolbox
Session 4 - A practical journey on how to use the DataBench ToolboxSession 4 - A practical journey on how to use the DataBench Toolbox
Session 4 - A practical journey on how to use the DataBench ToolboxDataBench
 
CoreBigBench: Benchmarking Big Data Core Operations
CoreBigBench: Benchmarking Big Data Core OperationsCoreBigBench: Benchmarking Big Data Core Operations
CoreBigBench: Benchmarking Big Data Core OperationsDataBench
 
DataBench Toolbox in a Nutshell
DataBench Toolbox in a NutshellDataBench Toolbox in a Nutshell
DataBench Toolbox in a NutshellDataBench
 
Success Stories on Big Data & Analytics
Success Stories on Big Data & AnalyticsSuccess Stories on Big Data & Analytics
Success Stories on Big Data & AnalyticsDataBench
 
DataBench Virtual BenchLearning "Success storie on Big Data & Analytics use c...
DataBench Virtual BenchLearning "Success storie on Big Data & Analytics use c...DataBench Virtual BenchLearning "Success storie on Big Data & Analytics use c...
DataBench Virtual BenchLearning "Success storie on Big Data & Analytics use c...DataBench
 
DataBench Virtual BenchLearning "Big Data - Benchmark your way to Excellent B...
DataBench Virtual BenchLearning "Big Data - Benchmark your way to Excellent B...DataBench Virtual BenchLearning "Big Data - Benchmark your way to Excellent B...
DataBench Virtual BenchLearning "Big Data - Benchmark your way to Excellent B...DataBench
 
DataBench Virtual BenchLearning "Big Data - Benchmark your way to Excellent B...
DataBench Virtual BenchLearning "Big Data - Benchmark your way to Excellent B...DataBench Virtual BenchLearning "Big Data - Benchmark your way to Excellent B...
DataBench Virtual BenchLearning "Big Data - Benchmark your way to Excellent B...DataBench
 
Building the DataBench Workflow and Architecture, Todor Ivanov, Bench 2019 - ...
Building the DataBench Workflow and Architecture, Todor Ivanov, Bench 2019 - ...Building the DataBench Workflow and Architecture, Todor Ivanov, Bench 2019 - ...
Building the DataBench Workflow and Architecture, Todor Ivanov, Bench 2019 - ...DataBench
 
DataBench Toolbox Demo, Ivan Martinez, Tomas Pariente Lobo, BDV Meet-Up Riga,...
DataBench Toolbox Demo, Ivan Martinez, Tomas Pariente Lobo, BDV Meet-Up Riga,...DataBench Toolbox Demo, Ivan Martinez, Tomas Pariente Lobo, BDV Meet-Up Riga,...
DataBench Toolbox Demo, Ivan Martinez, Tomas Pariente Lobo, BDV Meet-Up Riga,...DataBench
 
DataBench session @ BDV Meet-Up Riga: The case of HOBBIT, 27/06/2019
DataBench session @ BDV Meet-Up Riga: The case of HOBBIT, 27/06/2019DataBench session @ BDV Meet-Up Riga: The case of HOBBIT, 27/06/2019
DataBench session @ BDV Meet-Up Riga: The case of HOBBIT, 27/06/2019DataBench
 
DataBench in a Nutshell - The market: Assessing Industrial Needs, Richard Ste...
DataBench in a Nutshell - The market: Assessing Industrial Needs, Richard Ste...DataBench in a Nutshell - The market: Assessing Industrial Needs, Richard Ste...
DataBench in a Nutshell - The market: Assessing Industrial Needs, Richard Ste...DataBench
 
Big Data Benchmarking, Tomas Pariente Lobo, Open Expo Europe, 20/06/2019
Big Data Benchmarking, Tomas Pariente Lobo, Open Expo Europe, 20/06/2019Big Data Benchmarking, Tomas Pariente Lobo, Open Expo Europe, 20/06/2019
Big Data Benchmarking, Tomas Pariente Lobo, Open Expo Europe, 20/06/2019DataBench
 
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...DataBench
 
Impacts of data-driven AI in business sectors, Richard Stevens, ICT 2018, 05/...
Impacts of data-driven AI in business sectors, Richard Stevens, ICT 2018, 05/...Impacts of data-driven AI in business sectors, Richard Stevens, ICT 2018, 05/...
Impacts of data-driven AI in business sectors, Richard Stevens, ICT 2018, 05/...DataBench
 
Relating Big Data Business and Technical Performance Indicators, Barbara Pern...
Relating Big Data Business and Technical Performance Indicators, Barbara Pern...Relating Big Data Business and Technical Performance Indicators, Barbara Pern...
Relating Big Data Business and Technical Performance Indicators, Barbara Pern...DataBench
 
Building a Bridge between Technical and Business Benchmarking, Gabriella Catt...
Building a Bridge between Technical and Business Benchmarking, Gabriella Catt...Building a Bridge between Technical and Business Benchmarking, Gabriella Catt...
Building a Bridge between Technical and Business Benchmarking, Gabriella Catt...DataBench
 

More from DataBench (20)

Welcome to DataBench
Welcome to DataBenchWelcome to DataBench
Welcome to DataBench
 
Session 1 - The Current Landscape of Big Data Benchmarks
Session 1 - The Current Landscape of Big Data BenchmarksSession 1 - The Current Landscape of Big Data Benchmarks
Session 1 - The Current Landscape of Big Data Benchmarks
 
Session 2 - A Project Perspective on Big Data Architectural Pipelines and Ben...
Session 2 - A Project Perspective on Big Data Architectural Pipelines and Ben...Session 2 - A Project Perspective on Big Data Architectural Pipelines and Ben...
Session 2 - A Project Perspective on Big Data Architectural Pipelines and Ben...
 
Session 3 - The DataBench Framework: A compelling offering to measure the Imp...
Session 3 - The DataBench Framework: A compelling offering to measure the Imp...Session 3 - The DataBench Framework: A compelling offering to measure the Imp...
Session 3 - The DataBench Framework: A compelling offering to measure the Imp...
 
Session 4 - A practical journey on how to use the DataBench Toolbox
Session 4 - A practical journey on how to use the DataBench ToolboxSession 4 - A practical journey on how to use the DataBench Toolbox
Session 4 - A practical journey on how to use the DataBench Toolbox
 
CoreBigBench: Benchmarking Big Data Core Operations
CoreBigBench: Benchmarking Big Data Core OperationsCoreBigBench: Benchmarking Big Data Core Operations
CoreBigBench: Benchmarking Big Data Core Operations
 
DataBench Toolbox in a Nutshell
DataBench Toolbox in a NutshellDataBench Toolbox in a Nutshell
DataBench Toolbox in a Nutshell
 
Success Stories on Big Data & Analytics
Success Stories on Big Data & AnalyticsSuccess Stories on Big Data & Analytics
Success Stories on Big Data & Analytics
 
DataBench Virtual BenchLearning "Success storie on Big Data & Analytics use c...
DataBench Virtual BenchLearning "Success storie on Big Data & Analytics use c...DataBench Virtual BenchLearning "Success storie on Big Data & Analytics use c...
DataBench Virtual BenchLearning "Success storie on Big Data & Analytics use c...
 
DataBench Virtual BenchLearning "Big Data - Benchmark your way to Excellent B...
DataBench Virtual BenchLearning "Big Data - Benchmark your way to Excellent B...DataBench Virtual BenchLearning "Big Data - Benchmark your way to Excellent B...
DataBench Virtual BenchLearning "Big Data - Benchmark your way to Excellent B...
 
DataBench Virtual BenchLearning "Big Data - Benchmark your way to Excellent B...
DataBench Virtual BenchLearning "Big Data - Benchmark your way to Excellent B...DataBench Virtual BenchLearning "Big Data - Benchmark your way to Excellent B...
DataBench Virtual BenchLearning "Big Data - Benchmark your way to Excellent B...
 
Building the DataBench Workflow and Architecture, Todor Ivanov, Bench 2019 - ...
Building the DataBench Workflow and Architecture, Todor Ivanov, Bench 2019 - ...Building the DataBench Workflow and Architecture, Todor Ivanov, Bench 2019 - ...
Building the DataBench Workflow and Architecture, Todor Ivanov, Bench 2019 - ...
 
DataBench Toolbox Demo, Ivan Martinez, Tomas Pariente Lobo, BDV Meet-Up Riga,...
DataBench Toolbox Demo, Ivan Martinez, Tomas Pariente Lobo, BDV Meet-Up Riga,...DataBench Toolbox Demo, Ivan Martinez, Tomas Pariente Lobo, BDV Meet-Up Riga,...
DataBench Toolbox Demo, Ivan Martinez, Tomas Pariente Lobo, BDV Meet-Up Riga,...
 
DataBench session @ BDV Meet-Up Riga: The case of HOBBIT, 27/06/2019
DataBench session @ BDV Meet-Up Riga: The case of HOBBIT, 27/06/2019DataBench session @ BDV Meet-Up Riga: The case of HOBBIT, 27/06/2019
DataBench session @ BDV Meet-Up Riga: The case of HOBBIT, 27/06/2019
 
DataBench in a Nutshell - The market: Assessing Industrial Needs, Richard Ste...
DataBench in a Nutshell - The market: Assessing Industrial Needs, Richard Ste...DataBench in a Nutshell - The market: Assessing Industrial Needs, Richard Ste...
DataBench in a Nutshell - The market: Assessing Industrial Needs, Richard Ste...
 
Big Data Benchmarking, Tomas Pariente Lobo, Open Expo Europe, 20/06/2019
Big Data Benchmarking, Tomas Pariente Lobo, Open Expo Europe, 20/06/2019Big Data Benchmarking, Tomas Pariente Lobo, Open Expo Europe, 20/06/2019
Big Data Benchmarking, Tomas Pariente Lobo, Open Expo Europe, 20/06/2019
 
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
 
Impacts of data-driven AI in business sectors, Richard Stevens, ICT 2018, 05/...
Impacts of data-driven AI in business sectors, Richard Stevens, ICT 2018, 05/...Impacts of data-driven AI in business sectors, Richard Stevens, ICT 2018, 05/...
Impacts of data-driven AI in business sectors, Richard Stevens, ICT 2018, 05/...
 
Relating Big Data Business and Technical Performance Indicators, Barbara Pern...
Relating Big Data Business and Technical Performance Indicators, Barbara Pern...Relating Big Data Business and Technical Performance Indicators, Barbara Pern...
Relating Big Data Business and Technical Performance Indicators, Barbara Pern...
 
Building a Bridge between Technical and Business Benchmarking, Gabriella Catt...
Building a Bridge between Technical and Business Benchmarking, Gabriella Catt...Building a Bridge between Technical and Business Benchmarking, Gabriella Catt...
Building a Bridge between Technical and Business Benchmarking, Gabriella Catt...
 

Recently uploaded

Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 

Recently uploaded (20)

Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 

Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe, ICPE 2018, 9-13/04/2018

  • 1. 9. April 2018 Exploratory Analysis of Spark Structured Streaming [Work-in-Progress] Todor Ivanov todor@dbis.cs.uni-frankfurt.de Jason Taaffe jasontaaffe@yahoo.de Goethe University Frankfurt am Main, Germany http://www.bigdata.uni-frankfurt.de/
  • 2. 9. April 2018 Motivation • Increasing popularity and variety of Stream Processing Engines such as Spark Streaming, Flink Streaming, Samza, etc. • Growing support for stream SQL features such as Spark Structured Streaming, Flink SQL, Samza SQL, etc. Research Objectives • Evaluate the features of Spark Structured Streaming • Prototype a benchmark for testing stream SQL based on BigBench • Perform experiments validating the prototype benchmark and exploring the Spark Structured Streaming capabilities/features 2PABS 2018, Berlin, Germany, April 9
  • 3. 9. April 2018 Spark Structured Streaming (introduced in Spark 2.0) ● Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming. ● built and executed on top of the Spark SQL engine ● use the Dataset/DataFrame API in Scala, Java, Python or R to express streaming aggregations, event-time windows, stream-to-batch joins, etc ● Micro-batch processing mode ○ processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as 100 milliseconds and exactly-once fault-tolerance guarantees ● Continuous processing mode (introduced in Spark 2.3) ○ achieve end-to-end latencies as low as 1 millisecond with at- least-once guarantees (future work) 3PABS 2018, Berlin, Germany, April 9 Source: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
  • 4. 9. April 2018 Spark Structured Streaming (2) “The key idea in Structured Streaming is to treat a live data stream as a table that is being continuously appended. [...] Every data item that is arriving on the stream is like a row being appended to the Input Table.” [https://spark.apache.org/docs/latest/structured-streaming- programming-guide.html ] “... at any time, the output of the application is equivalent to executing a batch job on a prefix of the data.” [https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html ]. 4PABS 2018, Berlin, Germany, April 9
  • 5. 9. April 2018 Trigger Execution Metrics 5PABS 2018, Berlin, Germany, April 9
  • 6. 9. April 2018 Structured Streaming Micro-Benchmark ● Popular streaming benchmarks such as HiBench, SparkBench and Yahoo Streaming Benchmark → do not stress test the streaming SQL capabilities of engines. ● Idea: use BigBench [1,2] as a base for new streaming SQL micro-benchmark ● Simplified 2 phase benchmark methodology ○ Phase 1: generate and prepare the data ○ Phase 2: execute the continuous workloads and validate the results 6PABS 2018, Berlin, Germany, April 9 ExecutingthestreamingSQL micro-benchmark Phase 1 - Data Preparation Phase 2 - Workload Execution Generate data for scale factor Split data into file chunks simulating stream on disk Execute streaming queries Store results and metrics
  • 7. 9. April 2018 Workloads ● 5 queries were chosen for the evaluation ● 4 of them are from BigBench V2 and represent relevant streaming questions ○ Q05 : Find the 10 most browsed products in the last X seconds. ○ Q06 : Find the 5 most browsed products that were not purchased across all users (or only specific user) in the last X seconds. ○ Q16 : Find the top ten pages visited by all users (or specific user) in the last X seconds. ○ Q22 : Show the number of unique visitors in the X seconds. ● 1 new query performing simple monitoring operation: ○ Qmilk : Show the sold products (of a certain product or category) in the last X seconds. 7PABS 2018, Berlin, Germany, April 9
  • 8. 9. April 2018 Implementation and Limitations ● Q05, Q16 and Qmilk were successfully implemented. ○ Spark code: https://github.com/Taaffy/Structured-Streaming-Micro-Benchmark ● Q06 - performs join operations on two streaming datasets, which was not supported ○ however, the latest version supports stream-stream joins (https://databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache- spark-2-3.html ) ● Q22 - uses a distinct operation, sorting operation (Order By) and Limit keyword, which were not supported at the time of implementation 8 var web_logs_05 = web_logsDF .groupBy ( "wl _ item_id" ).count( ) .orderBy ( "count" ) .select ( "wl _item_id" , "count" ) .where ( "wl _ item _ id ␣ IS ␣ NOT ␣ NULL " ) var web_logs_16 = web_logsDF .groupBy ( " wl_webpage_name " ) . count ( ) .orderBy ( "count " ) .select ( " wl_webpage_name " , " count " ) .where ( " wl_webpage_name ␣ IS ␣ NOT ␣ NULL " ) var web _ sales _ milk = web_salesDF .groupBy ( "ws _ product _ id") . count ( ) .orderBy ( "ws _ product_ id") .select ( "ws _ product _ id " , " count ") .where ( "ws_product_ID ␣ IS ␣ NOT ␣ NULL “) PABS 2018, Berlin, Germany, April 9
  • 9. 9. April 2018 Benchmarking Setup ● Hardware Setup: ○ Intel Core i5 CPU 760 @3.47GHz x 4 ○ 8 GB RAM & 1 TB Hard Disk ● Software ○ Ubuntu 16.04 LTS OS ○ Java Version 1.8.0_131 & Scala Version 2.11.2 ○ Apache Spark 2.3 - Standalone with Default parameters 9PABS 2018, Berlin, Germany, April 9
  • 10. 9. April 2018 Data Preparation and Execution ● Generate data with the BigBench V2 data generator for scale factor 1 ○ web logs consisting of web clicks in JSON format (around 20GB) ○ web sales structured data (around 10MB) ● Web logs with variable file sizes ranging from 50MB to 2 GB were created. ● For every file size, 10 files were created with subsequent timestamps to simulate a stream of 10 data files. ● Q05 and Q16 were tested with varied web logs files. ● Qmilk was tested 10MB web sales file copied 10 times to ensure 10 files could be streamed successfully. 10PABS 2018, Berlin, Germany, April 9
  • 11. 9. April 2018 Performed Experiments • Query and File Size Combinations (total of 31 combinations) 11PABS 2018, Berlin, Germany, April 9
  • 12. 9. April 2018 Exploratory Analysis ● Warmup run for all experimental executions taken into account. (excluded from measures) ● After the second execution the average query latency becomes constant in all 3 cases. 12PABS 2018, Berlin, Germany, April 9
  • 13. 9. April 2018 Average Processing Times ● 3 groups of queries: ○ single executions - Q05 and Q16 ○ parallel pair executions - Q05, Q16 and Qmilk ○ triple parallel executions - Q05, Q16 and Qmilk ● In general, it can be said that the more queries are run in parallel and the larger the file sizes used, the longer these take, to be completed and the more resource intensive they are. 13PABS 2018, Berlin, Germany, April 9
  • 14. 9. April 2018 Other Observations ● Average Latency increased by more than 200% between the 50MB and 2 GB file sizes. ● Comparing single query latency to triple query latency, it increased 2 to 3-fold. ● CPU utilization was the highest at the 50MB level and decreased every time the file size was increased. ● The JAVA heap size increased for larger file sizes, making the memory size to be the limiting factor when using larger files. 14PABS 2018, Berlin, Germany, April 9
  • 15. 9. April 2018 Experimental Limitations ● File Creation Date Issue ○ It appeared that when all the streaming files were copied to the streaming directory and had the same file creation or modification date, Spark considered these to be the same file even if the files had different names (web-logs1, web-logs2, etc.). ● Files larger than 2GB and with more than 10 file streams should be investigated in future research for potential bottlenecks and peak performance. ● Spark in Cluster mode should be tested and evaluated with the end-to-end BigBench execution. 15PABS 2018, Berlin, Germany, April 9
  • 16. 9. April 2018 Lessons Learned ● Pros ○ Simple programing model and streaming API ○ Several built-in metrics ○ Several extraction and sink options for metrics ○ Possible to run queries in parallel ● Cons ○ Undocumented trigger process ○ Query limitations due to streaming API (Syntax) ○ Complicated metric extraction process 16PABS 2018, Berlin, Germany, April 9
  • 17. 9. April 2018 Future Work ● Extend the benchmark with more queries and implementations on other engines (Flink, Samza, etc.). ● Build automated benchmark kit with all executable components (as part of ABench). ● Perform experiments with larger file sizes on multi-node Spark cluster. 17PABS 2018, Berlin, Germany, April 9
  • 18. 9. April 2018 Thank you for your attention!
  • 19. 9. April 2018 References [1] Ahmad Ghazal, Todor Ivanov, Pekka Kostamaa, Alain Crolotte, Ryan Voong, Mohammed Al-Kateb, Waleed Ghazal, and Roberto V. Zicari. 2017. BigBench V2: The New and Improved BigBench. In ICDE 2017, San Diego, CA, USA, April 19-22. [2] Ahmad Ghazal, Tilmann Rabl, Minqing Hu, Francois Raab, Meikel Poess, Alain Crolotte, and Hans-Arno Jacobsen. 2013. BigBench: Towards An Industry Standard Benchmark for Big Data Analytics. In SIGMOD 2013. 1197–1208. 19PABS 2018, Berlin, Germany, April 9