Computing aggregates over windows is at the core of virtually every stream processing job. Typical stream processing applications involve overlapping windows and, therefore, cause redundant computations. Several techniques prevent this redundancy by sharing partial aggregates among windows. However, these techniques do not support out-of-order processing and session windows. Out-of-order processing is a key requirement to deal with delayed tuples in case of source failures such as temporary sensor outages. Session windows are widely used to separate different periods of user activity from each other. Current versions of Apache Flink use Window Buckets to process stream aggregations with session windows and out-of-order tuples. This Approach does not share partial aggregates among overlapping windows. In our talk, we present Scotty, a high throughput operator for window discretization and aggregation in Apache Flink. Scotty splits streams into non-overlapping slices and computes partial aggregates per slice. These partial aggregates are shared among all overlapping windows including session windows. Scotty introduces the first slicing technique which (1) enables stream slicing for session windows in addition to tumbling and sliding windows and (2) processes out-of-order tuples efficiently. Scotty was first published at ICDE 2018 (http://www.user.tu-berlin.de/powibol/assets/publications/traub-scotty-icde-2018.pdf).
code.talks 2019 - Scotty: Efficient Window Aggregation for your Stream Proces...Jonas Traub
This presentation was held at Code.Talks 2019 in Hamburg.
A video is available at: https://www.youtube.com/watch?v=K1y5dJvP1jM
Window aggregation is a core operation in data stream processing.
Stream Processing Systems, like Flink or Storm, implement general aggregation techniques which perform poorly under specific workloads (e.g. Sliding Windows).
To this end, we present Scotty, a new highly-efficient window operator.
Scotty exploits specific workload properties such as the type of aggregation functions (e.g., invertible, associative), window types (e.g., sliding, sessions), windowing measures (e.g., time- or countbased), and stream (dis)order. This allows Scotty to outperform systems like Flink by up to one order of magnitude.
The structure of this talk is threefold:
First, we give an introduction to the semantics and implementations of window aggregations in modern Stream Processing Systems.
Second, we discuss the design of Scotty and show why Scotty is able to outperform the default window operators of many stream processing systems.
Third, we give a hands-on introduction to Scotty and demonstrate how it can be integrated into standard Flink, Storm, or Beam stream processing pipelines.
Scotty and its connectors are available as open-source (https://github.com/TU-Berlin-DIMA/scotty-window-processor) and contributions are highly welcome.
FlinkForward Berlin 2019 - Scotty: Efficient Window Aggregation with General ...Jonas Traub
Window aggregation is a core operation in data stream processing. Existing aggregation techniques focus on reducing latency, eliminating redundant computations, and minimizing memory usage.
However, each technique operates under different assumptions with respect to workload characteristics such as properties of aggregation functions (e.g., invertible, associative), window types (e.g., sliding, sessions), windowing measures (e.g., time- or countbased), and stream (dis)order. Violating the assumptions of a technique can deem it unusable or drastically reduce its performance.
In this talk, we present Scotty an implementation of a general stream slicing technique for window aggregation. This technique automatically adapts to workload characteristics to improve performance without sacrificing its general applicability. Our experiments show that Scotty outperforms alternative implementations, like the default window operator in Flink, by up to one order of magnitude.
Furthermore, we present how to use Scotty as a library in Flink, Storm, or Beam without changing the underlying Stream Processing System.
General stream slicing was first published at EDBT 2019 (http://www.user.tu-berlin.de/powibol/assets/publications/traub-efficient-window-aggregation-with-general-stream-slicing-edbt-2019.pdf) where it received the Best Paper Award.
The Scotty library and its connectors are available as open-source (https://github.com/TU-Berlin-DIMA/scotty-window-processor) and contributions are highly welcome.
Definitely not Java! A Hands-on Introduction to Efficient Functional Programm...Jonas Traub
This talk was presented at code.talks 2022 in Hamburg.
Learn more about code.talks at: https://codetalks.de/
Download Slides at: https://codetalks.de/speakers#talk-1128?event=7
Video recording on Youtube: https://www.youtube.com/watch?v=GC-hKOvQBus
In this session, we will provide an introduction to REASON, also known as ReasonML. At SAP, we use REASON to build our Master Data Integration (MDI) Service as part of SAPs Business Technology Platform. MDI is a highly scalable, distributed, multi-tenant system powered by functional programming. REASON, is our language of choice for MDI: It lets you write simple, fast, and type save code. With REASON, you rarely have to annotate types, but everything gets checked for you through REASON’s type inference system, reducing bugs and increasing code maintainability. REASON translates to OCaml or JavaScript, providing access to both worlds and enabling it to run on NodeJS. During the session, we will highlight the advantages of the syntax compared to Java and introduce you to the beauty of immutability, tail recursion and other relevant concepts. Our session targets programmers who are not yet familiar with functional programming, as well as functional programmers who want to learn more about REASON.
Efficient Data Stream Processing in the Internet of Things - SoftwareCampus A...Jonas Traub
This talk was presented for the SoftwareCampus Alumni e.V. on 07.12.2020. For more Information about the program check https://softwarecampus-alumni.de/ and https://softwarecampus.de/
Abstract: The Internet of Things (IoT) consists of billions of devices which form a cloud of network-connected sensor nodes. These sensor nodes supply a vast number of data streams with massive amounts of sensor data. Real-time sensor data enables diverse applications including traffic-aware navigation, machine monitoring, and home automation. In this talk, we will dive into recent research which optimizes real-time data gathering and data analysis in the IoT. The talk will provide an overview of available techniques which can be deployed on sensor nodes, intermediate network nodes, and central analysis systems. We will look into the state-of-the-art in practice and research and make you aware of important tradeoffs in real-time IoT data analysis.
CV: Jonas Traub is a postdoctoral researcher at the Database Systems and Information Management group at TU Berlin. His main research interests include stream processing, sensor data analysis, and data acquisition techniques. In his PhD, he studied efficient data gathering, processing, and transmission in the IoT. His research shows that one can save up to 87% in sensor reads and data transfers by applying smart data reduction techniques on sensor nodes. He further introduced a demand-based control layer which optimizes the data acquisition from thousands of sensors. With his Scotty-framework, he contributed a general aggregation technique for streaming systems which outperforms alternative solutions by an order of magnitude in throughput. His work received a Best Paper Award at the 22nd International Conference on Extending Database Technology (EDBT). Prior to his work at TU Berlin, he studied at KTH Stockholm and DHBW Stuttgart and worked several years at IBM in Germany and the USA. Jonas is an alumnus of Software Campus where he worked with SAP as industry partner.
Analyzing Efficient Stream Processing on Modern Hardware (VLDB 2019 Presentat...Jonas Traub
This talk by Steffen Zeuch presents our VLDB 2019 paper about "Analyzing Efficient Stream Processing on Modern Hardware".
This paper is about showing the potential of hardware-tailored code compilation and data ingestion at memory speed for a scale-up SPE. Analyze state-of-the-art streaming systems and identify sources of inefficiency. We investigate the data-related and processing-related design space and derive design changes for streaming systems to exploit modern hardware more efficiently. In order to efficiently scale up, one should avoid managed runtimes, use a compilation-based approach to produce hardware-tailored code, avoid queues and use operator fusion, and use late merge instead of partitioning.
The full paper with all our findings is available online:
http://www.vldb.org/pvldb/vol12/p516-zeuch.pdf
Database Research at TU Berlin DIMA and DFKI IAM - USA Excursion Slides 2019Jonas Traub
In April 2019, we did an USA excursion and presented selected publications of the TU Berlin DIMA and the DFKI IAM research groups. This slide set contains the four teaser talks which we presented on the tour:
1) Jonas Traub: Optimized On-Demand Data Streaming from Sensor Nodes
2) Sebastian Breß: Generating Custom Code for Efficient Query Execution on Heterogeneous Processors
3) Martin Kiefer: Estimating Join Selectivities using Bandwidth Optimized Kernel Density Models
4) Andreas Kunft: BlockJoin: Efficient Matrix Partitioning through Joins
Efficient Window Aggregation with General Stream Slicing (EDBT 2019, Best Paper)Jonas Traub
The Paper "Efficient Window Aggregation with General Stream Slicing" by Jonas Traub, Philipp M. Grulich, Alejandro Rodriguez Cuellar, Sebastian Breß, Tilmann Rabl, and Volker Markl was selected as best paper of the International Conference on Extending Database Technology (EDBT) 2019.
Abstract:
Window aggregation is a core operation in data stream processing. Existing aggregation techniques focus on reducing latency, eliminating redundant computations, and minimizing memory usage. However, each technique operates under different assumptions with respect to workload characteristics such as properties of aggregation functions (e.g., invertible, associative), window types
(e.g., sliding, sessions), windowing measures (e.g., time- or countbased), and stream (dis)order. Violating the assumptions of a technique can deem it unusable or drastically reduce its performance.
In this paper, we present the first general stream slicing technique for window aggregation. General stream slicing automatically adapts to workload characteristics to improve performance without sacrificing its general applicability. As a prerequisite, we identify workload characteristics which affect the performance and applicability of aggregation techniques. Our experiments show that general stream slicing outperforms alternative concepts by up to one order of magnitude.
Resense: Transparent Record and Replay of Sensor Data in the Internet of Thin...Jonas Traub
Resense: Transparent Record and Replay of Sensor Data
in the Internet of Things
Resense was presented as Demonstration at the 22nd International Conference on Extending Database Technology (EDBT) in Lisbon, Portugal and received the Best Demonstration Award (http://edbticdt2019.inesc-id.pt/?awards_demo_edbt).
AUTHORS:
Dimitrios Giouroukis, Julius Hülsmann, Janis von Bleichert, Morgan Geldenhuys, Tim Stullich, Felipe Oliveira Gutierrez, Jonas Traub, Kaustubh Beedkar, Volker Markl
ABSTRACT
As the scientific interest in the Internet of Things (IoT) continues to grow, emulating IoT infrastructure involving a large number of heterogeneous sensors plays a crucial role. Existing research on emulating sensors is often tailored to specific hardware and/or software, which makes it difficult to reproduce and extend. In this paper we show how to emulate different kinds of sensors in a unified way that makes the downstream application agnostic as to whether the sensor data is acquired from real sensors or is read from memory using emulated sensors. We propose the Resense framework that allows for replaying sensor data using emulated sensors and provides an easy-to-use software for setting up and executing IoT experiments involving a large number of heterogeneous sensors. We demonstrate various aspects of Resense in the context of a sports analytics application using real-world sensor data and a set of Raspberry Pis.
code.talks 2019 - Scotty: Efficient Window Aggregation for your Stream Proces...Jonas Traub
This presentation was held at Code.Talks 2019 in Hamburg.
A video is available at: https://www.youtube.com/watch?v=K1y5dJvP1jM
Window aggregation is a core operation in data stream processing.
Stream Processing Systems, like Flink or Storm, implement general aggregation techniques which perform poorly under specific workloads (e.g. Sliding Windows).
To this end, we present Scotty, a new highly-efficient window operator.
Scotty exploits specific workload properties such as the type of aggregation functions (e.g., invertible, associative), window types (e.g., sliding, sessions), windowing measures (e.g., time- or countbased), and stream (dis)order. This allows Scotty to outperform systems like Flink by up to one order of magnitude.
The structure of this talk is threefold:
First, we give an introduction to the semantics and implementations of window aggregations in modern Stream Processing Systems.
Second, we discuss the design of Scotty and show why Scotty is able to outperform the default window operators of many stream processing systems.
Third, we give a hands-on introduction to Scotty and demonstrate how it can be integrated into standard Flink, Storm, or Beam stream processing pipelines.
Scotty and its connectors are available as open-source (https://github.com/TU-Berlin-DIMA/scotty-window-processor) and contributions are highly welcome.
FlinkForward Berlin 2019 - Scotty: Efficient Window Aggregation with General ...Jonas Traub
Window aggregation is a core operation in data stream processing. Existing aggregation techniques focus on reducing latency, eliminating redundant computations, and minimizing memory usage.
However, each technique operates under different assumptions with respect to workload characteristics such as properties of aggregation functions (e.g., invertible, associative), window types (e.g., sliding, sessions), windowing measures (e.g., time- or countbased), and stream (dis)order. Violating the assumptions of a technique can deem it unusable or drastically reduce its performance.
In this talk, we present Scotty an implementation of a general stream slicing technique for window aggregation. This technique automatically adapts to workload characteristics to improve performance without sacrificing its general applicability. Our experiments show that Scotty outperforms alternative implementations, like the default window operator in Flink, by up to one order of magnitude.
Furthermore, we present how to use Scotty as a library in Flink, Storm, or Beam without changing the underlying Stream Processing System.
General stream slicing was first published at EDBT 2019 (http://www.user.tu-berlin.de/powibol/assets/publications/traub-efficient-window-aggregation-with-general-stream-slicing-edbt-2019.pdf) where it received the Best Paper Award.
The Scotty library and its connectors are available as open-source (https://github.com/TU-Berlin-DIMA/scotty-window-processor) and contributions are highly welcome.
Definitely not Java! A Hands-on Introduction to Efficient Functional Programm...Jonas Traub
This talk was presented at code.talks 2022 in Hamburg.
Learn more about code.talks at: https://codetalks.de/
Download Slides at: https://codetalks.de/speakers#talk-1128?event=7
Video recording on Youtube: https://www.youtube.com/watch?v=GC-hKOvQBus
In this session, we will provide an introduction to REASON, also known as ReasonML. At SAP, we use REASON to build our Master Data Integration (MDI) Service as part of SAPs Business Technology Platform. MDI is a highly scalable, distributed, multi-tenant system powered by functional programming. REASON, is our language of choice for MDI: It lets you write simple, fast, and type save code. With REASON, you rarely have to annotate types, but everything gets checked for you through REASON’s type inference system, reducing bugs and increasing code maintainability. REASON translates to OCaml or JavaScript, providing access to both worlds and enabling it to run on NodeJS. During the session, we will highlight the advantages of the syntax compared to Java and introduce you to the beauty of immutability, tail recursion and other relevant concepts. Our session targets programmers who are not yet familiar with functional programming, as well as functional programmers who want to learn more about REASON.
Efficient Data Stream Processing in the Internet of Things - SoftwareCampus A...Jonas Traub
This talk was presented for the SoftwareCampus Alumni e.V. on 07.12.2020. For more Information about the program check https://softwarecampus-alumni.de/ and https://softwarecampus.de/
Abstract: The Internet of Things (IoT) consists of billions of devices which form a cloud of network-connected sensor nodes. These sensor nodes supply a vast number of data streams with massive amounts of sensor data. Real-time sensor data enables diverse applications including traffic-aware navigation, machine monitoring, and home automation. In this talk, we will dive into recent research which optimizes real-time data gathering and data analysis in the IoT. The talk will provide an overview of available techniques which can be deployed on sensor nodes, intermediate network nodes, and central analysis systems. We will look into the state-of-the-art in practice and research and make you aware of important tradeoffs in real-time IoT data analysis.
CV: Jonas Traub is a postdoctoral researcher at the Database Systems and Information Management group at TU Berlin. His main research interests include stream processing, sensor data analysis, and data acquisition techniques. In his PhD, he studied efficient data gathering, processing, and transmission in the IoT. His research shows that one can save up to 87% in sensor reads and data transfers by applying smart data reduction techniques on sensor nodes. He further introduced a demand-based control layer which optimizes the data acquisition from thousands of sensors. With his Scotty-framework, he contributed a general aggregation technique for streaming systems which outperforms alternative solutions by an order of magnitude in throughput. His work received a Best Paper Award at the 22nd International Conference on Extending Database Technology (EDBT). Prior to his work at TU Berlin, he studied at KTH Stockholm and DHBW Stuttgart and worked several years at IBM in Germany and the USA. Jonas is an alumnus of Software Campus where he worked with SAP as industry partner.
Analyzing Efficient Stream Processing on Modern Hardware (VLDB 2019 Presentat...Jonas Traub
This talk by Steffen Zeuch presents our VLDB 2019 paper about "Analyzing Efficient Stream Processing on Modern Hardware".
This paper is about showing the potential of hardware-tailored code compilation and data ingestion at memory speed for a scale-up SPE. Analyze state-of-the-art streaming systems and identify sources of inefficiency. We investigate the data-related and processing-related design space and derive design changes for streaming systems to exploit modern hardware more efficiently. In order to efficiently scale up, one should avoid managed runtimes, use a compilation-based approach to produce hardware-tailored code, avoid queues and use operator fusion, and use late merge instead of partitioning.
The full paper with all our findings is available online:
http://www.vldb.org/pvldb/vol12/p516-zeuch.pdf
Database Research at TU Berlin DIMA and DFKI IAM - USA Excursion Slides 2019Jonas Traub
In April 2019, we did an USA excursion and presented selected publications of the TU Berlin DIMA and the DFKI IAM research groups. This slide set contains the four teaser talks which we presented on the tour:
1) Jonas Traub: Optimized On-Demand Data Streaming from Sensor Nodes
2) Sebastian Breß: Generating Custom Code for Efficient Query Execution on Heterogeneous Processors
3) Martin Kiefer: Estimating Join Selectivities using Bandwidth Optimized Kernel Density Models
4) Andreas Kunft: BlockJoin: Efficient Matrix Partitioning through Joins
Efficient Window Aggregation with General Stream Slicing (EDBT 2019, Best Paper)Jonas Traub
The Paper "Efficient Window Aggregation with General Stream Slicing" by Jonas Traub, Philipp M. Grulich, Alejandro Rodriguez Cuellar, Sebastian Breß, Tilmann Rabl, and Volker Markl was selected as best paper of the International Conference on Extending Database Technology (EDBT) 2019.
Abstract:
Window aggregation is a core operation in data stream processing. Existing aggregation techniques focus on reducing latency, eliminating redundant computations, and minimizing memory usage. However, each technique operates under different assumptions with respect to workload characteristics such as properties of aggregation functions (e.g., invertible, associative), window types
(e.g., sliding, sessions), windowing measures (e.g., time- or countbased), and stream (dis)order. Violating the assumptions of a technique can deem it unusable or drastically reduce its performance.
In this paper, we present the first general stream slicing technique for window aggregation. General stream slicing automatically adapts to workload characteristics to improve performance without sacrificing its general applicability. As a prerequisite, we identify workload characteristics which affect the performance and applicability of aggregation techniques. Our experiments show that general stream slicing outperforms alternative concepts by up to one order of magnitude.
Resense: Transparent Record and Replay of Sensor Data in the Internet of Thin...Jonas Traub
Resense: Transparent Record and Replay of Sensor Data
in the Internet of Things
Resense was presented as Demonstration at the 22nd International Conference on Extending Database Technology (EDBT) in Lisbon, Portugal and received the Best Demonstration Award (http://edbticdt2019.inesc-id.pt/?awards_demo_edbt).
AUTHORS:
Dimitrios Giouroukis, Julius Hülsmann, Janis von Bleichert, Morgan Geldenhuys, Tim Stullich, Felipe Oliveira Gutierrez, Jonas Traub, Kaustubh Beedkar, Volker Markl
ABSTRACT
As the scientific interest in the Internet of Things (IoT) continues to grow, emulating IoT infrastructure involving a large number of heterogeneous sensors plays a crucial role. Existing research on emulating sensors is often tailored to specific hardware and/or software, which makes it difficult to reproduce and extend. In this paper we show how to emulate different kinds of sensors in a unified way that makes the downstream application agnostic as to whether the sensor data is acquired from real sensors or is read from memory using emulated sensors. We propose the Resense framework that allows for replaying sensor data using emulated sensors and provides an easy-to-use software for setting up and executing IoT experiments involving a large number of heterogeneous sensors. We demonstrate various aspects of Resense in the context of a sports analytics application using real-world sensor data and a set of Raspberry Pis.
Scotty: Efficient Window Aggregation for Out-of-Order Stream ProcessingJonas Traub
This poster was presented at ICDE 2018.
Abstract: Computing aggregates over windows is at the core of virtually every stream processing job. Typical stream processing applications involve overlapping windows and, therefore, cause redundant computations. Several techniques prevent this redundancy by sharing partial aggregates among windows. However, these techniques do not support out-of-order processing and session windows. Out-of-order processing is a key requirement to deal with delayed tuples in case of source failures such as temporary sensor outages. Session windows are widely used to separate different periods of user activity from each other.
In this paper, we present Scotty, a high throughput operator for window discretization and aggregation. Scotty splits streams into non-overlapping slices and computes partial aggregates per slice. These partial aggregates are shared among all concurrent queries with arbitrary combinations of tumbling, sliding, and session windows. Scotty introduces the first slicing technique which (1) enables stream slicing for session windows in addition to tumbling and sliding windows and (2) processes out-of-order tuples efficiently. Our technique is generally applicable to a broad group of dataflow systems which use a unified batch and stream processing model. Our experiments show that we achieve a throughput an order of magnitude higher than alternative state-of-the-art solutions.
Scalable Detection of Concept Drifts on Data Streams with Parallel Adaptive W...Jonas Traub
Paper: Scalable Detection of Concept Drifts on Data Streams
with Parallel Adaptive Windowing
Abstract: Machine learning techniques for data stream analysis suffer from concept drifts such as changed user preferences, varying weather conditions, or economic changes. These concept drifts cause wrong predictions and lead to incorrect business decisions. Concept drift detection methods such as adaptive windowing (Adwin) allow for adapting to concept drifts on the fly.
In this paper, we examine Adwin in detail and point out its throughput bottlenecks. We then introduce several parallelization alternatives to address these bottlenecks. Our optimizations lead
to a speedup of two orders of magnitude over the original Adwin implementation. Thus, we explore parallel adaptive windowing to provide scalable concept detection for high-velocity data streams with millions of tuples per second.
Efficient SIMD Vectorization for Hashing in OpenCLJonas Traub
This poster was presented at the 21st International Conference on Extending Database Technology (EDBT), March 26-29, 2018.
Paper: Efficient SIMD Vectorization for Hashing in OpenCL
Abstract: Hashing is at the core of many efficient database operators such as hash-based joins and aggregations. Vectorization is a technique that uses Single Instruction Multiple Data (SIMD) instructions to process multiple data elements at once. Applying vectorization to hash tables results in promising speedups for build and probe operations. However, vectorization typically requires intrinsics – low-level APIs in which functions map to processorspecific SIMD instructions. Intrinsics are specific to a processor architecture and result in complex and difficult to maintain code. OpenCL is a parallel programming framework which provides a higher abstraction level than intrinsics and is portable to different processors. Thus, OpenCL avoids processor dependencies, which results in improved code maintainability. In this paper, we add efficient, vectorized hashing primitives to OpenCL. Our results show that OpenCL-based vectorization is competitive to intrinsics on CPUs but not on Xeon Phi coprocessors.
UZH Stream Reasoning Workshop 2018: Optimized On-Demand Data Streaming from S...Jonas Traub
About the Workshop:
The Stream Reasoning Workshop took place from January 16th to 17th, 2018.
Processing, querying and reasoning over streaming data is studied in different communities such as KR&R, Semantic Web, Databases, Stream Processing, Complex Event Processing, etc., where researchers have different perspectives and face different challenges.
This workshop aims at advancing Stream Reasoning as research theme by bringing together these different views and goals. In addition to invited talks, the workshop will provide opportunities for all participants to engage in discussions on open problems and future directions.
(http://www.ifi.uzh.ch/en/ddis/events/streamreasoning2018.html)
About the Talk:
Real-time sensor data enables diverse applications such as smart metering, traffic monitoring, and sport analysis. In the Internet of Things, billions of sensor nodes form a sensor cloud and offer data streams to analysis systems. However, it is impossible to transfer all available data with maximal frequencies to all applications. Therefore, we need to tailor data streams to the demand of applications. We contribute a technique that optimizes communication costs while maintaining the desired accuracy. Our technique schedules reads across huge amounts of sensors based on the data-demands of a huge amount of concurrent queries. We introduce user-defined sampling functions that define the data-demand of queries and facilitate various adaptive sampling techniques, which decrease the amount of transferred data. Moreover, we share sensor reads and data transfers among queries. Our experiments with real-world data show that our approach saves up to 87% in data transmissions.
JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...Jonas Traub
This slide set was presented at UCSB on Sep. 30, 2017.
The talk covers an extended version of the slides from SoCC 2017 plus a quick overview of Apache Flink.
I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...Jonas Traub
We present I², an interactive development environment for real-time analysis pipelines, which is based on Apache Flink and Apache Zeppelin. The sheer amount of available streaming data frequently makes it impossible to visualize all data points at the same time. I² coordinates running Flink jobs and corresponding visualizations such that only the currently depicted data points are processed in Flink and transferred towards the front end. We show how Flink jobs can adapt to changed visualization properties at runtime to allow interactive data exploration on high bandwidth data streams. Moreover, we present a data reduction technique which minimizes data transfer while providing loss free time-series plots. We show I² in a live demonstration in which we replay recorded sensor data from a football match (ca. 12k event/s). I² was first presented at EDBT'17 where it was awarded as best demonstration. The demonstration is available as open source at github.com/TU-Berlin-DIMA/i2.
I²: Interactive Real-Time Visualization for Streaming DataJonas Traub
This is our poster for the demonstration "I²: Interactive Real-Time Visualization for Streaming Data" which was awarded as best demonstration at EDBT 2017.
The paper and the source code are available on GitHub: https://github.com/TU-Berlin-DIMA/i2
LWA 2015: The Apache Flink Platform (Poster)Jonas Traub
This is our poster for the German paper "Die Apache Flink Plattform zur parallelen Analyse von Datenströmen und Stapeldaten" which was published in Proceedings of the
LWA 2015 Workshops: KDML, FGWM, IR, and FGDB. Trier, Germany, 7.-9. October 2015. Link: http://ceur-ws.org/Vol-1458/H02_CRC79_Traub.pdf
LWA 2015: The Apache Flink Platform for Parallel Batch and Stream AnalysisJonas Traub
This is our presentation for the German paper "Die Apache Flink Plattform zur parallelen Analyse von Datenströmen und Stapeldaten" which was published in Proceedings of the
LWA 2015 Workshops: KDML, FGWM, IR, and FGDB. Trier, Germany, 7.-9. October 2015. Link: http://ceur-ws.org/Vol-1458/H02_CRC79_Traub.pdf
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Scotty: Efficient Window Aggregation for Out-of-Order Stream ProcessingJonas Traub
This poster was presented at ICDE 2018.
Abstract: Computing aggregates over windows is at the core of virtually every stream processing job. Typical stream processing applications involve overlapping windows and, therefore, cause redundant computations. Several techniques prevent this redundancy by sharing partial aggregates among windows. However, these techniques do not support out-of-order processing and session windows. Out-of-order processing is a key requirement to deal with delayed tuples in case of source failures such as temporary sensor outages. Session windows are widely used to separate different periods of user activity from each other.
In this paper, we present Scotty, a high throughput operator for window discretization and aggregation. Scotty splits streams into non-overlapping slices and computes partial aggregates per slice. These partial aggregates are shared among all concurrent queries with arbitrary combinations of tumbling, sliding, and session windows. Scotty introduces the first slicing technique which (1) enables stream slicing for session windows in addition to tumbling and sliding windows and (2) processes out-of-order tuples efficiently. Our technique is generally applicable to a broad group of dataflow systems which use a unified batch and stream processing model. Our experiments show that we achieve a throughput an order of magnitude higher than alternative state-of-the-art solutions.
Scalable Detection of Concept Drifts on Data Streams with Parallel Adaptive W...Jonas Traub
Paper: Scalable Detection of Concept Drifts on Data Streams
with Parallel Adaptive Windowing
Abstract: Machine learning techniques for data stream analysis suffer from concept drifts such as changed user preferences, varying weather conditions, or economic changes. These concept drifts cause wrong predictions and lead to incorrect business decisions. Concept drift detection methods such as adaptive windowing (Adwin) allow for adapting to concept drifts on the fly.
In this paper, we examine Adwin in detail and point out its throughput bottlenecks. We then introduce several parallelization alternatives to address these bottlenecks. Our optimizations lead
to a speedup of two orders of magnitude over the original Adwin implementation. Thus, we explore parallel adaptive windowing to provide scalable concept detection for high-velocity data streams with millions of tuples per second.
Efficient SIMD Vectorization for Hashing in OpenCLJonas Traub
This poster was presented at the 21st International Conference on Extending Database Technology (EDBT), March 26-29, 2018.
Paper: Efficient SIMD Vectorization for Hashing in OpenCL
Abstract: Hashing is at the core of many efficient database operators such as hash-based joins and aggregations. Vectorization is a technique that uses Single Instruction Multiple Data (SIMD) instructions to process multiple data elements at once. Applying vectorization to hash tables results in promising speedups for build and probe operations. However, vectorization typically requires intrinsics – low-level APIs in which functions map to processorspecific SIMD instructions. Intrinsics are specific to a processor architecture and result in complex and difficult to maintain code. OpenCL is a parallel programming framework which provides a higher abstraction level than intrinsics and is portable to different processors. Thus, OpenCL avoids processor dependencies, which results in improved code maintainability. In this paper, we add efficient, vectorized hashing primitives to OpenCL. Our results show that OpenCL-based vectorization is competitive to intrinsics on CPUs but not on Xeon Phi coprocessors.
UZH Stream Reasoning Workshop 2018: Optimized On-Demand Data Streaming from S...Jonas Traub
About the Workshop:
The Stream Reasoning Workshop took place from January 16th to 17th, 2018.
Processing, querying and reasoning over streaming data is studied in different communities such as KR&R, Semantic Web, Databases, Stream Processing, Complex Event Processing, etc., where researchers have different perspectives and face different challenges.
This workshop aims at advancing Stream Reasoning as research theme by bringing together these different views and goals. In addition to invited talks, the workshop will provide opportunities for all participants to engage in discussions on open problems and future directions.
(http://www.ifi.uzh.ch/en/ddis/events/streamreasoning2018.html)
About the Talk:
Real-time sensor data enables diverse applications such as smart metering, traffic monitoring, and sport analysis. In the Internet of Things, billions of sensor nodes form a sensor cloud and offer data streams to analysis systems. However, it is impossible to transfer all available data with maximal frequencies to all applications. Therefore, we need to tailor data streams to the demand of applications. We contribute a technique that optimizes communication costs while maintaining the desired accuracy. Our technique schedules reads across huge amounts of sensors based on the data-demands of a huge amount of concurrent queries. We introduce user-defined sampling functions that define the data-demand of queries and facilitate various adaptive sampling techniques, which decrease the amount of transferred data. Moreover, we share sensor reads and data transfers among queries. Our experiments with real-world data show that our approach saves up to 87% in data transmissions.
JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...Jonas Traub
This slide set was presented at UCSB on Sep. 30, 2017.
The talk covers an extended version of the slides from SoCC 2017 plus a quick overview of Apache Flink.
I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...Jonas Traub
We present I², an interactive development environment for real-time analysis pipelines, which is based on Apache Flink and Apache Zeppelin. The sheer amount of available streaming data frequently makes it impossible to visualize all data points at the same time. I² coordinates running Flink jobs and corresponding visualizations such that only the currently depicted data points are processed in Flink and transferred towards the front end. We show how Flink jobs can adapt to changed visualization properties at runtime to allow interactive data exploration on high bandwidth data streams. Moreover, we present a data reduction technique which minimizes data transfer while providing loss free time-series plots. We show I² in a live demonstration in which we replay recorded sensor data from a football match (ca. 12k event/s). I² was first presented at EDBT'17 where it was awarded as best demonstration. The demonstration is available as open source at github.com/TU-Berlin-DIMA/i2.
I²: Interactive Real-Time Visualization for Streaming DataJonas Traub
This is our poster for the demonstration "I²: Interactive Real-Time Visualization for Streaming Data" which was awarded as best demonstration at EDBT 2017.
The paper and the source code are available on GitHub: https://github.com/TU-Berlin-DIMA/i2
LWA 2015: The Apache Flink Platform (Poster)Jonas Traub
This is our poster for the German paper "Die Apache Flink Plattform zur parallelen Analyse von Datenströmen und Stapeldaten" which was published in Proceedings of the
LWA 2015 Workshops: KDML, FGWM, IR, and FGDB. Trier, Germany, 7.-9. October 2015. Link: http://ceur-ws.org/Vol-1458/H02_CRC79_Traub.pdf
LWA 2015: The Apache Flink Platform for Parallel Batch and Stream AnalysisJonas Traub
This is our presentation for the German paper "Die Apache Flink Plattform zur parallelen Analyse von Datenströmen und Stapeldaten" which was published in Proceedings of the
LWA 2015 Workshops: KDML, FGWM, IR, and FGDB. Trier, Germany, 7.-9. October 2015. Link: http://ceur-ws.org/Vol-1458/H02_CRC79_Traub.pdf
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Flink Forward 2018: Efficient Window Aggregation with Stream Slicing
1. Efficient Window Aggregation
with Stream Slicing
Berlin, September 3-5, 2018
Philipp M. Grulich
Research Assistant (DFKI)
Jonas Traub
Research Associate (TU Berlin)
2. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Stream Slicing Example
2
3. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Stream Slicing Example
3
4. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Stream Slicing Example
4
5. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Stream Slicing Example
5
6. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Stream Slicing Example
6
7. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Stream Slicing Example
7
8. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Stream Slicing Example
8
9. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Stream Slicing Example
9
10. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Stream Slicing Example
10
11. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Stream Slicing Example
11
12. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Stream Slicing Research
12
13. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Stream Slicing Research
13
CIKM 2016
14. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Stream Slicing Research
14
ICDE 2018
CIKM 2016
15. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Flink Windowing Bottlenecks
15
.window(SlidingEventTimeWindows.of(Time.minutes(1), Time.seconds(10)))
.sum()
Example Query:
16. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Flink Windowing Bottlenecks
16
.window(SlidingEventTimeWindows.of(Time.minutes(1), Time.seconds(10)))
.sum()
Example Query:
Processing with Buckets:
Events: Buckets:
Eventtime
17. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Flink Windowing Bottlenecks
16
.window(SlidingEventTimeWindows.of(Time.minutes(1), Time.seconds(10)))
.sum()
Example Query:
Processing with Buckets:
<4,3>
Events: Buckets:
Eventtime
18. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Flink Windowing Bottlenecks
16
.window(SlidingEventTimeWindows.of(Time.minutes(1), Time.seconds(10)))
.sum()
Example Query:
Processing with Buckets:
<0:60, 3><4,3>
Events: Buckets:
Eventtime
19. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Flink Windowing Bottlenecks
16
.window(SlidingEventTimeWindows.of(Time.minutes(1), Time.seconds(10)))
.sum()
Example Query:
Processing with Buckets:
<0:60, 3><4,3>
<15,6>
Events: Buckets:
Eventtime
20. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Flink Windowing Bottlenecks
16
.window(SlidingEventTimeWindows.of(Time.minutes(1), Time.seconds(10)))
.sum()
Example Query:
Processing with Buckets:
<0:60, 3>
<10:70, 6>
<4,3>
<15,6>
<0:60, 9>
Events: Buckets:
Eventtime
21. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Flink Windowing Bottlenecks
16
.window(SlidingEventTimeWindows.of(Time.minutes(1), Time.seconds(10)))
.sum()
Example Query:
Processing with Buckets:
<0:60, 3>
<10:70, 6>
<4,3>
<15,6>
<0:60, 9>
<55,6>
Events: Buckets:
Eventtime
22. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Flink Windowing Bottlenecks
16
.window(SlidingEventTimeWindows.of(Time.minutes(1), Time.seconds(10)))
.sum()
Example Query:
Processing with Buckets:
<0:60, 3>
<10:70, 6>
...
<4,3>
<15,6>
<0:60, 9>
<55,6>
<0:60, 15>
<10:70, 12>
Events: Buckets:
Eventtime
23. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Flink Windowing Bottlenecks
16
.window(SlidingEventTimeWindows.of(Time.minutes(1), Time.seconds(10)))
.sum()
Example Query:
Processing with Buckets:
<0:60, 3>
<10:70, 6>
...
<4,3>
<15,6>
<0:60, 9>
<55,6>
<0:60, 15>
<10:70, 12>
<66,1>
Events: Buckets:
Eventtime
24. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Flink Windowing Bottlenecks
16
.window(SlidingEventTimeWindows.of(Time.minutes(1), Time.seconds(10)))
.sum()
Example Query:
Processing with Buckets:
<0:60, 3>
<10:70, 6>
...
<60:120, 1>
<4,3>
<15,6>
<0:60, 9>
<55,6>
<0:60, 15>
<10:70, 12>
<66,1>
<10:70, 13>
Events: Buckets:
Eventtime
25. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Flink Windowing Bottlenecks
17
Number of Buckets = Window Length / Slide Length
26. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Flink Windowing Bottlenecks
17
Number of Buckets = Window Length / Slide Length
SlidingEventTimeWindows.of(Time.minutes(1), Time.seconds(10)) --> 6 Buckets
27. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Flink Windowing Bottlenecks
17
Number of Buckets = Window Length / Slide Length
SlidingEventTimeWindows.of(Time.minutes(1), Time.seconds(10))
SlidingEventTimeWindows.of(Time.day(1), Time.seconds(10))
--> 6 Buckets
--> 8640 Buckets
28. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Flink Windowing Bottlenecks
17
Number of Buckets = Window Length / Slide Length
SlidingEventTimeWindows.of(Time.minutes(1), Time.seconds(10))
SlidingEventTimeWindows.of(Time.day(1), Time.seconds(10))
--> 6 Buckets
--> 8640 Buckets
Overlapping windows cause:
29. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Flink Windowing Bottlenecks
17
Number of Buckets = Window Length / Slide Length
SlidingEventTimeWindows.of(Time.minutes(1), Time.seconds(10))
SlidingEventTimeWindows.of(Time.day(1), Time.seconds(10))
--> 6 Buckets
--> 8640 Buckets
Overlapping windows cause:
● Every event is assigned to many windows.
30. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Flink Windowing Bottlenecks
17
Number of Buckets = Window Length / Slide Length
SlidingEventTimeWindows.of(Time.minutes(1), Time.seconds(10))
SlidingEventTimeWindows.of(Time.day(1), Time.seconds(10))
--> 6 Buckets
--> 8640 Buckets
Overlapping windows cause:
● Every event is assigned to many windows.
● Repeated aggregations --> aggregation function is called on every window
31. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Flink Windowing Bottlenecks
17
Number of Buckets = Window Length / Slide Length
SlidingEventTimeWindows.of(Time.minutes(1), Time.seconds(10))
SlidingEventTimeWindows.of(Time.day(1), Time.seconds(10))
--> 6 Buckets
--> 8640 Buckets
Overlapping windows cause:
● Every event is assigned to many windows.
● Repeated aggregations --> aggregation function is called on every window
● High memory consumption --> especially for windows without incremental aggregation
32. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Flink Windowing Bottlenecks
17
Number of Buckets = Window Length / Slide Length
SlidingEventTimeWindows.of(Time.minutes(1), Time.seconds(10))
SlidingEventTimeWindows.of(Time.day(1), Time.seconds(10))
--> 6 Buckets
--> 8640 Buckets
Overlapping windows cause:
● Every event is assigned to many windows.
● Repeated aggregations --> aggregation function is called on every window
● High memory consumption --> especially for windows without incremental aggregation
● Check for merging windows
33. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Architecture Overview
18
34. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Architecture Overview
18
35. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Architecture Overview
18
36. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Architecture Overview
18
37. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Architecture Overview
18
38. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Architecture Overview
18
39. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Session Window Aggregate Sharing
19
40. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Session Window Aggregate Sharing
19
41. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Session Window Aggregate Sharing
19
42. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Out-of-Order Processing and Sessions
20
43. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Out-of-Order Processing and Sessions
20
44. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Out-of-Order Processing and Sessions
20
45. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Out-of-Order Processing and Sessions
20
46. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Out-of-Order Processing and Sessions
20
47. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Out-of-Order Processing and Sessions
20
49. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Stream Slicing Performance
22
50. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Stream Slicing Performance
23
51. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Runtime-Dynamic Windows
24
52. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Runtime-Dynamic Windows
24
Event Stream:
Window Definition Stream:
<WindowDefinition>
53. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Runtime-Dynamic Windows
24
Event Stream:
Dynamic Window Operator
Window Definition Stream:
<WindowDefinition>
54. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Runtime-Dynamic Windows
24
Event Stream:
Dynamic Window Operator
Output Stream:
<Window, Agg>
Window Definition Stream:
<WindowDefinition>
55. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Runtime-Dynamic Windows
24
.dynamicWindow(windowDefinitionStream)
.sum()
Event Stream:
Dynamic Window Operator
Output Stream:
<Window, Agg>
Window Definition Stream:
<WindowDefinition>
56. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
From Research to Production
25
Research
Production
57. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
From Research to Production
● Implement complete fault-tolerance and state-management
25
Research
Production
58. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
From Research to Production
● Implement complete fault-tolerance and state-management
● State migration
25
Research
Production
59. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
From Research to Production
● Implement complete fault-tolerance and state-management
● State migration
○ Hard limitation: Aggregated buckets in state snapshots cannot be migrated
25
Research
Production
60. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
From Research to Production
● Implement complete fault-tolerance and state-management
● State migration
○ Hard limitation: Aggregated buckets in state snapshots cannot be migrated
● Sophisticated testing
25
Research
Production
61. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
From Research to Production
● Implement complete fault-tolerance and state-management
● State migration
○ Hard limitation: Aggregated buckets in state snapshots cannot be migrated
● Sophisticated testing
How to expose multi-windows and dynamic-windows to users?
25
Research
Production
62. Jonas Traub (TU Berlin), Philipp M. Grulich (DFKI) - Efficient Window Aggregation with Stream Slicing
Wrap-Up
Scotty Features:
- stream slicing
- pre-aggregation
- aggregate sharing
- out-of-order processing
- session window support
- multi-window support
- runtime-dynamic window support
Let’s bring it to production!
JIRA: [FLINK-7001]
26
This talk is supported by the European Union Horizon 2020 Projects
Proteus (687691), Streamline (688191), SAGE (671500), and
E2Data (780245) and by the German Ministry for Education and
Research as Berlin Big Data Center (01IS14013A) and Software
Campus (01IS12056).