The Paper "Efficient Window Aggregation with General Stream Slicing" by Jonas Traub, Philipp M. Grulich, Alejandro Rodriguez Cuellar, Sebastian Breß, Tilmann Rabl, and Volker Markl was selected as best paper of the International Conference on Extending Database Technology (EDBT) 2019.
Abstract:
Window aggregation is a core operation in data stream processing. Existing aggregation techniques focus on reducing latency, eliminating redundant computations, and minimizing memory usage. However, each technique operates under different assumptions with respect to workload characteristics such as properties of aggregation functions (e.g., invertible, associative), window types
(e.g., sliding, sessions), windowing measures (e.g., time- or countbased), and stream (dis)order. Violating the assumptions of a technique can deem it unusable or drastically reduce its performance.
In this paper, we present the first general stream slicing technique for window aggregation. General stream slicing automatically adapts to workload characteristics to improve performance without sacrificing its general applicability. As a prerequisite, we identify workload characteristics which affect the performance and applicability of aggregation techniques. Our experiments show that general stream slicing outperforms alternative concepts by up to one order of magnitude.
Aggregate Sharing for User-Define Data Stream WindowsParis Carbone
Aggregation queries on data streams are evaluated over evolving and often overlapping logical views called windows. While the aggregation of periodic windows were extensively studied in the past through the use of aggregate sharing techniques such as Panes and Pairs, little to no work has been put in optimizing the aggregation of very common, non-periodic windows. Typical examples of non-periodic windows are punctuations and sessions which can implement complex business logic and are often expressed as user- defined operators on platforms such as Google Dataflow or Apache Storm. The aggregation of such non-periodic or user-defined windows either falls back to expensive, best-effort aggregate sharing methods, or is not optimized at all.
In this paper we present a technique to perform efficient aggregate sharing for data stream windows, which are de- clared as user-defined functions (UDFs) and can contain arbitrary business logic. To this end, we first introduce the concept of User-Defined Windows (UDWs), a simple, UDF-based programming abstraction that allows users to programmatically define custom windows. We then define semantics for UDWs, based on which we design Cutty, a low-cost aggregate sharing technique. Cutty improves and outperforms the state of the art for aggregate sharing on single and multiple queries. Moreover, it enables aggregate sharing for a broad class of non-periodic UDWs. We implemented our techniques on Apache Flink, an open source stream processing system, and performed experiments demonstrating orders of magnitude of reduction in aggregation costs compared to the state of the art.
Cloud-based dynamic distributed optimisation of integrated process planning a...Piotr Dziurzanski
A presentation of the paper developed in the SAFIRE project titled "Cloud-based dynamic distributed optimisation of integrated process planning and scheduling in smart factories", delivered at the Genetic and Evolutionary Computation Conference (GECCO) at Prague, The Czech Republic in July 2019.
A Study on Process Improvement in the Assembly Line of Switch Manufacturingijceronline
The paper is about the process improvement in the assembly line at switch manufacturing company and to improve the process by focusing into the areas viz. Process flow, Time study and rework minimization. This improvement are made by using cause-and-effect diagram, critical path method and root cause analysis. The analysis will help to reduce the amount of rework that occurs during manufacturing of modular switches in the assembly line process
Aggregate Sharing for User-Define Data Stream WindowsParis Carbone
Aggregation queries on data streams are evaluated over evolving and often overlapping logical views called windows. While the aggregation of periodic windows were extensively studied in the past through the use of aggregate sharing techniques such as Panes and Pairs, little to no work has been put in optimizing the aggregation of very common, non-periodic windows. Typical examples of non-periodic windows are punctuations and sessions which can implement complex business logic and are often expressed as user- defined operators on platforms such as Google Dataflow or Apache Storm. The aggregation of such non-periodic or user-defined windows either falls back to expensive, best-effort aggregate sharing methods, or is not optimized at all.
In this paper we present a technique to perform efficient aggregate sharing for data stream windows, which are de- clared as user-defined functions (UDFs) and can contain arbitrary business logic. To this end, we first introduce the concept of User-Defined Windows (UDWs), a simple, UDF-based programming abstraction that allows users to programmatically define custom windows. We then define semantics for UDWs, based on which we design Cutty, a low-cost aggregate sharing technique. Cutty improves and outperforms the state of the art for aggregate sharing on single and multiple queries. Moreover, it enables aggregate sharing for a broad class of non-periodic UDWs. We implemented our techniques on Apache Flink, an open source stream processing system, and performed experiments demonstrating orders of magnitude of reduction in aggregation costs compared to the state of the art.
Cloud-based dynamic distributed optimisation of integrated process planning a...Piotr Dziurzanski
A presentation of the paper developed in the SAFIRE project titled "Cloud-based dynamic distributed optimisation of integrated process planning and scheduling in smart factories", delivered at the Genetic and Evolutionary Computation Conference (GECCO) at Prague, The Czech Republic in July 2019.
A Study on Process Improvement in the Assembly Line of Switch Manufacturingijceronline
The paper is about the process improvement in the assembly line at switch manufacturing company and to improve the process by focusing into the areas viz. Process flow, Time study and rework minimization. This improvement are made by using cause-and-effect diagram, critical path method and root cause analysis. The analysis will help to reduce the amount of rework that occurs during manufacturing of modular switches in the assembly line process
This paper is highlighted as the agile tasking simulations marking of contingent outcomes, essential to notes of the financial plan and to assessing the means to report on it.. We found the starting position for the cross-over and we learned how to foil the glass car.
Big data ET models & benchmarking with distributed OSGEO toolsHirofumi Hayashi
FOSS4G2013 TOKYO 基調講演:Big data ET models & benchmarking with distributed OSGEO tools
Dr. Yann Chemin (OSGeo Charter Member, International Water Management Institute (IWMI))
The design series focused on simulations and specifically Part III: Confluent Scale. We began by highlighting the vector requirements for tasking in our current model and the need for lens facets for scale differentiation. Then in Part II define simulations reporting. In Part III we began to move past forms and mix to the affluent modeling of the Ark Science of Glass in the Advancing Age to come.
Cloud-based Integrated Process Planning and Scheduling Optimisation via Asyn...Piotr Dziurzanski
A presentation of the paper developed in the SAFIRE project titled "Cloud-based Integrated Process Planning and Scheduling Optimisation via Asynchronous Islands", delivered at the 16th International Conference on Economics of Grids, Clouds, Systems & Services, 17-19 September 2019, Leeds, UK
Every release has some big features which get all the attention they deserve. Just like the previous years we're going to take a look at some of the smaller features. One of them might be the thing that you really need for a certain website or it may be the feature that your editors have missed for so long.
Comments on Simulations Project Parts I & II Marking Contingencies.pdfBrij Consulting, LLC
This paper is highlighted as the agile tasking simulations marking of contingent outcomes, essential to notes of the financial plan and to assessing the means to report on it.. We continue with Lift Data to crossover to the Energy Star. And after working on Benchmarking the System delivery with instances we will continue with Part III to foil the glass car with a wing for stabilization.
Talk about frontend performance on the web. Awesome figures, pillars of performance, some terminology, 12 quick wins and couple of web sources to continue your endeavor.
Comparison of CESAR energy simulation results with real data and a private co...Andrea Silvagni
The Combined Energy Simulation And Retrofitting (CESA) tool developed by EMPA in Zurich simulates the energy demand of large clusters of buildings, with the scope to aid policy interventions and retrofitting strategies.
In this project we analyze how the accuracy and precision of the simulation results versus real data.
Definitely not Java! A Hands-on Introduction to Efficient Functional Programm...Jonas Traub
This talk was presented at code.talks 2022 in Hamburg.
Learn more about code.talks at: https://codetalks.de/
Download Slides at: https://codetalks.de/speakers#talk-1128?event=7
Video recording on Youtube: https://www.youtube.com/watch?v=GC-hKOvQBus
In this session, we will provide an introduction to REASON, also known as ReasonML. At SAP, we use REASON to build our Master Data Integration (MDI) Service as part of SAPs Business Technology Platform. MDI is a highly scalable, distributed, multi-tenant system powered by functional programming. REASON, is our language of choice for MDI: It lets you write simple, fast, and type save code. With REASON, you rarely have to annotate types, but everything gets checked for you through REASON’s type inference system, reducing bugs and increasing code maintainability. REASON translates to OCaml or JavaScript, providing access to both worlds and enabling it to run on NodeJS. During the session, we will highlight the advantages of the syntax compared to Java and introduce you to the beauty of immutability, tail recursion and other relevant concepts. Our session targets programmers who are not yet familiar with functional programming, as well as functional programmers who want to learn more about REASON.
Efficient Data Stream Processing in the Internet of Things - SoftwareCampus A...Jonas Traub
This talk was presented for the SoftwareCampus Alumni e.V. on 07.12.2020. For more Information about the program check https://softwarecampus-alumni.de/ and https://softwarecampus.de/
Abstract: The Internet of Things (IoT) consists of billions of devices which form a cloud of network-connected sensor nodes. These sensor nodes supply a vast number of data streams with massive amounts of sensor data. Real-time sensor data enables diverse applications including traffic-aware navigation, machine monitoring, and home automation. In this talk, we will dive into recent research which optimizes real-time data gathering and data analysis in the IoT. The talk will provide an overview of available techniques which can be deployed on sensor nodes, intermediate network nodes, and central analysis systems. We will look into the state-of-the-art in practice and research and make you aware of important tradeoffs in real-time IoT data analysis.
CV: Jonas Traub is a postdoctoral researcher at the Database Systems and Information Management group at TU Berlin. His main research interests include stream processing, sensor data analysis, and data acquisition techniques. In his PhD, he studied efficient data gathering, processing, and transmission in the IoT. His research shows that one can save up to 87% in sensor reads and data transfers by applying smart data reduction techniques on sensor nodes. He further introduced a demand-based control layer which optimizes the data acquisition from thousands of sensors. With his Scotty-framework, he contributed a general aggregation technique for streaming systems which outperforms alternative solutions by an order of magnitude in throughput. His work received a Best Paper Award at the 22nd International Conference on Extending Database Technology (EDBT). Prior to his work at TU Berlin, he studied at KTH Stockholm and DHBW Stuttgart and worked several years at IBM in Germany and the USA. Jonas is an alumnus of Software Campus where he worked with SAP as industry partner.
More Related Content
Similar to Efficient Window Aggregation with General Stream Slicing (EDBT 2019, Best Paper)
This paper is highlighted as the agile tasking simulations marking of contingent outcomes, essential to notes of the financial plan and to assessing the means to report on it.. We found the starting position for the cross-over and we learned how to foil the glass car.
Big data ET models & benchmarking with distributed OSGEO toolsHirofumi Hayashi
FOSS4G2013 TOKYO 基調講演:Big data ET models & benchmarking with distributed OSGEO tools
Dr. Yann Chemin (OSGeo Charter Member, International Water Management Institute (IWMI))
The design series focused on simulations and specifically Part III: Confluent Scale. We began by highlighting the vector requirements for tasking in our current model and the need for lens facets for scale differentiation. Then in Part II define simulations reporting. In Part III we began to move past forms and mix to the affluent modeling of the Ark Science of Glass in the Advancing Age to come.
Cloud-based Integrated Process Planning and Scheduling Optimisation via Asyn...Piotr Dziurzanski
A presentation of the paper developed in the SAFIRE project titled "Cloud-based Integrated Process Planning and Scheduling Optimisation via Asynchronous Islands", delivered at the 16th International Conference on Economics of Grids, Clouds, Systems & Services, 17-19 September 2019, Leeds, UK
Every release has some big features which get all the attention they deserve. Just like the previous years we're going to take a look at some of the smaller features. One of them might be the thing that you really need for a certain website or it may be the feature that your editors have missed for so long.
Comments on Simulations Project Parts I & II Marking Contingencies.pdfBrij Consulting, LLC
This paper is highlighted as the agile tasking simulations marking of contingent outcomes, essential to notes of the financial plan and to assessing the means to report on it.. We continue with Lift Data to crossover to the Energy Star. And after working on Benchmarking the System delivery with instances we will continue with Part III to foil the glass car with a wing for stabilization.
Talk about frontend performance on the web. Awesome figures, pillars of performance, some terminology, 12 quick wins and couple of web sources to continue your endeavor.
Comparison of CESAR energy simulation results with real data and a private co...Andrea Silvagni
The Combined Energy Simulation And Retrofitting (CESA) tool developed by EMPA in Zurich simulates the energy demand of large clusters of buildings, with the scope to aid policy interventions and retrofitting strategies.
In this project we analyze how the accuracy and precision of the simulation results versus real data.
Similar to Efficient Window Aggregation with General Stream Slicing (EDBT 2019, Best Paper) (20)
Definitely not Java! A Hands-on Introduction to Efficient Functional Programm...Jonas Traub
This talk was presented at code.talks 2022 in Hamburg.
Learn more about code.talks at: https://codetalks.de/
Download Slides at: https://codetalks.de/speakers#talk-1128?event=7
Video recording on Youtube: https://www.youtube.com/watch?v=GC-hKOvQBus
In this session, we will provide an introduction to REASON, also known as ReasonML. At SAP, we use REASON to build our Master Data Integration (MDI) Service as part of SAPs Business Technology Platform. MDI is a highly scalable, distributed, multi-tenant system powered by functional programming. REASON, is our language of choice for MDI: It lets you write simple, fast, and type save code. With REASON, you rarely have to annotate types, but everything gets checked for you through REASON’s type inference system, reducing bugs and increasing code maintainability. REASON translates to OCaml or JavaScript, providing access to both worlds and enabling it to run on NodeJS. During the session, we will highlight the advantages of the syntax compared to Java and introduce you to the beauty of immutability, tail recursion and other relevant concepts. Our session targets programmers who are not yet familiar with functional programming, as well as functional programmers who want to learn more about REASON.
Efficient Data Stream Processing in the Internet of Things - SoftwareCampus A...Jonas Traub
This talk was presented for the SoftwareCampus Alumni e.V. on 07.12.2020. For more Information about the program check https://softwarecampus-alumni.de/ and https://softwarecampus.de/
Abstract: The Internet of Things (IoT) consists of billions of devices which form a cloud of network-connected sensor nodes. These sensor nodes supply a vast number of data streams with massive amounts of sensor data. Real-time sensor data enables diverse applications including traffic-aware navigation, machine monitoring, and home automation. In this talk, we will dive into recent research which optimizes real-time data gathering and data analysis in the IoT. The talk will provide an overview of available techniques which can be deployed on sensor nodes, intermediate network nodes, and central analysis systems. We will look into the state-of-the-art in practice and research and make you aware of important tradeoffs in real-time IoT data analysis.
CV: Jonas Traub is a postdoctoral researcher at the Database Systems and Information Management group at TU Berlin. His main research interests include stream processing, sensor data analysis, and data acquisition techniques. In his PhD, he studied efficient data gathering, processing, and transmission in the IoT. His research shows that one can save up to 87% in sensor reads and data transfers by applying smart data reduction techniques on sensor nodes. He further introduced a demand-based control layer which optimizes the data acquisition from thousands of sensors. With his Scotty-framework, he contributed a general aggregation technique for streaming systems which outperforms alternative solutions by an order of magnitude in throughput. His work received a Best Paper Award at the 22nd International Conference on Extending Database Technology (EDBT). Prior to his work at TU Berlin, he studied at KTH Stockholm and DHBW Stuttgart and worked several years at IBM in Germany and the USA. Jonas is an alumnus of Software Campus where he worked with SAP as industry partner.
code.talks 2019 - Scotty: Efficient Window Aggregation for your Stream Proces...Jonas Traub
This presentation was held at Code.Talks 2019 in Hamburg.
A video is available at: https://www.youtube.com/watch?v=K1y5dJvP1jM
Window aggregation is a core operation in data stream processing.
Stream Processing Systems, like Flink or Storm, implement general aggregation techniques which perform poorly under specific workloads (e.g. Sliding Windows).
To this end, we present Scotty, a new highly-efficient window operator.
Scotty exploits specific workload properties such as the type of aggregation functions (e.g., invertible, associative), window types (e.g., sliding, sessions), windowing measures (e.g., time- or countbased), and stream (dis)order. This allows Scotty to outperform systems like Flink by up to one order of magnitude.
The structure of this talk is threefold:
First, we give an introduction to the semantics and implementations of window aggregations in modern Stream Processing Systems.
Second, we discuss the design of Scotty and show why Scotty is able to outperform the default window operators of many stream processing systems.
Third, we give a hands-on introduction to Scotty and demonstrate how it can be integrated into standard Flink, Storm, or Beam stream processing pipelines.
Scotty and its connectors are available as open-source (https://github.com/TU-Berlin-DIMA/scotty-window-processor) and contributions are highly welcome.
FlinkForward Berlin 2019 - Scotty: Efficient Window Aggregation with General ...Jonas Traub
Window aggregation is a core operation in data stream processing. Existing aggregation techniques focus on reducing latency, eliminating redundant computations, and minimizing memory usage.
However, each technique operates under different assumptions with respect to workload characteristics such as properties of aggregation functions (e.g., invertible, associative), window types (e.g., sliding, sessions), windowing measures (e.g., time- or countbased), and stream (dis)order. Violating the assumptions of a technique can deem it unusable or drastically reduce its performance.
In this talk, we present Scotty an implementation of a general stream slicing technique for window aggregation. This technique automatically adapts to workload characteristics to improve performance without sacrificing its general applicability. Our experiments show that Scotty outperforms alternative implementations, like the default window operator in Flink, by up to one order of magnitude.
Furthermore, we present how to use Scotty as a library in Flink, Storm, or Beam without changing the underlying Stream Processing System.
General stream slicing was first published at EDBT 2019 (http://www.user.tu-berlin.de/powibol/assets/publications/traub-efficient-window-aggregation-with-general-stream-slicing-edbt-2019.pdf) where it received the Best Paper Award.
The Scotty library and its connectors are available as open-source (https://github.com/TU-Berlin-DIMA/scotty-window-processor) and contributions are highly welcome.
Analyzing Efficient Stream Processing on Modern Hardware (VLDB 2019 Presentat...Jonas Traub
This talk by Steffen Zeuch presents our VLDB 2019 paper about "Analyzing Efficient Stream Processing on Modern Hardware".
This paper is about showing the potential of hardware-tailored code compilation and data ingestion at memory speed for a scale-up SPE. Analyze state-of-the-art streaming systems and identify sources of inefficiency. We investigate the data-related and processing-related design space and derive design changes for streaming systems to exploit modern hardware more efficiently. In order to efficiently scale up, one should avoid managed runtimes, use a compilation-based approach to produce hardware-tailored code, avoid queues and use operator fusion, and use late merge instead of partitioning.
The full paper with all our findings is available online:
http://www.vldb.org/pvldb/vol12/p516-zeuch.pdf
Database Research at TU Berlin DIMA and DFKI IAM - USA Excursion Slides 2019Jonas Traub
In April 2019, we did an USA excursion and presented selected publications of the TU Berlin DIMA and the DFKI IAM research groups. This slide set contains the four teaser talks which we presented on the tour:
1) Jonas Traub: Optimized On-Demand Data Streaming from Sensor Nodes
2) Sebastian Breß: Generating Custom Code for Efficient Query Execution on Heterogeneous Processors
3) Martin Kiefer: Estimating Join Selectivities using Bandwidth Optimized Kernel Density Models
4) Andreas Kunft: BlockJoin: Efficient Matrix Partitioning through Joins
Resense: Transparent Record and Replay of Sensor Data in the Internet of Thin...Jonas Traub
Resense: Transparent Record and Replay of Sensor Data
in the Internet of Things
Resense was presented as Demonstration at the 22nd International Conference on Extending Database Technology (EDBT) in Lisbon, Portugal and received the Best Demonstration Award (http://edbticdt2019.inesc-id.pt/?awards_demo_edbt).
AUTHORS:
Dimitrios Giouroukis, Julius Hülsmann, Janis von Bleichert, Morgan Geldenhuys, Tim Stullich, Felipe Oliveira Gutierrez, Jonas Traub, Kaustubh Beedkar, Volker Markl
ABSTRACT
As the scientific interest in the Internet of Things (IoT) continues to grow, emulating IoT infrastructure involving a large number of heterogeneous sensors plays a crucial role. Existing research on emulating sensors is often tailored to specific hardware and/or software, which makes it difficult to reproduce and extend. In this paper we show how to emulate different kinds of sensors in a unified way that makes the downstream application agnostic as to whether the sensor data is acquired from real sensors or is read from memory using emulated sensors. We propose the Resense framework that allows for replaying sensor data using emulated sensors and provides an easy-to-use software for setting up and executing IoT experiments involving a large number of heterogeneous sensors. We demonstrate various aspects of Resense in the context of a sports analytics application using real-world sensor data and a set of Raspberry Pis.
Flink Forward 2018: Efficient Window Aggregation with Stream SlicingJonas Traub
Computing aggregates over windows is at the core of virtually every stream processing job. Typical stream processing applications involve overlapping windows and, therefore, cause redundant computations. Several techniques prevent this redundancy by sharing partial aggregates among windows. However, these techniques do not support out-of-order processing and session windows. Out-of-order processing is a key requirement to deal with delayed tuples in case of source failures such as temporary sensor outages. Session windows are widely used to separate different periods of user activity from each other. Current versions of Apache Flink use Window Buckets to process stream aggregations with session windows and out-of-order tuples. This Approach does not share partial aggregates among overlapping windows. In our talk, we present Scotty, a high throughput operator for window discretization and aggregation in Apache Flink. Scotty splits streams into non-overlapping slices and computes partial aggregates per slice. These partial aggregates are shared among all overlapping windows including session windows. Scotty introduces the first slicing technique which (1) enables stream slicing for session windows in addition to tumbling and sliding windows and (2) processes out-of-order tuples efficiently. Scotty was first published at ICDE 2018 (http://www.user.tu-berlin.de/powibol/assets/publications/traub-scotty-icde-2018.pdf).
Scotty: Efficient Window Aggregation for Out-of-Order Stream ProcessingJonas Traub
This poster was presented at ICDE 2018.
Abstract: Computing aggregates over windows is at the core of virtually every stream processing job. Typical stream processing applications involve overlapping windows and, therefore, cause redundant computations. Several techniques prevent this redundancy by sharing partial aggregates among windows. However, these techniques do not support out-of-order processing and session windows. Out-of-order processing is a key requirement to deal with delayed tuples in case of source failures such as temporary sensor outages. Session windows are widely used to separate different periods of user activity from each other.
In this paper, we present Scotty, a high throughput operator for window discretization and aggregation. Scotty splits streams into non-overlapping slices and computes partial aggregates per slice. These partial aggregates are shared among all concurrent queries with arbitrary combinations of tumbling, sliding, and session windows. Scotty introduces the first slicing technique which (1) enables stream slicing for session windows in addition to tumbling and sliding windows and (2) processes out-of-order tuples efficiently. Our technique is generally applicable to a broad group of dataflow systems which use a unified batch and stream processing model. Our experiments show that we achieve a throughput an order of magnitude higher than alternative state-of-the-art solutions.
Scalable Detection of Concept Drifts on Data Streams with Parallel Adaptive W...Jonas Traub
Paper: Scalable Detection of Concept Drifts on Data Streams
with Parallel Adaptive Windowing
Abstract: Machine learning techniques for data stream analysis suffer from concept drifts such as changed user preferences, varying weather conditions, or economic changes. These concept drifts cause wrong predictions and lead to incorrect business decisions. Concept drift detection methods such as adaptive windowing (Adwin) allow for adapting to concept drifts on the fly.
In this paper, we examine Adwin in detail and point out its throughput bottlenecks. We then introduce several parallelization alternatives to address these bottlenecks. Our optimizations lead
to a speedup of two orders of magnitude over the original Adwin implementation. Thus, we explore parallel adaptive windowing to provide scalable concept detection for high-velocity data streams with millions of tuples per second.
Efficient SIMD Vectorization for Hashing in OpenCLJonas Traub
This poster was presented at the 21st International Conference on Extending Database Technology (EDBT), March 26-29, 2018.
Paper: Efficient SIMD Vectorization for Hashing in OpenCL
Abstract: Hashing is at the core of many efficient database operators such as hash-based joins and aggregations. Vectorization is a technique that uses Single Instruction Multiple Data (SIMD) instructions to process multiple data elements at once. Applying vectorization to hash tables results in promising speedups for build and probe operations. However, vectorization typically requires intrinsics – low-level APIs in which functions map to processorspecific SIMD instructions. Intrinsics are specific to a processor architecture and result in complex and difficult to maintain code. OpenCL is a parallel programming framework which provides a higher abstraction level than intrinsics and is portable to different processors. Thus, OpenCL avoids processor dependencies, which results in improved code maintainability. In this paper, we add efficient, vectorized hashing primitives to OpenCL. Our results show that OpenCL-based vectorization is competitive to intrinsics on CPUs but not on Xeon Phi coprocessors.
UZH Stream Reasoning Workshop 2018: Optimized On-Demand Data Streaming from S...Jonas Traub
About the Workshop:
The Stream Reasoning Workshop took place from January 16th to 17th, 2018.
Processing, querying and reasoning over streaming data is studied in different communities such as KR&R, Semantic Web, Databases, Stream Processing, Complex Event Processing, etc., where researchers have different perspectives and face different challenges.
This workshop aims at advancing Stream Reasoning as research theme by bringing together these different views and goals. In addition to invited talks, the workshop will provide opportunities for all participants to engage in discussions on open problems and future directions.
(http://www.ifi.uzh.ch/en/ddis/events/streamreasoning2018.html)
About the Talk:
Real-time sensor data enables diverse applications such as smart metering, traffic monitoring, and sport analysis. In the Internet of Things, billions of sensor nodes form a sensor cloud and offer data streams to analysis systems. However, it is impossible to transfer all available data with maximal frequencies to all applications. Therefore, we need to tailor data streams to the demand of applications. We contribute a technique that optimizes communication costs while maintaining the desired accuracy. Our technique schedules reads across huge amounts of sensors based on the data-demands of a huge amount of concurrent queries. We introduce user-defined sampling functions that define the data-demand of queries and facilitate various adaptive sampling techniques, which decrease the amount of transferred data. Moreover, we share sensor reads and data transfers among queries. Our experiments with real-world data show that our approach saves up to 87% in data transmissions.
JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...Jonas Traub
This slide set was presented at UCSB on Sep. 30, 2017.
The talk covers an extended version of the slides from SoCC 2017 plus a quick overview of Apache Flink.
I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...Jonas Traub
We present I², an interactive development environment for real-time analysis pipelines, which is based on Apache Flink and Apache Zeppelin. The sheer amount of available streaming data frequently makes it impossible to visualize all data points at the same time. I² coordinates running Flink jobs and corresponding visualizations such that only the currently depicted data points are processed in Flink and transferred towards the front end. We show how Flink jobs can adapt to changed visualization properties at runtime to allow interactive data exploration on high bandwidth data streams. Moreover, we present a data reduction technique which minimizes data transfer while providing loss free time-series plots. We show I² in a live demonstration in which we replay recorded sensor data from a football match (ca. 12k event/s). I² was first presented at EDBT'17 where it was awarded as best demonstration. The demonstration is available as open source at github.com/TU-Berlin-DIMA/i2.
I²: Interactive Real-Time Visualization for Streaming DataJonas Traub
This is our poster for the demonstration "I²: Interactive Real-Time Visualization for Streaming Data" which was awarded as best demonstration at EDBT 2017.
The paper and the source code are available on GitHub: https://github.com/TU-Berlin-DIMA/i2
LWA 2015: The Apache Flink Platform (Poster)Jonas Traub
This is our poster for the German paper "Die Apache Flink Plattform zur parallelen Analyse von Datenströmen und Stapeldaten" which was published in Proceedings of the
LWA 2015 Workshops: KDML, FGWM, IR, and FGDB. Trier, Germany, 7.-9. October 2015. Link: http://ceur-ws.org/Vol-1458/H02_CRC79_Traub.pdf
LWA 2015: The Apache Flink Platform for Parallel Batch and Stream AnalysisJonas Traub
This is our presentation for the German paper "Die Apache Flink Plattform zur parallelen Analyse von Datenströmen und Stapeldaten" which was published in Proceedings of the
LWA 2015 Workshops: KDML, FGWM, IR, and FGDB. Trier, Germany, 7.-9. October 2015. Link: http://ceur-ws.org/Vol-1458/H02_CRC79_Traub.pdf
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Efficient Window Aggregation with General Stream Slicing (EDBT 2019, Best Paper)
1. Jonas Traub Philipp M. Grulich Alejandro Rodríguez Cuéllar Sebastian Breß
Asterios Katsifodimos Tilmann Rabl Volker Markl
Efficient Window Aggregation with
General Stream Slicing
22nd International Conference on Extending Database Technology
March 26-29, 2019, Lisbon, Portugal
2. Stream Processing Pipelines
27.03.2019 Efficient Window Aggregation with General Stream Slicing 2
A stream processing pipeline is a series of concurrently running operators.
3. Stream Processing Pipelines
27.03.2019 Efficient Window Aggregation with General Stream Slicing 2
A stream processing pipeline is a series of concurrently running operators.
Window
Aggregation
4. Stream Processing Pipelines
27.03.2019 Efficient Window Aggregation with General Stream Slicing 2
A stream processing pipeline is a series of concurrently running operators.
Window
Aggregation
53
5. Stream Processing Pipelines
27.03.2019 Efficient Window Aggregation with General Stream Slicing 2
A stream processing pipeline is a series of concurrently running operators.
Window
Aggregation
8
15. We store partial aggregates instead of all tuples. Small memory footprint.
Stream Slicing Example
27.03.2019 Efficient Window Aggregation with General Stream Slicing 9
17. We assign each tuple to exactly one slice. O(1) per-tuple complexity.
Stream Slicing Example
27.03.2019 Efficient Window Aggregation with General Stream Slicing 10
19. We require just a few computation steps to calculate final aggregates. Low latency.
Stream Slicing Example
27.03.2019 Efficient Window Aggregation with General Stream Slicing 11
21. We share partial aggregations among all users and queries. Efficiency by preventing redundancy.
Stream Slicing Example
27.03.2019 Efficient Window Aggregation with General Stream Slicing 12
31. General Slicing Core
The General Slicing Core adapts to work load characteristics
and provides extension point for user-defined window types and aggregation functions.
27.03.2019 Efficient Window Aggregation with General Stream Slicing 15
32. General Stream Slicing Internals
27.03.2019 Efficient Window Aggregation with General Stream Slicing 16
33. General Stream Slicing Internals
27.03.2019 Efficient Window Aggregation with General Stream Slicing 16
Part 1: Three Fundamental Operations on Slices
34. General Stream Slicing Internals
27.03.2019 Efficient Window Aggregation with General Stream Slicing 16
Merge Slices
Part 1: Three Fundamental Operations on Slices
35. General Stream Slicing Internals
27.03.2019 Efficient Window Aggregation with General Stream Slicing 16
Merge Slices Split Slices
Part 1: Three Fundamental Operations on Slices
36. General Stream Slicing Internals
27.03.2019 Efficient Window Aggregation with General Stream Slicing 16
Merge Slices Split Slices Update Slices
Part 1: Three Fundamental Operations on Slices
37. General Stream Slicing Internals
27.03.2019 Efficient Window Aggregation with General Stream Slicing 16
Merge Slices Split Slices Update Slices
Part 1: Three Fundamental Operations on Slices
Part 2: Adapt to Workload Characteristics:
38. General Stream Slicing Internals
27.03.2019 Efficient Window Aggregation with General Stream Slicing 16
Merge Slices Split Slices Update Slices
Part 1: Three Fundamental Operations on Slices
Part 2: Adapt to Workload Characteristics:
Do we need to store original tuples?
39. General Stream Slicing Internals
27.03.2019 Efficient Window Aggregation with General Stream Slicing 16
Merge Slices Split Slices Update Slices
Part 1: Three Fundamental Operations on Slices
Part 2: Adapt to Workload Characteristics:
Do we need to store original tuples?
Do we potentially need to split slices?
40. General Stream Slicing Internals
27.03.2019 Efficient Window Aggregation with General Stream Slicing 16
Merge Slices Split Slices Update Slices
Part 1: Three Fundamental Operations on Slices
Part 2: Adapt to Workload Characteristics:
Do we need to store original tuples?
Do we potentially need to split slices?
Do we potentially need
to remove tuples from slices?
41. General Stream Slicing Internals
27.03.2019 Efficient Window Aggregation with General Stream Slicing 16
Merge Slices Split Slices Update Slices
Part 1: Three Fundamental Operations on Slices
Part 2: Adapt to Workload Characteristics:
Do we need to store original tuples?
Do we potentially need to split slices?
Do we potentially need
to remove tuples from slices?
42. General Stream Slicing Internals
27.03.2019 Efficient Window Aggregation with General Stream Slicing 16
Merge Slices Split Slices Update Slices
Part 1: Three Fundamental Operations on Slices
Part 2: Adapt to Workload Characteristics:
Do we need to store original tuples?
Do we potentially need to split slices?
Do we potentially need
to remove tuples from slices?
General Stream Slicing adapts to current workload characteristics.
43. Impact of Workload Characteristics (Example)
27.03.2019 Efficient Window Aggregation with General Stream Slicing 17
44. Impact of Workload Characteristics (Example)
27.03.2019 Efficient Window Aggregation with General Stream Slicing 17
1 2 1 4 3 1 5 2 2 3 6 1 2 2 1
45. Impact of Workload Characteristics (Example)
27.03.2019 Efficient Window Aggregation with General Stream Slicing 17
1 2 1 4 3 1 5 2 2 3 6 1 2 2 1
Count-based tumbling window
with a length of 5 tuples.
46. Impact of Workload Characteristics (Example)
27.03.2019 Efficient Window Aggregation with General Stream Slicing 17
1 2 1 4 3 1 5 2 2 3 6 1 2 2 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Tuple Count
15
Count-based tumbling window
with a length of 5 tuples.
47. Impact of Workload Characteristics (Example)
27.03.2019 Efficient Window Aggregation with General Stream Slicing 17
1 2 1 4 3 1 5 2 2 3 6 1 2 2 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Tuple Count
15
Count-based tumbling window
with a length of 5 tuples.
11 13 12
48. Impact of Workload Characteristics (Example)
27.03.2019 Efficient Window Aggregation with General Stream Slicing 17
1 2 1 4 3 1 5 2 2 3 6 1 2 2 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Tuple Count
15
11 13 12
What if the stream is out-of-order?
49. Impact of Workload Characteristics (Example)
27.03.2019 Efficient Window Aggregation with General Stream Slicing 17
1 2 1 4 3 1 5 2 2 3 6 1 2 2 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Tuple Count
15
Event Time
5 12 13 20 35 37 42 46 48 51 52 57 63 64 65
11 13 12
What if the stream is out-of-order?
50. Impact of Workload Characteristics (Example)
27.03.2019 Efficient Window Aggregation with General Stream Slicing 17
1 2 1 4 3 1 5 2 2 3 6 1 2 2 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Tuple Count
15
Event Time
5 12 13 20 35 37 42 46 48 51 52 57 63 64 65
11 13 12
What if the stream is out-of-order?
5
49
Out-of-order Tuple
51. Impact of Workload Characteristics (Example)
27.03.2019 Efficient Window Aggregation with General Stream Slicing 17
1 2 1 4 3 1 5 2 2 3 6 1 2 2 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Tuple Count
15
Event Time
5 12 13 20 35 37 42 46 48 51 52 57 63 64 65
11 13 12
What if the stream is out-of-order?
5
49
Out-of-order Tuple
52. Impact of Workload Characteristics (Example)
27.03.2019 Efficient Window Aggregation with General Stream Slicing 17
1 2 1 4 3 1 5 2 2 3 6 1 2 2 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Tuple Count
15
Event Time
5 12 13 20 35 37 42 46 48 51 52 57 63 64 65
11 13 12
What if the stream is out-of-order?
5
49
53. Impact of Workload Characteristics (Example)
27.03.2019 Efficient Window Aggregation with General Stream Slicing 17
1 2 1 4 3 1 5 2 2 3 6 1 2 2 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Tuple Count
15
Event Time
5 12 13 20 35 37 42 46 48 51 52 57 63 64 65
11 13 12
What if the stream is out-of-order?
5
49
13 12
58. Impact of Workload Characteristics (Example)
27.03.2019 Efficient Window Aggregation with General Stream Slicing 17
1 2 1 4 3 1 5 2 2 3 6 1 2 2 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Tuple Count
15
Event Time
5 12 13 20 35 37 42 46 48 51 52 57 63 64 65
11 13 12
1 2 1 4 3 1 5 2 2 3 6 1 2 2 1
What if the stream is out-of-order?
5
49
13 123 1+ -5 + - 3
5
What if the aggregation function is not invertible?
59. In-order Processing with Context Free Windows
27.03.2019 Efficient Window Aggregation with General Stream Slicing 18
60. In-order Processing with Context Free Windows
27.03.2019 Efficient Window Aggregation with General Stream Slicing 18
Slicing techniques scale to large numbers of concurrent windows.
61. Impact of Stream Order
27.03.2019 Efficient Window Aggregation with General Stream Slicing 19
62. Impact of Stream Order
27.03.2019 Efficient Window Aggregation with General Stream Slicing 19
Slicing techniques are robust against out-of-order tuples.
63. Impact of Aggregation Functions (20% out-of-order)
27.03.2019 Efficient Window Aggregation with General Stream Slicing 20
64. Impact of Aggregation Functions (20% out-of-order)
27.03.2019 Efficient Window Aggregation with General Stream Slicing 20
Stream Slicing performs well on many different kinds of aggregation functions.
65. Efficient Window Aggregation with General Stream Slicing
27.03.2019 Efficient Window Aggregation with General Stream Slicing 21
66. Efficient Window Aggregation with General Stream Slicing
• We identify workload characteristics which impact
applicability and performance of window aggregation techniques.
27.03.2019 Efficient Window Aggregation with General Stream Slicing 21
67. Efficient Window Aggregation with General Stream Slicing
• We identify workload characteristics which impact
applicability and performance of window aggregation techniques.
• We present a generally applicable and highly efficient solution for
streaming window aggregation.
27.03.2019 Efficient Window Aggregation with General Stream Slicing 21
68. Efficient Window Aggregation with General Stream Slicing
• We identify workload characteristics which impact
applicability and performance of window aggregation techniques.
• We present a generally applicable and highly efficient solution for
streaming window aggregation.
• We show that general stream slicing is generally applicable and
offers better performance than alternative approaches.
27.03.2019 Efficient Window Aggregation with General Stream Slicing 21
69. Efficient Window Aggregation with General Stream Slicing
• We identify workload characteristics which impact
applicability and performance of window aggregation techniques.
• We present a generally applicable and highly efficient solution for
streaming window aggregation.
• We show that general stream slicing is generally applicable and
offers better performance than alternative approaches.
27.03.2019 Efficient Window Aggregation with General Stream Slicing 21
tu-berlin-dima.github.io/scotty-window-processor
Open Source Repository: