This document provides a tutorial on the role of event-time order in data streaming analysis. The agenda covers motivations and examples of data streaming and stream processing engines, causes of out-of-order data and solutions to enforce total ordering, pros and cons of total ordering, and relaxation of total ordering using watermarks. Enforcing total ordering through techniques like sorting tuples is computationally expensive but provides benefits like determinism and synchronization. However, it may be an overkill for some applications and increase latency.
The data streaming processing paradigm and its use in modern fog architecturesVincenzo Gulisano
Invited lecture at the University of Trieste.
The lecture covers (briefly) the data streaming processing paradigm, research challenges related to distributed, parallel and deterministic streaming analysis and the research of the DCS (Distributed Computing and Systems) groups at Chalmers University of Technology.
Crash course on data streaming (with examples using Apache Flink)Vincenzo Gulisano
These are the slides I used for a crash course (4 hours) on data streaming. It contains both theory / research aspects as well as examples based on Apache Flink (DataStream API)
Slides for my Associate Professor (oavlönad docent) lecture.
The lecture is about Data Streaming (its evolution and basic concepts) and also contains an overview of my research.
The benefits of fine-grained synchronization in deterministic and efficient ...Vincenzo Gulisano
This talk, given by Vincenzo Gulisano and Yiannis Nikolakopoulos at Yahoo! discusses some of their latest research results in the field of deterministic and efficient parallelization of data streaming operators. It also present ScaleGate, the abstract data type at the core of their research and whose java-based lock-free implementation is available at https://github.com/dcs-chalmers/ScaleGate_Java
ScaleJoin: a Deterministic, Disjoint-Parallel and Skew-Resilient Stream JoinVincenzo Gulisano
This is the presentation of the paper "ScaleJoin: a Deterministic, Disjoint-Parallel and Skew-Resilient Stream Join", presented by Vincenzo Gulisano, Yiannis Nikolakopoulos, Marina Papatriantafilou and Philippas Tsigas at the IEEE Big Data conference held in Santa Clara, 2015.
This document discusses techniques for mining data streams. It begins by defining different types of streaming data like time-series data and sequence data. It then discusses the characteristics of data streams like their huge volume, fast changing nature, and requirement for real-time processing. The key challenges in stream query processing are the unbounded memory requirements and need for approximate query answering. The document outlines several synopsis data structures and techniques used for mining data streams, including random sampling, histograms, sketches, and randomized algorithms. It also discusses architectures for stream query processing and classification of dynamic data streams.
Presentation for the Softskills Seminar course @ Telecom ParisTech. Topic is the paper by Domings Hulten "Mining high speed data streams". Presented by me the 30/11/2017
The data streaming processing paradigm and its use in modern fog architecturesVincenzo Gulisano
Invited lecture at the University of Trieste.
The lecture covers (briefly) the data streaming processing paradigm, research challenges related to distributed, parallel and deterministic streaming analysis and the research of the DCS (Distributed Computing and Systems) groups at Chalmers University of Technology.
Crash course on data streaming (with examples using Apache Flink)Vincenzo Gulisano
These are the slides I used for a crash course (4 hours) on data streaming. It contains both theory / research aspects as well as examples based on Apache Flink (DataStream API)
Slides for my Associate Professor (oavlönad docent) lecture.
The lecture is about Data Streaming (its evolution and basic concepts) and also contains an overview of my research.
The benefits of fine-grained synchronization in deterministic and efficient ...Vincenzo Gulisano
This talk, given by Vincenzo Gulisano and Yiannis Nikolakopoulos at Yahoo! discusses some of their latest research results in the field of deterministic and efficient parallelization of data streaming operators. It also present ScaleGate, the abstract data type at the core of their research and whose java-based lock-free implementation is available at https://github.com/dcs-chalmers/ScaleGate_Java
ScaleJoin: a Deterministic, Disjoint-Parallel and Skew-Resilient Stream JoinVincenzo Gulisano
This is the presentation of the paper "ScaleJoin: a Deterministic, Disjoint-Parallel and Skew-Resilient Stream Join", presented by Vincenzo Gulisano, Yiannis Nikolakopoulos, Marina Papatriantafilou and Philippas Tsigas at the IEEE Big Data conference held in Santa Clara, 2015.
This document discusses techniques for mining data streams. It begins by defining different types of streaming data like time-series data and sequence data. It then discusses the characteristics of data streams like their huge volume, fast changing nature, and requirement for real-time processing. The key challenges in stream query processing are the unbounded memory requirements and need for approximate query answering. The document outlines several synopsis data structures and techniques used for mining data streams, including random sampling, histograms, sketches, and randomized algorithms. It also discusses architectures for stream query processing and classification of dynamic data streams.
Presentation for the Softskills Seminar course @ Telecom ParisTech. Topic is the paper by Domings Hulten "Mining high speed data streams". Presented by me the 30/11/2017
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkFlink Forward
1) Apache SAMOA is a platform for mining big data streams in real-time that provides algorithms, libraries and an execution framework.
2) It allows researchers to develop and compare stream mining algorithms and practitioners to easily apply state-of-the-art algorithms to problems like sentiment analysis, spam detection and recommendations.
3) The Vertical Hoeffding Tree algorithm in SAMOA provides high parallelism and accuracy for streaming decision tree learning, outperforming native Apache Flink implementations on certain datasets while being faster on others.
Introduction to Data streaming - 05/12/2014Raja Chiky
Raja Chiky is an associate professor whose research interests include data stream mining, distributed architectures, and recommender systems. The document outlines data streaming concepts including what a data stream is, data stream management systems, and basic approximate algorithms used for processing massive, high-velocity data streams. It also discusses challenges in distributed systems and using semantic technologies for data streaming.
This dissertation defense presentation covered Peng Du's work on making one-sided dense linear algebra algorithms resilient to hard and soft errors. The presentation included:
1) Motivation for the work due to increasing failure rates in large HPC systems.
2) An overview of the dissertation goals to make LU, QR and solvers resilient to hard and soft errors.
3) Key contributions including efficient protection of the left factor from hard errors, recovery from hard errors using checkpointing, and detection and recovery from multiple soft errors.
Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. In in this lecture we overview the mining of data streams
The Power of Both Choices: Practical Load Balancing for Distributed Stream Pr...Anis Nasir
This paper proposes a technique called Partial Key Grouping (PKG) to balance load in distributed stream processing engines. PKG splits keys between workers and assigns instances using the "power of two choices" algorithm. This achieves load balancing while maintaining low memory and aggregation overhead of O(1), unlike shuffle grouping which has O(W) overhead. Experiments on real datasets with Apache Storm show PKG improves throughput by 60% and reduces latency by 45% compared to key grouping and shuffle grouping.
When Two Choices Are not Enough: Balancing at Scale in Distributed Stream Pro...Anis Nasir
This document proposes two algorithms, D-Choices and W-Choices, to improve load balancing in distributed stream processing systems. The algorithms identify "heavy hitters" or frequent keys in the data stream and process them using more than two workers to better balance load. Evaluation shows the algorithms provide up to 150% higher throughput and 60% lower latency compared to traditional partitioning approaches.
Scalable Distributed Real-Time Clustering for Big Data StreamsAntonio Severien
Analyzing and applying machine learning algorithms to a possibly infinite flow of data is a challenging task. This presentation presents the SAMOA framework, which allows the development of machine learning algorithms on top of any distributed stream processing engine. It also demonstrates the development and use of a distributed clustering algorithm based on CluStream using the Apache S4 platform.
This document describes research on implementing Curran's approximation algorithm for pricing Asian options using a dataflow architecture. The algorithm was implemented on a Maxeler dataflow engine (DFE) and compared to a CPU implementation. Different fixed-point precisions were tested on the DFE and 54-bit fixed-point provided the best balance of precision and resource usage. Implementing the algorithm across multiple DFEs provided speedups of 5-12x over a 48-core CPU. Further optimization of dynamic ranges allowed increasing the unrolling factor, improving performance and energy efficiency.
Mining big data streams with APACHE SAMOA by Albert BifetJ On The Beach
In this talk, we present Apache SAMOA, an open-source platform for mining big data streams with Apache Flink, Storm and Samza. Real time analytics is becoming the fastest and most efficient way to obtain useful knowledge from what is happening now, allowing organizations to react quickly when problems appear or to detect new trends helping to improve their performance. Apache SAMOA includes algorithms for the most common machine learning tasks such as classification and clustering. It provides a pluggable architecture that allows it to run on Apache Flink, but also with other several distributed stream processing engines such as Storm and Samza.
Scalable Distributed Real-Time Clustering for Big Data StreamsAntonio Severien
This thesis presents a scalable distributed clustering algorithm for streaming big data. The author implemented a real-time distributed clustering algorithm and a classification algorithm using the Scalable Advanced Massive Online Analysis (SAMOA) framework. SAMOA is a platform-independent framework for distributed machine learning on data streams. It provides interfaces for algorithms to be run on distributed stream processing engines like Apache S4 and Twitter Storm. The author's algorithms were tested on these platforms using the SAMOA framework.
Mining Big Data Streams with APACHE SAMOAAlbert Bifet
In this talk, we present Apache SAMOA, an open-source platform for
mining big data streams with Apache Flink, Storm and Samza. Real time analytics is
becoming the fastest and most efficient way to obtain useful knowledge
from what is happening now, allowing organizations to react quickly
when problems appear or to detect new trends helping to improve their
performance. Apache SAMOA includes algorithms for the most common
machine learning tasks such as classification and clustering. It
provides a pluggable architecture that allows it to run on Apache
Flink, but also with other several distributed stream processing
engines such as Storm and Samza.
Stream data mining & CluStream frameworkYueshen Xu
This document summarizes a framework for clustering evolving data streams. The framework uses micro-clusters to represent subsets of data points and clusters micro-clusters into macro-clusters over different time horizons. Micro-clusters are represented by cluster feature vectors (CFVs) and are updated when new data points arrive by joining, deleting, or merging micro-clusters. Macro-clusters are formed by applying a modified k-means algorithm that clusters micro-clusters represented by their CFVs over different time periods.
Self-adaptive container monitoring with performance-aware Load-Shedding policies, by Rolando Brondolin, PhD student in System Architecture at Politecnico di Milano
Big Graph Analytics Systems (Sigmod16 Tutorial)Yuanyuan Tian
In recent years we have witnessed a surging interest in developing Big Graph processing systems. To date, tens of Big Graph systems have been proposed. This tutorial provides a timely and comprehensive review of existing Big Graph systems, and summarizes their pros and cons from various perspectives. We start from the existing vertex-centric systems, which which a programmer thinks intuitively like a vertex when developing parallel graph algorithms. We then introduce systems that adopt other computation paradigms and execution settings. The topics covered in this tutorial include programming models and algorithm design, computation models, communication mechanisms, out-of-core support, fault tolerance, dynamic graph support, and so on. We also highlight future research opportunities on Big Graph analytics.
Artificial intelligence and data stream miningAlbert Bifet
Big Data and Artificial Intelligence have the potential to
fundamentally shift the way we interact with our surroundings. The
challenge of deriving insights from data streams has been recognized
as one of the most exciting and key opportunities for both academia
and industry. Advanced analysis of big data streams from sensors and
devices is bound to become a key area of artificial intelligence
research as the number of applications requiring such processing
increases. Dealing with the evolution over time of such data streams,
i.e., with concepts that drift or change completely, is one of the
core issues in stream mining. In this talk, I will present an overview
of data stream mining, industrial applications, open source tools, and
current challenges of data stream mining.
[EUC2016] FFWD: latency-aware event stream processing via domain-specific loa...Matteo Ferroni
Tools and applications for event stream processing and real-time analytics are getting a huge hype these days on a wide range of application scenarios, from the smallest Internet of Things (IoT) embedded sensor to the most popular Social Network feed. Unfortunately, dealing with this kind of input rises some issues that can easily mine the real-time analysis requirement due to an unexpected overload of the system; this happens as the processing time may strongly depend on the single event content, while the event arrival rate may vary unpredictably over time. In this work, we propose Fast Forward With Degradation (FFWD), a latency-aware load shedding framework that exploits performance degradation techniques to adapt the throughput of the application to the size of the input, allowing the system to have a fast and reliable response time in case of overloading. Moreover, we show how different domain-specific policies can guarantee a reasonable accuracy of the aggregated output metrics.
Full paper: http://ieeexplore.ieee.org/document/7982234/
The document discusses Linux system capacity planning. It covers performance monitoring tools like Sysstat and Ganglia that can be used to collect time series performance data on metrics like CPU usage, memory usage, and network traffic. This data is useful for troubleshooting and basic forecasting but not for creating what-if scenarios or fully understanding application behavior. The document also discusses concepts in capacity planning like utilization, Little's Law, and queueing theory. It provides an example of using the PDQ modeling tool to create a simple queueing model of a web application with HTTP, application, and database servers.
Self-adaptive container monitoring with performance-aware Load-Shedding policies, by Rolando Brondolin, PhD student in System Architecture at Politecnico di Milano
These slides were designed for Apache Hadoop + Apache Apex workshop (University program).
Audience was mainly from third year engineering students from Computer, IT, Electronics and telecom disciplines.
I tried to keep it simple for beginners to understand. Some of the examples are using context from India. But, in general this would be good starting point for the beginners.
Advanced users/experts may not find this relevant.
Summary of the article: "Band selection for dimension in hyper spectral image using integrated information gain and principal component analysis technique"
An Introduction to Distributed Data StreamingParis Carbone
A lecture on distributed data streaming, introducing all basic abstractions such as windowing, synopses (state), partitioning and parallelism and applying into an example pipeline for detecting fires. It also offers a brief introduction and motivation on reliability guarantees and the need for repeatable sources and application level fault tolerance and consistency.
Queuing theory and traffic analysis in depthIdcIdk1
This document provides a summary of concepts in queuing theory and network traffic analysis. It discusses queuing theory concepts like Little's Law, M/M/1 queues, and Kendall's notation. It then covers an empirical study of router delay that models delays using a fluid queue and reports on busy period metrics. Finally, it discusses the concept of network traffic self-similarity found in measurements of Ethernet LAN traffic.
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkFlink Forward
1) Apache SAMOA is a platform for mining big data streams in real-time that provides algorithms, libraries and an execution framework.
2) It allows researchers to develop and compare stream mining algorithms and practitioners to easily apply state-of-the-art algorithms to problems like sentiment analysis, spam detection and recommendations.
3) The Vertical Hoeffding Tree algorithm in SAMOA provides high parallelism and accuracy for streaming decision tree learning, outperforming native Apache Flink implementations on certain datasets while being faster on others.
Introduction to Data streaming - 05/12/2014Raja Chiky
Raja Chiky is an associate professor whose research interests include data stream mining, distributed architectures, and recommender systems. The document outlines data streaming concepts including what a data stream is, data stream management systems, and basic approximate algorithms used for processing massive, high-velocity data streams. It also discusses challenges in distributed systems and using semantic technologies for data streaming.
This dissertation defense presentation covered Peng Du's work on making one-sided dense linear algebra algorithms resilient to hard and soft errors. The presentation included:
1) Motivation for the work due to increasing failure rates in large HPC systems.
2) An overview of the dissertation goals to make LU, QR and solvers resilient to hard and soft errors.
3) Key contributions including efficient protection of the left factor from hard errors, recovery from hard errors using checkpointing, and detection and recovery from multiple soft errors.
Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. In in this lecture we overview the mining of data streams
The Power of Both Choices: Practical Load Balancing for Distributed Stream Pr...Anis Nasir
This paper proposes a technique called Partial Key Grouping (PKG) to balance load in distributed stream processing engines. PKG splits keys between workers and assigns instances using the "power of two choices" algorithm. This achieves load balancing while maintaining low memory and aggregation overhead of O(1), unlike shuffle grouping which has O(W) overhead. Experiments on real datasets with Apache Storm show PKG improves throughput by 60% and reduces latency by 45% compared to key grouping and shuffle grouping.
When Two Choices Are not Enough: Balancing at Scale in Distributed Stream Pro...Anis Nasir
This document proposes two algorithms, D-Choices and W-Choices, to improve load balancing in distributed stream processing systems. The algorithms identify "heavy hitters" or frequent keys in the data stream and process them using more than two workers to better balance load. Evaluation shows the algorithms provide up to 150% higher throughput and 60% lower latency compared to traditional partitioning approaches.
Scalable Distributed Real-Time Clustering for Big Data StreamsAntonio Severien
Analyzing and applying machine learning algorithms to a possibly infinite flow of data is a challenging task. This presentation presents the SAMOA framework, which allows the development of machine learning algorithms on top of any distributed stream processing engine. It also demonstrates the development and use of a distributed clustering algorithm based on CluStream using the Apache S4 platform.
This document describes research on implementing Curran's approximation algorithm for pricing Asian options using a dataflow architecture. The algorithm was implemented on a Maxeler dataflow engine (DFE) and compared to a CPU implementation. Different fixed-point precisions were tested on the DFE and 54-bit fixed-point provided the best balance of precision and resource usage. Implementing the algorithm across multiple DFEs provided speedups of 5-12x over a 48-core CPU. Further optimization of dynamic ranges allowed increasing the unrolling factor, improving performance and energy efficiency.
Mining big data streams with APACHE SAMOA by Albert BifetJ On The Beach
In this talk, we present Apache SAMOA, an open-source platform for mining big data streams with Apache Flink, Storm and Samza. Real time analytics is becoming the fastest and most efficient way to obtain useful knowledge from what is happening now, allowing organizations to react quickly when problems appear or to detect new trends helping to improve their performance. Apache SAMOA includes algorithms for the most common machine learning tasks such as classification and clustering. It provides a pluggable architecture that allows it to run on Apache Flink, but also with other several distributed stream processing engines such as Storm and Samza.
Scalable Distributed Real-Time Clustering for Big Data StreamsAntonio Severien
This thesis presents a scalable distributed clustering algorithm for streaming big data. The author implemented a real-time distributed clustering algorithm and a classification algorithm using the Scalable Advanced Massive Online Analysis (SAMOA) framework. SAMOA is a platform-independent framework for distributed machine learning on data streams. It provides interfaces for algorithms to be run on distributed stream processing engines like Apache S4 and Twitter Storm. The author's algorithms were tested on these platforms using the SAMOA framework.
Mining Big Data Streams with APACHE SAMOAAlbert Bifet
In this talk, we present Apache SAMOA, an open-source platform for
mining big data streams with Apache Flink, Storm and Samza. Real time analytics is
becoming the fastest and most efficient way to obtain useful knowledge
from what is happening now, allowing organizations to react quickly
when problems appear or to detect new trends helping to improve their
performance. Apache SAMOA includes algorithms for the most common
machine learning tasks such as classification and clustering. It
provides a pluggable architecture that allows it to run on Apache
Flink, but also with other several distributed stream processing
engines such as Storm and Samza.
Stream data mining & CluStream frameworkYueshen Xu
This document summarizes a framework for clustering evolving data streams. The framework uses micro-clusters to represent subsets of data points and clusters micro-clusters into macro-clusters over different time horizons. Micro-clusters are represented by cluster feature vectors (CFVs) and are updated when new data points arrive by joining, deleting, or merging micro-clusters. Macro-clusters are formed by applying a modified k-means algorithm that clusters micro-clusters represented by their CFVs over different time periods.
Self-adaptive container monitoring with performance-aware Load-Shedding policies, by Rolando Brondolin, PhD student in System Architecture at Politecnico di Milano
Big Graph Analytics Systems (Sigmod16 Tutorial)Yuanyuan Tian
In recent years we have witnessed a surging interest in developing Big Graph processing systems. To date, tens of Big Graph systems have been proposed. This tutorial provides a timely and comprehensive review of existing Big Graph systems, and summarizes their pros and cons from various perspectives. We start from the existing vertex-centric systems, which which a programmer thinks intuitively like a vertex when developing parallel graph algorithms. We then introduce systems that adopt other computation paradigms and execution settings. The topics covered in this tutorial include programming models and algorithm design, computation models, communication mechanisms, out-of-core support, fault tolerance, dynamic graph support, and so on. We also highlight future research opportunities on Big Graph analytics.
Artificial intelligence and data stream miningAlbert Bifet
Big Data and Artificial Intelligence have the potential to
fundamentally shift the way we interact with our surroundings. The
challenge of deriving insights from data streams has been recognized
as one of the most exciting and key opportunities for both academia
and industry. Advanced analysis of big data streams from sensors and
devices is bound to become a key area of artificial intelligence
research as the number of applications requiring such processing
increases. Dealing with the evolution over time of such data streams,
i.e., with concepts that drift or change completely, is one of the
core issues in stream mining. In this talk, I will present an overview
of data stream mining, industrial applications, open source tools, and
current challenges of data stream mining.
[EUC2016] FFWD: latency-aware event stream processing via domain-specific loa...Matteo Ferroni
Tools and applications for event stream processing and real-time analytics are getting a huge hype these days on a wide range of application scenarios, from the smallest Internet of Things (IoT) embedded sensor to the most popular Social Network feed. Unfortunately, dealing with this kind of input rises some issues that can easily mine the real-time analysis requirement due to an unexpected overload of the system; this happens as the processing time may strongly depend on the single event content, while the event arrival rate may vary unpredictably over time. In this work, we propose Fast Forward With Degradation (FFWD), a latency-aware load shedding framework that exploits performance degradation techniques to adapt the throughput of the application to the size of the input, allowing the system to have a fast and reliable response time in case of overloading. Moreover, we show how different domain-specific policies can guarantee a reasonable accuracy of the aggregated output metrics.
Full paper: http://ieeexplore.ieee.org/document/7982234/
The document discusses Linux system capacity planning. It covers performance monitoring tools like Sysstat and Ganglia that can be used to collect time series performance data on metrics like CPU usage, memory usage, and network traffic. This data is useful for troubleshooting and basic forecasting but not for creating what-if scenarios or fully understanding application behavior. The document also discusses concepts in capacity planning like utilization, Little's Law, and queueing theory. It provides an example of using the PDQ modeling tool to create a simple queueing model of a web application with HTTP, application, and database servers.
Self-adaptive container monitoring with performance-aware Load-Shedding policies, by Rolando Brondolin, PhD student in System Architecture at Politecnico di Milano
These slides were designed for Apache Hadoop + Apache Apex workshop (University program).
Audience was mainly from third year engineering students from Computer, IT, Electronics and telecom disciplines.
I tried to keep it simple for beginners to understand. Some of the examples are using context from India. But, in general this would be good starting point for the beginners.
Advanced users/experts may not find this relevant.
Summary of the article: "Band selection for dimension in hyper spectral image using integrated information gain and principal component analysis technique"
An Introduction to Distributed Data StreamingParis Carbone
A lecture on distributed data streaming, introducing all basic abstractions such as windowing, synopses (state), partitioning and parallelism and applying into an example pipeline for detecting fires. It also offers a brief introduction and motivation on reliability guarantees and the need for repeatable sources and application level fault tolerance and consistency.
Queuing theory and traffic analysis in depthIdcIdk1
This document provides a summary of concepts in queuing theory and network traffic analysis. It discusses queuing theory concepts like Little's Law, M/M/1 queues, and Kendall's notation. It then covers an empirical study of router delay that models delays using a fluid queue and reports on busy period metrics. Finally, it discusses the concept of network traffic self-similarity found in measurements of Ethernet LAN traffic.
Impatience is a Virtue: Revisiting Disorder in High-Performance Log AnalyticsBadrish Chandramouli
There is a growing interest in processing real-time queries over out-of-order streams in this big data era. This paper presents a comprehensive solution to meet this requirement. Our solution is based on Impatience sort, an online sorting technique that is based on an old technique called Patience sort. Impatience sort is tailored for incrementally sorting streaming datasets that present themselves as almost sorted, usually due to network delays and machine failures. With several optimizations, our solution can adapt to both input streams and query logic. Further, we develop a new Impatience framework that leverages Impatience sort to reduce the latency and memory usage of query execution, and supports a range of user latency requirements, without compromising on query completeness and throughput, while leveraging existing efficient in-order streaming engines and operators. We evaluate our proposed solution in Trill, a high-performance streaming engine, and demonstrate that our techniques significantly improve sorting performance and reduce memory usage – in some cases, by over an order of magnitude.
This document presents the πRT-calculus, a calculus for modeling mobile real-time processes. It extends the π-calculus with a timeout operator to model real-time aspects. The document covers the syntax and semantics of the π-calculus and πRT-calculus. It also discusses design choices like having a global clock and discrete time. An example of a mobile video streaming system is used to illustrate the πRT-calculus. The document concludes by discussing future work, like developing timed bisimulation and extending to continuous time.
Development of a Distributed Stream Processing System (DSPS) in node.js and ZeroMQ and demonstration of an application of trending topics with a dataset from Twitter.
This document provides an overview of stream processing. It discusses how stream processing systems are used to process large volumes of real-time data continuously and produce actionable information. Examples of applications discussed include traffic monitoring, network monitoring, smart grids, and sensor networks. Key concepts of stream processing covered include data streams, operators, windows, programming models, fault tolerance, and platforms like Storm and Spark Streaming.
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018Seunghyun Lee
Pinot is a real-time OLAP data store that can support multiple analytics use cases like interactive dashboards, site facing queries, and anomaly detection in a single system. It achieves this through features like configurable indexes, dynamic query planning and execution, smart data partitioning and routing, and pre-materialized indexes like star-trees that optimize for latency and throughput across different workloads. The document discusses Pinot's architecture and optimizations that enable it to meet the performance requirements of these different use cases.
Inside LoLA - Experiences from building a state space tool for place transiti...Universität Rostock
LoLA is a state space tool for analyzing place/transition nets that was developed starting in 1998. It uses various reduction techniques like stubborn sets, symmetries, and linear algebra to combat state space explosion. LoLA has been applied to problems in areas like model checking, business process verification, and distributed systems. Its core data structures and algorithms keep processing costs low during operations like firing transitions and state space traversal.
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Flink Forward
Apache Beam is Flink’s sibling in the Apache family of streaming processing frameworks. The Beam and Flink teams work closely together on advancing what is possible in streaming processing, including Streaming SQL extensions and code interoperability on both platforms.
Beam was originally developed at Google as the amalgamation of its internal batch and streaming frameworks to power the exabyte-scale data processing for Gmail, YouTube and Ads. It now powers a fully-managed, serverless service Google Cloud Dataflow, as well as is available to run in other Public Clouds and on-premises when deployed in portability mode on Apache Flink, Spark, Samza and other runners. Users regularly run distributed data processing jobs on Beam spanning tens of thousands of CPU cores and processing millions of events per second.
In this session, Sergei Sokolenko, Cloud Dataflow product manager, and Reuven Lax, the founding member of the Dataflow and Beam team, will share Google’s learnings from building and operating a global streaming processing infrastructure shared by thousands of customers, including:
safe deployment to dozens of geographic locations,
resource autoscaling to minimize processing costs,
separating compute and state storage for better scaling behavior,
dynamic work rebalancing of work items away from overutilized worker nodes,
offering a throughput-optimized batch processing capability with the same API as streaming,
grouping and joining of 100s of Terabytes in a hybrid in-memory/on-desk file system,
integrating with the Google Cloud security ecosystem, and other lessons.
Customers benefit from these advances through faster execution of jobs, resource savings, and a fully managed data processing environment that runs in the Cloud and removes the need to manage infrastructure.
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...Lionel Briand
This document discusses testing dynamic behavior in executable software models for cyber-physical systems. It presents challenges for model-in-the-loop (MiL) testing due to large input spaces, expensive simulations, and lack of simple oracles. The document proposes using search-based testing to generate critical test cases by formulating it as a multi-objective optimization problem. It demonstrates the approach on an advanced driver assistance system and discusses improving performance with surrogate modeling.
Chronix is a domain specific time series database designed for anomaly detection in operational data. It is optimized for the needs of anomaly detection by supporting domain specific data types, analysis algorithms, data models, and query languages. It aims to address limitations of general purpose time series databases by exploiting characteristics of operational data through features like optional pre-computation of extras, timestamp compression, domain specific records and compression techniques, and multi-dimensional storage. An evaluation using data from five industry projects found that Chronix has significantly smaller memory and storage footprints and faster data retrieval and analysis times compared to other time series databases.
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
Apache Apex is a next gen big data analytics platform. Originally developed at DataTorrent it comes with a powerful stream processing engine, rich set of functional building blocks and an easy to use API for the developer to build real-time and batch applications. Apex runs natively on YARN and HDFS and is used in production in various industries. You will learn about the Apex architecture, including its unique features for scalability, fault tolerance and processing guarantees, programming model and use cases.
http://apachebigdata2016.sched.org/event/6M0L/next-gen-big-data-analytics-with-apache-apex-thomas-weise-datatorrent
1) Pipeline processing increases computational speed by dividing tasks into sequential steps and allowing multiple tasks to progress through the steps simultaneously.
2) Arithmetic pipelines are used for fixed-point and floating-point operations by dividing the operations, like multiplication, into stages like generating partial products or adding carry bits.
3) Vector and array processors further improve parallelism by performing the same operations on multiple data elements simultaneously using multiple processing units.
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
This is an overview of architecture with use cases for Apache Apex, a big data analytics platform. It comes with a powerful stream processing engine, rich set of functional building blocks and an easy to use API for the developer to build real-time and batch applications. Apex runs natively on YARN and HDFS and is used in production in various industries. You will learn more about two use cases: A leading Ad Tech company serves billions of advertising impressions and collects terabytes of data from several data centers across the world every day. Apex was used to implement rapid actionable insights, for real-time reporting and allocation, utilizing Kafka and files as source, dimensional computation and low latency visualization. A customer in the IoT space uses Apex for Time Series service, including efficient storage of time series data, data indexing for quick retrieval and queries at high scale and precision. The platform leverages the high availability, horizontal scalability and operability of Apex.
Opensample: A Low-latency, Sampling-based Measurement Platform for Software D...Junho Suh
In this paper we propose, implement and evaluate OpenSample: a low-latency, sampling-based network measure- ment platform targeted at building faster control loops for software-defined networks. OpenSample leverages sFlow packet sampling to provide near–real-time measurements of both net- work load and individual flows. While OpenSample is useful in any context, it is particularly useful in an SDN environment where a network controller can quickly take action based on the data it provides. Using sampling for network monitoring allows OpenSample to have a 100 millisecond control loop rather than the 1–5 second control loop of prior polling-based approaches. We implement OpenSample in the Floodlight OpenFlow controller and evaluate it both in simulation and on a testbed comprised of commodity switches. When used to inform traffic engineering, OpenSample provides up to a 150% throughput improvement over both static equal-cost multi-path routing and a polling-based solution with a one second control loop.
This document discusses data stream management and streaming data warehouses. It defines key concepts like data streams, data stream management systems (DSMS), and streaming data warehouses (SDW). A DSMS processes continuous queries over data streams in real-time with low latency. An SDW integrates recent streaming data with historical data for analysis, using asynchronous and lightweight ETL processes. The document outlines components of a DSMS and SDW and algorithms for query processing, optimization, and load shedding in these systems.
Next Gen Big Data Analytics with Apache Apex discusses Apache Apex, an open source stream processing framework. It provides an overview of Apache Apex's capabilities for processing continuous, real-time data streams at scale. Specifically, it describes how Apache Apex allows for in-memory, distributed stream processing using a programming model of operators in a directed acyclic graph. It also covers Apache Apex's features for fault tolerance, dynamic scaling, and integration with Hadoop and YARN.
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
This talk explores deploying a series of small and large batch and streaming pipelines locally, to Spark and Flink clusters and to Google Cloud Dataflow services to give the audience a feel for the portability of Beam, a new portable Big Data processing framework recently submitted by Google to the Apache foundation. This talk will look at how the programming model handles late arriving data in a stream with event time, windows, and triggers.
Similar to Tutorial: The Role of Event-Time Analysis Order in Data Streaming (20)
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
This presentation was provided by Steph Pollock of The American Psychological Association’s Journals Program, and Damita Snow, of The American Society of Civil Engineers (ASCE), for the initial session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session One: 'Setting Expectations: a DEIA Primer,' was held June 6, 2024.
Assessment and Planning in Educational technology.pptxKavitha Krishnan
In an education system, it is understood that assessment is only for the students, but on the other hand, the Assessment of teachers is also an important aspect of the education system that ensures teachers are providing high-quality instruction to students. The assessment process can be used to provide feedback and support for professional development, to inform decisions about teacher retention or promotion, or to evaluate teacher effectiveness for accountability purposes.
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Thinking of getting a dog? Be aware that breeds like Pit Bulls, Rottweilers, and German Shepherds can be loyal and dangerous. Proper training and socialization are crucial to preventing aggressive behaviors. Ensure safety by understanding their needs and always supervising interactions. Stay safe, and enjoy your furry friends!
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...PECB
Denis is a dynamic and results-driven Chief Information Officer (CIO) with a distinguished career spanning information systems analysis and technical project management. With a proven track record of spearheading the design and delivery of cutting-edge Information Management solutions, he has consistently elevated business operations, streamlined reporting functions, and maximized process efficiency.
Certified as an ISO/IEC 27001: Information Security Management Systems (ISMS) Lead Implementer, Data Protection Officer, and Cyber Risks Analyst, Denis brings a heightened focus on data security, privacy, and cyber resilience to every endeavor.
His expertise extends across a diverse spectrum of reporting, database, and web development applications, underpinned by an exceptional grasp of data storage and virtualization technologies. His proficiency in application testing, database administration, and data cleansing ensures seamless execution of complex projects.
What sets Denis apart is his comprehensive understanding of Business and Systems Analysis technologies, honed through involvement in all phases of the Software Development Lifecycle (SDLC). From meticulous requirements gathering to precise analysis, innovative design, rigorous development, thorough testing, and successful implementation, he has consistently delivered exceptional results.
Throughout his career, he has taken on multifaceted roles, from leading technical project management teams to owning solutions that drive operational excellence. His conscientious and proactive approach is unwavering, whether he is working independently or collaboratively within a team. His ability to connect with colleagues on a personal level underscores his commitment to fostering a harmonious and productive workplace environment.
Date: May 29, 2024
Tags: Information Security, ISO/IEC 27001, ISO/IEC 42001, Artificial Intelligence, GDPR
-------------------------------------------------------------------------------
Find out more about ISO training and certification services
Training: ISO/IEC 27001 Information Security Management System - EN | PECB
ISO/IEC 42001 Artificial Intelligence Management System - EN | PECB
General Data Protection Regulation (GDPR) - Training Courses - EN | PECB
Webinars: https://pecb.com/webinars
Article: https://pecb.com/article
-------------------------------------------------------------------------------
For more information about PECB:
Website: https://pecb.com/
LinkedIn: https://www.linkedin.com/company/pecb/
Facebook: https://www.facebook.com/PECBInternational/
Slideshare: http://www.slideshare.net/PECBCERTIFICATION
How to Build a Module in Odoo 17 Using the Scaffold MethodCeline George
Odoo provides an option for creating a module by using a single line command. By using this command the user can make a whole structure of a module. It is very easy for a beginner to make a module. There is no need to make each file manually. This slide will show how to create a module using the scaffold method.
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
A workshop hosted by the South African Journal of Science aimed at postgraduate students and early career researchers with little or no experience in writing and publishing journal articles.
Physiology and chemistry of skin and pigmentation, hairs, scalp, lips and nail, Cleansing cream, Lotions, Face powders, Face packs, Lipsticks, Bath products, soaps and baby product,
Preparation and standardization of the following : Tonic, Bleaches, Dentifrices and Mouth washes & Tooth Pastes, Cosmetics for Nails.
A review of the growth of the Israel Genealogy Research Association Database Collection for the last 12 months. Our collection is now passed the 3 million mark and still growing. See which archives have contributed the most. See the different types of records we have, and which years have had records added. You can also see what we have for the future.
Tutorial: The Role of Event-Time Analysis Order in Data Streaming
1. Tutorial:
The Role of Event-Time Order
in Data Streaming Analysis
VincenzoGulisano
Chalmers University ofTechnology
Gothenburg, Sweden
vincenzo.gulisano@chalmers.se
Dimitris Palyvos-Giannas
Chalmers University ofTechnology
Gothenburg, Sweden
palyvos@chalmers.se
Bastian Havers
Chalmers University ofTechnology &Volvo Cars
Gothenburg, Sweden
havers@chalmers.se
Marina Papatriantafilou
Chalmers University ofTechnology
Gothenburg, Sweden
ptrianta@chalmers.se
3. Agenda
• Motivation, preliminaries and examples about
data streaming and Stream Processing Engines
• Causes of out-of-order data and solutions enforcing total-ordering
• Pros/Cons of total-ordering
• Relaxation of total-ordering and the watermarks
3
https://github.com/ vincenzo- gulisano/debs2020_tutorial_event_time
4. Agenda
• Motivation, preliminaries and examples about
data streaming and Stream Processing Engines
• Causes of out-of-order data and solutions enforcing total-ordering
• Pros/Cons of total-ordering
• Relaxation of total-ordering and the watermarks
4
6. Where does Big Data Originate?
1 trillion sensors
by 2030
in the Internet of Things (IoT)
uploaded toYouTube
every minute
2.32 billion
Facebook users
219 billion photos
uploaded to Facebook
1 online interaction
every 18 seconds
by 2025
6
500 hours of video
7. Main Memory
Database Management Systems (DBMSs) vs.
Stream Processing Engines (SPEs)
7
Disk
1 Data
Query Processing
3 Query
results
2 Query
Main Memory
Query Processing
Continuous
QueryData
Query
results
9. Flink-related images / code snippets in the following are taken from: https://flink.apache.org/ 9
10. Data Stream:
unbounded sequence of tuples
sharing the same schema
Example: vehicles’ speed reports
10
time
Field Field
vehicle id text
time (secs) text
speed (Km/h) double
X coordinate double
Y coordinate double
A 8:00 55.5 X1 Y1
Let’s assume each source (e.g.,
vehicle)
produces and delivers
a timestamp-sorted stream
A 8:07 34.3 X3 Y3
A 8:03 70.3 X2 Y2
11. Continuous query (or simply query):
Directed Acyclic Graph (DAG) of streams and
operators
OP
OP
OP
OP OP
OP OP
source op
(1+ out streams)
sink op
(1+ in streams)
stream
op
(1+ in, 1+ out streams)
11
12. Data Streaming Operators
Two main types:
• Stateless operators
• do not maintain any state
• one-by-one processing
• if they maintain some state, such state does not evolve depending on the tuples being
processed
• Stateful operators
• maintain a state that evolves depending on the tuples being processed
• produce output tuples that depend on multiple input tuples
12
OP
OP
13. Stateless Operators
Filter / route tuples based on one (or more) conditions
13
Filter
...Map Transform each tuple
15. Stateful Operators
Aggregate information from multiple
tuples (e.g., max, min, sum, ...)
15
Join tuples coming from 2 streams given a
certain predicate
Aggregate
Join
16. Windows and Stateful Analysis
Stateful operations are done over windows:
• Time-based (e.g., tuples in the last 10 minutes)
• Tuple-based (e.g., given the last 50 tuples)
16
time
[8:00,9:00)
[8:20,9:20)
[8:40,9:40)
Example of time-based window of size 1 hour and advance 20 minutes
How many tuple in a window?
Which time period does a window span?
20. Basic operators and user-defined operators
20
Besides a set of basic operators, SPEs usually
allow the user to define ad-hoc operators
(e.g., when existing aggregation are not enough)
21. SPEs and operators' variants
• Each SPE might define its own variants
of certain streaming operators:
21
t1
t2
t3
t4
t1
t2
t3
t4
R S
Sliding
window Window
sizeWS
WSWR
Predicate P
22. Sample Query
"every five minutes, of all vehicles that braked significantly, find the
one that braked the hardest"
22
time
A 8:00 55.5 X1 Y1 ... B 8:07 34.3 X3 Y3 ...
B 8:03 70.3 X2 Y2 ...
23. Sample Query
Remove
unused fields
Map
Field
vehicle id
time (secs)
speed (Km/h)
X coordinate
Y coordinate
...
Field
vehicle id
time (secs)
speed (Km/h)
Every minute,
compute average
speed of each
vehicle during the
last 2.5 minutes
Aggregate
Field
vehicle id
time (secs)
avg speed (Km/h)
Join
High average
speed and slow
current speed?
Filter
Field
vehicle id
time (secs)
braking factor
Join on
vehicle id in
last minute
Field
vehicle id
time (secs)
avg speed (Km/h)
speed (Km/h)
Aggregate
Every 5 minutes,
produce vehicle that
braked the hardest
during last 5
minutes
23
24. Agenda
• Motivation, preliminaries and examples about
data streaming and Stream Processing Engines
• Causes of out-of-order data and solutions enforcing total-ordering
• Pros/Cons of total-ordering
• Relaxation of total-ordering and the watermarks
24
25. From an abstract query
… to streaming application run by an SPE
25
OP1 OP2 OP4 OP6OP3 OP5Source Sink
OP1
OP2
OP4 OP6OP3
OP5Source
Sink
Source OP2
OP3
OP3
OP5
31. Data sources that produce out-of-order data
• Discussed in many related works (e.g., Babu, Shivnath, Utkarsh Srivastava, and JenniferWidom.
"Exploiting k-constraints to reduce memory overhead in continuous queries over data streams." ACM
Transactions on Database Systems (TODS) 29.3 (2004): 545-580.)
• Battery-operated devices, unreliable wireless networks, …
1 trillion sensors
by 2030
in the Internet of Things (IoT)
31
32. Causes of out-of-order data
32
Sources
themselves
Asynchronous
Distributed
Parallel
executions
33. The 3-step procedure
(sequential stream join)
33
For each incoming tuple t:
1. compare t with all tuples in opposite window given predicate P
2. add t to its window
3. remove stale tuples from t’s window
Add tuplesto S
Add tuples to R
Prod
R
Prod
S
Consume resultsConsPU
We assume each producer
delivers tuples in timestamp
order
34. The 3-step procedure, is it enough?
34
t1
t2
t1
t2
R S
WSWR
t3
t1
t2
t1
t2
R S
WSWR
t4
t3
35. Causes of out-of-order data
35
Asynchronous
Distributed
Parallel
executions
Any operator fed data from multiple logical /
physical stream can potentially observe out-
of-order data
41. Parallel execution
• Stateful operators: Semantic awareness
• Aggregate: count within last hour, group-by vehicle id
41
Previous Subcluster
R
…
R
…
M Agg1
M Agg2
M Agg3
…
…
…
Vehicle A
42. Parallel execution
• Depending on the stateful operator semantic:
• Partition input stream into keys
• Each key is processed by 1 thread
• # keys >> # threads/nodes
42
43. Parallel execution
• Depending on the stateful operator semantic:
• Partition input stream into keys
• Each key is processed by 1 thread
• # keys >> # threads/nodes
43
Keys
domain
Agg1 Agg2 Agg3
A
D
E
B
C F
44. Parallel execution
• Depending on the stateful operator semantics:
• Partition input stream into keys
• Each key is processed by 1 thread
• # keys >> # threads/nodes
44
Keys
domain
Agg1 Agg2 Agg3
A
D
E
B
C F
46. A 8:00 55.5 X1 Y1
A 8:07 34.3 X3 Y3
A 8:03 70.3 X2 Y2
Parallel execution
• Stateful operators: Semantic awareness
• Aggregate: count within last hour, group-by vehicle id
46
… R
… R
M Agg1
M Agg2
M Agg3
…
…
…
Map
Map
Vehicle A
Round-robin
(stateless)
47. Parallel execution
• Stateful operators: Semantic awareness
• Aggregate: count within last hour, group-by vehicle id
47
R…
R…
M Agg1
M Agg2
M Agg3
…
…
…
Vehicle A
Map
Map
Round-robin
(stateless)
A 8:00 55.5 X1 Y1
A 8:07 34.3 X3 Y3
A 8:03 70.3 X2 Y2
48. Inherent disorder
48
Map Aggregate Join Filter Aggregate
Print-to-file operator
P
Map0
Map1
Map2
Map3
P Aggregate Filter AggregateJoin
Disorder from parallelism
49. how to merge several timestamp-sorted streams...
49
M
...
...into one timestamp-sorted stream?
50. Gulisano, Vincenzo, et al. "Streamcloud: An elastic and scalable data streaming system." IEEE Transactions on Parallel and
Distributed Systems 23.12 (2012): 2351-2365. 50
51. Balazinska, Magdalena, et al. "Fault-tolerance in the Borealis distributed stream processing system." Proceedings of the
2005 ACM SIGMOD international conference on Management of data. 2005. 51
52. Gulisano, Vincenzo, et al. "Scalejoin: A deterministic, disjoint-parallel and skew-resilient stream join." IEEE Transactions on Big
Data (2016).
52
53. Which one to choose?
• Different options have different costs
• might require different types of data structures
• might require “special tuples” (more on this in the following part of
the tutorial)
• When the price of merge-sorting is paid by who provides the
tuples rather than who receives them, the system can scale better
53
54. Tutorial:
The Role of Event-Time Order
in Data Streaming Analysis
VincenzoGulisano
Chalmers University ofTechnology
Gothenburg, Sweden
vincenzo.gulisano@chalmers.se
Dimitris Palyvos-Giannas
Chalmers University ofTechnology
Gothenburg, Sweden
palyvos@chalmers.se
Bastian Havers
Chalmers University ofTechnology &Volvo Cars
Gothenburg, Sweden
havers@chalmers.se
Marina Papatriantafilou
Chalmers University ofTechnology
Gothenburg, Sweden
ptrianta@chalmers.se
55. Tutorial:
The Role of Event-Time Order
in Data Streaming Analysis
VincenzoGulisano
Chalmers University ofTechnology
Gothenburg, Sweden
vincenzo.gulisano@chalmers.se
Dimitris Palyvos-Giannas
Chalmers University ofTechnology
Gothenburg, Sweden
palyvos@chalmers.se
Bastian Havers
Chalmers University ofTechnology &Volvo Cars
Gothenburg, Sweden
havers@chalmers.se
Marina Papatriantafilou
Chalmers University ofTechnology
Gothenburg, Sweden
ptrianta@chalmers.se
56. 56
Agenda
• Motivation, preliminaries and examples about
data streaming and Stream Processing Engines
• Causes of out-of-order data and solutions enforcing total-ordering
• Pros/Cons of total-ordering
• Relaxation of total-ordering and the watermarks
57. 57
Agenda
• Motivation, preliminaries and examples about
data streaming and Stream Processing Engines
• Causes of out-of-order data and solutions enforcing total-ordering
• Pros/Cons of total-ordering
• Relaxation of total-ordering and the watermarks
58. Pros/Cons of total-ordering
• Cons:
• Expensive (computation- and latency-wise)
• An “overkill” for certain applications (more of this in the following
slides)
• Pros:
• Determinism
• Synchronization
• Eager purging of stale state
58
59. Pros/Cons of total-ordering
• Cons:
• Expensive (computation- and latency-wise)
• An “overkill” for certain applications (more of this in the following
slides)
• Pros:
• Determinism
• Synchronization
• Eager purging of stale state
59
60. Cost
• We need to temporary maintain tuples
• Linear in the number of tuples we receive,
which depends on the streams’ rate
• We need to sort tuples… O(n log(n))
• (n is number of sources or tuples, depending on the case)
• We need data from all sources, the processing latency depends on the slowest source.
• The latency overhead can be estimated based on the sources’ rates1
1Gulisano, Vincenzo, et al. "Performance modeling of stream joins." Proceedings of the 11th ACM International Conference on
Distributed and Event-based Systems. 2017. 60
61. Estimating the latency overhead
1Gulisano, Vincenzo, et al. "Performance modeling of stream joins." Proceedings of the 11th ACM International
Conference on Distributed and Event-based Systems. 2017. 61
62. Pros/Cons of total-ordering
• Cons:
• Expensive (computation- and latency-wise)
• An “overkill” for certain applications (more of this in the following
slides)
• Pros:
• Determinism
• Synchronization
• Eager purging of stale state
62
63. Determinism
63
OP1 OP2 OP4 OP6OP3 OP5Source Sink
OP1
OP2
OP4 OP6OP3
OP5Source
Sink
Source OP2
OP3
OP3
OP5
Balazinska, Magdalena, et al. "Fault-tolerance in the Borealis distributed stream processing system." Proceedings of the 2005 ACM SIGMOD international
conference on Management of data. 2005.
Gulisano, Vincenzo. StreamCloud: an elastic parallel-distributed stream processing engine. Diss. 2012.
64. Determinism
64
t1 t2
S1
S2
t3
t4
t5 t6 t7 t8
t9
t1 t2 t3 t4 t5 t6 t7 t8 t9
…
Tuple-based window, size: 4 / advance: 1
Hwang, Jeong-Hyon, Ugur Cetintemel, and Stan Zdonik. "Fast and reliable stream processing over wide area networks." 2007 IEEE 23rd International
Conference on Data Engineering Workshop. IEEE, 2007.
Gulisano, Vincenzo, Yiannis Nikolakopoulos, Marina Papatriantafilou, and Philippas Tsigas. "Scalejoin: A deterministic, disjoint-parallel and skew-resilient
stream join." IEEE Transactions on Big Data (2016).
65. Pros/Cons of total-ordering
• Cons:
• Expensive (computation- and latency-wise)
• An “overkill” for certain applications (more of this in the following
slides)
• Pros:
• Determinism
• Synchronization
• Eager purging of stale state
65
67. Synchronization
t1
t2
R S
WR
t3
t4
R S
t4
R S
t4
R S
t4
t1
WR
t2
WR WR
t3
Gulisano, Vincenzo, Yiannis Nikolakopoulos, Marina Papatriantafilou, and Philippas Tsigas. "Scalejoin: A deterministic, disjoint-parallel and skew-resilient stream join."
IEEE Transactions on Big Data (2016).
67
68. Synchronization
68
t1 t2R
S
t3
t4
t5 t6 t7 t8
t9
• Act as a barrier
• Carry the reconfiguration to be applied
• Change the parallelism degree
• Change other configuration
t3 t4 t5 t6… …
: Wait for , make sure
are ready and stop
: Wait for , make sure
are ready and stop
Najdataei, Hannaneh, et al. "Stretch: Scalable and elastic deterministic streaming analysis with virtual
shared-nothing parallelism." Proceedings of the 13th ACM International Conference on Distributed and Event-
based Systems. 2019.
69. 69
Agenda
• Motivation, preliminaries and examples about
data streaming and Stream Processing Engines
• Causes of out-of-order data and solutions enforcing total-ordering
• Pros/Cons of total-ordering
• Relaxation of total-ordering and the watermarks
73. Relaxing Correctness
73
Correctness
Performance
Flexibility
✓ In-order input streams
✓ Order in each window
✓ Correct window assignment
✗ In-order input streams
✓ Order in each window
✓ Correct window assignment
✗ In-order input streams
✗ Order in each window
✓ Correct window assignment
74. Disorder + Correct Window Assignment
1. When to create each window?
2. When to close each window (and produce result)?
74
0 10 20
ts: 7ts: 16 ts: 18ts: 3ts: 21
76. Closing Windows
76
0 10 20
ts: 7ts: 18 ts: 9ts: 21ts: 17 CLOSED
Operator needs some guarantee
it will not receive tuples with ts < W
Safely close all windows where
right_boundary ≤ W
77. Watermarks
77
Assume we can compute a monotonic function
𝐹: 𝑂 → 𝐸
that returns W ∈ E the earliest event time of any tuple
that can arrive at operator O in the future
0 10 20
ts: 7ts: 18 ts: 9ts: 21ts: 17
We call the value of this monotonic function F* the (low) watermark of operator O!
Monotonicity ➔ no tuples with ts < W will arrive in the future.
Solves problem of safely closing windows!
CLOSED
* The watermark of operator O is function of O and all its upstream peers, but we omit the latter for brevity.
W W
78. Watermarks in Practice
Watermarks are generated at the sources.
They (conceptually) flow through the pipeline.
They propagate regardless of data filtering ➔ all operators have up-to-date view of time
78
S
O1
02 O3
WS
WO2
WO1
WO3
WO1, OUT
WO2, OUT
WO3, OUT
Input Watermark: Earliest ts that O can receive.
Output Watermark: Earliest ts that O can emit.
79. Input & Output Timestamps
79
Stateless
tsIN tsOUT = tsIN
Stateful
tsIN tsOUT ≠ tsIN
tsIN
2 tsOUT
tsOUT tsOUTtsOUT
2. How is the timestamp set for window results?
State Event Time
tsIN
1
1.Which windows are complete?
81. 81
Flink Example
Map Aggregate Filter Aggregate
Join0
Join1
Join2
Join3
P
1. Result Correctness from correct window assignment
0 10 20
t
s
:
7
t
s
:
1
8
t
s
:
9
CLOSED
W W
2. Watermark propagation
S
O
1
0
2
O
3
WS
WO2
WO1
WO3
WO1, OUT
WO2, OUT
WO3, OUT
output watermark
P
82. Generating Watermarks
• Perfect Watermarks
• Sorted data or very predictable data sources.
• Determinism without sorting for order-independent window functions.
• Disorder inside windows is still possible!
• Sorting possible if needed, but not imposed.
• Heuristic Watermarks
• When impossible to perfectly predict data (e.g., distributed sources).
• Best-effort prediction of event-time progress.
• Possibility for late data.
• More knowledge about internals of sources → less mispredictions (late data).
82
83. Fast and Slow Watermarks
83
Window Complete
Perfect Watermark
Event Time
Window Complete
Slow Watermark
✗ Performance (Latency) Window Complete
Fast Watermark
✗ Correctness
Window Lifetime
84. Triggering
84
0h 12h 36h24h
Latency of 24 hours!
Can we do better?
Emit (partial) result every 1h
Completeness Trigger
Emit result when window is complete
Repeated Update Trigger
Periodically emit (partial) result
(e.g., for every tuple, every hour, etc)
Correctness
Performance
Flexibility
86. Putting It All Together
Window Complete
Perfect Watermark
Event Time
Window Complete
Slow Watermark
✗ Performance (Latency) Window Complete
Fast Watermark
✗ Correctness
Window Lifetime
Repeated Trigger
Early results
✓ Performance (Latency) Repeated Trigger
Late results
✓ Correctness
Watermark
Trigger
On-Time result
86
87. … the light at the end of the tunnel …
87
• Motivation, preliminaries and examples
about data streaming and Stream
Processing Engines
• Causes of out-of-order data and solutions
enforcing total-ordering
• Pros/Cons of total-ordering
• Relaxation of total-ordering and the
watermarks
88. To summarize:
Event time advances based
on:
88
Tuples themselves
Watermarks
Cons Pros
• Determinism, for order
sensitive / insensitive
functions
• Costly merge-sorting
• Coupled processing / output
of tuples
• Decoupled processing /
output of tuples
• Propagation of time
passing even in the absence
of tuples
• Would require special support
for order-sensitive functions
• Latency depends on
frequency of watermarks
89. Tutorial:
The Role of Event-Time Order
in Data Streaming Analysis
VincenzoGulisano
Chalmers University ofTechnology
Gothenburg, Sweden
vincenzo.gulisano@chalmers.se
Dimitris Palyvos-Giannas
Chalmers University ofTechnology
Gothenburg, Sweden
palyvos@chalmers.se
Bastian Havers
Chalmers University ofTechnology &Volvo Cars
Gothenburg, Sweden
havers@chalmers.se
Marina Papatriantafilou
Chalmers University ofTechnology
Gothenburg, Sweden
ptrianta@chalmers.se