Invited lecture at the University of Trieste.
The lecture covers (briefly) the data streaming processing paradigm, research challenges related to distributed, parallel and deterministic streaming analysis and the research of the DCS (Distributed Computing and Systems) groups at Chalmers University of Technology.
Tutorial: The Role of Event-Time Analysis Order in Data StreamingVincenzo Gulisano
This document provides a tutorial on the role of event-time order in data streaming analysis. The agenda covers motivations and examples of data streaming and stream processing engines, causes of out-of-order data and solutions to enforce total ordering, pros and cons of total ordering, and relaxation of total ordering using watermarks. Enforcing total ordering through techniques like sorting tuples is computationally expensive but provides benefits like determinism and synchronization. However, it may be an overkill for some applications and increase latency.
Crash course on data streaming (with examples using Apache Flink)Vincenzo Gulisano
These are the slides I used for a crash course (4 hours) on data streaming. It contains both theory / research aspects as well as examples based on Apache Flink (DataStream API)
Slides for my Associate Professor (oavlönad docent) lecture.
The lecture is about Data Streaming (its evolution and basic concepts) and also contains an overview of my research.
The benefits of fine-grained synchronization in deterministic and efficient ...Vincenzo Gulisano
This talk, given by Vincenzo Gulisano and Yiannis Nikolakopoulos at Yahoo! discusses some of their latest research results in the field of deterministic and efficient parallelization of data streaming operators. It also present ScaleGate, the abstract data type at the core of their research and whose java-based lock-free implementation is available at https://github.com/dcs-chalmers/ScaleGate_Java
ScaleJoin: a Deterministic, Disjoint-Parallel and Skew-Resilient Stream JoinVincenzo Gulisano
This is the presentation of the paper "ScaleJoin: a Deterministic, Disjoint-Parallel and Skew-Resilient Stream Join", presented by Vincenzo Gulisano, Yiannis Nikolakopoulos, Marina Papatriantafilou and Philippas Tsigas at the IEEE Big Data conference held in Santa Clara, 2015.
This document discusses techniques for mining data streams. It begins by defining different types of streaming data like time-series data and sequence data. It then discusses the characteristics of data streams like their huge volume, fast changing nature, and requirement for real-time processing. The key challenges in stream query processing are the unbounded memory requirements and need for approximate query answering. The document outlines several synopsis data structures and techniques used for mining data streams, including random sampling, histograms, sketches, and randomized algorithms. It also discusses architectures for stream query processing and classification of dynamic data streams.
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkFlink Forward
1) Apache SAMOA is a platform for mining big data streams in real-time that provides algorithms, libraries and an execution framework.
2) It allows researchers to develop and compare stream mining algorithms and practitioners to easily apply state-of-the-art algorithms to problems like sentiment analysis, spam detection and recommendations.
3) The Vertical Hoeffding Tree algorithm in SAMOA provides high parallelism and accuracy for streaming decision tree learning, outperforming native Apache Flink implementations on certain datasets while being faster on others.
Tutorial: The Role of Event-Time Analysis Order in Data StreamingVincenzo Gulisano
This document provides a tutorial on the role of event-time order in data streaming analysis. The agenda covers motivations and examples of data streaming and stream processing engines, causes of out-of-order data and solutions to enforce total ordering, pros and cons of total ordering, and relaxation of total ordering using watermarks. Enforcing total ordering through techniques like sorting tuples is computationally expensive but provides benefits like determinism and synchronization. However, it may be an overkill for some applications and increase latency.
Crash course on data streaming (with examples using Apache Flink)Vincenzo Gulisano
These are the slides I used for a crash course (4 hours) on data streaming. It contains both theory / research aspects as well as examples based on Apache Flink (DataStream API)
Slides for my Associate Professor (oavlönad docent) lecture.
The lecture is about Data Streaming (its evolution and basic concepts) and also contains an overview of my research.
The benefits of fine-grained synchronization in deterministic and efficient ...Vincenzo Gulisano
This talk, given by Vincenzo Gulisano and Yiannis Nikolakopoulos at Yahoo! discusses some of their latest research results in the field of deterministic and efficient parallelization of data streaming operators. It also present ScaleGate, the abstract data type at the core of their research and whose java-based lock-free implementation is available at https://github.com/dcs-chalmers/ScaleGate_Java
ScaleJoin: a Deterministic, Disjoint-Parallel and Skew-Resilient Stream JoinVincenzo Gulisano
This is the presentation of the paper "ScaleJoin: a Deterministic, Disjoint-Parallel and Skew-Resilient Stream Join", presented by Vincenzo Gulisano, Yiannis Nikolakopoulos, Marina Papatriantafilou and Philippas Tsigas at the IEEE Big Data conference held in Santa Clara, 2015.
This document discusses techniques for mining data streams. It begins by defining different types of streaming data like time-series data and sequence data. It then discusses the characteristics of data streams like their huge volume, fast changing nature, and requirement for real-time processing. The key challenges in stream query processing are the unbounded memory requirements and need for approximate query answering. The document outlines several synopsis data structures and techniques used for mining data streams, including random sampling, histograms, sketches, and randomized algorithms. It also discusses architectures for stream query processing and classification of dynamic data streams.
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkFlink Forward
1) Apache SAMOA is a platform for mining big data streams in real-time that provides algorithms, libraries and an execution framework.
2) It allows researchers to develop and compare stream mining algorithms and practitioners to easily apply state-of-the-art algorithms to problems like sentiment analysis, spam detection and recommendations.
3) The Vertical Hoeffding Tree algorithm in SAMOA provides high parallelism and accuracy for streaming decision tree learning, outperforming native Apache Flink implementations on certain datasets while being faster on others.
When Two Choices Are not Enough: Balancing at Scale in Distributed Stream Pro...Anis Nasir
This document proposes two algorithms, D-Choices and W-Choices, to improve load balancing in distributed stream processing systems. The algorithms identify "heavy hitters" or frequent keys in the data stream and process them using more than two workers to better balance load. Evaluation shows the algorithms provide up to 150% higher throughput and 60% lower latency compared to traditional partitioning approaches.
The Power of Both Choices: Practical Load Balancing for Distributed Stream Pr...Anis Nasir
This paper proposes a technique called Partial Key Grouping (PKG) to balance load in distributed stream processing engines. PKG splits keys between workers and assigns instances using the "power of two choices" algorithm. This achieves load balancing while maintaining low memory and aggregation overhead of O(1), unlike shuffle grouping which has O(W) overhead. Experiments on real datasets with Apache Storm show PKG improves throughput by 60% and reduces latency by 45% compared to key grouping and shuffle grouping.
This dissertation defense presentation covered Peng Du's work on making one-sided dense linear algebra algorithms resilient to hard and soft errors. The presentation included:
1) Motivation for the work due to increasing failure rates in large HPC systems.
2) An overview of the dissertation goals to make LU, QR and solvers resilient to hard and soft errors.
3) Key contributions including efficient protection of the left factor from hard errors, recovery from hard errors using checkpointing, and detection and recovery from multiple soft errors.
An overview of streaming algorithms: what they are, what the general principles regarding them are, and how they fit into a big data architecture. Also four specific examples of streaming algorithms and use-cases.
Scalable Distributed Real-Time Clustering for Big Data StreamsAntonio Severien
This thesis presents a scalable distributed clustering algorithm for streaming big data. The author implemented a real-time distributed clustering algorithm and a classification algorithm using the Scalable Advanced Massive Online Analysis (SAMOA) framework. SAMOA is a platform-independent framework for distributed machine learning on data streams. It provides interfaces for algorithms to be run on distributed stream processing engines like Apache S4 and Twitter Storm. The author's algorithms were tested on these platforms using the SAMOA framework.
Scalable Distributed Real-Time Clustering for Big Data StreamsAntonio Severien
Analyzing and applying machine learning algorithms to a possibly infinite flow of data is a challenging task. This presentation presents the SAMOA framework, which allows the development of machine learning algorithms on top of any distributed stream processing engine. It also demonstrates the development and use of a distributed clustering algorithm based on CluStream using the Apache S4 platform.
Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. In in this lecture we overview the mining of data streams
Presentation for the Softskills Seminar course @ Telecom ParisTech. Topic is the paper by Domings Hulten "Mining high speed data streams". Presented by me the 30/11/2017
Mining Big Data Streams with APACHE SAMOAAlbert Bifet
In this talk, we present Apache SAMOA, an open-source platform for
mining big data streams with Apache Flink, Storm and Samza. Real time analytics is
becoming the fastest and most efficient way to obtain useful knowledge
from what is happening now, allowing organizations to react quickly
when problems appear or to detect new trends helping to improve their
performance. Apache SAMOA includes algorithms for the most common
machine learning tasks such as classification and clustering. It
provides a pluggable architecture that allows it to run on Apache
Flink, but also with other several distributed stream processing
engines such as Storm and Samza.
Mining big data streams with APACHE SAMOA by Albert BifetJ On The Beach
In this talk, we present Apache SAMOA, an open-source platform for mining big data streams with Apache Flink, Storm and Samza. Real time analytics is becoming the fastest and most efficient way to obtain useful knowledge from what is happening now, allowing organizations to react quickly when problems appear or to detect new trends helping to improve their performance. Apache SAMOA includes algorithms for the most common machine learning tasks such as classification and clustering. It provides a pluggable architecture that allows it to run on Apache Flink, but also with other several distributed stream processing engines such as Storm and Samza.
Introduction to Data streaming - 05/12/2014Raja Chiky
Raja Chiky is an associate professor whose research interests include data stream mining, distributed architectures, and recommender systems. The document outlines data streaming concepts including what a data stream is, data stream management systems, and basic approximate algorithms used for processing massive, high-velocity data streams. It also discusses challenges in distributed systems and using semantic technologies for data streaming.
This document describes research on implementing Curran's approximation algorithm for pricing Asian options using a dataflow architecture. The algorithm was implemented on a Maxeler dataflow engine (DFE) and compared to a CPU implementation. Different fixed-point precisions were tested on the DFE and 54-bit fixed-point provided the best balance of precision and resource usage. Implementing the algorithm across multiple DFEs provided speedups of 5-12x over a 48-core CPU. Further optimization of dynamic ranges allowed increasing the unrolling factor, improving performance and energy efficiency.
Stream data mining & CluStream frameworkYueshen Xu
This document summarizes a framework for clustering evolving data streams. The framework uses micro-clusters to represent subsets of data points and clusters micro-clusters into macro-clusters over different time horizons. Micro-clusters are represented by cluster feature vectors (CFVs) and are updated when new data points arrive by joining, deleting, or merging micro-clusters. Macro-clusters are formed by applying a modified k-means algorithm that clusters micro-clusters represented by their CFVs over different time periods.
ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...Srinath Perera
DEBS Grand Challenge is a yearly, real-life data based event processing challenge posted by Distributed Event Based Systems conference. The 2015 challenge uses a taxi trip data set from New York city that includes 173 million events col- lected over the year 2013. This paper presents how we used WSO2 CEP, an open source, commercially available Complex Event Processing Engine, to solve the problem. With a 8-core commodity physical machine, our solution did about 350,000 events/second throughput with mean latency less than one millisecond for both queries. With a VirtualBox VM, our solution did about 50,000 events/second throughput with mean latency of 1 and 6 milliseconds for the first and second queries respectively. The paper will outline the solution, present results, and discuss how we optimized the solution for maximum performance.
Self-adaptive container monitoring with performance-aware Load-Shedding policies, by Rolando Brondolin, PhD student in System Architecture at Politecnico di Milano
Artificial intelligence and data stream miningAlbert Bifet
Big Data and Artificial Intelligence have the potential to
fundamentally shift the way we interact with our surroundings. The
challenge of deriving insights from data streams has been recognized
as one of the most exciting and key opportunities for both academia
and industry. Advanced analysis of big data streams from sensors and
devices is bound to become a key area of artificial intelligence
research as the number of applications requiring such processing
increases. Dealing with the evolution over time of such data streams,
i.e., with concepts that drift or change completely, is one of the
core issues in stream mining. In this talk, I will present an overview
of data stream mining, industrial applications, open source tools, and
current challenges of data stream mining.
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLFlink Forward
The document discusses Samsara, a domain specific language for distributed machine learning. It provides an algebraic expression language for linear algebra operations and optimizes distributed computations. An example of linear regression on a cereals dataset is presented to demonstrate how Samsara can be used to estimate regression coefficients in a distributed fashion. Key steps include loading data as a distributed row matrix, extracting feature and target matrices, computing the normal equations, and solving the linear system to estimate coefficients.
Big Graph Analytics Systems (Sigmod16 Tutorial)Yuanyuan Tian
In recent years we have witnessed a surging interest in developing Big Graph processing systems. To date, tens of Big Graph systems have been proposed. This tutorial provides a timely and comprehensive review of existing Big Graph systems, and summarizes their pros and cons from various perspectives. We start from the existing vertex-centric systems, which which a programmer thinks intuitively like a vertex when developing parallel graph algorithms. We then introduce systems that adopt other computation paradigms and execution settings. The topics covered in this tutorial include programming models and algorithm design, computation models, communication mechanisms, out-of-core support, fault tolerance, dynamic graph support, and so on. We also highlight future research opportunities on Big Graph analytics.
The document discusses machine learning techniques for graphs and graph-parallel computing. It describes how graphs can model real-world data with entities as vertices and relationships as edges. Common machine learning tasks on graphs include identifying influential entities, finding communities, modeling dependencies, and predicting user behavior. The document introduces the concept of graph-parallel programming models that allow algorithms to be expressed by having each vertex perform computations based on its local neighborhood. It presents examples of graph algorithms like PageRank, product recommendations, and identifying leaders that can be implemented in a graph-parallel manner. Finally, it discusses challenges of analyzing large real-world graphs and how systems like GraphLab address these challenges through techniques like vertex-cuts and asynchronous execution.
Summary of the article: "Band selection for dimension in hyper spectral image using integrated information gain and principal component analysis technique"
The document discusses data streaming in IoT and big data analytics. It begins with an introduction to data streaming and the need for streaming techniques due to the complexity of analyzing large volumes of IoT data. It then covers the data streaming processing paradigm, including continuous queries, stateless and stateful operators, and windows. Challenges and research questions in data streaming are also discussed, such as distributed deployment, parallelism, and fault tolerance. The document concludes that data streaming is well-suited for real-time analysis of IoT data due to its ability to perform online, parallel and distributed processing.
The document describes analyzing clickstream data using Spark clustering algorithms. It discusses parsing raw clickstream data containing user agent strings to extract useful fields like device type, OS, and timezone. It then applies distributed K-modes clustering in Spark to group users into subsets exhibiting similar patterns. The results show 10 clusters with different proportions of country, timezone, device and other fields, revealing distinct user types within the dataset.
When Two Choices Are not Enough: Balancing at Scale in Distributed Stream Pro...Anis Nasir
This document proposes two algorithms, D-Choices and W-Choices, to improve load balancing in distributed stream processing systems. The algorithms identify "heavy hitters" or frequent keys in the data stream and process them using more than two workers to better balance load. Evaluation shows the algorithms provide up to 150% higher throughput and 60% lower latency compared to traditional partitioning approaches.
The Power of Both Choices: Practical Load Balancing for Distributed Stream Pr...Anis Nasir
This paper proposes a technique called Partial Key Grouping (PKG) to balance load in distributed stream processing engines. PKG splits keys between workers and assigns instances using the "power of two choices" algorithm. This achieves load balancing while maintaining low memory and aggregation overhead of O(1), unlike shuffle grouping which has O(W) overhead. Experiments on real datasets with Apache Storm show PKG improves throughput by 60% and reduces latency by 45% compared to key grouping and shuffle grouping.
This dissertation defense presentation covered Peng Du's work on making one-sided dense linear algebra algorithms resilient to hard and soft errors. The presentation included:
1) Motivation for the work due to increasing failure rates in large HPC systems.
2) An overview of the dissertation goals to make LU, QR and solvers resilient to hard and soft errors.
3) Key contributions including efficient protection of the left factor from hard errors, recovery from hard errors using checkpointing, and detection and recovery from multiple soft errors.
An overview of streaming algorithms: what they are, what the general principles regarding them are, and how they fit into a big data architecture. Also four specific examples of streaming algorithms and use-cases.
Scalable Distributed Real-Time Clustering for Big Data StreamsAntonio Severien
This thesis presents a scalable distributed clustering algorithm for streaming big data. The author implemented a real-time distributed clustering algorithm and a classification algorithm using the Scalable Advanced Massive Online Analysis (SAMOA) framework. SAMOA is a platform-independent framework for distributed machine learning on data streams. It provides interfaces for algorithms to be run on distributed stream processing engines like Apache S4 and Twitter Storm. The author's algorithms were tested on these platforms using the SAMOA framework.
Scalable Distributed Real-Time Clustering for Big Data StreamsAntonio Severien
Analyzing and applying machine learning algorithms to a possibly infinite flow of data is a challenging task. This presentation presents the SAMOA framework, which allows the development of machine learning algorithms on top of any distributed stream processing engine. It also demonstrates the development and use of a distributed clustering algorithm based on CluStream using the Apache S4 platform.
Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. In in this lecture we overview the mining of data streams
Presentation for the Softskills Seminar course @ Telecom ParisTech. Topic is the paper by Domings Hulten "Mining high speed data streams". Presented by me the 30/11/2017
Mining Big Data Streams with APACHE SAMOAAlbert Bifet
In this talk, we present Apache SAMOA, an open-source platform for
mining big data streams with Apache Flink, Storm and Samza. Real time analytics is
becoming the fastest and most efficient way to obtain useful knowledge
from what is happening now, allowing organizations to react quickly
when problems appear or to detect new trends helping to improve their
performance. Apache SAMOA includes algorithms for the most common
machine learning tasks such as classification and clustering. It
provides a pluggable architecture that allows it to run on Apache
Flink, but also with other several distributed stream processing
engines such as Storm and Samza.
Mining big data streams with APACHE SAMOA by Albert BifetJ On The Beach
In this talk, we present Apache SAMOA, an open-source platform for mining big data streams with Apache Flink, Storm and Samza. Real time analytics is becoming the fastest and most efficient way to obtain useful knowledge from what is happening now, allowing organizations to react quickly when problems appear or to detect new trends helping to improve their performance. Apache SAMOA includes algorithms for the most common machine learning tasks such as classification and clustering. It provides a pluggable architecture that allows it to run on Apache Flink, but also with other several distributed stream processing engines such as Storm and Samza.
Introduction to Data streaming - 05/12/2014Raja Chiky
Raja Chiky is an associate professor whose research interests include data stream mining, distributed architectures, and recommender systems. The document outlines data streaming concepts including what a data stream is, data stream management systems, and basic approximate algorithms used for processing massive, high-velocity data streams. It also discusses challenges in distributed systems and using semantic technologies for data streaming.
This document describes research on implementing Curran's approximation algorithm for pricing Asian options using a dataflow architecture. The algorithm was implemented on a Maxeler dataflow engine (DFE) and compared to a CPU implementation. Different fixed-point precisions were tested on the DFE and 54-bit fixed-point provided the best balance of precision and resource usage. Implementing the algorithm across multiple DFEs provided speedups of 5-12x over a 48-core CPU. Further optimization of dynamic ranges allowed increasing the unrolling factor, improving performance and energy efficiency.
Stream data mining & CluStream frameworkYueshen Xu
This document summarizes a framework for clustering evolving data streams. The framework uses micro-clusters to represent subsets of data points and clusters micro-clusters into macro-clusters over different time horizons. Micro-clusters are represented by cluster feature vectors (CFVs) and are updated when new data points arrive by joining, deleting, or merging micro-clusters. Macro-clusters are formed by applying a modified k-means algorithm that clusters micro-clusters represented by their CFVs over different time periods.
ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...Srinath Perera
DEBS Grand Challenge is a yearly, real-life data based event processing challenge posted by Distributed Event Based Systems conference. The 2015 challenge uses a taxi trip data set from New York city that includes 173 million events col- lected over the year 2013. This paper presents how we used WSO2 CEP, an open source, commercially available Complex Event Processing Engine, to solve the problem. With a 8-core commodity physical machine, our solution did about 350,000 events/second throughput with mean latency less than one millisecond for both queries. With a VirtualBox VM, our solution did about 50,000 events/second throughput with mean latency of 1 and 6 milliseconds for the first and second queries respectively. The paper will outline the solution, present results, and discuss how we optimized the solution for maximum performance.
Self-adaptive container monitoring with performance-aware Load-Shedding policies, by Rolando Brondolin, PhD student in System Architecture at Politecnico di Milano
Artificial intelligence and data stream miningAlbert Bifet
Big Data and Artificial Intelligence have the potential to
fundamentally shift the way we interact with our surroundings. The
challenge of deriving insights from data streams has been recognized
as one of the most exciting and key opportunities for both academia
and industry. Advanced analysis of big data streams from sensors and
devices is bound to become a key area of artificial intelligence
research as the number of applications requiring such processing
increases. Dealing with the evolution over time of such data streams,
i.e., with concepts that drift or change completely, is one of the
core issues in stream mining. In this talk, I will present an overview
of data stream mining, industrial applications, open source tools, and
current challenges of data stream mining.
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLFlink Forward
The document discusses Samsara, a domain specific language for distributed machine learning. It provides an algebraic expression language for linear algebra operations and optimizes distributed computations. An example of linear regression on a cereals dataset is presented to demonstrate how Samsara can be used to estimate regression coefficients in a distributed fashion. Key steps include loading data as a distributed row matrix, extracting feature and target matrices, computing the normal equations, and solving the linear system to estimate coefficients.
Big Graph Analytics Systems (Sigmod16 Tutorial)Yuanyuan Tian
In recent years we have witnessed a surging interest in developing Big Graph processing systems. To date, tens of Big Graph systems have been proposed. This tutorial provides a timely and comprehensive review of existing Big Graph systems, and summarizes their pros and cons from various perspectives. We start from the existing vertex-centric systems, which which a programmer thinks intuitively like a vertex when developing parallel graph algorithms. We then introduce systems that adopt other computation paradigms and execution settings. The topics covered in this tutorial include programming models and algorithm design, computation models, communication mechanisms, out-of-core support, fault tolerance, dynamic graph support, and so on. We also highlight future research opportunities on Big Graph analytics.
The document discusses machine learning techniques for graphs and graph-parallel computing. It describes how graphs can model real-world data with entities as vertices and relationships as edges. Common machine learning tasks on graphs include identifying influential entities, finding communities, modeling dependencies, and predicting user behavior. The document introduces the concept of graph-parallel programming models that allow algorithms to be expressed by having each vertex perform computations based on its local neighborhood. It presents examples of graph algorithms like PageRank, product recommendations, and identifying leaders that can be implemented in a graph-parallel manner. Finally, it discusses challenges of analyzing large real-world graphs and how systems like GraphLab address these challenges through techniques like vertex-cuts and asynchronous execution.
Summary of the article: "Band selection for dimension in hyper spectral image using integrated information gain and principal component analysis technique"
The document discusses data streaming in IoT and big data analytics. It begins with an introduction to data streaming and the need for streaming techniques due to the complexity of analyzing large volumes of IoT data. It then covers the data streaming processing paradigm, including continuous queries, stateless and stateful operators, and windows. Challenges and research questions in data streaming are also discussed, such as distributed deployment, parallelism, and fault tolerance. The document concludes that data streaming is well-suited for real-time analysis of IoT data due to its ability to perform online, parallel and distributed processing.
The document describes analyzing clickstream data using Spark clustering algorithms. It discusses parsing raw clickstream data containing user agent strings to extract useful fields like device type, OS, and timezone. It then applies distributed K-modes clustering in Spark to group users into subsets exhibiting similar patterns. The results show 10 clusters with different proportions of country, timezone, device and other fields, revealing distinct user types within the dataset.
Presentation by Steffen Zeuch, Researcher at German Research Center for Artificial Intelligence (DFKI) and Post-Doc at TU Berlin (Germany), at the FogGuru Boot Camp training in September 2018.
From complex Systems to Networks: Discovering and Modeling the Correct Network"diannepatricia
This document discusses representing complex systems as higher-order networks (HON) to more accurately model dependencies. Conventionally, networks represent single entities at nodes, but HON breaks nodes into higher-order components carrying different relationship types. This captures dependencies beyond first order in a scalable way. The document presents applications of HON, including more accurately clustering global shipping patterns and ranking web pages based on clickstreams. HON provides a general framework for network analysis tasks like ranking, clustering and link prediction across domains involving complex trajectories, information flow, and disease spread.
distributed system lab materials about admilkesa13
The document discusses various topics related to distributed systems including:
1. An agenda covering evolution of computational technology, parallel computing, cluster computing, grid computing, utility computing, virtualization, service-oriented architecture, cloud computing, and internet of things.
2. Definitions of distributed systems and reasons why typical definitions are unsatisfactory.
3. A proposed working definition of distributed systems.
4. Computing paradigms including centralized, parallel, and distributed computing.
5. Challenges of distributed systems such as failure recovery, scalability, asynchrony, and security.
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreHPCC Systems
Data Centric Approach: Our platform is built on the premise of absorbing data from multiple data sources and transforming them to a highly intelligent social network graphs that can be processed to non-obvious relationships.
The Nevada Department of Transportation manages and maintains roads and freeways in Nevada. It collects data from hundreds of sensors and cameras to monitor traffic conditions. This data is stored in the Nevada Data Exchange and shared with agencies and a 511 traveler information system. NDOT uses this data for real-time traffic management and is building a fiber optic network to connect its devices. It is working to implement advanced technologies like connected vehicles to further share information and improve transportation.
"The DEBS Grand Challenge 2017" as presented in the The 11th ACM International Conference on Distributed and Event-Based Systems, 19 - 23 June, 2017 held in Barcelona, Spain
This work was supported by grants from the EU H2020 Framework Programme provided for the project HOBBIT (GA no. 688227).
Using Graph Analysis and Fraud Detection in the Fintech IndustryStanka Dalekova
Paysafe provides simple and secure payment solutions to businesses of all sizes around the world, processing billions of payment dollars a year. This, combined with the focus of flawless customer experience and real-time money transfer, makes it a candidate for the “dark side” of the payments industry: fraudsters, money launderers, etc. With traditional data storage techniques such as relational technologies, it is almost impossible to see beyond individual accounts to the connections between them. In this session see how Paysafe implemented the property graph technologies in Oracle Spatial and Graph and Oracle Database, including its fast, built-in, in-memory graph analytics, to perform fast graph queries that identify patterns of fraud.
Using Graph Analysis and Fraud Detection in the Fintech IndustryStanka Dalekova
Paysafe provides simple and secure payment solutions to businesses of all sizes around the world, processing billions of payment dollars a year. This, combined with the focus of flawless customer experience and real-time money transfer, makes it a candidate for the “dark side” of the payments industry: fraudsters, money launderers, etc. With traditional data storage techniques such as relational technologies, it is almost impossible to see beyond individual accounts to the connections between them. In this session see how Paysafe implemented the property graph technologies in Oracle Spatial and Graph and Oracle Database, including its fast, built-in, in-memory graph analytics, to perform fast graph queries that identify patterns of fraud.
This document discusses how to download and play the mobile game Subway Surfers on a personal computer. It describes using BlueStacks, an Android emulator, to install and run the game normally played on phones and tablets. BlueStacks allows users to access Google Play to download Subway Surfers and other Android apps. Once installed through BlueStacks, the game can be played offline on a PC like a mobile game, allowing users to enjoy Subway Surfers on a larger screen without being limited to a phone.
introduction to advanced distributed systemmilkesa13
This document provides an overview of distributed systems and computing paradigms. It begins with an agenda that covers topics like parallel computing, cluster computing, grid computing, utility computing, and cloud computing. Examples of distributed systems are provided. Definitions of distributed systems emphasize that they are collections of independent computers that appear as a single system to users. Computing paradigms like centralized, parallel, and distributed computing are described. Challenges of distributed systems like failures, scalability, and asynchrony are listed. The operational layers of distributed systems from the application to network layers are outlined. Models and enabling technologies are mentioned.
This document provides a summary of practical machine learning on big data platforms. It begins with an introduction and agenda, then provides a quick brief on the machine learning process. It discusses the current landscape of open source tools, including evolutionary drivers and examples. It covers case studies from Twitter and their experience. Finally, it discusses architectural forces like Moore's Law and Kryder's Law that are shaping the field. The document aims to present a unified approach for machine learning on big data platforms and discuss how industry leaders are implementing these techniques.
This document provides an overview of stream processing. It discusses how stream processing systems are used to process large volumes of real-time data continuously and produce actionable information. Examples of applications discussed include traffic monitoring, network monitoring, smart grids, and sensor networks. Key concepts of stream processing covered include data streams, operators, windows, programming models, fault tolerance, and platforms like Storm and Spark Streaming.
Understanding the qualities of Web robot traffic is essential to build mechanisms for mitigating the impact of their traffic on Web systems. This project presents an updated characterization of the navigational and session patterns of Web robot traffic across three Web servers in the United States, Europe, and the Middle East under 30 different features. The results indicate that some features may be fitted to the same heavy-tailed model across the Web servers, but the best fitting models for other features depend on the Web server. Due to some different tasks of Web robots and security policies set by website administrators, there are thus some features of Web robot traffic that cannot be universally modeled. The paper titled “Some (Non-)Universal Features of Web Robot Traffic” which presents the report of this project has been accepted at 52th Annual Conference on Information Sciences and Systems (CISS).
Get Started with Data Science by Analyzing Traffic Data from California HighwaysAerospike, Inc.
This document summarizes an effort to analyze traffic data from California highways to better understand data science techniques. The researchers searched for an open dataset, eventually finding sensor data from California highways. They analyzed the data format and values to understand it. To detect traffic incidents, they framed it as a classification problem and prepared training data by labeling sensor records near incidents as positive examples. They trained classifiers on this data but initial results were poor. After refining the features and balancing the training data, the classifiers showed more promising results.
This document provides an agenda for a presentation on data analytics. It includes an introduction to data analytics concepts and examples of applications in various industries. A demo of Instant Insights, a cloud-based analytics tool, is presented. The document emphasizes that building analytics solutions through testing and iteration is better than just discussing ideas. Users should test solutions in the real world and measure user behavior to make data-driven decisions.
Introduction of streaming data, difference between batch processing and stream processing, Research issues in streaming data processing, Performance evaluation metrics , tools for stream processing.
This document provides an update on perfSONAR network measurement tools, the IRIS and DyGIR projects, the Archipelago measurement platform, network services on TransPAC3 and ACE, and the Data Logistics Toolkit. Key points include:
- perfSONAR and OSCARS software will be used to provide monitoring and dynamic circuit services on TransPAC3 and ACE.
- The IRIS and DyGIR projects will develop monitoring and dynamic circuit software packages for international research networks.
- The Archipelago platform conducts large-scale IPv4 topology measurements from over 50 probes worldwide.
- TransPAC3 and ACE will provide high-performance connectivity between regions and dedicated infrastructure for data movement using the
Provenance for Data Munging EnvironmentsPaul Groth
Data munging is a crucial task across domains ranging from drug discovery and policy studies to data science. Indeed, it has been reported that data munging accounts for 60% of the time spent in data analysis. Because data munging involves a wide variety of tasks using data from multiple sources, it often becomes difficult to understand how a cleaned dataset was actually produced (i.e. its provenance). In this talk, I discuss our recent work on tracking data provenance within desktop systems, which addresses problems of efficient and fine grained capture. I also describe our work on scalable provence tracking within a triple store/graph database that supports messy web data. Finally, I briefly touch on whether we will move from adhoc data munging approaches to more declarative knowledge representation languages such as Probabilistic Soft Logic.
Presented at Information Sciences Institute - August 13, 2015
Similar to The data streaming processing paradigm and its use in modern fog architectures (20)
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
The data streaming processing paradigm and its use in modern fog architectures
1. The data streaming processing paradigm and
its use in modern fog architectures
Vincenzo Gulisano, Ph.D.
2. Agenda
• Who we are (my group)
• Motivation
• The data streaming processing paradigm
• Challenges and research questions
• Conclusions
• Bibliography
2
3. Agenda
• Who we are (my group)
• Motivation
• The data streaming processing paradigm
• Challenges and research questions
• Conclusions
• Bibliography
3
4. Who we are
Chalmers university
Computer Science and Engineering
department
Distributed Computing and Systems
research group
4
5. Who we are
• ~15 PhD degrees awarded
• ~12 PhD students, 5 Postdocs and 6 faculty
• Acknowledged internationally as leading
group in practical multicore synchronization
algorithms and programming (results
adopted by Java JDK, C++, Intel, NVIDIA)
• Extensive network of academic and
industrial collaborations and has been
continuously supported by national and
international projects
5
6. Distributed Computing and Systems Research Group
Department of Computer Science and engineering
Chalmers University of Technology
At our research team:
Research expertise & projects
Cyber
Security
Efficient
parallel &
stream
computing
Distributed
systems
IoT & Sensor
Networks
6
7. Agenda
• Who we are
• Motivation
• The data streaming processing paradigm
• Challenges and research questions
• Conclusions
• Bibliography
7
8. • IoT
• Edge
• Fog
• Cloud
• Cyber-Physical System
• Big Data
...
BUZZ WORDS
8
11. IoT enables for increased awareness, security, power-efficiency, ...
large IoT systems are complex
traditional data analysis techniques alone are not adequate!
11
13. AMIs VNs
large IoT systems are complex
Characteristics [15]:
1. edge location
2. location awareness
3. low latency
4. geographical distribution
5. large-scale
6. support for mobility
7. real-time interactions
8. predominance of wireless
9. heterogeneous
10. interoperability / federation
11. interaction with the cloud
13
14. traditional data analysis techniques alone are not adequate! [13,14]
1. does the infrastructure allow for billions of
readings per day to be transferred continuously?
2. the latency incurred while transferring data, does
that undermine the utility of the analysis?
3. is it secure to concentrate all the data in a single
place? [11]
4. is it smart to give away fine-grained data? [12]
14
15. Agenda
• Who we are
• Motivation
• The data streaming processing paradigm
• Challenges and research questions
• Conclusions
• Bibliography
15
16. Main Memory
Motivation
DBMS vs. DSMS
Disk
1 Data
Query Processing
3 Query
results
2 Query
Main Memory
Query Processing
Continuous
Query
Data
Query
results
16
17. Before we start... about data streaming and Stream Processing Engines (SPEs)
An incomplete list of SPEs (cf. related work in [18]):
time
Borealis
The Aurora Project
STanfordstREamdatAManager
NiagaraCQ
COUGAR
StreamCloud
Covering all of them / discussing which use cases are best for each one out of scope...
the following show connection between what is being presented and a certain SPE
17
18. data stream: unbounded sequence of tuples sharing the same schema
Example: vehicles’ speed reports
time
Field Field
vehicle id text
time (secs) text
speed (Km/h) double
X coordinate double
Y coordinate double
A 8:00 55.5 X1 Y1
Let’s assume each source
(e.g., vehicle) produces
and delivers a timestamp
sorted stream
A 8:07 34.3 X3 Y3
A 8:03 70.3 X2 Y2
18
19. continuous query (or simply query): Directed Acyclic Graph (DAG) of
streams and operators
OP
OP
OP
OP OP
OP OP
source op
(1+ out streams)
sink op
(1+ in streams)
stream
op
(1+ in, 1+ out streams)
19
20. data streaming operators
Two main types:
• Stateless operators
• do not maintain any state
• one-by-one processing
• if they maintain some state, such state does not evolve depending
on the tuples being processed
• Stateful operators
• maintain a state that evolves depending on the tuples being
processed
• produce output tuples that depend on multiple input tuples
OP
OP
20
25. stateful operators
Aggregate information from multiple tuples
(e.g., max, min, sum, ...)
Join tuples coming from 2 streams given a certain predicate
Aggregate
Join
25
27. Wait a moment!
if streams are unbounded, how can we aggregate or join?
27
28. windows and stateful analysis [18]
Stateful operations are done over windows:
• Time-based (e.g., tuples in the last 10 minutes)
• Tuple-based (e.g., given the last 50 tuples)
time
[8:00,9:00)
[8:20,9:20)
[8:40,9:40)
Usually applications rely on time-based sliding windows
28
29. time-based sliding window aggregation (count)
Counter: 4
time
[8:00,9:00)
8:05 8:15 8:22 8:45 9:05
Output: 4
Counter: 1
Counter: 2
Counter: 3
Counter: 3
time
8:05 8:15 8:22 8:45 9:05
[8:20,9:20)
we assumed each source
produces and delivers a
timestamp sorted stream!
What happens if this is not
the case?
29
31. basic operators and user-defined operators
Besides a set of basic operators, SPEs usually allow the user to define
ad-hoc operators (e.g., when existing aggregation are not enough)
31
32. sample query
For each vehicle, raise an alert if the speed of the latest report is more
than 2 times higher than its average speed in the last 30 days.
time
A 8:00 55.5 X1 Y1 A 8:07 34.3 X3 Y3
A 8:03 70.3 X2 Y2
32
33. Remove
unused fields
Map
Field
vehicle id
time (secs)
speed (Km/h)
X coordinate
Y coordinate
Field
vehicle id
time (secs)
speed (Km/h)
Compute average
speed for each
vehicle during the
last 30 days
Aggregate
Field
vehicle id
time (secs)
avg speed (Km/h)
Join
Check
condition
Filter
Field
vehicle id
time (secs)
speed (Km/h)
Join on
vehicle id
Field
vehicle id
time (secs)
avg speed (Km/h)
speed (Km/h)
sample query
33
34. M A J F
sample query
Notice:
• the same semantics can be defined in several ways (using different
operators and composing them in different ways)
• Using many basic building blocks can ease the task of distributing
and parallelizing the analysis (more in the following...)
34
39. Agenda
• Who we are
• Motivation
• The data streaming processing paradigm
• Challenges and research questions
• Conclusions
• Bibliography
39
40. 1. Distributed deployment
2. Parallel deployment
3. Ordering and determinism
4. Shared-nothing vs shared-memory parallelism
5. Load balancing
6. Elasticity
7. Fault tolerance
8. Data sharing
Challenges and research questions
40
41. Picture source: Modern OS, by A. Tanenbaum
Data analysts
Stream Processing Engine
OS / Hardware
How can research
support these layers?
Why is it challenging?
41
42. A minimal example
(... the data analyst)
tell me the average speed
(per week) of each car
Road Side Unit (RSU)
42
43. Parallelization challenges
compute average speed
(per week) of each car
compute average speed
(per week) of each car
Does this work?
What if a car travels close to
different RSUs (and sends data
to both)?
43
45. Parallelization challenges
compute average speed
(per week) of each car
Odd plate number
Even plate numberSend to other RSU
compute average speed
(per week) of each car
Even plate number
Odd plate numberSend to other RSU
They are now communicating!
45
46. Balancing challenges
compute average speed
(per week) of each car
Odd plate
number
Even plate
number
Send to other RSU
compute average speed
(per week) of each car
Even plate
number
Odd plate
number
Send to other RSU
Too much pollution, we will
introduce the alternate
license plates policy!
Thanks! Now I am
using only 50% of my
resources...
46
47. Balancing challenges
compute average speed
(per week) of each car
Dynamic
condition TRUE
Dynamic
condition FALSE
Send to other RSU
compute average speed
(per week) of each car
Dynamic
condition FALSE
Dynamic
condition TRUE
Send to other RSU
They are now communicating!
There must be monitoring!
We need state
transfer protocols!
47
48. Balancing challenges
Time
responsible for responsible for
dynamic condition change
(in between the week!)
What about the data already sent to ?
tell me the average speed
(per week) of each car
48
49. Adapting challenges
compute average speed
(per week) of each car
Dynamic
condition TRUE
Dynamic
condition FALSE
Send to other RSU
compute average speed
(per week) of each car
Dynamic
condition FALSE
Dynamic
condition TRUE
Send to other RSU
What about the data
already sent to ?
49
50. Balancing challenges
compute average speed
(per week) of each car
Dynamic
condition TRUE
Dynamic
condition FALSE
Send to other RSU
compute average speed
(per week) of each car
Dynamic
condition FALSE
Dynamic
condition TRUE
Send to other RSU
They are now communicating!
There must be monitoring!
We need state
transfer protocols!
We need backups!
What about the data sent
between and ?
Cars need to
buffer data too!
Cars need to
buffer data too! 50
51. Faulttolerance
Elasticity
Loadbalancing
Determinism
Parallel execution of streaming operators
51
Parallel execution of streaming applications
DDoSdetection
andmitigation
Intrusiondetection
Datavalidation
Differentially
privateaggregation
Vehicularnetworks
analysis
SmartGrids/AMI
analysis
Security and
privacy
IoT
Cyber-Physical
Systems
Provenance and custom scheduling
52. Synchronization / Data structures
Faulttolerance
Elasticity
Loadbalancing
Determinism
Parallel execution of streaming operators
52
Parallel execution of streaming applications
Many-core systems / FPGAs
DDoSdetection
andmitigation
Intrusiondetection
Datavalidation
Differentially
privateaggregation
Vehicularnetworks
analysis
SmartGrids/AMI
analysis
Parallel joins
Parallel
aggregates
Joins
modeling
Security and
privacy
IoT
Cyber-Physical
Systems
Provenance and custom scheduling
53. • Palyvos-Giannas, D., Gulisano, V., & Papatriantafilou, M. (2019). GeneaLog: Fine-
grained data streaming provenance in cyber-physical systems. Parallel Computing
• Najdataei, H., Nikolakopoulos, Y., Papatriantafilou, M., Tsigas, P., & Gulisano, V.
(2019). STRETCH: Scalable and Elastic Deterministic Streaming Analysis with Virtual
Shared-Nothing Parallelism. Proceedings of the 13th ACM International Conference
on Distributed and Event-based Systems, DEBS 2019
• Palyvos-Giannas, D., Gulisano, V., & Papatriantafilou, M. (2019). Haren: A Framework
for Ad-Hoc Thread Scheduling Policies for Data Streaming Applications. Proceedings
of the 13th ACM International Conference on Distributed and Event-based Systems,
DEBS 2019
• Havers, B., Duvignau, R., Najdataei, H., Gulisano, V., Koppisetty, A. C., &
Papatriantafilou, M. (2019). DRIVEN: a Framework for Efficient Data Retrieval and
Clustering in Vehicular Networks. 35th IEEE International Conference on Data
Engineering, ICDE 2019
• Duvignau, R., Gulisano, V., Papatriantafilou, M., & Savic, V. (2019). Streaming
piecewise linear approximation for efficient data management in edge computing.
Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, SAC 2019
• Walulya, I., Palyvos-Giannas, D., Nikolakopoulos, Y., Gulisano, V., Papatriantafilou,
M., & Tsigas, P. (2018). Viper: A module for communication-layer determinism and
scaling in low-latency stream processing. Future Generation Comp. Syst., 88
• Rooij, J. van, Gulisano, V., & Papatriantafilou, M. (2018). LoCoVolt: Distributed
Detection of Broken Meters in Smart Grids through Stream Processing. Proceedings
of the 12th ACM International Conference on Distributed and Event-based Systems,
DEBS 2018
Provenance Determinism
Vehicular networks
AMI
Parallelism Determinism Elasticity
53
Custom scheduling
Vehicular networks
Vehicular networks
DeterminismParallelism
Smart Grids / AMIs
54. • Najdataei, H., Nikolakopoulos, Y., Gulisano, V., & Papatriantafilou, M.
(2018). Continuous and Parallel LiDAR Point-Cloud Clustering. 38th IEEE
International Conference on Distributed Computing Systems, ICDCS 2018
• Gulisano, V., Nikolakopoulos, Y., Cederman, D., Papatriantafilou, M., &
Tsigas, P. (2017). Efficient Data Streaming Multiway Aggregation through
Concurrent Algorithmic Designs and New Abstract Data Types. TOPC
• Zacheilas, N., Kalogeraki, V., Nikolakopoulos, Y., Gulisano, V.,
Papatriantafilou, M., & Tsigas, P. (2017). Maximizing Determinism in
Stream Processing Under Latency Constraints. BT - Proceedings of the
11th ACM International Conference on Distributed and Event-based
Systems, DEBS 2017
• Gulisano, V., Papadopoulos, A. V., Nikolakopoulos, Y., Papatriantafilou, M.,
& Tsigas, P. (2017). Performance Modeling of Stream Joins. Proceedings
of the 11th ACM International Conference on Distributed and Event-
based Systems, DEBS 2017
• Geethakumari, P. R., Gulisano, V., Svensson, B. J., Trancoso, P., & Sourdis,
I. (2017). Single window stream aggregation using reconfigurable
hardware. International Conference on Field Programmable Technology,
ICFPT 2017
• Gulisano, V., Tudor, V., Almgren, M., & Papatriantafilou, M. (2016). BES:
Differentially Private and Distributed Event Aggregation in Advanced
Metering Infrastructures. Proceedings of the 2nd ACM International
Workshop on Cyber-Physical System Security, CPSS@AsiaCCS
• Gulisano, V., Callau-Zori, M., Fu, Z., Jiménez-Peris, R., Papatriantafilou, M.,
& Patiño-Martínez, M. (2015). STONE: A streaming DDoS defense
framework. Expert Syst. Appl., 42
54
Parallelism Clustering
Parallelism DeterminismStream aggregates
Parallelism Determinism
Synchronization / Data structures
ModelingStream joins
FPGAs DeterminismStream aggregates
Differentially
private aggregation
DDoS detection
and mitigation
55. • Gulisano, V., Nikolakopoulos, Y., Papatriantafilou, M., & Tsigas, P. (2015).
Scalejoin: A deterministic, disjoint-parallel and skew-resilient stream join. 2015
IEEE International Conference on Big Data, Big Data 2015
• Gulisano, V., Nikolakopoulos, Y., Walulya, I., Papatriantafilou, M., & Tsigas, P.
(2015). Deterministic real-time analytics of geospatial data streams through
ScaleGate objects. Proceedings of the 9th ACM International Conference on
Distributed Event-Based Systems, DEBS ’15
• Gulisano, V., Almgren, M., & Papatriantafilou, M. (2014). METIS: A Two-Tier
Intrusion Detection System for Advanced Metering Infrastructures.
International Conference on Security and Privacy in Communication Networks -
10th International ICST Conference, SecureComm 2014
• Gulisano, V., Jiménez-Peris, R., Patiño-Martínez, M., Soriente, C., & Valduriez, P.
(2012). StreamCloud: An Elastic and Scalable Data Streaming System. IEEE
Trans. Parallel Distrib. Syst., 23
• Gulisano, V., Jiménez-Peris, R., Patiño-Martínez, M., & Valduriez, P. (2010).
StreamCloud: A Large Scale Data Streaming System. BT - 2010 International
Conference on Distributed Computing Systems, ICDCS 2010
55
Parallelism DeterminismStream joins
Parallelism Determinism
Synchronization / Data structures
Intrusion detection
Parallelism Determinism
Parallelism Determinism
56. Agenda
• Who we are
• Motivation
• The data streaming processing paradigm
• Challenges and research questions
• Conclusions
• Bibliography
56
57. Millions of years
of evolution
Millions of
sensors
• Store information
• Iterate multiple times over data
• Think, do not rush through decisions
• ”Hard-wired” routines
• Real-time decisions
• High-throughput / low-latency
Should I (really)
have an extra
piece of cake?
Danger!!!
Run!!!
Humans
57
58. Years / Decades
of evolution
Millions of
sensors
What traffic
congestion
patterns can I
observe
frequently?
Don’t take
over, car in
opposite lane!
• Store information
• Iterate multiple times over data
• Think, do not rush through decisions
Databases, data mining
techniques...
Data streaming, distributed
and parallel analysis
• Continuous analysis
• Real-time decisions
• High-throughput / low-latency
Computers
(cyber-physical / IoT systems)
58
59. Agenda
• Motivation
• The data streaming processing paradigm
• Challenges and research questions
• Conclusions
• Bibliography
59
60. Bibliography
1. Zhou, Jiazhen, Rose Qingyang Hu, and Yi Qian. "Scalable distributed communication architectures to support advanced
metering infrastructure in smart grid." IEEE Transactions on Parallel and Distributed Systems 23.9 (2012): 1632-1642.
2. Gulisano, Vincenzo, et al. "BES: Differentially Private and Distributed Event Aggregation in Advanced Metering Infrastructures."
Proceedings of the 2nd ACM International Workshop on Cyber-Physical System Security. ACM, 2016.
3. Gulisano, Vincenzo, Magnus Almgren, and Marina Papatriantafilou. "Online and scalable data validation in advanced metering
infrastructures." IEEE PES Innovative Smart Grid Technologies, Europe. IEEE, 2014.
4. Gulisano, Vincenzo, Magnus Almgren, and Marina Papatriantafilou. "METIS: a two-tier intrusion detection system for advanced
metering infrastructures." International Conference on Security and Privacy in Communication Systems. Springer International
Publishing, 2014.
5. Yousefi, Saleh, Mahmoud Siadat Mousavi, and Mahmood Fathy. "Vehicular ad hoc networks (VANETs): challenges and
perspectives." 2006 6th International Conference on ITS Telecommunications. IEEE, 2006.
6. El Zarki, Magda, et al. "Security issues in a future vehicular network." European Wireless. Vol. 2. 2002.
7. Georgiadis, Giorgos, and Marina Papatriantafilou. "Dealing with storage without forecasts in smart grids: Problem
transformation and online scheduling algorithm." Proceedings of the 29th Annual ACM Symposium on Applied Computing.
ACM, 2014.
8. Fu, Zhang, et al. "Online temporal-spatial analysis for detection of critical events in Cyber-Physical Systems." Big Data (Big
Data), 2014 IEEE International Conference on. IEEE, 2014.
60
61. Bibliography
9. Arasu, Arvind, et al. "Linear road: a stream data management benchmark." Proceedings of the Thirtieth
international conference on Very large data bases-Volume 30. VLDB Endowment, 2004.
10. Lv, Yisheng, et al. "Traffic flow prediction with big data: a deep learning approach." IEEE Transactions on
Intelligent Transportation Systems 16.2 (2015): 865-873.
11. Grochocki, David, et al. "AMI threats, intrusion detection requirements and deployment
recommendations." Smart Grid Communications (SmartGridComm), 2012 IEEE Third International
Conference on. IEEE, 2012.
12. Molina-Markham, Andrés, et al. "Private memoirs of a smart meter." Proceedings of the 2nd ACM
workshop on embedded sensing systems for energy-efficiency in building. ACM, 2010.
13. Gulisano, Vincenzo, et al. "Streamcloud: A large scale data streaming system." Distributed Computing
Systems (ICDCS), 2010 IEEE 30th International Conference on. IEEE, 2010.
14. Stonebraker, Michael, Uǧur Çetintemel, and Stan Zdonik. "The 8 requirements of real-time stream
processing." ACM SIGMOD Record 34.4 (2005): 42-47.
15. Bonomi, Flavio, et al. "Fog computing and its role in the internet of things." Proceedings of the first
edition of the MCC workshop on Mobile cloud computing. ACM, 2012.
16. Himmelsbach, Michael, et al. "LIDAR-based 3D object perception." Proceedings of 1st international
workshop on cognition for technical systems. Vol. 1. 2008.
61
62. Bibliography
17. Geiger, Andreas, et al. "Vision meets robotics: The KITTI dataset." The International Journal of Robotics Research (2013):
0278364913491297.
18. Gulisano, Vincenzo Massimiliano. StreamCloud: An Elastic Parallel-Distributed Stream Processing Engine. Diss. Informatica,
2012.
19. Cardellini, Valeria, et al. "Optimal operator placement for distributed stream processing applications." Proceedings of the 10th
ACM International Conference on Distributed and Event-based Systems. ACM, 2016.
20. Costache, Stefania, et al. "Understanding the Data-Processing Challenges in Intelligent Vehicular Systems." Proceedings of the
2016 IEEE Intelligent Vehicles Symposium (IV16).
21. Cormode, Graham. "The continuous distributed monitoring model." ACM SIGMOD Record 42.1 (2013): 5-14.
22. Giatrakos, Nikos, Antonios Deligiannakis, and Minos Garofalakis. "Scalable Approximate Query Tracking over Highly Distributed
Data Streams." Proceedings of the 2016 International Conference on Management of Data. ACM, 2016.
23. Gulisano, Vincenzo, et al. "Streamcloud: An elastic and scalable data streaming system." IEEE Transactions on Parallel and
Distributed Systems 23.12 (2012): 2351-2365.
24. Shah, Mehul A., et al. "Flux: An adaptive partitioning operator for continuous query systems." Data Engineering, 2003.
Proceedings. 19th International Conference on. IEEE, 2003.
62
63. Bibliography
25. Cederman, Daniel, et al. "Brief announcement: concurrent data structures for efficient streaming aggregation." Proceedings of
the 26th ACM symposium on Parallelism in algorithms and architectures. ACM, 2014.
26. Ji, Yuanzhen, et al. "Quality-driven processing of sliding window aggregates over out-of-order data streams." Proceedings of
the 9th ACM International Conference on Distributed Event-Based Systems. ACM, 2015.
27. Ji, Yuanzhen, et al. "Quality-driven disorder handling for concurrent windowed stream queries with shared operators."
Proceedings of the 10th ACM International Conference on Distributed and Event-based Systems. ACM, 2016.
28. Gulisano, Vincenzo, et al. "Scalejoin: A deterministic, disjoint-parallel and skew-resilient stream join." Big Data (Big Data), 2015
IEEE International Conference on. IEEE, 2015.
29. Ottenwälder, Beate, et al. "MigCEP: operator migration for mobility driven distributed complex event processing." Proceedings
of the 7th ACM international conference on Distributed event-based systems. ACM, 2013.
30. De Matteis, Tiziano, and Gabriele Mencagli. "Keep calm and react with foresight: strategies for low-latency and energy-efficient
elastic data stream processing." Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming. ACM, 2016.
31. Balazinska, Magdalena, et al. "Fault-tolerance in the Borealis distributed stream processing system." ACM Transactions on
Database Systems (TODS) 33.1 (2008): 3.
32. Castro Fernandez, Raul, et al. "Integrating scale out and fault tolerance in stream processing using operator state
management." Proceedings of the 2013 ACM SIGMOD international conference on Management of data. ACM, 2013.
63
64. Bibliography
33. Dwork, Cynthia. "Differential privacy: A survey of results." International Conference on Theory and Applications of
Models of Computation. Springer Berlin Heidelberg, 2008.
34. Dwork, Cynthia, et al. "Differential privacy under continual observation." Proceedings of the forty-second ACM
symposium on Theory of computing. ACM, 2010.
35. Kargl, Frank, Arik Friedman, and Roksana Boreli. "Differential privacy in intelligent transportation systems." Proceedings
of the sixth ACM conference on Security and privacy in wireless and mobile networks. ACM, 2013.
64
Editor's Notes
Before we start... questions and please notice
A like to be hands-on...
say these are some examples...
say these are some examples...
say these are some examples...
About the ordering we will come back later
Need to explain a bit what is the approach I am following here
maybe say the DB part is a bit of an oversimplifaction