Machine Learning with Apache Flink at Stockholm Machine Learning GroupTill Rohrmann
This presentation presents Apache Flink's approach to scalable machine learning: Composable machine learning pipelines, consisting of transformers and learners, and distributed linear algebra.
The presentation was held at the Machine Learning Stockholm group on the 23rd of March 2015.
Flink Forward Berlin 2017: Matt Zimmer - Custom, Complex Windows at Scale Usi...Flink Forward
The windowing capabilities offered by most stream processing engines are limited to aligned windows of a fixed duration. However, many real-world event processing use cases don’t fit this rigid structure, resulting in awkward processing pipelines. There haven’t been good alternatives, until recently that is. Apache Flink offers a rich Window API that supports implementing unaligned windows of varying duration. In this talk, Matt Zimmer will discuss using this API at Netflix to aggregate events into windows customized along varying definitions of a session. He will talk about implementation details such as: * Handling out-of-order events * Limiting state build-up while aggregating a subset of events from an event stream * Periodically emitting early results * Creating windows bounded by a type of event Attendees will leave this talk with practical techniques and knowledge to implement their own custom windows in Apache Flink.
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo PeriniFlink Forward
World's toughest and most interesting analysis tasks lie at the intersection of graph data (inter-dependencies in data) and deep learning (inter-dependencies in the model). Classical graph embedding techniques have for years occupied research groups seeking how complex graphs can be encoded into a low-dimensional latent space. Recently, deep learning has dominated the space of embeddings generation due to its ability to automatically generate embeddings given any static graph.
Grapharis is a project that revitalizes the concept of graph embeddings, yet it does so in a real setting were graphs are not static but keep changing over time (think of user interactions in social networks). More specifically, we explored how a system like Flink can be used to simplify both the process of training a graph embedding model incrementally but also make complex inferences and predictions in real time using graph structured data streams. To our knowledge, Grapharis is the first complete data pipeline using Flink and Tensorflow for real-time deep graph learning. This talk will cover how we can train, store and generate embeddings continuously and accurately as data evolves over time without the need to re-train the underlying model.
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...Flink Forward
SK telecom shares our experience of using Flink in building a solution for Predictive Maintenance (PdM). Our PdM solution named metatron PdM consists of (1) a Deep Neural Network (DNN)-based prediction model for precise prediction, and (2) a Flink-based runtime system which applies the model to a sliding window on sensor data streams. Efficient handling of multi-sensor streaming data for real-time prediction of equipment condition is a critical component of our product. In this talk, we first show why we choose Flink as a core engine for our streaming use case in which we generate real-time predictions using DNNs trained with Keras on top of TensorFlow and Theano. In addition, we present a comparative study of methods to exploit learning models on JVM such as directly using Python libraries on CPython embedded in JVM, using TensorFlow Java API (including Flink TensorFlow), and making RPC calls to TensorFlow Serving. We then explain how we implement the runtime system using Flink DataStream API, especially with event time, various window mechanisms, timestamp and watermark, custom source and sink, and checkpointing. Lastly, we present how we use the official Flink Docker image for solution delivery and the Flink metric system for monitoring and management of our solution. We hope our use case sets a good example of building a DNN-based streaming solution using Flink.
This document provides an overview of Apache Flink internals. It begins with an introduction and recap of Flink programming concepts. It then discusses how Flink programs are compiled into execution plans and executed in a pipelined fashion, as opposed to being executed eagerly like regular code. The document outlines Flink's architecture including the optimizer, runtime environment, and data storage integrations. It also covers iterative processing and how Flink handles iterations both by unrolling loops and with native iterative datasets.
Flink provides concise summaries of key points:
1) After submitting a Flink job, the client creates and submits the job graph to the JobManager, which then creates an execution graph and deploys tasks across TaskManagers for parallel execution.
2) The batch optimizer chooses optimal execution plans by evaluating physical execution strategies like join algorithms and data shipping approaches to minimize data shuffling and network usage.
3) Flink iterations are optimized by having the runtime directly handle caching, state maintenance, and pushing work out of loops to avoid scheduling overhead between iterations. Delta iterations further improve efficiency by only updating changed elements in each iteration.
Ufuc Celebi – Stream & Batch Processing in one SystemFlink Forward
The document describes the architecture and execution model of Apache Flink. Flink uses a distributed dataflow model where a job is represented as a directed acyclic graph of operators. The client submits this graph to the JobManager, which schedules tasks across TaskManagers. Tasks communicate asynchronously through data channels to process bounded and unbounded data in a pipelined fashion.
Machine Learning with Apache Flink at Stockholm Machine Learning GroupTill Rohrmann
This presentation presents Apache Flink's approach to scalable machine learning: Composable machine learning pipelines, consisting of transformers and learners, and distributed linear algebra.
The presentation was held at the Machine Learning Stockholm group on the 23rd of March 2015.
Flink Forward Berlin 2017: Matt Zimmer - Custom, Complex Windows at Scale Usi...Flink Forward
The windowing capabilities offered by most stream processing engines are limited to aligned windows of a fixed duration. However, many real-world event processing use cases don’t fit this rigid structure, resulting in awkward processing pipelines. There haven’t been good alternatives, until recently that is. Apache Flink offers a rich Window API that supports implementing unaligned windows of varying duration. In this talk, Matt Zimmer will discuss using this API at Netflix to aggregate events into windows customized along varying definitions of a session. He will talk about implementation details such as: * Handling out-of-order events * Limiting state build-up while aggregating a subset of events from an event stream * Periodically emitting early results * Creating windows bounded by a type of event Attendees will leave this talk with practical techniques and knowledge to implement their own custom windows in Apache Flink.
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo PeriniFlink Forward
World's toughest and most interesting analysis tasks lie at the intersection of graph data (inter-dependencies in data) and deep learning (inter-dependencies in the model). Classical graph embedding techniques have for years occupied research groups seeking how complex graphs can be encoded into a low-dimensional latent space. Recently, deep learning has dominated the space of embeddings generation due to its ability to automatically generate embeddings given any static graph.
Grapharis is a project that revitalizes the concept of graph embeddings, yet it does so in a real setting were graphs are not static but keep changing over time (think of user interactions in social networks). More specifically, we explored how a system like Flink can be used to simplify both the process of training a graph embedding model incrementally but also make complex inferences and predictions in real time using graph structured data streams. To our knowledge, Grapharis is the first complete data pipeline using Flink and Tensorflow for real-time deep graph learning. This talk will cover how we can train, store and generate embeddings continuously and accurately as data evolves over time without the need to re-train the underlying model.
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...Flink Forward
SK telecom shares our experience of using Flink in building a solution for Predictive Maintenance (PdM). Our PdM solution named metatron PdM consists of (1) a Deep Neural Network (DNN)-based prediction model for precise prediction, and (2) a Flink-based runtime system which applies the model to a sliding window on sensor data streams. Efficient handling of multi-sensor streaming data for real-time prediction of equipment condition is a critical component of our product. In this talk, we first show why we choose Flink as a core engine for our streaming use case in which we generate real-time predictions using DNNs trained with Keras on top of TensorFlow and Theano. In addition, we present a comparative study of methods to exploit learning models on JVM such as directly using Python libraries on CPython embedded in JVM, using TensorFlow Java API (including Flink TensorFlow), and making RPC calls to TensorFlow Serving. We then explain how we implement the runtime system using Flink DataStream API, especially with event time, various window mechanisms, timestamp and watermark, custom source and sink, and checkpointing. Lastly, we present how we use the official Flink Docker image for solution delivery and the Flink metric system for monitoring and management of our solution. We hope our use case sets a good example of building a DNN-based streaming solution using Flink.
This document provides an overview of Apache Flink internals. It begins with an introduction and recap of Flink programming concepts. It then discusses how Flink programs are compiled into execution plans and executed in a pipelined fashion, as opposed to being executed eagerly like regular code. The document outlines Flink's architecture including the optimizer, runtime environment, and data storage integrations. It also covers iterative processing and how Flink handles iterations both by unrolling loops and with native iterative datasets.
Flink provides concise summaries of key points:
1) After submitting a Flink job, the client creates and submits the job graph to the JobManager, which then creates an execution graph and deploys tasks across TaskManagers for parallel execution.
2) The batch optimizer chooses optimal execution plans by evaluating physical execution strategies like join algorithms and data shipping approaches to minimize data shuffling and network usage.
3) Flink iterations are optimized by having the runtime directly handle caching, state maintenance, and pushing work out of loops to avoid scheduling overhead between iterations. Delta iterations further improve efficiency by only updating changed elements in each iteration.
Ufuc Celebi – Stream & Batch Processing in one SystemFlink Forward
The document describes the architecture and execution model of Apache Flink. Flink uses a distributed dataflow model where a job is represented as a directed acyclic graph of operators. The client submits this graph to the JobManager, which schedules tasks across TaskManagers. Tasks communicate asynchronously through data channels to process bounded and unbounded data in a pipelined fashion.
The document introduces Gelly, Flink's graph processing API. It discusses why graph processing with Flink, provides an overview of Gelly and its key features like iterative graph processing. It describes Gelly's native iteration support and both vertex-centric and gather-sum-apply models. Examples demonstrate basic graph operations and algorithms like connected components, shortest paths. The summary concludes by mentioning upcoming Gelly features and encouraging readers to try it out.
Design and Implementation of the Security Graph LanguageAsankhaya Sharma
Today software is built in fundamentally different
ways from how it was a decade ago. It is increasingly common
for applications to be assembled out of open-source components,
resulting in the use of large amounts of third-party code. This
third-party code is a means for vulnerabilities to make their
way downstream into applications. Recent vulnerabilities such
as Heartbleed, FREAK SSL/TLS, GHOST, and the Equifax data
breach (due to a flaw in Apache Struts) were ultimately caused
by third-party components. We argue that an automated way to
audit the open-source ecosystem, catalog existing vulnerabilities,
and discover new flaws is essential to using open-source safely.
To this end, we describe the Security Graph Language (SGL), a
domain-specific language for analysing graph-structured datasets
of open-source code and cataloguing vulnerabilities. SGL allows
users to express complex queries on relations between libraries
and vulnerabilities in the style of a program analysis language.
SGL queries double as an executable representation for vulnerabilities, allowing vulnerabilities to be automatically checked
against a database and deduplicated using a canonical representation. We outline a novel optimisation for SGL queries based on
regular path query containment, improving query performance up to 3 orders of magnitude. We also demonstrate the
effectiveness of SGL in practice to find zero-day vulnerabilities
by identifying sever
This document provides an overview of Apache Flink, an open-source stream processing framework. It discusses Flink's capabilities in supporting streaming, batch, and iterative processing natively through a streaming dataflow model. It also describes Flink's architecture including the client, job manager, task managers, and various execution setups like local, remote, YARN, and embedded. Finally, it compares Flink to other stream and batch processing systems in terms of their APIs, fault tolerance guarantees, and strengths.
Predictive Maintenance with Deep Learning and Apache FlinkDongwon Kim
Flink can be used to build a predictive maintenance system using deep learning models on time-series sensor data. A Flink data stream processing pipeline is designed to handle joining streams, applying convolutional LSTM models through an ensemble, and monitoring outputs. Docker and Prometheus are used to package and monitor the solution.
My talk from SICS Data Science Day, describing FlinkML, the Machine Learning library for Apache Flink.
I talk about our approach to large-scale machine learning and how we utilize state-of-the-art algorithms to ensure FlinkML is a truly scalable library.
You can watch a video of the talk here: https://youtu.be/k29qoCm4c_k
This document provides an overview of streaming systems and Flink streaming. It discusses key concepts in streaming including stream processing, windowing, and fault tolerance. The document also includes examples of using Flink's streaming API, such as reading from multiple inputs, window aggregations, and joining data streams. It summarizes Flink's programming model, roadmap, and performance capabilities. Flink is presented as a next-generation stream processing system that combines a true streaming runtime with expressive APIs and competitive performance.
Flink Forward Berlin 2017: Boris Lublinsky, Stavros Kontopoulos - Introducing...Flink Forward
The Machine Learning (ML) library for Flink is a new effort to bring scalable ML tools to the Flink community in order to make Flink a premier platform for ML. This effort covers a wide area of ML related activities including data analysis, model training and model serving. In this talk we will cover approaches to using Flink for model serving. We will present a framework and actual code for serving and updating models in real time. Although presented examples cover only two model types - Tensorflow and PMML, the framework is applicable for any type of externalizable models. We will also show how this implementation fits into overall ML learning implementation and demonstrate integration with SparkML and Tensorflow/Keras and specifically show how models can be exported from these systems. This initial implementation can be further enhanced to support additional model representations, for example PFA and introduction of model serving specific semantics into Flink framework.
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Vasia Kalavri
Apache Flink is a general-purpose platform for batch and streaming distributed data processing. This talk describes how Flink’s powerful APIs, iterative operators and other unique features make it a competitive alternative for large-scale graph processing as well. We take a close look at how one can elegantly express graph analysis tasks, using common Flink operators and how different graph processing models, like vertex-centric, can be easily mapped to Flink dataflows. Next, we get a sneak preview into Flink's upcoming Graph API, Gelly, which further simplifies graph application development in Flink. Finally, we show how to perform end-to-end data analysis, mixing common Flink operators and Gelly, without having to build complex pipelines and combine different systems. We go through a step-by-step example, demonstrating how to perform loading, transformation, filtering, graph creation and analysis, with a single Flink program.
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
This document provides an overview of several big data technologies including MapReduce, Pig, Flume, Cascading, and Dremel. It describes what each technology is used for, how it works, and example applications. MapReduce is a programming model for processing large datasets in a distributed environment, while Pig, Flume, and Cascading build upon MapReduce to provide higher-level abstractions. Dremel is an interactive query system for nested and complex datasets that uses a column-oriented data storage format.
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...ucelebi
Ufuk Celebi presented on the architecture and execution of Apache Flink's streaming data flow engine. Flink allows for both stream and batch processing using a common runtime. It translates APIs into a directed acyclic graph (DAG) called a JobGraph. The JobGraph is distributed across TaskManagers which execute parallel tasks. Communication between components like the JobManager and TaskManagers uses an actor system to coordinate scheduling, checkpointing, and monitoring of distributed streaming data flows.
This document discusses the Pulsar connector for Apache Flink 1.14. It provides an overview of StreamNative, which offers both stream storage with Apache Pulsar and stream processing with Flink. It then covers the timeline of contributions to the Pulsar connector for Flink and how it has evolved. Finally, it describes the design of the new Pulsar source connector for Flink that uses the FLIP-27 source interface, including how it handles Pulsar subscription modes and implements split enumeration, reading, and processing in a way that supports both batch and streaming workloads.
This document discusses best practices for productionalizing machine learning models built with Spark ML. It covers key stages like data preparation, model training, and operationalization. For data preparation, it recommends handling null values, missing data, and data types as custom Spark ML stages within a pipeline. For training, it suggests sampling data for testing and caching only required columns to improve efficiency. For operationalization, it discusses persisting models, validating prediction schemas, and extracting feature names from pipelines. The goal is to build robust, scalable and efficient ML workflows with Spark ML.
RxJava allows building reactive applications by treating everything as a stream of messages. Observables represent message producers and observers consume messages. Observables provide asynchronous and parallel execution via operators like subscribeOn and observeOn. This makes applications resilient to failure, scalable under varying workloads, and responsive to clients. RxJava also promotes message-driven architectures, functional programming, and handling errors as regular messages to improve these characteristics. Developers must also unsubscribe to prevent leaking resources and ensure observables only run when needed.
Real-time driving score service using FlinkDongwon Kim
Dongwon Kim presented on migrating T map's driving score service from a batch processing architecture to a real-time streaming architecture using Apache Flink. The new system calculates driving scores for each session in real-time as GPS data is received, allowing users to see their scores sooner. It utilizes Flink's event time processing, windowing, and a custom trigger to handle out-of-order data. Metrics are collected using Prometheus to monitor performance and latency.
Flink Connector Development Tips & TricksEron Wright
A look at some of the challenges and techniques for developing a connector for Apache Flink, covering the different types of connectors, lifecycle, metrics, event-time support, and fault tolerance.
Presentation video: https://www.youtube.com/watch?v=ZkbYO5S4z18
This document provides an overview of Apache Flink and streaming analytics. It discusses key concepts in streaming such as event time vs processing time, watermarks, windows, and fault tolerance using checkpoints and savepoints. It provides examples of time-windowed and session-windowed aggregations as well as pattern detection using state. The document also covers mixing event time and processing time, window triggers, and reprocessing data from savepoints in streaming jobs.
This document discusses Java 8 streams and how they are implemented under the hood. Streams use a pipeline concept where the output of one unit becomes the input of the next. Streams are built on functional interfaces and lambdas introduced in Java 8. Spliterators play an important role in enabling parallel processing of streams by splitting data sources into multiple chunks that can be processed independently in parallel. The reference pipeline implementation uses a linked list of processing stages to represent the stream operations and evaluates whether operations can be parallelized to take advantage of multiple threads or cores.
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Till Rohrmann
Apache Flink is an open source platform for distributed stream and batch data processing. It provides APIs called DataStream for unbounded streaming data and DataSet for bounded batch data. Flink runs streaming topologies that allow for windowing, aggregation and other stream processing functions. It supports exactly-once processing semantics through distributed snapshots and checkpoints. The system is optimized for low latency and high throughput streaming applications.
The DAGScheduler is responsible for computing the DAG of stages for a Spark job and submitting them to the TaskScheduler. The TaskScheduler then submits individual tasks from each stage for execution and works with the DAGScheduler to handle failures through task and stage retries. Together, the DAGScheduler and TaskScheduler coordinate the execution of jobs by breaking them into independent stages of parallel tasks across executor nodes.
Measurement .Net Performance with BenchmarkDotNetVasyl Senko
This document discusses using BenchmarkDotNet to measure .NET performance. It provides an overview of profiling vs benchmarking vs microbenchmarking and best practices for benchmarking. It then demonstrates how to install BenchmarkDotNet, design a benchmark, analyze results, and configure columns, jobs, and diagnosers. Key aspects of how BenchmarkDotNet works and useful statistics and tools it provides are also summarized.
Join this workshop and accelerate your journey to production-ready Kubernetes by learning the practical techniques for reliably operating your software lifecycle using the GitOps pattern. The Weaveworks team will be running a full-day workshop, sharing their expertise as users and contributors of Kubernetes and Prometheus, as well as followers of GitOps (operations by pull request) practices.
Using a combination of instructor led demonstrations and hands-on exercises, the workshop will enable the attendee to go into detail on the following topics:
• Developing and operating your Kubernetes microservices at scale
• DevOps best practices and the movement towards a “GitOps” approach
• Building with Kubernetes in production: caring for your apps, implementing CI/CD best practices, and utilizing the right metrics, monitoring tools, and automated alerts
• Operating Kubernetes in production: Upgrading and managing Kubernetes, managing incident response, and adhering to security best practices for Kubernetes
The document introduces Gelly, Flink's graph processing API. It discusses why graph processing with Flink, provides an overview of Gelly and its key features like iterative graph processing. It describes Gelly's native iteration support and both vertex-centric and gather-sum-apply models. Examples demonstrate basic graph operations and algorithms like connected components, shortest paths. The summary concludes by mentioning upcoming Gelly features and encouraging readers to try it out.
Design and Implementation of the Security Graph LanguageAsankhaya Sharma
Today software is built in fundamentally different
ways from how it was a decade ago. It is increasingly common
for applications to be assembled out of open-source components,
resulting in the use of large amounts of third-party code. This
third-party code is a means for vulnerabilities to make their
way downstream into applications. Recent vulnerabilities such
as Heartbleed, FREAK SSL/TLS, GHOST, and the Equifax data
breach (due to a flaw in Apache Struts) were ultimately caused
by third-party components. We argue that an automated way to
audit the open-source ecosystem, catalog existing vulnerabilities,
and discover new flaws is essential to using open-source safely.
To this end, we describe the Security Graph Language (SGL), a
domain-specific language for analysing graph-structured datasets
of open-source code and cataloguing vulnerabilities. SGL allows
users to express complex queries on relations between libraries
and vulnerabilities in the style of a program analysis language.
SGL queries double as an executable representation for vulnerabilities, allowing vulnerabilities to be automatically checked
against a database and deduplicated using a canonical representation. We outline a novel optimisation for SGL queries based on
regular path query containment, improving query performance up to 3 orders of magnitude. We also demonstrate the
effectiveness of SGL in practice to find zero-day vulnerabilities
by identifying sever
This document provides an overview of Apache Flink, an open-source stream processing framework. It discusses Flink's capabilities in supporting streaming, batch, and iterative processing natively through a streaming dataflow model. It also describes Flink's architecture including the client, job manager, task managers, and various execution setups like local, remote, YARN, and embedded. Finally, it compares Flink to other stream and batch processing systems in terms of their APIs, fault tolerance guarantees, and strengths.
Predictive Maintenance with Deep Learning and Apache FlinkDongwon Kim
Flink can be used to build a predictive maintenance system using deep learning models on time-series sensor data. A Flink data stream processing pipeline is designed to handle joining streams, applying convolutional LSTM models through an ensemble, and monitoring outputs. Docker and Prometheus are used to package and monitor the solution.
My talk from SICS Data Science Day, describing FlinkML, the Machine Learning library for Apache Flink.
I talk about our approach to large-scale machine learning and how we utilize state-of-the-art algorithms to ensure FlinkML is a truly scalable library.
You can watch a video of the talk here: https://youtu.be/k29qoCm4c_k
This document provides an overview of streaming systems and Flink streaming. It discusses key concepts in streaming including stream processing, windowing, and fault tolerance. The document also includes examples of using Flink's streaming API, such as reading from multiple inputs, window aggregations, and joining data streams. It summarizes Flink's programming model, roadmap, and performance capabilities. Flink is presented as a next-generation stream processing system that combines a true streaming runtime with expressive APIs and competitive performance.
Flink Forward Berlin 2017: Boris Lublinsky, Stavros Kontopoulos - Introducing...Flink Forward
The Machine Learning (ML) library for Flink is a new effort to bring scalable ML tools to the Flink community in order to make Flink a premier platform for ML. This effort covers a wide area of ML related activities including data analysis, model training and model serving. In this talk we will cover approaches to using Flink for model serving. We will present a framework and actual code for serving and updating models in real time. Although presented examples cover only two model types - Tensorflow and PMML, the framework is applicable for any type of externalizable models. We will also show how this implementation fits into overall ML learning implementation and demonstrate integration with SparkML and Tensorflow/Keras and specifically show how models can be exported from these systems. This initial implementation can be further enhanced to support additional model representations, for example PFA and introduction of model serving specific semantics into Flink framework.
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Vasia Kalavri
Apache Flink is a general-purpose platform for batch and streaming distributed data processing. This talk describes how Flink’s powerful APIs, iterative operators and other unique features make it a competitive alternative for large-scale graph processing as well. We take a close look at how one can elegantly express graph analysis tasks, using common Flink operators and how different graph processing models, like vertex-centric, can be easily mapped to Flink dataflows. Next, we get a sneak preview into Flink's upcoming Graph API, Gelly, which further simplifies graph application development in Flink. Finally, we show how to perform end-to-end data analysis, mixing common Flink operators and Gelly, without having to build complex pipelines and combine different systems. We go through a step-by-step example, demonstrating how to perform loading, transformation, filtering, graph creation and analysis, with a single Flink program.
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
This document provides an overview of several big data technologies including MapReduce, Pig, Flume, Cascading, and Dremel. It describes what each technology is used for, how it works, and example applications. MapReduce is a programming model for processing large datasets in a distributed environment, while Pig, Flume, and Cascading build upon MapReduce to provide higher-level abstractions. Dremel is an interactive query system for nested and complex datasets that uses a column-oriented data storage format.
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...ucelebi
Ufuk Celebi presented on the architecture and execution of Apache Flink's streaming data flow engine. Flink allows for both stream and batch processing using a common runtime. It translates APIs into a directed acyclic graph (DAG) called a JobGraph. The JobGraph is distributed across TaskManagers which execute parallel tasks. Communication between components like the JobManager and TaskManagers uses an actor system to coordinate scheduling, checkpointing, and monitoring of distributed streaming data flows.
This document discusses the Pulsar connector for Apache Flink 1.14. It provides an overview of StreamNative, which offers both stream storage with Apache Pulsar and stream processing with Flink. It then covers the timeline of contributions to the Pulsar connector for Flink and how it has evolved. Finally, it describes the design of the new Pulsar source connector for Flink that uses the FLIP-27 source interface, including how it handles Pulsar subscription modes and implements split enumeration, reading, and processing in a way that supports both batch and streaming workloads.
This document discusses best practices for productionalizing machine learning models built with Spark ML. It covers key stages like data preparation, model training, and operationalization. For data preparation, it recommends handling null values, missing data, and data types as custom Spark ML stages within a pipeline. For training, it suggests sampling data for testing and caching only required columns to improve efficiency. For operationalization, it discusses persisting models, validating prediction schemas, and extracting feature names from pipelines. The goal is to build robust, scalable and efficient ML workflows with Spark ML.
RxJava allows building reactive applications by treating everything as a stream of messages. Observables represent message producers and observers consume messages. Observables provide asynchronous and parallel execution via operators like subscribeOn and observeOn. This makes applications resilient to failure, scalable under varying workloads, and responsive to clients. RxJava also promotes message-driven architectures, functional programming, and handling errors as regular messages to improve these characteristics. Developers must also unsubscribe to prevent leaking resources and ensure observables only run when needed.
Real-time driving score service using FlinkDongwon Kim
Dongwon Kim presented on migrating T map's driving score service from a batch processing architecture to a real-time streaming architecture using Apache Flink. The new system calculates driving scores for each session in real-time as GPS data is received, allowing users to see their scores sooner. It utilizes Flink's event time processing, windowing, and a custom trigger to handle out-of-order data. Metrics are collected using Prometheus to monitor performance and latency.
Flink Connector Development Tips & TricksEron Wright
A look at some of the challenges and techniques for developing a connector for Apache Flink, covering the different types of connectors, lifecycle, metrics, event-time support, and fault tolerance.
Presentation video: https://www.youtube.com/watch?v=ZkbYO5S4z18
This document provides an overview of Apache Flink and streaming analytics. It discusses key concepts in streaming such as event time vs processing time, watermarks, windows, and fault tolerance using checkpoints and savepoints. It provides examples of time-windowed and session-windowed aggregations as well as pattern detection using state. The document also covers mixing event time and processing time, window triggers, and reprocessing data from savepoints in streaming jobs.
This document discusses Java 8 streams and how they are implemented under the hood. Streams use a pipeline concept where the output of one unit becomes the input of the next. Streams are built on functional interfaces and lambdas introduced in Java 8. Spliterators play an important role in enabling parallel processing of streams by splitting data sources into multiple chunks that can be processed independently in parallel. The reference pipeline implementation uses a linked list of processing stages to represent the stream operations and evaluates whether operations can be parallelized to take advantage of multiple threads or cores.
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Till Rohrmann
Apache Flink is an open source platform for distributed stream and batch data processing. It provides APIs called DataStream for unbounded streaming data and DataSet for bounded batch data. Flink runs streaming topologies that allow for windowing, aggregation and other stream processing functions. It supports exactly-once processing semantics through distributed snapshots and checkpoints. The system is optimized for low latency and high throughput streaming applications.
The DAGScheduler is responsible for computing the DAG of stages for a Spark job and submitting them to the TaskScheduler. The TaskScheduler then submits individual tasks from each stage for execution and works with the DAGScheduler to handle failures through task and stage retries. Together, the DAGScheduler and TaskScheduler coordinate the execution of jobs by breaking them into independent stages of parallel tasks across executor nodes.
Measurement .Net Performance with BenchmarkDotNetVasyl Senko
This document discusses using BenchmarkDotNet to measure .NET performance. It provides an overview of profiling vs benchmarking vs microbenchmarking and best practices for benchmarking. It then demonstrates how to install BenchmarkDotNet, design a benchmark, analyze results, and configure columns, jobs, and diagnosers. Key aspects of how BenchmarkDotNet works and useful statistics and tools it provides are also summarized.
Join this workshop and accelerate your journey to production-ready Kubernetes by learning the practical techniques for reliably operating your software lifecycle using the GitOps pattern. The Weaveworks team will be running a full-day workshop, sharing their expertise as users and contributors of Kubernetes and Prometheus, as well as followers of GitOps (operations by pull request) practices.
Using a combination of instructor led demonstrations and hands-on exercises, the workshop will enable the attendee to go into detail on the following topics:
• Developing and operating your Kubernetes microservices at scale
• DevOps best practices and the movement towards a “GitOps” approach
• Building with Kubernetes in production: caring for your apps, implementing CI/CD best practices, and utilizing the right metrics, monitoring tools, and automated alerts
• Operating Kubernetes in production: Upgrading and managing Kubernetes, managing incident response, and adhering to security best practices for Kubernetes
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2qoUklo.
Mark Price talks about techniques for making performance testing a first-class citizen in a Continuous Delivery pipeline. He covers a number of war stories experienced by the team building one of the world's most advanced trading exchanges. Filmed at qconlondon.com.
Mark Price is a Senior Performance Engineer at Improbable.io, working on optimizing and scaling reality-scale simulations. Previously, he worked as Lead Performance Engineer at LMAX Exchange, where he helped to optimize the platform to become one of the world's fastest FX exchanges.
Scaling Monitoring At Databricks From Prometheus to M3LibbySchulze
M3 has been successfully deployed at Databricks to replace their Prometheus monitoring system. Some key lessons learned include monitoring important M3 metrics like memory and disk usage, having automated deployment processes, and planning for capacity needs and spikes in metrics. Updates to M3 have gone smoothly, and future plans include using new M3 features like downsampling and separate namespaces.
This presentation is an overview of the "Performance Modelling of Serverless Computing Platforms" paper published in IEEE Transactions on Cloud Computing (TCC). These slides were presented at CASCON 2020 Workshop on Cloud Computing.
Authors: Nima Mahmoudi and Hamzeh Khazaei
Paper: https://doi.org/10.1109/TCC.2020.3033373
Presentation: https://youtu.be/R3-gEd_0TF8
Citi Tech Talk: Monitoring and Performanceconfluent
The objective of the engagement is for Citi to have an understanding and path forward to monitor their Confluent Platform and
- Platform Monitoring
- Maintenance and Upgrade
Kubernetes @ Squarespace: Kubernetes in the DatacenterKevin Lynch
The document discusses Kubernetes adoption at Squarespace as their engineering organization grew. It describes the challenges of a monolithic architecture and how microservices addressed these challenges. It then discusses how Kubernetes helped solve operational challenges of provisioning and scaling microservices. Key Kubernetes concepts like pods, deployments, services and namespaces are explained. Monitoring, networking and security with Kubernetes are also covered.
The document discusses Apache Beam, a solution for next generation data processing. It provides a unified programming model for both batch and streaming data processing. Beam allows data pipelines to be written once and run on multiple execution engines. The presentation covers common challenges with historical data processing approaches, how Beam addresses these issues, a demo of running a Beam pipeline on different engines, and how to get involved with the Apache Beam community.
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
This talk explores deploying a series of small and large batch and streaming pipelines locally, to Spark and Flink clusters and to Google Cloud Dataflow services to give the audience a feel for the portability of Beam, a new portable Big Data processing framework recently submitted by Google to the Apache foundation. This talk will look at how the programming model handles late arriving data in a stream with event time, windows, and triggers.
Temporal Performance Modelling of Serverless Computing Platforms - WoSC6Nima Mahmoudi
This presentation is an overview of the "Temporal Performance Modeling of Serverless Computing Platforms" paper published in Sixth International Workshop on Serverless Computing (WoSC6) 2020 as part of IEEE Middleware conference.
Authors: Nima Mahmoudi and Hamzeh Khazaei
Paper: https://www.serverlesscomputing.org/wosc6/#p1
Preprint and Artifacts: https://research.nima-dev.com/publication/mahmoudi-2020-tempperf/
Full Presentation: https://youtu.be/9r3j_1B5t8c
Lightning Talk (1 min): https://youtu.be/E5KigIq0Z1E
PACS Lab: https://pacs.eecs.yorku.ca/
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...NETWAYS
Open source is at the heart of what we do at Grafana Labs and there is so much happening! The intent of this talk to update everyone on the latest development when it comes to Grafana, Pyroscope, Faro, Loki, Mimir, Tempo and more. Everyone has had at least heard about Grafana but maybe some of the other projects mentioned above are new to you? Welcome to this talk 😉 Beside the update what is new we will also quickly introduce them during this talk.
This document provides an overview of performance tuning best practices for Scala applications. It discusses motivations for performance tuning such as resolving issues or reducing infrastructure costs. Some common bottlenecks are identified as databases, asynchronous/thread operations, and I/O. Best practices covered include measuring metrics, identifying bottlenecks, and avoiding premature optimization. Microbenchmarks and optimization examples using Scala collections are also presented.
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
Apache Apex is a next gen big data analytics platform. Originally developed at DataTorrent it comes with a powerful stream processing engine, rich set of functional building blocks and an easy to use API for the developer to build real-time and batch applications. Apex runs natively on YARN and HDFS and is used in production in various industries. You will learn about the Apex architecture, including its unique features for scalability, fault tolerance and processing guarantees, programming model and use cases.
http://apachebigdata2016.sched.org/event/6M0L/next-gen-big-data-analytics-with-apache-apex-thomas-weise-datatorrent
4-year chronicles of ALLSTOCKER (a trading platform for used construction equipment and machinery). We describe how the system has evolved incrementally using Pharo smalltalk.
Feedback on creating an Apache Beam Runner based on Spark Structured streaming framework: a piece of software that translates pipelines written with Apache Beam to native pipelines that use Apache Spark Structured Streaming framework.
This document discusses using SpecFlow and WatiN for web automation testing with a behavior-driven development (BDD) approach. It provides an overview of BDD, demonstrates SpecFlow features like step arguments and scenario outlines, and recommends BDD patterns like page object model and driver pattern. The framework emphasizes clear specifications over tools, and supports collaboration between technical and non-technical teams through its use of plain language to describe features and scenarios.
Google Cloud Dataflow is a next generation managed big data service based on the Apache Beam programming model. It provides a unified model for batch and streaming data processing, with an optimized execution engine that automatically scales based on workload. Customers report being able to build complex data pipelines more quickly using Cloud Dataflow compared to other technologies like Spark, and with improved performance and reduced operational overhead.
Architectual Comparison of Apache Apex and Spark StreamingApache Apex
This presentation discusses architectural differences between Apache Apex features with Spark Streaming. It discusses how these differences effect use cases like ingestion, fast real-time analytics, data movement, ETL, fast batch, very low latency SLA, high throughput and large scale ingestion.
Also, it will cover fault tolerance, low latency, connectors to sources/destinations, smart partitioning, processing guarantees, computation and scheduling model, state management and dynamic changes. Further, it will discuss how these features affect time to market and total cost of ownership.
Digital Twins Computer Networking Paper Presentation.pptxaryanpankaj78
A Digital Twin in computer networking is a virtual representation of a physical network, used to simulate, analyze, and optimize network performance and reliability. It leverages real-time data to enhance network management, predict issues, and improve decision-making processes.
Build the Next Generation of Apps with the Einstein 1 Platform.
Rejoignez Philippe Ozil pour une session de workshops qui vous guidera à travers les détails de la plateforme Einstein 1, l'importance des données pour la création d'applications d'intelligence artificielle et les différents outils et technologies que Salesforce propose pour vous apporter tous les bénéfices de l'IA.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELijaia
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
This study Examines the Effectiveness of Talent Procurement through the Imple...DharmaBanothu
In the world with high technology and fast
forward mindset recruiters are walking/showing interest
towards E-Recruitment. Present most of the HRs of
many companies are choosing E-Recruitment as the best
choice for recruitment. E-Recruitment is being done
through many online platforms like Linkedin, Naukri,
Instagram , Facebook etc. Now with high technology E-
Recruitment has gone through next level by using
Artificial Intelligence too.
Key Words : Talent Management, Talent Acquisition , E-
Recruitment , Artificial Intelligence Introduction
Effectiveness of Talent Acquisition through E-
Recruitment in this topic we will discuss about 4important
and interlinked topics which are
Null Bangalore | Pentesters Approach to AWS IAMDivyanshu
#Abstract:
- Learn more about the real-world methods for auditing AWS IAM (Identity and Access Management) as a pentester. So let us proceed with a brief discussion of IAM as well as some typical misconfigurations and their potential exploits in order to reinforce the understanding of IAM security best practices.
- Gain actionable insights into AWS IAM policies and roles, using hands on approach.
#Prerequisites:
- Basic understanding of AWS services and architecture
- Familiarity with cloud security concepts
- Experience using the AWS Management Console or AWS CLI.
- For hands on lab create account on [killercoda.com](https://killercoda.com/cloudsecurity-scenario/)
# Scenario Covered:
- Basics of IAM in AWS
- Implementing IAM Policies with Least Privilege to Manage S3 Bucket
- Objective: Create an S3 bucket with least privilege IAM policy and validate access.
- Steps:
- Create S3 bucket.
- Attach least privilege policy to IAM user.
- Validate access.
- Exploiting IAM PassRole Misconfiguration
-Allows a user to pass a specific IAM role to an AWS service (ec2), typically used for service access delegation. Then exploit PassRole Misconfiguration granting unauthorized access to sensitive resources.
- Objective: Demonstrate how a PassRole misconfiguration can grant unauthorized access.
- Steps:
- Allow user to pass IAM role to EC2.
- Exploit misconfiguration for unauthorized access.
- Access sensitive resources.
- Exploiting IAM AssumeRole Misconfiguration with Overly Permissive Role
- An overly permissive IAM role configuration can lead to privilege escalation by creating a role with administrative privileges and allow a user to assume this role.
- Objective: Show how overly permissive IAM roles can lead to privilege escalation.
- Steps:
- Create role with administrative privileges.
- Allow user to assume the role.
- Perform administrative actions.
- Differentiation between PassRole vs AssumeRole
Try at [killercoda.com](https://killercoda.com/cloudsecurity-scenario/)
Road construction is not as easy as it seems to be, it includes various steps and it starts with its designing and
structure including the traffic volume consideration. Then base layer is done by bulldozers and levelers and after
base surface coating has to be done. For giving road a smooth surface with flexibility, Asphalt concrete is used.
Asphalt requires an aggregate sub base material layer, and then a base layer to be put into first place. Asphalt road
construction is formulated to support the heavy traffic load and climatic conditions. It is 100% recyclable and
saving non renewable natural resources.
With the advancement of technology, Asphalt technology gives assurance about the good drainage system and with
skid resistance it can be used where safety is necessary such as outsidethe schools.
The largest use of Asphalt is for making asphalt concrete for road surfaces. It is widely used in airports around the
world due to the sturdiness and ability to be repaired quickly, it is widely used for runways dedicated to aircraft
landing and taking off. Asphalt is normally stored and transported at 150’C or 300’F temperature
3. Agenda
1. Big Data Benchmarking
a. State of the art
b. NEXMark: A benchmark over continuous data streams
2. Nexmark on Apache Beam
a. Introducing Beam
b. Advantages of using Beam for benchmarking
c. Implementation
d. Nexmark + Beam: a win-win story
e. Neutral benchmarking: a difficult issue
f. Example: Running Nexmark on Spark
3. Current state and future work
3
5. Benchmarking
Why do we benchmark?
1. Performance
2. Correctness
Benchmark suites steps:
1. Generate data
2. Compute data
3. Measure performance
4. Validate results
Types of benchmarks
● Microbenchmarks
● Functional
● Business case
● Data Mining / Machine Learning
5
6. Issues of Benchmarking Suites for Big Data
● No standard suite: Terasort, TPCx-HS (Hadoop), HiBench, ...
● No common model/API: Strongly tied to each processing engine or SQL
● Too focused on Hadoop infrastructure
● Mixed Benchmarks for storage/processing
● Few Benchmarking suites support streaming
6
7. State of the art
Batch
● Terasoft: Sort random data
● TPCx-HS: Sort to measure Hadoop compatible distributions
● TPC-DS on Spark: TPC-DS business case with Spark SQL
● Berkeley Big Data Benchmark: SQL-like queries on Hive, Redshift, Impala
● HiBench* and BigBench
Streaming
● Yahoo Streaming Benchmark
* HiBench includes also some streaming / windowing benchmarks
7
8. Nexmark
Benchmark for queries over data streams
Business case: Online Auction System
Research paper draft 2004
Example:
Query 4: What is the average selling price for each auction category?
Query 8: Who has entered the system and created an auction in the last period?
Auction
Person
Seller
Person
Bidder
Person
Bidder
Bid
8
Item
9. Nexmark on Google Dataflow
● Port of SQL style queries described in the NEXMark research paper to
Google Cloud Dataflow by Mark Shields and others at Google.
● Enriched queries set with Google Cloud Dataflow client use cases
● Used as rich integration testing scenario on the Google Cloud Dataflow
9
11. Apache Beam
Beam Model: Fn Runners
Apache
Flink
Apache
Spark
Beam Model: Pipeline Construction
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Cloud
Dataflow
Execution
1. The Beam Programming Model
2. SDKs for writing Beam pipelines (Java/Python)
3. Runners for existing distributed processing
backends
11
12. The Beam Model: What is Being Computed?
12
Event Time: timestamp when the event happened
Processing Time: wall clock absolute program time
13. The Beam Model: Where in Event Time?
Event Time
Processing
Time
12:0212:00 12:1012:0812:0612:04
12:0212:00 12:1012:0812:0612:04
Input
Output
13
● Split infinite data into finite chunks
15. Apache Beam Pipeline concepts
Data processing Pipeline
(executed via a Beam runner)
PTransform PTransform
Read
(Source)
Write
(Sink)
Input
PCollection
Window per min Count
Output
HDFS
15
16. GroupByKey
CoGroupByKey
Combine -> Reduce
Sum
Count
Min / Max
Mean
...
ParDo -> DoFn
MapElements
FlatMapElements
Filter
WithKeys
Keys
Values
Windowing/Triggers
Windows
FixedWindows
GlobalWindows
SlidingWindows
Sessions
Triggers
AfterWatermark
AfterProcessingTime
Repeatedly
Element-wise Grouping
Apache Beam - Programming Model
16
17. Nexmark on Apache Beam
● Nexmark was ported from Dataflow to Beam 0.2.0 as an integration test case
● Refactored to most recent Beam version
● Made code more generic to support all the Beam runners
● Changed some queries to use new APIs
● Validated queries in all the runners to test their support of the Beam model
17
18. Advantages of using Beam for benchmarking
● Rich model: all use cases that we had could be expressed using Beam API
● Can test both batch and streaming modes with exactly the same code
● Multiple runners: queries can be executed on Beam supported runners
(provided that the given runner supports the used features)
● Monitoring features (metrics)
18
20. Components of Nexmark
● Generator:
○ generation of timestamped events (bids, persons, auctions)
correlated between each other
● NexmarkLauncher:
○ creates sources that use the generator
○ queries pipelines launching, monitoring
● Output metrics:
○ Each query includes ParDos to update metrics
○ execution time, processing event rate, number of results,
but also invalid auctions/bids, …
● Modes:
○ Batch mode: test data is finite and uses a BoundedSource
○ Streaming mode: test data is finite but uses an
UnboundedSource to trigger streaming mode in runners
20
21. Some of the queries
21
Query Description Use of Beam model
3 Who is selling in particular US states? Join, State, Timer
5 Which auctions have seen the most bids in the last period? Sliding Window, Combiners
6 What is the average selling price per seller for their last 10
closed auctions?
Global Window, Custom Combiner
7 What are the highest bids per period? Fixed Windows, Side Input
9 * Winning bids Custom Window
11 * How many bids did a user make in each session he was active? Session Window, Triggering
12 * How many bids does a user make within a fixed processing time
limit?
Global Window, working in
Processing Time
*: not in original NexMark paper
22. Query structure
22
1. Get PCollection<Event> as input
2. Apply ParDo + Filter to extract object of interest: Bids, Auctions, Person(s)
3. Apply transforms: Filter, Count, GroupByKey, Window, etc.
4. Apply ParDo to output the final PCollection: collection of AuctionPrice,
AuctionCount ...
23. Key point: When to compute data?
23
● Windows: divide data into event-time-based finite chunks.
○ Often required when doing aggregations over unbounded data
24. Key point: When to compute data?
● Triggers: condition to fire computation
● Default trigger: at the end of the window
● Required when working on unbouded data in Global Window
● Q11: trigger fires when 20 elements were received
24
25. Key point: When to compute data?
● Q12: Trigger fired when first element is received + delay
(works in processing in global window time to create a duration)
● Processing time: wall clock absolute program time
● Event time: timestamp in which the event occurred
25
26. Key point: How to make a join?
● CoGroupByKey (in Q3, Q9): groups values of PCollections<KV> that share the same key
○ Q3: Join Auctions and Persons by their person id and tag them
26
27. Key point: How to temporarily group events?
● Custom window function (in Q9)
○ As CoGroupByKey is per window, need
to put bids and auctions in the same
window before joining them.
27
28. Key point: How to deal with out of order events?
● State and Timer APIs in an incremental join (Q3):
○ Memorize person event waiting for corresponding auctions and clear at timer
○ Memorize auction events waiting for corresponding person event
28
29. Key point: How to tweak reduction phase?
Custom combiner (in Q6) to be
able to specify
1. how elements are added to
accumulators
2. how accumulators merge
3. how to extract final data
to calculate the average price of the
last 3 closed auctions
29
30. Conclusion on queries
● Wide coverage of the Beam API
○ Most of the API
○ Illustrates also working in processing time
● Realistic
○ Real use cases, valid queries for an end user auction system
○ Extra queries inspired by Google Cloud Dataflow client use cases
● Complex queries
○ Leverage all the runners capabilities
30
31. Beam + Nexmark = A win-win story
● Streaming test
● A/B testing of big data execution engines (regression and performance
comparison between 2 versions of the same engine or of the same runner, ...)
● Integration testing (SDK with runners, runners with engines, …)
● Validate Beam runners capability matrix
31
33. Neutral Benchmarking: A difficult issue
● Different levels of support of capabilities of the Beam model among runners
● All execution systems have different strengths: we would end up comparing
things that are not always comparable
○ Some runners were designed to be batch oriented, others streaming oriented
○ Some are designed towards sub-second latency, others support auto-scaling
● Runners / Systems can have multiple knobs to tweak the options
● Benchmarking on a distributed environment can be inconsistent.
● Even worse if you benchmark on the cloud (e.g. Noisy neighbors)
33
42. Current state
42
● Nexmark helped discover bugs and missing features in Beam
● 9 open issues / 6 closed issues on Beam upstream
● Nexmark PR is opened, expected to be merged in Beam master soon
43. Future work
● Resolve open Nexmark and Beam issues
● Add more queries to evaluate corner cases
● Validate new runners: Gearpump, Storm, JStorm
● Integrate Nexmark into the current set of tests of Beam
● Streaming SQL-based queries (using the ongoing work on Calcite DSL)
43
44. Contribute
You are welcome to contribute!
● 5 open Github issues and 9 Beam Jiras that need to be taken care of
● Improve documentation + more refactoring
● New ideas, more queries, support for IOs, etc
44
45. Greetings
● Mark Shields (Google): Contributing Nexmark + answering our questions.
● Thomas Groh, Kenneth Knowles (Google): Direct runner + State/Timer API.
● Amit Sela, Aviem Zur (Paypal): Spark Runner + Metrics.
● Aljoscha Krettek (data Artisans), Jinsong Lee (Ali Baba): Flink Runner.
● Jean-Baptiste Onofre, Abbass Marouni (Talend): comments and help to run
Nexmark in our YARN cluster.
● The rest of the Beam community in general for being awesome.
45
50. Query 3: Who is selling in particular US states?
● Illustrates incremental join of the auctions and the persons collections
● uses global window and using per-key state and timer APIs
○ Apply global window to events with trigger repeatedly after at least nbEvents in pane =>
results will be materialized each time nbEvents are received.
○ input1: collection of auctions events filtered by category and keyed by seller id
○ input2: collection of persons events filtered by US state codes and keyed by person id
○ CoGroupByKey to group auctions and persons by personId/sellerId + tags to distinguish
persons and auctions
○ ParDo to do the incremental join: auctions and person events can arrive out of order
■ person element stored in persistent state in order to match future auctions by that
person. Set a timer to clear the person state after a TTL
■ auction elements stored in persistent state until we have seen the corresponding person
record. Then, it can be output and cleared
○ output NameCityStateId(person.name, person.city, person.state, auction.id) objects
50
51. Query 5: Which auctions have seen the most bids in
the last period?
● Illustrates sliding windows and combiners (i.e. reducers) to compare the
elements in auctions Collection
○ Input: (sliding) window (to have a result over 1h period updated every 1 min) collection of
bids events
○ ParDo to replace bid elements by their auction id
○ Count.PerElement to count the occurrences of each auction id
○ Combine.globally to select only the auctions with the maximum number of bids
■ BinaryCombineFn to compare one to one the elements of the collection (auction id
occurrences, i.e. number of bids)
■ Return KV(auction id, max occurrences)
○ output: AuctionCount(auction id, max occurrences) objects
51
52. Query 6: What is the average selling price per seller
for their last 10 closed auctions?
● Illustrates specialized combiner that allows specifying the combine process of
the bids
○ input: winning-bids keyed by seller id
○ GlobalWindow + trigerring at each element (to have a continuous flow of updates at each
new winning-bid)
○ Combine.perKey to calculate average price of last 10 winning bids for each seller.
■ create Arraylist accumulators for chunks of data
■ add all elements of the chunks to the accumulators, sort them by bid timeStamp then
price keeping last 10 elements
■ iteratively merge the accumulators until there is only one: just add all bids of all
accumulators to a final accumulator and sort by timeStamp then price keeping last 10
elements
■ extractOutput: sum all the prices of the bids and divide by accumulator size
○ output SellerPrice(sellerId, avgPrice) object 52
53. Query 7: What are the highest bids per period?
● Could have been implemented with a combiner like query5 but deliberately
implemented using Max(prices) as a side input and illustrate fanout.
● Fanout is a redistribution using an intermediate implicit combine step to
reduce the load in the final step of the Max transform
○ input: (fixed) windowed collection of 100 000 bids events
○ ParDo to replace bids by their price
○ Max.withFanout to get the max per window and use it as a side input for next step. Fanout is
useful if there are many events to be computed in a window using the Max transform.
○ ParDo on the bids with side input to output the bid if bid.price equals maxPrice (that comes
from side input)
53
54. Query 9 Winning-bids (not part of original Nexmark):
extract the most recent of the highest bids
● Illustrates custom window function to reconcile auctions and bids + join them
○ input: collection of events
○ Apply custom windowing function to temporarily reconcile auctions and bids events in the same
custom window (AuctionOrBidWindow)
■ assign auctions to window [auction.timestamp, auction.expiring]
■ assign bids to window [bid.timestamp, bid.timestamp + expectedAuctionDuration (generator
configuration parameter)]
■ merge all 'bid' windows into their corresponding 'auction' window, provided the auction has not
expired.
○ Filter + ParDos to extract auctions out of events and key them by auction id
○ Filter + ParDos to extract bids out of events and key them by auction id
○ CogroupByKey (groups values of PCollections<KV> that share the same key) to group auctions and
bids by auction id + tags to distinguish auctions and bids
○ ParDo to
■ determine best bid price: verification of valid bid, sort prices by price ASC then time DESC
and keep the max price
■ and output AuctionBid(auction, bestBid) objects
54
55. Query 11 (not part of original Nexmark): How many
bids did a user make in each session he was active?
● Illustrates session windows + triggering on the bids collection
○ input: collection of bids events
○ ParDo to replace bids with their bidder id
○ Apply session windows with gap duration = windowDuration (configuration item) and trigger
repeatedly after at least nbEvents in pane => each window (i.e. session) will contain bid ids
received since last windowDuration period of inactivity and materialized every nbEvents bids
○ Count.perElement to count bids per bidder id (number of occurrences of bidder id)
○ output idsPerSession(bidder, bidsCount) objects
55
56. Query 12 (not part of original Nexmark): How many
bids does a user make within a fixed processing time
limit?
● Illustrates working in processing time in the Global window to count
occurrences of bidder
○ input: collection of bid events
○ ParDo to replace bids by their bidder id
○ Apply global window with trigger repeatedly after processingTime pass the first element in
pane + windowDuration (configuration item) => each pane will contain elements processed
within windowDuration time
○ Count.perElement to count bids per bidder id (occurrences of bidder id)
○ output BidsPerWindow(bidder, bidsCount) objects
56