Dongwon Kim presented on migrating T map's driving score service from a batch processing architecture to a real-time streaming architecture using Apache Flink. The new system calculates driving scores for each session in real-time as GPS data is received, allowing users to see their scores sooner. It utilizes Flink's event time processing, windowing, and a custom trigger to handle out-of-order data. Metrics are collected using Prometheus to monitor performance and latency.
Predictive Maintenance with Deep Learning and Apache FlinkDongwon Kim
Flink can be used to build a predictive maintenance system using deep learning models on time-series sensor data. A Flink data stream processing pipeline is designed to handle joining streams, applying convolutional LSTM models through an ensemble, and monitoring outputs. Docker and Prometheus are used to package and monitor the solution.
Why building a big data platform is hard? What are the key aspects involved in providing a "Serverless" experience for data folks. And how Databricks solves infrastructure problems and provides the "Serverless" experience.
- Spark Streaming allows processing of live data streams using Spark's batch processing engine by dividing streams into micro-batches.
- A Spark Streaming application consists of input streams, transformations on those streams such as maps and filters, and output operations. The application runs continuously processing each micro-batch.
- Key aspects of operationalizing Spark Streaming jobs include checkpointing to ensure fault tolerance, optimizing throughput by increasing parallelism, and debugging using Spark UI.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yyaHb8.
The authors discuss Netflix's new stream processing system that supports a reactive programming model, allows auto scaling, and is capable of processing millions of messages per second. Filmed at qconsf.com.
Danny Yuan is an architect and software developer in Netflix’s Platform Engineering team. Justin Becker is Senior Software Engineer at Netflix.
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeFlink Forward
This document discusses processing scientific mass spectrometry data in real-time using parallel and distributed computing techniques. It describes how a mass spectrometry experiment produces terabytes of data that currently takes over 24 hours to fully process. The document proposes using MapReduce and Apache Flink to parallelize the data processing across clusters to help speed it up towards real-time analysis. Initial tests show Flink can process the data 2-3 times faster than traditional Hadoop MapReduce. Finally, it discusses simulating real-time streaming of the data using Kafka and Flink Streaming to enable processing results within 10 seconds of the experiment completing.
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward
Let’s be honest: Running a distributed stateful stream processor that is able to handle terabytes of state and tens of gigabytes of data per second while being highly available and correct (in an exactly-once sense) does not work without any planning, configuration and monitoring. While the Flink developer community tries to make everything as simple as possible, it is still important to be aware of all the requirements and implications In this talk, we will provide some insights into the greatest operations mysteries of Flink from a high-level perspective: - Capacity and resource planning: Understand the theoretical limits. - Memory and CPU configuration: Distribute resources according to your needs. - Setting up High Availability: Planning for failures. - Checkpointing and State Backends: Ensure correctness and fast recovery For each of the listed topics, we will introduce the concepts of Flink and provide some best practices we have learned over the past years supporting Flink users in production.
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLFlink Forward
The document discusses Samsara, a domain specific language for distributed machine learning. It provides an algebraic expression language for linear algebra operations and optimizes distributed computations. An example of linear regression on a cereals dataset is presented to demonstrate how Samsara can be used to estimate regression coefficients in a distributed fashion. Key steps include loading data as a distributed row matrix, extracting feature and target matrices, computing the normal equations, and solving the linear system to estimate coefficients.
This document provides an overview of Apache Flink internals. It begins with an introduction and recap of Flink programming concepts. It then discusses how Flink programs are compiled into execution plans and executed in a pipelined fashion, as opposed to being executed eagerly like regular code. The document outlines Flink's architecture including the optimizer, runtime environment, and data storage integrations. It also covers iterative processing and how Flink handles iterations both by unrolling loops and with native iterative datasets.
Predictive Maintenance with Deep Learning and Apache FlinkDongwon Kim
Flink can be used to build a predictive maintenance system using deep learning models on time-series sensor data. A Flink data stream processing pipeline is designed to handle joining streams, applying convolutional LSTM models through an ensemble, and monitoring outputs. Docker and Prometheus are used to package and monitor the solution.
Why building a big data platform is hard? What are the key aspects involved in providing a "Serverless" experience for data folks. And how Databricks solves infrastructure problems and provides the "Serverless" experience.
- Spark Streaming allows processing of live data streams using Spark's batch processing engine by dividing streams into micro-batches.
- A Spark Streaming application consists of input streams, transformations on those streams such as maps and filters, and output operations. The application runs continuously processing each micro-batch.
- Key aspects of operationalizing Spark Streaming jobs include checkpointing to ensure fault tolerance, optimizing throughput by increasing parallelism, and debugging using Spark UI.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yyaHb8.
The authors discuss Netflix's new stream processing system that supports a reactive programming model, allows auto scaling, and is capable of processing millions of messages per second. Filmed at qconsf.com.
Danny Yuan is an architect and software developer in Netflix’s Platform Engineering team. Justin Becker is Senior Software Engineer at Netflix.
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeFlink Forward
This document discusses processing scientific mass spectrometry data in real-time using parallel and distributed computing techniques. It describes how a mass spectrometry experiment produces terabytes of data that currently takes over 24 hours to fully process. The document proposes using MapReduce and Apache Flink to parallelize the data processing across clusters to help speed it up towards real-time analysis. Initial tests show Flink can process the data 2-3 times faster than traditional Hadoop MapReduce. Finally, it discusses simulating real-time streaming of the data using Kafka and Flink Streaming to enable processing results within 10 seconds of the experiment completing.
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward
Let’s be honest: Running a distributed stateful stream processor that is able to handle terabytes of state and tens of gigabytes of data per second while being highly available and correct (in an exactly-once sense) does not work without any planning, configuration and monitoring. While the Flink developer community tries to make everything as simple as possible, it is still important to be aware of all the requirements and implications In this talk, we will provide some insights into the greatest operations mysteries of Flink from a high-level perspective: - Capacity and resource planning: Understand the theoretical limits. - Memory and CPU configuration: Distribute resources according to your needs. - Setting up High Availability: Planning for failures. - Checkpointing and State Backends: Ensure correctness and fast recovery For each of the listed topics, we will introduce the concepts of Flink and provide some best practices we have learned over the past years supporting Flink users in production.
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLFlink Forward
The document discusses Samsara, a domain specific language for distributed machine learning. It provides an algebraic expression language for linear algebra operations and optimizes distributed computations. An example of linear regression on a cereals dataset is presented to demonstrate how Samsara can be used to estimate regression coefficients in a distributed fashion. Key steps include loading data as a distributed row matrix, extracting feature and target matrices, computing the normal equations, and solving the linear system to estimate coefficients.
This document provides an overview of Apache Flink internals. It begins with an introduction and recap of Flink programming concepts. It then discusses how Flink programs are compiled into execution plans and executed in a pipelined fashion, as opposed to being executed eagerly like regular code. The document outlines Flink's architecture including the optimizer, runtime environment, and data storage integrations. It also covers iterative processing and how Flink handles iterations both by unrolling loops and with native iterative datasets.
Spark Streaming allows processing of live data streams using Spark. It works by receiving data streams, chopping them into batches, and processing the batches using Spark. This presentation covered Spark Streaming concepts like the lifecycle of a streaming application, best practices for aggregations, operationalization through checkpointing, and achieving high throughput. It also discussed debugging streaming jobs and the benefits of combining streaming with batch, machine learning, and SQL processing.
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo PeriniFlink Forward
World's toughest and most interesting analysis tasks lie at the intersection of graph data (inter-dependencies in data) and deep learning (inter-dependencies in the model). Classical graph embedding techniques have for years occupied research groups seeking how complex graphs can be encoded into a low-dimensional latent space. Recently, deep learning has dominated the space of embeddings generation due to its ability to automatically generate embeddings given any static graph.
Grapharis is a project that revitalizes the concept of graph embeddings, yet it does so in a real setting were graphs are not static but keep changing over time (think of user interactions in social networks). More specifically, we explored how a system like Flink can be used to simplify both the process of training a graph embedding model incrementally but also make complex inferences and predictions in real time using graph structured data streams. To our knowledge, Grapharis is the first complete data pipeline using Flink and Tensorflow for real-time deep graph learning. This talk will cover how we can train, store and generate embeddings continuously and accurately as data evolves over time without the need to re-train the underlying model.
QConSF 2014 talk on Netflix Mantis, a stream processing systemDanny Yuan
Justin and I gave this talk in QCon SF 2014 about the Mantis, a stream processing system that features a reactive programming API, auto scaling, and stream locality
Distributed Real-Time Stream Processing: Why and How 2.0Petr Zapletal
The demand for stream processing is increasing a lot these day. Immense amounts of data has to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications include trading, social networks, Internet of things, system monitoring, and many other examples.
In this talk we are going to discuss various state of the art open-source distributed streaming frameworks, their similarities and differences, implementation trade-offs and their intended use-cases. Apart of that, I’m going to speak about Fast Data, theory of streaming, framework evaluation and so on. My goal is to provide comprehensive overview about modern streaming frameworks and to help fellow developers with picking the best possible for their particular use-case.
Self-managed and automatically reconfigurable stream processingVasia Kalavri
With its superior state management and savepoint mechanism, Apache Flink is unique among modern stream processors in supporting minimal-effort job reconfiguration. Savepoints are being extensively used to enable dynamic scaling, bug fixing, upgrades, and numerous other reconfiguration use-cases, all while preserving exactly-once semantics. However, when it comes to dynamic scaling, the burden of reconfiguration decisions -when and how much to scale- is currently placed on the user.
In this talk, I share our recent work at ETH Zurich on providing support for self-managed and automatically reconfigurable stream processing. I present SnailTrail (NSDI’18), an online critical path analysis module that detects bottlenecks and provides insights on streaming application performance, and DS2 (OSDI’18), an automatic scaling controller which identifies optimal backpressure-free configurations and operates reactively online. Both SnailTrail and DS2 are integrated with Apache Flink and publicly available. I conclude with evaluation results, ongoing work, and and future challenges in this area.
Scalable Realtime Analytics with declarative SQL like Complex Event Processin...Srinath Perera
This document discusses real-time analytics and introduces WSO2 Complex Event Processing (CEP) as a SQL-like language for real-time analytics. It describes how CEP can be used to define filtering, aggregation, pattern matching, and other common operations on streaming data. It also discusses how CEP queries can be scaled out across multiple nodes by partitioning streams and queries. CEP provides an easy to use yet powerful way to perform real-time analytics on streaming data at scale.
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Alexey Kharlamov
At Integral, we process heavy volumes of click-stream traffic. 50K QPS of ad impressions at peak and close to 200K QPS of all browser calls. We build analytics on this streams of data. There are two applications which require quite significant computational effort: 'sessionization' and fraud detection.
Sessionization implies linking a series of requests from same browser into single record. There can be 5 or more total requests spread over 15-30 minutes which we need to link to each other.
Fraud detection is a process looking at various signals in browser requests and at substantial historical evidence data classifying ad impression either as legitimate or as fraudulent.
We've been doing both (as well as all other analytics) in batch mode once an hour at best. Both processes, and, in particular, fraud detection, are time sensitive and much more meaningful if done in near-real-time.
This talk would be about our experience migrating a once-per-day offline batch processing of impression data using hadoop to in-memory stream processing using Kafka, Storm and Cassandra. We will touch upon our choices and our reasoning for selecting the products used for this solution.
Hadoop is no longer the only or always preferred option in Big Data space. In-memory stream processing may be more effective for time series data preparation and aggregation. Ability to scale at a significantly lower cost means more customers, better accuracy and better business practices: since only in-stream processing allows for low-latency data and insight delivery it opens entirely new opportunities. However, transitioning of non-trivial data pipelines raises a number of questions hidden previously within the offline nature of batch processing. How will you join several data feeds? How will you implement failure recovery? In addition to handling terabytes of data per day our streaming system has to be guided by the following considerations:
• Recovery time
• Time relativity and continuity
• Geographical distribution of data sources
• Limit on data loss
• Maintainability
The system produces complex cross-correlational analysis of several data feeds and aggregation for client analytics with input feed frequency of up to 100K msg/sec.
This presentation will benefit anyone interested in learning an alternate approach for big data analytics, especially the process of joining multiple streams in memory using Cassandra. Presentation will also highlight certain optimization patterns used those can be useful in similar situations.
This document discusses different frameworks for big data processing at ResearchGate, including Hive, MapReduce, and Flink. It provides an example of using Hive to find the top 5 coauthors for each author based on publication data. Code snippets in Hive SQL and Java are included to implement the top k coauthors user defined aggregate function (UDAF) in Hive. The document evaluates different frameworks based on criteria like features, performance, and usability.
Some notes about spark streming positioning give the current players: Beam, Flink, Storm et al. Helpful if you have to choose an Streaming engine for your project.
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
The document discusses distributed stream processing frameworks. It provides an overview of frameworks like Storm, Spark Streaming, Samza, Flink, and Kafka Streams. It compares aspects of different frameworks like programming models, delivery guarantees, fault tolerance, and state management. General guidelines are given for choosing a framework based on needs like latency requirements and state needs. Storm and Trident are recommended for low latency tasks while Spark Streaming and Flink are more full-featured but have higher latency. The document provides code examples for word count in different frameworks.
Ufuc Celebi – Stream & Batch Processing in one SystemFlink Forward
The document describes the architecture and execution model of Apache Flink. Flink uses a distributed dataflow model where a job is represented as a directed acyclic graph of operators. The client submits this graph to the JobManager, which schedules tasks across TaskManagers. Tasks communicate asynchronously through data channels to process bounded and unbounded data in a pipelined fashion.
In our data-driven world, the need for speed has never been greater. The advent of Flink has no doubt paved the way for faster and more efficient data delivery solutions; however, it is not without its costs. The amount of time, talent and resources required to effectively manipulate streams and conduct analysis at scale is far from trivial, and it can be especially daunting to the uninitiated or technically challenged. In an effort to make scalable stream processing more readily accessible to the world, Cogility Software created Cogynt: a zero-coding analytics platform for the masses. Cogynt enables engineers and non-engineers alike to manipulate and analyze streams on an abstracted level, while leveraging the power of Flink and Kafka under the hood to declaratively build complex Flink jobs. Shielding the analyst from low-level system configuration and programming API’s lends itself to creating an environment where analysts can focus on what’s most important to their businesses – the data. This session will demonstrate, from a data science perspective, how Cogynt can easily do almost anything Flink users can do with code, and more!
Always On: Building Highly Available Applications on CassandraRobbie Strickland
Cassandra was built from the ground up to enable linearly scalable, always-on applications. But the path to high availability has many land mines that can mean failure for the inexperienced user. In this talk, I will offer practical advice on how to achieve 100% uptime on millions of transactions per second. I'll address all aspects of the topic, including deployment, configuration, application design, and operations.
This document provides an overview of resource aware scheduling in Apache Storm. It discusses the challenges of scheduling Storm topologies at Yahoo scale, including increasing heterogeneous clusters, low cluster utilization, and unbalanced resource usage. It then introduces the Resource Aware Scheduler (RAS) built for Storm, which allows fine-grained resource control and isolation for topologies through APIs and cgroups. Key features of RAS include pluggable scheduling strategies, per user resource guarantees, and topology priorities. Experimental results from Yahoo Storm clusters show significant improvements to throughput and resource utilization with RAS. Future work may include improved scheduling strategies and real-time resource monitoring.
The document discusses the evolution of Ceilometer, an OpenStack project that collects measurements from deployed clouds and persists the data for later retrieval and analysis. It describes how Ceilometer has scaled out its data collection capabilities over time by adding agents, partitioning workloads, and integrating with Gnocchi to provide more efficient time-series storage. The document also provides best practices for Ceilometer deployment and configuration to optimize data collection, storage and querying.
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecPeter Bakas
Talk on Netflix Keystone by Peter Bakas at SF Data Engineering Meetup on 2/23/2016.
Topics covered:
- Architectural design and principles for Keystone
- Technologies that Keystone is leveraging
- Best practices
http://www.meetup.com/SF-Data-Engineering/events/228293610/
Who: Karthik Ramasamy (@karthikz)
Date: September 20, 2016
Event: #TwitterRealTime
This slide deck consists of presentations from various teams about Twitter's real time infrastructure, the components it uses, and how they function. It includes presentations from David Rusek (@davidrusek), Maosong Fu (@Louis_Fumaosong), Sandy Strong (@st5are), and Yimin Tan (@YiminTan_Kevin).
Spark Streaming allows processing of live data streams using Spark. It works by receiving data streams, chopping them into batches, and processing the batches using Spark. This presentation covered Spark Streaming concepts like the lifecycle of a streaming application, best practices for aggregations, operationalization through checkpointing, and achieving high throughput. It also discussed debugging streaming jobs and the benefits of combining streaming with batch, machine learning, and SQL processing.
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo PeriniFlink Forward
World's toughest and most interesting analysis tasks lie at the intersection of graph data (inter-dependencies in data) and deep learning (inter-dependencies in the model). Classical graph embedding techniques have for years occupied research groups seeking how complex graphs can be encoded into a low-dimensional latent space. Recently, deep learning has dominated the space of embeddings generation due to its ability to automatically generate embeddings given any static graph.
Grapharis is a project that revitalizes the concept of graph embeddings, yet it does so in a real setting were graphs are not static but keep changing over time (think of user interactions in social networks). More specifically, we explored how a system like Flink can be used to simplify both the process of training a graph embedding model incrementally but also make complex inferences and predictions in real time using graph structured data streams. To our knowledge, Grapharis is the first complete data pipeline using Flink and Tensorflow for real-time deep graph learning. This talk will cover how we can train, store and generate embeddings continuously and accurately as data evolves over time without the need to re-train the underlying model.
QConSF 2014 talk on Netflix Mantis, a stream processing systemDanny Yuan
Justin and I gave this talk in QCon SF 2014 about the Mantis, a stream processing system that features a reactive programming API, auto scaling, and stream locality
Distributed Real-Time Stream Processing: Why and How 2.0Petr Zapletal
The demand for stream processing is increasing a lot these day. Immense amounts of data has to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications include trading, social networks, Internet of things, system monitoring, and many other examples.
In this talk we are going to discuss various state of the art open-source distributed streaming frameworks, their similarities and differences, implementation trade-offs and their intended use-cases. Apart of that, I’m going to speak about Fast Data, theory of streaming, framework evaluation and so on. My goal is to provide comprehensive overview about modern streaming frameworks and to help fellow developers with picking the best possible for their particular use-case.
Self-managed and automatically reconfigurable stream processingVasia Kalavri
With its superior state management and savepoint mechanism, Apache Flink is unique among modern stream processors in supporting minimal-effort job reconfiguration. Savepoints are being extensively used to enable dynamic scaling, bug fixing, upgrades, and numerous other reconfiguration use-cases, all while preserving exactly-once semantics. However, when it comes to dynamic scaling, the burden of reconfiguration decisions -when and how much to scale- is currently placed on the user.
In this talk, I share our recent work at ETH Zurich on providing support for self-managed and automatically reconfigurable stream processing. I present SnailTrail (NSDI’18), an online critical path analysis module that detects bottlenecks and provides insights on streaming application performance, and DS2 (OSDI’18), an automatic scaling controller which identifies optimal backpressure-free configurations and operates reactively online. Both SnailTrail and DS2 are integrated with Apache Flink and publicly available. I conclude with evaluation results, ongoing work, and and future challenges in this area.
Scalable Realtime Analytics with declarative SQL like Complex Event Processin...Srinath Perera
This document discusses real-time analytics and introduces WSO2 Complex Event Processing (CEP) as a SQL-like language for real-time analytics. It describes how CEP can be used to define filtering, aggregation, pattern matching, and other common operations on streaming data. It also discusses how CEP queries can be scaled out across multiple nodes by partitioning streams and queries. CEP provides an easy to use yet powerful way to perform real-time analytics on streaming data at scale.
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Alexey Kharlamov
At Integral, we process heavy volumes of click-stream traffic. 50K QPS of ad impressions at peak and close to 200K QPS of all browser calls. We build analytics on this streams of data. There are two applications which require quite significant computational effort: 'sessionization' and fraud detection.
Sessionization implies linking a series of requests from same browser into single record. There can be 5 or more total requests spread over 15-30 minutes which we need to link to each other.
Fraud detection is a process looking at various signals in browser requests and at substantial historical evidence data classifying ad impression either as legitimate or as fraudulent.
We've been doing both (as well as all other analytics) in batch mode once an hour at best. Both processes, and, in particular, fraud detection, are time sensitive and much more meaningful if done in near-real-time.
This talk would be about our experience migrating a once-per-day offline batch processing of impression data using hadoop to in-memory stream processing using Kafka, Storm and Cassandra. We will touch upon our choices and our reasoning for selecting the products used for this solution.
Hadoop is no longer the only or always preferred option in Big Data space. In-memory stream processing may be more effective for time series data preparation and aggregation. Ability to scale at a significantly lower cost means more customers, better accuracy and better business practices: since only in-stream processing allows for low-latency data and insight delivery it opens entirely new opportunities. However, transitioning of non-trivial data pipelines raises a number of questions hidden previously within the offline nature of batch processing. How will you join several data feeds? How will you implement failure recovery? In addition to handling terabytes of data per day our streaming system has to be guided by the following considerations:
• Recovery time
• Time relativity and continuity
• Geographical distribution of data sources
• Limit on data loss
• Maintainability
The system produces complex cross-correlational analysis of several data feeds and aggregation for client analytics with input feed frequency of up to 100K msg/sec.
This presentation will benefit anyone interested in learning an alternate approach for big data analytics, especially the process of joining multiple streams in memory using Cassandra. Presentation will also highlight certain optimization patterns used those can be useful in similar situations.
This document discusses different frameworks for big data processing at ResearchGate, including Hive, MapReduce, and Flink. It provides an example of using Hive to find the top 5 coauthors for each author based on publication data. Code snippets in Hive SQL and Java are included to implement the top k coauthors user defined aggregate function (UDAF) in Hive. The document evaluates different frameworks based on criteria like features, performance, and usability.
Some notes about spark streming positioning give the current players: Beam, Flink, Storm et al. Helpful if you have to choose an Streaming engine for your project.
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
The document discusses distributed stream processing frameworks. It provides an overview of frameworks like Storm, Spark Streaming, Samza, Flink, and Kafka Streams. It compares aspects of different frameworks like programming models, delivery guarantees, fault tolerance, and state management. General guidelines are given for choosing a framework based on needs like latency requirements and state needs. Storm and Trident are recommended for low latency tasks while Spark Streaming and Flink are more full-featured but have higher latency. The document provides code examples for word count in different frameworks.
Ufuc Celebi – Stream & Batch Processing in one SystemFlink Forward
The document describes the architecture and execution model of Apache Flink. Flink uses a distributed dataflow model where a job is represented as a directed acyclic graph of operators. The client submits this graph to the JobManager, which schedules tasks across TaskManagers. Tasks communicate asynchronously through data channels to process bounded and unbounded data in a pipelined fashion.
In our data-driven world, the need for speed has never been greater. The advent of Flink has no doubt paved the way for faster and more efficient data delivery solutions; however, it is not without its costs. The amount of time, talent and resources required to effectively manipulate streams and conduct analysis at scale is far from trivial, and it can be especially daunting to the uninitiated or technically challenged. In an effort to make scalable stream processing more readily accessible to the world, Cogility Software created Cogynt: a zero-coding analytics platform for the masses. Cogynt enables engineers and non-engineers alike to manipulate and analyze streams on an abstracted level, while leveraging the power of Flink and Kafka under the hood to declaratively build complex Flink jobs. Shielding the analyst from low-level system configuration and programming API’s lends itself to creating an environment where analysts can focus on what’s most important to their businesses – the data. This session will demonstrate, from a data science perspective, how Cogynt can easily do almost anything Flink users can do with code, and more!
Always On: Building Highly Available Applications on CassandraRobbie Strickland
Cassandra was built from the ground up to enable linearly scalable, always-on applications. But the path to high availability has many land mines that can mean failure for the inexperienced user. In this talk, I will offer practical advice on how to achieve 100% uptime on millions of transactions per second. I'll address all aspects of the topic, including deployment, configuration, application design, and operations.
This document provides an overview of resource aware scheduling in Apache Storm. It discusses the challenges of scheduling Storm topologies at Yahoo scale, including increasing heterogeneous clusters, low cluster utilization, and unbalanced resource usage. It then introduces the Resource Aware Scheduler (RAS) built for Storm, which allows fine-grained resource control and isolation for topologies through APIs and cgroups. Key features of RAS include pluggable scheduling strategies, per user resource guarantees, and topology priorities. Experimental results from Yahoo Storm clusters show significant improvements to throughput and resource utilization with RAS. Future work may include improved scheduling strategies and real-time resource monitoring.
The document discusses the evolution of Ceilometer, an OpenStack project that collects measurements from deployed clouds and persists the data for later retrieval and analysis. It describes how Ceilometer has scaled out its data collection capabilities over time by adding agents, partitioning workloads, and integrating with Gnocchi to provide more efficient time-series storage. The document also provides best practices for Ceilometer deployment and configuration to optimize data collection, storage and querying.
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecPeter Bakas
Talk on Netflix Keystone by Peter Bakas at SF Data Engineering Meetup on 2/23/2016.
Topics covered:
- Architectural design and principles for Keystone
- Technologies that Keystone is leveraging
- Best practices
http://www.meetup.com/SF-Data-Engineering/events/228293610/
Who: Karthik Ramasamy (@karthikz)
Date: September 20, 2016
Event: #TwitterRealTime
This slide deck consists of presentations from various teams about Twitter's real time infrastructure, the components it uses, and how they function. It includes presentations from David Rusek (@davidrusek), Maosong Fu (@Louis_Fumaosong), Sandy Strong (@st5are), and Yimin Tan (@YiminTan_Kevin).
Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...Flink Forward
Streaming engines like Apache Flink are redefining ETL and data processing. Data can be extracted, transformed, filtered and written out in real-time with an ease matching that of batch processing. However the real challenge of matching the prowess of batch ETL remains in doing joins, in maintaining state and to have the data be paused or rested dynamically. Netflix has a microservices architecture. Different microservices serve and record different kind of user interactions with the product. Some of these live services generate millions of events per second, all carrying meaningful but often partial information. Things start to get exciting when we want to combine the events coming from one high-traffic microservice to another. Joining these raw events generates rich datasets that are used to train the machine learning models that serve Netflix recommendations. Historically we have done this joining of large volume data-sets in batch. However we asked ourselves if the data is being generated in real-time, why must it not be processed downstream in real time? Why wait a full day to get information from an event that was generated a few mins ago? In this talk, we will share how we solved a complex join of two high-volume event streams using Flink. We will talk about maintaining large state, fault tolerance of a stateful application and strategies for failure recovery.
The document discusses Netflix's distributed impression store that captures user impressions and recommendations at scale. Billions of entries are stored daily to track what content users have seen. An exponential moving average model is used to represent impression counts over multiple time windows in a compact way. The data is stored in a distributed cache-like system that uses memory and external storage for scale and persistence. The system replicates data across regions for availability.
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's ScaleSeunghyun Lee
Pinot is an open source distributed OLAP data store designed for low latency analytics on large datasets. It is used at LinkedIn for various real-time analytics applications requiring sub-second latency on billions of events daily. Pinot uses a columnar data format, inverted indexes, encoding, and star tree indexes to enable fast filtering and aggregation. It also supports both batch and real-time ingestion from streaming data sources like Kafka.
Story of migrating event pipeline from batch to streaminglohitvijayarenu
The document summarizes Twitter's migration of its 4 trillion event log pipeline from batch to streaming processing using Apache technologies. Key aspects include:
1. Twitter aggregated 10PB of event logs across millions of clients into categories stored hourly on HDFS.
2. They designed a log pipeline in Google Cloud Platform using PubSub for storage, Dataflow jobs to stream to destinations like BigQuery and GCS, and a client library for uniform event publishing.
3. The pipeline supports streaming 4+ trillion events per day between Twitter datacenters and Google Cloud at sub-second latency while ensuring data integrity.
Keystone event processing pipeline on a dockerized microservices architectureZhenzhong Xu
The document provides an overview of Keystone, Netflix's event processing pipeline. Some key points:
- Keystone is a collection of microservices and components that form a single, self-contained logical service for processing over 500 billion events generated daily at Netflix.
- It acts as a self-scaling, multi-tenant event processing pipeline that embraces continuous integration/continuous delivery to be self-healing and cloud failure tolerant.
- The routing infrastructure uses Zookeeper for instance assignment and checkpoints to clusters stored in S3 for at-least-once delivery semantics under failure conditions.
- The control plane handles container resource allocation, scheduling, and cluster orchestration and deployments.
- Current
Evolution of Real-time User Engagement Event Consumption at PinterestHostedbyConfluent
"We will discuss how we at Pinterest transformed real time user engagement event consumption.
Every day, we log hundreds of billions of user engagement events across different domains to a few common Kafka topics which are consumed by hundreds of real time applications. These real time applications were built upon diverged frameworks (e.g. Spark Streaming, Storm, Flink, and internally developed frameworks using Kafka Consumer API) without standardization on processing logics. It led to repeated processing of similar logic, multiple codebases to maintain, low data quality, and inconsistency with offline datasets. These negatively impact scalability, reliability, efficiency and data accuracy of these applications and eventually affect the real-time content recommendation quality and user experience.
To address these challenges, we unified the way of consuming events in our real time applications by consolidating the compute engines to Flink, splitting events in those common topics by engagement types, generating cleansed events with standardized processing to align on business concepts. Throughout these efforts, we achieved multi-million dollar infrastructure savings and double-digit engagement gain after applications adopted those cleansed events.
Moving forward, we are implementing frameworks for better tracking and governing the Kafka events and real time use cases."
Mirko Damiani - An Embedded soft real time distributed system in Golinuxlab_conf
An embedded system usually involves low level languages like C and highly customized hardware. In this talk we will see a use case of a soft real time system which was developed taking a very different approach, written in Go. We will see what are the advantages of this choice, along with its limits.
BDA403 How Netflix Monitors Applications in Real-time with Amazon KinesisAmazon Web Services
Thousands of services work in concert to deliver millions of hours of video streams to Netflix customers every day. These applications vary in size, function, and technology, but they all make use of the Netflix network to communicate. Understanding the interactions between these services is a daunting challenge both because of the sheer volume of traffic and the dynamic nature of deployments. In this talk, we’ll first discuss why Netflix chose Amazon Kinesis Streams over other streaming data solutions like Kafka to address these challenges at scale. We’ll then dive deep into how Netflix uses Amazon Kinesis Streams to enrich network traffic logs and identify usage patterns in real time. Lastly, we will cover how Netflix uses this system to build comprehensive dependency maps, increase network efficiency, and improve failure resiliency. From this talk, you’ll take away techniques and processes that you can apply to your large-scale networks and derive real-time, actionable insights.
Flink Forward San Francisco 2019: Building Financial Identity Platform using ...Flink Forward
Building Financial Identity Platform using Apache Flink
To power financial prosperity around the world, Intuit needs to create personalized product experience and new data centric products. Some of these use cases include Enabling 360 Customer View for Personalization and Targeting, building Ecosystem for Data Exchange between internal and 3rd party and personalize financial offerings, creating platform for Personalized security experience based on risk factors of people and devices.
Unlike workflow centric products (for example, tax processing, accounting transactions), these use cases are often information-intensive and require real-time access to a large amount of connected data associated with people, organizations and things they own.
To achieve this, we have created a platform called Unified Profile Service utilizing Flink. This platform is intended to provide the strategic data asset of a trusted, real-time, unified and connected view of people, organization and things they own. We have abstracted re-usable components such as sources, sinks, transformations etc and created a template. Utilizing this template our Product teams are able to rapidly test domain specific transformations and computations by creating and deploying Flink Jobs. This platform is running in production on AWS EMR, powering multiple use cases, ingesting and processing billions of events per day.
In this talk, we will be discussing the design details of this Platform built leveraging Flink and Flink APIs as well as challenges faced along the way. We will begin by talking about the various components of the pipeline such as Identity Stitching, Entity Resolution, Reconciliation and Data Persistence. We will then dig in to the technical details of how we abstracted away these common components and created a template. We will also talk about how we update Consumer’s Financial Identity Graph in real-time through custom built AWS Dynamodb and Neptune Sink using Flink’s Connector API.
Finally we will touch on lessons learnt along the way as we deployed the platform in production and offer advice on things to avoid as well as how to take things to the next level.
The need for gleaning answers from data in real-time is moving from nicety to a necessity. There are few options to analyze the never-ending stream of unbounded data at scale. Let’s compare and contrast the core principles and technologies the different open source solutions available to help with this endeavor, and where in the future processing engines need to evolve to solve processing needs at scale. These findings are based on the experience of continuing to build a scalable solution in the cloud to process over 700 billion events at Netflix, and how we are embarking on the next journey to evolve unbounded data processing engines.
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018Seunghyun Lee
Pinot is a real-time OLAP data store that can support multiple analytics use cases like interactive dashboards, site facing queries, and anomaly detection in a single system. It achieves this through features like configurable indexes, dynamic query planning and execution, smart data partitioning and routing, and pre-materialized indexes like star-trees that optimize for latency and throughput across different workloads. The document discusses Pinot's architecture and optimizations that enable it to meet the performance requirements of these different use cases.
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...Flink Forward
The streaming platform team at Lyft has been running Flink jobs in production for more than a year now, powering critical use cases like improving pickup ETA accuracy, dynamic pricing, generating machine learning features for fraud detection, real-time analytics among many others. Broadly, the jobs fall into two abstraction layers: applications (Flink jobs that run on the native platform) and analytics (that leverage Dryft, Lyft’s fully managed data processing engine). This talk will give an overview of the platform architecture, deployment model and user experience. The talk will also dive deeper into some of the challenges and the lessons that were learnt, running Flink jobs at scale, specifically around scaling Flink connectors, dealing with event time skew (source synchronization) and highlight common patterns of problems observed across several Flink jobs. Finally, the talk will give insights into how we are re-architecting the streaming platform @ Lyft using a Kubernetes based deployment.
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Big Data Spain
This document discusses Apache Flink for IoT event-time stream processing. It begins by introducing streaming architectures and Flink. It then discusses how IoT data has important properties like continuous data production and event timestamps that require event-time based processing. Examples are provided of companies like King and Bouygues Telecom using Flink for billions of events per day with challenges like out-of-order data and flexible windowing. Event-time processing in Flink is able to handle these challenges through features like watermarks.
A Practical Deep Dive into Observability of Streaming Applications with Kosta...HostedbyConfluent
This document provides an overview of observability of streaming applications using Kafka. It discusses the three pillars of observability - logging, metrics, and tracing. It describes how to expose Kafka client-side metrics using interceptors, metric reporters, and the Spring Boot framework. It demonstrates calculating consumer lag from broker and client-side metrics. It introduces OpenTelemetry for collecting telemetry data across applications and exporting to various backends. Finally, it wraps up with lessons on monitoring consumer lag trends and selecting the right metrics to ship.
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...Flink Forward
SK telecom shares our experience of using Flink in building a solution for Predictive Maintenance (PdM). Our PdM solution named metatron PdM consists of (1) a Deep Neural Network (DNN)-based prediction model for precise prediction, and (2) a Flink-based runtime system which applies the model to a sliding window on sensor data streams. Efficient handling of multi-sensor streaming data for real-time prediction of equipment condition is a critical component of our product. In this talk, we first show why we choose Flink as a core engine for our streaming use case in which we generate real-time predictions using DNNs trained with Keras on top of TensorFlow and Theano. In addition, we present a comparative study of methods to exploit learning models on JVM such as directly using Python libraries on CPython embedded in JVM, using TensorFlow Java API (including Flink TensorFlow), and making RPC calls to TensorFlow Serving. We then explain how we implement the runtime system using Flink DataStream API, especially with event time, various window mechanisms, timestamp and watermark, custom source and sink, and checkpointing. Lastly, we present how we use the official Flink Docker image for solution delivery and the Flink metric system for monitoring and management of our solution. We hope our use case sets a good example of building a DNN-based streaming solution using Flink.
Similar to Real-time driving score service using Flink (20)
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Natural Language Processing (NLP), RAG and its applications .pptxfkyes25
1. In the realm of Natural Language Processing (NLP), knowledge-intensive tasks such as question answering, fact verification, and open-domain dialogue generation require the integration of vast and up-to-date information. Traditional neural models, though powerful, struggle with encoding all necessary knowledge within their parameters, leading to limitations in generalization and scalability. The paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" introduces RAG (Retrieval-Augmented Generation), a novel framework that synergizes retrieval mechanisms with generative models, enhancing performance by dynamically incorporating external knowledge during inference.
2. My talks @FlinkForward
Flink Forward 2015
A Comparative Performance
Evaluation of Flink
Flink Forward 2017
Predictive Maintenance
with
Deep Learning and Flink .
Flink Forward 2018
Real-time driving score service
using Flink
3. T map, a mobile navigation app by SK telecom
≈
Choose from
frequent locations
Enter an address
or
a place nameWaze
Google Maps
4. T map, a mobile navigation app by SK telecom
multiple route options in driving mode arriving at destination
5. Driving score service by T map
I scored 83
out of 100! yay!
Driving score
KB Insurance DB Insurance
10% discount 10% discount
Car insurance discount for safe drivers
If you drive safely with ,
automobile insurance premiums
go down.
6. Driving score is based on three factors
My driving score
Rank : 970k Speeding
Rapid
accel.
Rapid
decel.
greatgood good
Monthly chart
Apr May Jun Jul Aug
7. The three factors are calculated for each session
6/29 (Fri.)
min
min
SKT Network Operation Center
Yanghyeon Village
●speeding 0 ●rapid acc. 0 ●rapid decel. 0
●speeding 1 ●rapid acc. 1 ●rapid decel. 0
6/28 (Thu.)
min
min
SKT Network Operation Center
Yanghyeon Village
●speeding 1 ●rapid acc. 1 ●rapid decel. 0
●speeding 1 ●rapid acc. 1 ●rapid decel. 1
● ● ●
8. The three factors are calculated for each session
● ● ●
Speeding
0.2km
My speed : 90km/h
(Speed limit : 70km/h)
Rapid accel.
(within 3 sec)
Rapid decel.
(within 3 sec)
9. Current client-server architecture
A GPS trajectory is generated
for each driving session
…
GPS coord.
• latitude
• longitude
• altitude
T1
GPS coord.
• latitude
• longitude
• altitude
T2
GPS coord.
• latitude
• longitude
• altitude
TN
T map
GPS trajectory
driving score
(+1day)
Batch ETL jobs are executed
twice a day
to calculate three factors ●●●
from trajectories
The main drawback
Users cannot see today’s driving scores until tomorrow
T map service server
...
11min
SKT Network Operation Center
●speeding 1 ●rapid acc. 1 ●rapid decel. 1
10. Migration from batch ETL to streaming processing
... ...
Service
DB
Millions
of users
...
Batch processing
Real-time streaming
processing
Goal
Let users know driving scores ASAP
11. Why we choose to use Flink?
https://flink.apache.org/introduction.html#features-why-flink
Exactly-once semantics
for stateful computations
stream processing and windowing
with event time semantics
flexible windowing
light-weight fault-tolerance high throughput and low latency
12. Contents
• Dataflow design and trigger customization
• Instrumentation with Prometheus
Source
JSON
parser SinkKafka Kafka Service DBUser
key-based
Bounded
OutOfOrderness
TimestampExtractor
(BOOTE)
messages
USER1 to ...
USER2 to ...
USER3 to ...
......
user ID +
destination
Session window
with a custom trigger
Define metrics Collect metrics Plot metrics
13. A 12-minute driving with 720 GPS coordinates
T map T map service server
...
...
...
...
T map generates
a GPS coordinate
every second
14. T map sends 4 messages to the service server
1st periodic message
(300 coordinates for the first 5 mins)
2nd periodic message
(300 coordinates for the next 5 mins)
End message
(120 coordinates for the last 2 mins)
...
...
...
T map T map service server
...
Init
message
15. Return scores right after receiving end messages
T map
driving score
7:20
T map service server
...
Init
a
7:08
Periodic
b
7:13
c
7:18
End
d
7:20
Messages
11min
SKT Network Operation Center
●speeding 1 ●rapid acc. 1 ●rapid decel. 1
16. Real-time driving score dataflow using
Source
JSON
parser SinkKafka Kafka Service DBUser
key-based
Logical
dataflow
messages
USER1 to ...
USER2 to ...
USER3 to ...
......
user ID +
destination
Session window
with a custom trigger
Bounded
OutOfOrderness
TimestampExtractor
(BOOTE)
at-least-once Kafka producer
session gap : 1 hour
17. Real-time driving score dataflow using
Source
JSON
parser SinkKafka Kafka Service DBUser
key-based
Logical
dataflow
Bounded
OutOfOrderness
TimestampExtractor
(BOOTE)
...
Source
Session window
with a custom trigger
p0
p1
p2
p19
20
partitions
20
tasks
256 tasks
...
...
...
p0
p1
p2
p19
20
partitions
Sink
...
several million
users
20
tasks
256
tasks
Service DB
...
User
Physical
dataflow
...
20
tasks
JSON
parser BOOTE
messages
USER1 to ...
USER2 to ...
USER3 to ...
......
user ID +
destination
messages
…
…
......
user ID +
destination
messages
…
…
......
user ID +
destination
messages
…
…
......
user ID +
destination
Session window
with a custom trigger
18. Session window (gap : 1 hour) with different triggers
The default
EventTimeTrigger
EarlyResultEventTimeTrigger
Time
Time
19. Session window (gap : 1 hour) with different triggers
8:13 8:18 8:208:08
abcd
7:13
b
Periodic
7:08
a
Init
7:18
c
Periodic
The default
EventTimeTrigger
● 1 ● 1 ● 1
EarlyResultEventTimeTrigger
7:20
d
End
fire
Time
Time
Watermark
20. Session window (gap : 1 hour) with different triggers
8:13 8:18 8:208:087:13
b
Periodic
7:08
a
Init
7:18
c
Periodic
7:20
d
End
8:13 8:18 8:208:087:13
b
Periodic
7:08
a
Init
7:18
c
Periodic
abcd
abcd ● 1 ● 1 ● 1
● 1 ● 1 ● 1
The default
EventTimeTrigger
EarlyResultEventTimeTrigger
7:20
d
End
early fire DO NOT fire
fire
(necessary in case of out-of-order messages)
Time
Time
Early timer
21. Slow for some reason
Out-or-order messages
...
Source
... ......
JSON
parser
p0
p1
p2
p19
...
p0
p1
p2
p19
Service
DB
... ...
a
Init
b
Periodic
cd
End
a
b
c
d
messages
……
……
......
user ID +
destination
messages
……
……
......
user ID +
destination
Session window w/
EarlyResultEventTimeTrigger
(session gap : 1 hour)
Sink
messages
user ID +
destination
……
……
Dongwon
to
SKT NOC
ab
Dongwon’s
iPhone
BOOTE
(maxOoO : 1 sec)
c d
22. How EarlyResultEventTimeTrigger deals with out-or-order messages
[Case 1] C arrives
before the early timer expires
c
[Case 2] C arrives
after the early timer expires
c
Time
Time
23. b
Periodic
a
Init
c
Periodic
abdc ● 1 ● 1 ● 1
d
End
early fire (perfect result) DO NOT fire
(no messages added after the last fire)
[Case 1] C arrives
before the early timer expires
c
[Case 2] C arrives
after the early timer expires
c
Time
Time
How EarlyResultEventTimeTrigger deals with out-or-order messages
24. b
Periodic
a
Init
c
Periodic
● 1 ● 1 ● 1
d
End
early fire (perfect result) DO NOT fire
(no messages added after the last fire)
b
Periodic
a
Init
c
Periodic
abd ● 0 ● 1 ● 1
d
End
early fire (incomplete result)
[Case 1] C arrives
before the early timer expires
c
[Case 2] C arrives
after the early timer expires
c
2nd fire (perfect result)
abc d ● 1 ● 1 ● 1
Time
Time
abdc
How EarlyResultEventTimeTrigger deals with out-or-order messages
25. EarlyResultEventTimeTrigger
[Constructor]
Get an evaluator to determine early firing
https://github.com/eastcirclek/flink-examples/blob/master/src/main/scala/com/github/eastcirclek/flink/trigger/EarlyResultEventTimeTrigger.scala
[onElement]
register an early timer
if the evaluator returns true
(e.g. when the end message comes in)
[onEventTime]
Fire if the early timer expires
26. Contents
• Dataflow design and trigger customization
• Instrumentation with Prometheus
Source
JSON
parser SinkKafka Kafka Service DBUser
key-based
Bounded
OutOfOrderness
TimestampExtractor
(BOOTE)
messages
USER1 to ...
USER2 to ...
USER3 to ...
......
user ID +
destination
Session window
with a custom trigger
Define metrics Collect metrics Plot metrics
28. Individual message statistics
1K messages per second
100M messages per day
10s of MB per second
2 TB per day
N:1
Message stats.
extractor
Message stats.
sink
Source SinkKafka Kafka
JSON
parser BOOTE
key-based
messages
……
……
......
user ID +
destination
Logical
dataflow
Session window
Service DBUser
meter histogram histogrammeter
29. Jitter (ingestion time – event time)
Source SinkKafka Kafka
JSON
parser
Bounded
OutOfOrderness
TimestampExtractor
key-based
messages
……
……
......
user ID +
destination
Logical
dataflow
Session window
Service DB
event
time
ingestion
time
User
1 sec
Based on this observation,
we use 1 sec for maxOutOfOrderness
32. Our own definition of latency
ingestion time
of end messages
Session output stats.
extractor
Session output stats.
sink
Source
Sink
Kafka
Kafka
JSON
parser BOOTE
Session window
messages
user ID +
destination
……
……
Dongwon
to
SKT NOC
abcd
End
d
End
d
End
d
End
d
processing time of
the resultant session output
@extractor
● 1 ● 1 ● 1
● 1 ● 1 ● 1
Considering maxOutOfOrderness is 1 second,
Flink takes at most 250 milliseconds
33. N:1 N:1
Message stats.
extractor
Message stats.
sink
Session output stats.
extractor
Session output stats.
sink
Source SinkKafka Kafka
JSON
parser BOOTE
key-based
messages
……
……
......
user ID +
destination
Service DBUser
How to expose metrics to Prometheus?
Session window