The slides for Stream Processing Meetup (7/19/2018)(https://www.meetup.com/Stream-Processing-Meetup-LinkedIn/events/251481797/).
This presentation introduces the newly-developed Samza Runner for Apache Beam. You will see the capability of the Samza Runner and how it supports key Beam features. You will also see a few use cases and our future roadmap.
The need for gleaning answers from data in real-time is moving from nicety to a necessity. There are few options to analyze the never-ending stream of unbounded data at scale. Let’s compare and contrast the core principles and technologies the different open source solutions available to help with this endeavor, and where in the future processing engines need to evolve to solve processing needs at scale. These findings are based on the experience of continuing to build a scalable solution in the cloud to process over 700 billion events at Netflix, and how we are embarking on the next journey to evolve unbounded data processing engines.
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini
Keystone - Processing over Half a Trillion events per day with 8 million events & 17 GB per second peaks, and at-least once processing semantics. We will explore in detail how we employ Kafka, Samza, and Docker at scale to implement a multi-tenant pipeline. We will also look at the evolution to its current state and where the pipeline is headed next in offering a self-service stream processing infrastructure atop the Kafka based pipeline and support Spark Streaming.
Slides from a presentation by Monal Daxini at Disney, Glendale CA about Netflix Open Source Software, Cloud Data Persistence, and Cassandra best Practices
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecPeter Bakas
Talk on Netflix Keystone by Peter Bakas at SF Data Engineering Meetup on 2/23/2016.
Topics covered:
- Architectural design and principles for Keystone
- Technologies that Keystone is leveraging
- Best practices
http://www.meetup.com/SF-Data-Engineering/events/228293610/
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Monal Daxini
Netflix Keystone Pipeline processing 600 billion events a day, and detailed treatise on the modification of and use of Samza for real time routing of events including docker.
The DAGScheduler is responsible for computing the DAG of stages for a Spark job and submitting them to the TaskScheduler. The TaskScheduler then submits individual tasks from each stage for execution and works with the DAGScheduler to handle failures through task and stage retries. Together, the DAGScheduler and TaskScheduler coordinate the execution of jobs by breaking them into independent stages of parallel tasks across executor nodes.
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019confluent
Data stream processing is built on the core concept of time. However, understanding time semantics and reasoning about time is not simple, especially if deterministic processing is expected. In this talk, we explain the difference between processing, ingestion, and event time and what their impact is on data stream processing. Furthermore, we explain how Kafka clusters and stream processing applications must be configured to achieve specific time semantics. Finally, we deep dive into the time semantics of the Kafka Streams DSL and KSQL operators, and explain in detail how the runtime handles time. Apache Kafka offers many ways to handle time on the storage layer, ie, the brokers, allowing users to build applications with different semantics. Time semantics in the processing layer, ie, Kafka Streams and KSQL, are even richer, more powerful, but also more complicated. Hence, it is paramount for developers, to understand different time semantics and to know how to configure Kafka to achieve them. Therefore, this talk enables developers to design applications with their desired time semantics, help them to reason about the runtime behavior with regard to time, and allow them to understand processing/query results.
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016Monal Daxini
Keystone processes over 700 billion events per day (1 peta byte) with at-least once processing semantics in the cloud. We will explore in detail how we leverage Kafka, Samza, Docker, and Linux at scale to implement a multi-tenant pipeline in AWS cloud within a year. We will also share our plans on offering a Stream Processing as a Service for all of Netflix use.
The need for gleaning answers from data in real-time is moving from nicety to a necessity. There are few options to analyze the never-ending stream of unbounded data at scale. Let’s compare and contrast the core principles and technologies the different open source solutions available to help with this endeavor, and where in the future processing engines need to evolve to solve processing needs at scale. These findings are based on the experience of continuing to build a scalable solution in the cloud to process over 700 billion events at Netflix, and how we are embarking on the next journey to evolve unbounded data processing engines.
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini
Keystone - Processing over Half a Trillion events per day with 8 million events & 17 GB per second peaks, and at-least once processing semantics. We will explore in detail how we employ Kafka, Samza, and Docker at scale to implement a multi-tenant pipeline. We will also look at the evolution to its current state and where the pipeline is headed next in offering a self-service stream processing infrastructure atop the Kafka based pipeline and support Spark Streaming.
Slides from a presentation by Monal Daxini at Disney, Glendale CA about Netflix Open Source Software, Cloud Data Persistence, and Cassandra best Practices
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecPeter Bakas
Talk on Netflix Keystone by Peter Bakas at SF Data Engineering Meetup on 2/23/2016.
Topics covered:
- Architectural design and principles for Keystone
- Technologies that Keystone is leveraging
- Best practices
http://www.meetup.com/SF-Data-Engineering/events/228293610/
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Monal Daxini
Netflix Keystone Pipeline processing 600 billion events a day, and detailed treatise on the modification of and use of Samza for real time routing of events including docker.
The DAGScheduler is responsible for computing the DAG of stages for a Spark job and submitting them to the TaskScheduler. The TaskScheduler then submits individual tasks from each stage for execution and works with the DAGScheduler to handle failures through task and stage retries. Together, the DAGScheduler and TaskScheduler coordinate the execution of jobs by breaking them into independent stages of parallel tasks across executor nodes.
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019confluent
Data stream processing is built on the core concept of time. However, understanding time semantics and reasoning about time is not simple, especially if deterministic processing is expected. In this talk, we explain the difference between processing, ingestion, and event time and what their impact is on data stream processing. Furthermore, we explain how Kafka clusters and stream processing applications must be configured to achieve specific time semantics. Finally, we deep dive into the time semantics of the Kafka Streams DSL and KSQL operators, and explain in detail how the runtime handles time. Apache Kafka offers many ways to handle time on the storage layer, ie, the brokers, allowing users to build applications with different semantics. Time semantics in the processing layer, ie, Kafka Streams and KSQL, are even richer, more powerful, but also more complicated. Hence, it is paramount for developers, to understand different time semantics and to know how to configure Kafka to achieve them. Therefore, this talk enables developers to design applications with their desired time semantics, help them to reason about the runtime behavior with regard to time, and allow them to understand processing/query results.
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016Monal Daxini
Keystone processes over 700 billion events per day (1 peta byte) with at-least once processing semantics in the cloud. We will explore in detail how we leverage Kafka, Samza, Docker, and Linux at scale to implement a multi-tenant pipeline in AWS cloud within a year. We will also share our plans on offering a Stream Processing as a Service for all of Netflix use.
Stream processing in python with Apache Samza and BeamHai Lu
Apache Samza is the streaming engine being used at LinkedIn that processes around 2 trillion messages daily. A while back we announced Samza's integration with Apache Beam, a great success which leads to our Samza Beam API. Now an UPGRADE of our APIs - we're now supporting Stream Processing in Python! This work has made stream processing more accessible and enabled many interesting use cases, particularly in the area of machine learning. The Python API is based on our work of Samza runner for Apache Beam. In this talk, we will quickly review our work on Samza runner, and then how we extended it to support portability in Beam (Python specifically). In addition to technical and architectural details, we will also talk about how we bridged Python and Java ecosystems at LinkedIn with the Python API, together with different use cases.
The document provides an overview of Apache Samza, including its key differentiators and future plans. It discusses Samza's performance advantages from using local state instead of remote databases. Samza allows stateful stream processing and incremental checkpointing for applications with terabytes of state. It supports a variety of input sources, processing as a service on YARN or embedded as a library. Upcoming features include a high-level API, support for event time windows, pipelines, and exactly-once processing while auto-scaling local state.
Unified Stream Processing at Scale with Apache Samza - BDS2017Jacob Maes
The shift to stream processing at LinkedIn has accelerated over the past few years. We now have over 200 Samza applications in production processing more than 260B events per day. Many of these are new applications, but there have also been more migrations from existing online and offline applications. To support the influx of new use cases, we have improved the flexibility, efficiency and reliability of Apache Samza.
In this talk, we will take a brief look at the broader streaming ecosystem at LinkedIn, then we will zoom in on a few representative use cases and explain how they are powered by recent advancements to Apache Samza including a unified high level API, flexible deployment model, batch processing, and more.
Function Mesh for Apache Pulsar, the Way for Simple Streaming SolutionsStreamNative
Pulsar Function is a succinct computing abstraction Apache Pulsar provides to express simple ETL and streaming tasks. The simplicity comes in two folds: Simple Interface and Simple Deployment. As it has been adopted, we realized that the ability to run natively on cloud and integrate multiple functions into one integrity are key to user success. We developed this new feature -- Function Mesh -- to support these new requirements.
This talk aims to provide a thorough walkthrough of this new Function Mesh Feature, including its design, implementation, use cases, and examples, to help people seeking simple streaming solutions understand this newly created powerful tool in Apache Pulsar.
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleSean Zhong
Gearpump is a Akka based realtime streaming engine, it use Actor to model everything. It has super performance and flexibility. It has performance of 18000000 messages/second and latency of 8ms on a cluster of 4 machines.
Apache Samza is a distributed stream processing framework, that's used Kafka for messaging, and YARN to provide fault tolerance, processor isolation, security, and resource management.
Why @Loggly Loves Apache Kafka, and How We Use Its Unbreakable Messaging for ...SolarWinds Loggly
Agenda for this Presentation
• The challenges of Log Management at scale
• Overview of Loggly’s processing pipeline
• Alternative technologies considered
• Why we love Apache Kafka
• How Kafka has added flexibility to our pipeline

The Challenges of Log Management at Scale
• Big data
– >750 billion events logged to date
– Sustained bursts of 100,000+ events per second
– Data space measured in petabytes
• Need for high fault tolerance
• Near real-time indexing requirements
• Time-series index management
This document discusses building stream processing as a service (SPaaS) using Apache Flink. It introduces Flink's stream processing capabilities and describes how to build a SPaaS offering with different levels of complexity and ease of use. It also covers the Keystone router for simple stream routing, building custom Flink jobs, and techniques for recovering from failures using backfill from Hive or rewinding the Flink job.
Hai Lu presented on the Samza Portable Runner for Apache Beam. The key points are:
1) The Samza Portable Runner allows stream processing to be done in multiple languages like Python by translating Beam pipelines into the Samza execution engine.
2) It provides a high-level Python SDK for building streaming applications on top of Beam's portability framework. Pipelines are translated from Python into the language-independent Beam representation.
3) Performance is improved through batching/bundling messages between the Python and Java processes to reduce round trips. Initial tests showed throughput increasing with larger bundle sizes.
4) Example use cases demonstrated near real-time image OCR, model training,
M|18 Choosing the Right High Availability Strategy for YouMariaDB plc
This document discusses MariaDB high availability strategies including replication, failover, and clustering. It defines key HA terminology and describes different replication topologies like asynchronous, semi-synchronous, and synchronous replication using Galera cluster. Use cases provided show how geographically distributed and production control systems benefit from MariaDB HA features.
Administrative techniques to reduce Kafka costs | Anna Kepler, ViasatHostedbyConfluent
When your Kafka clusters start growing so is the cost associated with them. As administrators we have to ensure that the service we support is operating in the most reliable way to satisfy the customers. However, for our business it is as important that we ensure the same service is also cost-efficient. There are two ways we can optimize the cost of service – tuning broker machines and tuning the data transfers. Minimizing data transfer is the largest return on investment since that is what accounts for the most spend. With the use of Kafka administrative tools and metrics we can find multiple ways to reduce the data transfers in the clusters.
The presentation will cover various techniques administrators of Kafka service can employ to reduce the data transfers and to save the operational costs. Reducing cross-AZ traffic, optimizing batching with use of DumpLogSegment script, utilizing Kafka metrics to shut down unused data streams and more.
With an objective of making our Kafka deployment as cost effective as possible, we have gained money saving tricks. And we would love to share them with the community.
Reactive mistakes - ScalaDays Chicago 2017Petr Zapletal
Reactive applications are becoming a de-facto industry standard and, if employed correctly, toolkits like Lightbend Reactive Platform make the implementation easier than ever. But design of these systems might be challenging as it requires particular mindset shift to tackle problems we might not be used to. In this talk we’re going to discuss the most common things I’ve seen in the field that prevented applications to work as expected. I’d like to talk about typical pitfalls that might cause troubles, about trade-offs that might not be fully understood or important choices that might be overlooked including persistent actors pitfalls, tackling of network partitions, proper implementations of graceful shutdown or distributed transactions, trade-offs of micro-services or actors and more.
This talk should be interesting for anyone who is thinking about, implementing, or have already deployed reactive application. My goal is to provide a comprehensive explanation of common problems to be sure they won’t be repeated by fellow developers. The talk is a little bit more focused on Lightbend platform but understanding of the concepts we are going to talk about should be beneficial for everyone interested in this field.
Understanding time in structured streamingdatamantra
This document discusses time abstractions in structured streaming. It introduces process time, event time, and ingestion time. It explains how to use the window API to apply windows over these different time abstractions. It also discusses handling late events using watermarks and implementing non-time based windows using custom state management and sessionization.
Flink Forward San Francisco 2018: Steven Wu - "Scaling Flink in Cloud" Flink Forward
Over 109 million subscribers are enjoying more than 125 million hours of TV shows and movies per day on Netflix. This leads to massive amount of data flowing through our data ingestion pipeline to improve service and user experience. They are powering various data analytic cases like personalization, operational insight, fraud detection. At the heart of this massive data ingestion pipeline is a self-serve stream processing platform that processes 3 trillion events and 12 PB of data every day. We have recently migrated this stream processing platform from Samza to Flink. In this talk, we will share the challenges and issues that we run into when running Flink at scale in cloud. We will dive deep into the troubleshooting techniques and lessons learned.
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward
Let’s be honest: Running a distributed stateful stream processor that is able to handle terabytes of state and tens of gigabytes of data per second while being highly available and correct (in an exactly-once sense) does not work without any planning, configuration and monitoring. While the Flink developer community tries to make everything as simple as possible, it is still important to be aware of all the requirements and implications In this talk, we will provide some insights into the greatest operations mysteries of Flink from a high-level perspective: - Capacity and resource planning: Understand the theoretical limits. - Memory and CPU configuration: Distribute resources according to your needs. - Setting up High Availability: Planning for failures. - Checkpointing and State Backends: Ensure correctness and fast recovery For each of the listed topics, we will introduce the concepts of Flink and provide some best practices we have learned over the past years supporting Flink users in production.
Maheedhar Gunturu presented on connecting Kafka message systems with Scylla. He discussed the benefits of message queues like Kafka including centralized infrastructure, buffering capabilities, and streaming data transformations. He then explained Kafka Connect which provides a standardized framework for building connectors with distributed and scalable connectors. Scylla and Cassandra connectors are available today with a Scylla shard aware connector being developed.
This document discusses the Pulsar connector for Apache Flink 1.14. It provides an overview of StreamNative, which offers both stream storage with Apache Pulsar and stream processing with Flink. It then covers the timeline of contributions to the Pulsar connector for Flink and how it has evolved. Finally, it describes the design of the new Pulsar source connector for Flink that uses the FLIP-27 source interface, including how it handles Pulsar subscription modes and implements split enumeration, reading, and processing in a way that supports both batch and streaming workloads.
Using Apache Pulsar as a Modern, Scalable, High Performing JMS Platform - Pus...StreamNative
JMS, as the first widely-supported enterprise messaging API, has been in the market for close to 20 years and still plays critical roles in many enterprises nowadays. Many mission-critical business applications are still running in production that follows JMS (2.0) specification on various JMS platforms like ActiveMQ, TibcoEMS, and etc.
However, modern business activities have raised new challenges that JMS can't answer very well such as cross-region message replication, real-time complex event processing, seamless horizontal scalability, and etc. In order to address these challenges, newer enterprise messaging/streaming technologies like Apache Pulsar is needed.
In this presentation, I will do a deep dive investigation on how Apache Pulsar can be used as the next generation unified enterprise messaging/streaming platform that can serve existing JMS applications with very minimum code changes. I will also demonstrate JMS to Pulsar migration with several concrete use cases and examples.
Streaming in Practice - Putting Apache Kafka in Productionconfluent
This presentation focuses on how to integrate all these components into an enterprise environment and what things you need to consider as you move into production.
We will touch on the following topics:
- Patterns for integrating with existing data systems and applications
- Metadata management at enterprise scale
- Tradeoffs in performance, cost, availability and fault tolerance
- Choosing which cross-datacenter replication patterns fit with your application
- Considerations for operating Kafka-based data pipelines in production
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...Flink Forward
Pravega is a stream storage system that we designed and built from the ground up for modern day stream processors such as Flink. Its storage layer is tiered and designed to provide low latency for writing and reading, while being able to store an unbounded amount of stream data that eventually becomes cold. We rely on a high-throughput component to store cold stream data, which is critical to enable applications to rely on Pravega alone for storing stream data. Pravega’s API enables applications to manipulate streams with a set of desirable features such as avoiding duplication and writing data transactionally. Both features are important for applications that require exactly-once semantics. This talk goes into the details of Pravega’s architecture and establishes the need for such a storage system.
A walk through the current state of stream processing, the key differentiators which make Samza stand out in the crowd, what's new in samza and what's coming next.
Stream processing in python with Apache Samza and BeamHai Lu
Apache Samza is the streaming engine being used at LinkedIn that processes around 2 trillion messages daily. A while back we announced Samza's integration with Apache Beam, a great success which leads to our Samza Beam API. Now an UPGRADE of our APIs - we're now supporting Stream Processing in Python! This work has made stream processing more accessible and enabled many interesting use cases, particularly in the area of machine learning. The Python API is based on our work of Samza runner for Apache Beam. In this talk, we will quickly review our work on Samza runner, and then how we extended it to support portability in Beam (Python specifically). In addition to technical and architectural details, we will also talk about how we bridged Python and Java ecosystems at LinkedIn with the Python API, together with different use cases.
The document provides an overview of Apache Samza, including its key differentiators and future plans. It discusses Samza's performance advantages from using local state instead of remote databases. Samza allows stateful stream processing and incremental checkpointing for applications with terabytes of state. It supports a variety of input sources, processing as a service on YARN or embedded as a library. Upcoming features include a high-level API, support for event time windows, pipelines, and exactly-once processing while auto-scaling local state.
Unified Stream Processing at Scale with Apache Samza - BDS2017Jacob Maes
The shift to stream processing at LinkedIn has accelerated over the past few years. We now have over 200 Samza applications in production processing more than 260B events per day. Many of these are new applications, but there have also been more migrations from existing online and offline applications. To support the influx of new use cases, we have improved the flexibility, efficiency and reliability of Apache Samza.
In this talk, we will take a brief look at the broader streaming ecosystem at LinkedIn, then we will zoom in on a few representative use cases and explain how they are powered by recent advancements to Apache Samza including a unified high level API, flexible deployment model, batch processing, and more.
Function Mesh for Apache Pulsar, the Way for Simple Streaming SolutionsStreamNative
Pulsar Function is a succinct computing abstraction Apache Pulsar provides to express simple ETL and streaming tasks. The simplicity comes in two folds: Simple Interface and Simple Deployment. As it has been adopted, we realized that the ability to run natively on cloud and integrate multiple functions into one integrity are key to user success. We developed this new feature -- Function Mesh -- to support these new requirements.
This talk aims to provide a thorough walkthrough of this new Function Mesh Feature, including its design, implementation, use cases, and examples, to help people seeking simple streaming solutions understand this newly created powerful tool in Apache Pulsar.
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleSean Zhong
Gearpump is a Akka based realtime streaming engine, it use Actor to model everything. It has super performance and flexibility. It has performance of 18000000 messages/second and latency of 8ms on a cluster of 4 machines.
Apache Samza is a distributed stream processing framework, that's used Kafka for messaging, and YARN to provide fault tolerance, processor isolation, security, and resource management.
Why @Loggly Loves Apache Kafka, and How We Use Its Unbreakable Messaging for ...SolarWinds Loggly
Agenda for this Presentation
• The challenges of Log Management at scale
• Overview of Loggly’s processing pipeline
• Alternative technologies considered
• Why we love Apache Kafka
• How Kafka has added flexibility to our pipeline

The Challenges of Log Management at Scale
• Big data
– >750 billion events logged to date
– Sustained bursts of 100,000+ events per second
– Data space measured in petabytes
• Need for high fault tolerance
• Near real-time indexing requirements
• Time-series index management
This document discusses building stream processing as a service (SPaaS) using Apache Flink. It introduces Flink's stream processing capabilities and describes how to build a SPaaS offering with different levels of complexity and ease of use. It also covers the Keystone router for simple stream routing, building custom Flink jobs, and techniques for recovering from failures using backfill from Hive or rewinding the Flink job.
Hai Lu presented on the Samza Portable Runner for Apache Beam. The key points are:
1) The Samza Portable Runner allows stream processing to be done in multiple languages like Python by translating Beam pipelines into the Samza execution engine.
2) It provides a high-level Python SDK for building streaming applications on top of Beam's portability framework. Pipelines are translated from Python into the language-independent Beam representation.
3) Performance is improved through batching/bundling messages between the Python and Java processes to reduce round trips. Initial tests showed throughput increasing with larger bundle sizes.
4) Example use cases demonstrated near real-time image OCR, model training,
M|18 Choosing the Right High Availability Strategy for YouMariaDB plc
This document discusses MariaDB high availability strategies including replication, failover, and clustering. It defines key HA terminology and describes different replication topologies like asynchronous, semi-synchronous, and synchronous replication using Galera cluster. Use cases provided show how geographically distributed and production control systems benefit from MariaDB HA features.
Administrative techniques to reduce Kafka costs | Anna Kepler, ViasatHostedbyConfluent
When your Kafka clusters start growing so is the cost associated with them. As administrators we have to ensure that the service we support is operating in the most reliable way to satisfy the customers. However, for our business it is as important that we ensure the same service is also cost-efficient. There are two ways we can optimize the cost of service – tuning broker machines and tuning the data transfers. Minimizing data transfer is the largest return on investment since that is what accounts for the most spend. With the use of Kafka administrative tools and metrics we can find multiple ways to reduce the data transfers in the clusters.
The presentation will cover various techniques administrators of Kafka service can employ to reduce the data transfers and to save the operational costs. Reducing cross-AZ traffic, optimizing batching with use of DumpLogSegment script, utilizing Kafka metrics to shut down unused data streams and more.
With an objective of making our Kafka deployment as cost effective as possible, we have gained money saving tricks. And we would love to share them with the community.
Reactive mistakes - ScalaDays Chicago 2017Petr Zapletal
Reactive applications are becoming a de-facto industry standard and, if employed correctly, toolkits like Lightbend Reactive Platform make the implementation easier than ever. But design of these systems might be challenging as it requires particular mindset shift to tackle problems we might not be used to. In this talk we’re going to discuss the most common things I’ve seen in the field that prevented applications to work as expected. I’d like to talk about typical pitfalls that might cause troubles, about trade-offs that might not be fully understood or important choices that might be overlooked including persistent actors pitfalls, tackling of network partitions, proper implementations of graceful shutdown or distributed transactions, trade-offs of micro-services or actors and more.
This talk should be interesting for anyone who is thinking about, implementing, or have already deployed reactive application. My goal is to provide a comprehensive explanation of common problems to be sure they won’t be repeated by fellow developers. The talk is a little bit more focused on Lightbend platform but understanding of the concepts we are going to talk about should be beneficial for everyone interested in this field.
Understanding time in structured streamingdatamantra
This document discusses time abstractions in structured streaming. It introduces process time, event time, and ingestion time. It explains how to use the window API to apply windows over these different time abstractions. It also discusses handling late events using watermarks and implementing non-time based windows using custom state management and sessionization.
Flink Forward San Francisco 2018: Steven Wu - "Scaling Flink in Cloud" Flink Forward
Over 109 million subscribers are enjoying more than 125 million hours of TV shows and movies per day on Netflix. This leads to massive amount of data flowing through our data ingestion pipeline to improve service and user experience. They are powering various data analytic cases like personalization, operational insight, fraud detection. At the heart of this massive data ingestion pipeline is a self-serve stream processing platform that processes 3 trillion events and 12 PB of data every day. We have recently migrated this stream processing platform from Samza to Flink. In this talk, we will share the challenges and issues that we run into when running Flink at scale in cloud. We will dive deep into the troubleshooting techniques and lessons learned.
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward
Let’s be honest: Running a distributed stateful stream processor that is able to handle terabytes of state and tens of gigabytes of data per second while being highly available and correct (in an exactly-once sense) does not work without any planning, configuration and monitoring. While the Flink developer community tries to make everything as simple as possible, it is still important to be aware of all the requirements and implications In this talk, we will provide some insights into the greatest operations mysteries of Flink from a high-level perspective: - Capacity and resource planning: Understand the theoretical limits. - Memory and CPU configuration: Distribute resources according to your needs. - Setting up High Availability: Planning for failures. - Checkpointing and State Backends: Ensure correctness and fast recovery For each of the listed topics, we will introduce the concepts of Flink and provide some best practices we have learned over the past years supporting Flink users in production.
Maheedhar Gunturu presented on connecting Kafka message systems with Scylla. He discussed the benefits of message queues like Kafka including centralized infrastructure, buffering capabilities, and streaming data transformations. He then explained Kafka Connect which provides a standardized framework for building connectors with distributed and scalable connectors. Scylla and Cassandra connectors are available today with a Scylla shard aware connector being developed.
This document discusses the Pulsar connector for Apache Flink 1.14. It provides an overview of StreamNative, which offers both stream storage with Apache Pulsar and stream processing with Flink. It then covers the timeline of contributions to the Pulsar connector for Flink and how it has evolved. Finally, it describes the design of the new Pulsar source connector for Flink that uses the FLIP-27 source interface, including how it handles Pulsar subscription modes and implements split enumeration, reading, and processing in a way that supports both batch and streaming workloads.
Using Apache Pulsar as a Modern, Scalable, High Performing JMS Platform - Pus...StreamNative
JMS, as the first widely-supported enterprise messaging API, has been in the market for close to 20 years and still plays critical roles in many enterprises nowadays. Many mission-critical business applications are still running in production that follows JMS (2.0) specification on various JMS platforms like ActiveMQ, TibcoEMS, and etc.
However, modern business activities have raised new challenges that JMS can't answer very well such as cross-region message replication, real-time complex event processing, seamless horizontal scalability, and etc. In order to address these challenges, newer enterprise messaging/streaming technologies like Apache Pulsar is needed.
In this presentation, I will do a deep dive investigation on how Apache Pulsar can be used as the next generation unified enterprise messaging/streaming platform that can serve existing JMS applications with very minimum code changes. I will also demonstrate JMS to Pulsar migration with several concrete use cases and examples.
Streaming in Practice - Putting Apache Kafka in Productionconfluent
This presentation focuses on how to integrate all these components into an enterprise environment and what things you need to consider as you move into production.
We will touch on the following topics:
- Patterns for integrating with existing data systems and applications
- Metadata management at enterprise scale
- Tradeoffs in performance, cost, availability and fault tolerance
- Choosing which cross-datacenter replication patterns fit with your application
- Considerations for operating Kafka-based data pipelines in production
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...Flink Forward
Pravega is a stream storage system that we designed and built from the ground up for modern day stream processors such as Flink. Its storage layer is tiered and designed to provide low latency for writing and reading, while being able to store an unbounded amount of stream data that eventually becomes cold. We rely on a high-throughput component to store cold stream data, which is critical to enable applications to rely on Pravega alone for storing stream data. Pravega’s API enables applications to manipulate streams with a set of desirable features such as avoiding duplication and writing data transactionally. Both features are important for applications that require exactly-once semantics. This talk goes into the details of Pravega’s architecture and establishes the need for such a storage system.
A walk through the current state of stream processing, the key differentiators which make Samza stand out in the crowd, what's new in samza and what's coming next.
This document discusses data streaming and stream processing using Kafka. It defines data streaming as continuously generated data from many sources sent simultaneously in small sizes. Stream processing applies continuous processing to data streams to produce instant analytics or trigger events. Kafka is presented as a streaming framework that can reliably process streaming data at large scales through its producers, consumers, and topics. Kafka streams adds stream processing capabilities through a convenient domain-specific language to perform stateless and stateful transformations on streams of data.
Stream processing involves processing unbounded streams of data in near real-time to produce derived data outputs. Samza is a distributed stream processing framework that allows processing of streams at large scale. At LinkedIn, Samza is used to process over 1 trillion events per day across many jobs and clusters for applications like tracking, analytics, and data standardization. Upcoming Samza features include improvements to local state storage, dynamic configuration, easier deployment of standalone jobs, and a high-level query language.
Spark Streaming Recipes and "Exactly Once" Semantics RevisedMichael Spector
This document discusses stream processing with Apache Spark. It begins with an overview of Spark Streaming and its advantages over other frameworks like low latency and rich APIs. It then covers core Spark Streaming concepts like windowing and achieving "exactly once" semantics through checkpointing and write ahead logs. The document presents two examples of using Spark Streaming for analytics and aggregation with transactional and snapshotted approaches. It concludes with notes on deployment with Mesos/Marathon and performance tuning Spark Streaming jobs.
This document summarizes a presentation about near real-time analytics platforms at Uber and LinkedIn. It discusses use cases for streaming analytics, challenges with scalability and operations, and new platforms developed using Apache Samza and SQL. Key points include how Samza is used to build streaming applications with SQL queries, operators, and support for multi-stage workflows. The platforms aim to simplify deployment and management of streaming jobs through interfaces like AthenaX.
Porting a Streaming Pipeline from Scala to RustEvan Chan
How we at Conviva ported a streaming data pipeline in months from Scala to Rust. What are the important human and technical factors in our port, and what did we learn?
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry confluent
Apache Beam (unified Batch and strEAM processing!) is a new Apache incubator project. Originally based on years of experience developing Big Data infrastructure within Google (such as MapReduce, FlumeJava, and MillWheel), it has now been donated to the OSS community at large.
Come learn about the fundamentals of out-of-order stream processing, and how Beam’s powerful tools for reasoning about time greatly simplify this complex task. Beam provides a model that allows developers to focus on the four important questions that must be answered by any stream processing pipeline:
What results are being calculated?
Where in event time are they calculated?
When in processing time are they materialized?
How do refinements of results relate?
Furthermore, by cleanly separating these questions from runtime characteristics, Beam programs become portable across multiple runtime environments, both proprietary (e.g., Google Cloud Dataflow) and open-source (e.g., Flink, Spark, et al).
The need for gleaning answers from unbounded data streams is moving from nicety to a necessity. Netflix is a data driven company, and has a need to process over 1 trillion events a day amounting to 3 PB of data to derive business insights.
To ease extracting insight, we are building a self-serve, scalable, fault-tolerant, multi-tenant "Stream Processing as a Service" platform so the user can focus on data analysis. I'll share our experience using Flink to help build the platform.
This document summarizes TenMax's data pipeline experience over the years from 2015 to 2017. It describes three versions of the data pipeline used to generate reports from raw event data. Version 1 used MongoDB but had poor write performance. Version 2 used Cassandra and had better write performance using LSM trees, but was costly to operate. Version 3 uses Kafka, Fluentd, Azure Blob storage and Spark to provide a scalable, cost-effective solution that can handle high throughput and complex aggregations. The document also discusses lessons learned around balancing features, costs and technologies like Spark, streaming and serverless models.
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
Spark Structured Streaming vs. Kafka Streams was compared. Spark Structured Streaming runs on a Spark cluster and allows reuse of Spark investments, while Kafka Streams is a Java library that provides low latency continuous processing. Both platforms support stateful operations like windows, aggregations and joins. Spark Structured Streaming supports multiple languages but has higher latency due to micro-batching, while Kafka Streams currently only supports Java but provides lower latency continuous processing.
This document discusses event time windowing in streaming data pipelines using the Glazier library. It begins with an example use case of gathering lowest latencies per session within 10 second windows. It then demonstrates how to implement this using Glazier to perform event time windowing rather than processing time windowing. The document explains key aspects of Glazier's API and how it uses Akka Streams under the hood to partition streams by key, apply tumbling windows based on event timestamps, and emit reduced results when windows close.
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services
"This is a technical architect's case study of how Loggly has employed the latest social-media-scale technologies as the backbone ingestion processing for our multi-tenant, geo-distributed, and real-time log management system. This presentation describes design details of how we built a second-generation system fully leveraging AWS services including Amazon Route 53 DNS with heartbeat and latency-based routing, multi-region VPCs, Elastic Load Balancing, Amazon Relational Database Service, and a number of pro-active and re-active approaches to scaling computational and indexing capacity.
The talk includes lessons learned in our first generation release, validated by thousands of customers; speed bumps and the mistakes we made along the way; various data models and architectures previously considered; and success at scale: speeds, feeds, and an unmeltable log processing engine."
Flink at netflix paypal speaker seriesMonal Daxini
(1) Monal Daxini presented on Netflix's use of Apache Flink for stream processing.
(2) Netflix introduced Flink two years ago and has driven its adoption within the company.
(3) Key aspects of Netflix's Flink usage include around 2,000 routing jobs processing around 3 trillion events per day across around 10,000 containers.
This document discusses end-to-end processing of 3.7 million telemetry events per second using a lambda architecture at Symantec. It provides an overview of Symantec's security data lake infrastructure, the telemetry data processing architecture using Kafka, Storm and HBase, tuning targets for the infrastructure components, and performance benchmarks for Kafka, Storm and Hive.
Netflix Open Source Meetup Season 4 Episode 2aspyker
In this episode, we will take a close look at 2 different approaches to high-throughput/low-latency data stores, developed by Netflix.
The first, EVCache, is a battle-tested distributed memcached-backed data store, optimized for the cloud. You will also hear about the road ahead for EVCache it evolves into an L1/L2 cache over RAM and SSDs.
The second, Dynomite, is a framework to make any non-distributed data-store, distributed. Netflix's first implementation of Dynomite is based on Redis.
Come learn about the products' features and hear from Thomson and Reuters, Diego Pacheco from Ilegra and other third party speakers, internal and external to Netflix, on how these products fit in their stack and roadmap.
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
Spark Streaming and Kafka Streams are two popular stream processing platforms. Spark Streaming uses micro-batching and allows for code reuse between batch and streaming jobs. Kafka Streams is embedded directly into Apache Kafka and leverages Kafka as its internal messaging layer. Both platforms support stateful stream processing operations like windowing, aggregations, and joins through distributed state stores. A demo application is shown that detects dangerous driving by joining truck position data with driver data using different streaming techniques.
What no one tells you about writing a streaming apphadooparchbook
This document discusses 5 things that are often not addressed when writing streaming applications:
1. Managing and monitoring long-running streaming jobs can be challenging as frameworks were not originally designed for streaming workloads. Options include using cluster mode to ensure jobs continue if clients disconnect and leveraging monitoring tools to track metrics.
2. Preventing data loss requires different approaches depending on the data source. File and receiver-based sources benefit from checkpointing while Kafka's commit log ensures data is not lost.
3. Spark Streaming is well-suited for tasks involving windowing, aggregations, and machine learning but may not be needed for all streaming use cases.
4. Achieving exactly-once semantics requires techniques
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit
So you know you want to write a streaming app but any non-trivial streaming app developer would have to think about these questions:
How do I manage offsets?
How do I manage state?
How do I make my spark streaming job resilient to failures? Can I avoid some failures?
How do I gracefully shutdown my streaming job?
How do I monitor and manage (e.g. re-try logic) streaming job?
How can I better manage the DAG in my streaming job?
When to use checkpointing and for what? When not to use checkpointing?
Do I need a WAL when using streaming data source? Why? When don’t I need one?
In this talk, we’ll share practices that no one talks about when you start writing your streaming app, but you’ll inevitably need to learn along the way.
Python is popular amongst data scientists and engineers for data processing tasks. The big data ecosystem has traditionally been rather JVM centric. Often Java (or Scala) are the only viable option to implement data processing pipelines. That sometimes poses an adoption barrier for organizations that have already invested in other language ecosystems. The Apache Beam project provides a unified programming model for data processing and its ongoing portability effort aims to enable multiple language SDKs (currently Java, Python and Go) on a common set of runners. The combination of Python streaming on the Apache Flink runner is one example. Let’s take a look how the Flink runner translates the Beam model into the native DataStream (or DataSet) API, how the runner is changing to support portable pipelines, how Python user code execution is coordinated with gRPC based services and how a sample pipeline runs on Flink.
This presentation is about Food Delivery Systems and how they are developed using the Software Development Life Cycle (SDLC) and other methods. It explains the steps involved in creating a food delivery app, from planning and designing to testing and launching. The slide also covers different tools and technologies used to make these systems work efficiently.
Open Channel Flow: fluid flow with a free surfaceIndrajeet sahu
Open Channel Flow: This topic focuses on fluid flow with a free surface, such as in rivers, canals, and drainage ditches. Key concepts include the classification of flow types (steady vs. unsteady, uniform vs. non-uniform), hydraulic radius, flow resistance, Manning's equation, critical flow conditions, and energy and momentum principles. It also covers flow measurement techniques, gradually varied flow analysis, and the design of open channels. Understanding these principles is vital for effective water resource management and engineering applications.
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...Transcat
Join us for this solutions-based webinar on the tools and techniques for commissioning and maintaining PV Systems. In this session, we'll review the process of building and maintaining a solar array, starting with installation and commissioning, then reviewing operations and maintenance of the system. This course will review insulation resistance testing, I-V curve testing, earth-bond continuity, ground resistance testing, performance tests, visual inspections, ground and arc fault testing procedures, and power quality analysis.
Fluke Solar Application Specialist Will White is presenting on this engaging topic:
Will has worked in the renewable energy industry since 2005, first as an installer for a small east coast solar integrator before adding sales, design, and project management to his skillset. In 2022, Will joined Fluke as a solar application specialist, where he supports their renewable energy testing equipment like IV-curve tracers, electrical meters, and thermal imaging cameras. Experienced in wind power, solar thermal, energy storage, and all scales of PV, Will has primarily focused on residential and small commercial systems. He is passionate about implementing high-quality, code-compliant installation techniques.
Sri Guru Hargobind Ji - Bandi Chor Guru.pdfBalvir Singh
Sri Guru Hargobind Ji (19 June 1595 - 3 March 1644) is revered as the Sixth Nanak.
• On 25 May 1606 Guru Arjan nominated his son Sri Hargobind Ji as his successor. Shortly
afterwards, Guru Arjan was arrested, tortured and killed by order of the Mogul Emperor
Jahangir.
• Guru Hargobind's succession ceremony took place on 24 June 1606. He was barely
eleven years old when he became 6th Guru.
• As ordered by Guru Arjan Dev Ji, he put on two swords, one indicated his spiritual
authority (PIRI) and the other, his temporal authority (MIRI). He thus for the first time
initiated military tradition in the Sikh faith to resist religious persecution, protect
people’s freedom and independence to practice religion by choice. He transformed
Sikhs to be Saints and Soldier.
• He had a long tenure as Guru, lasting 37 years, 9 months and 3 days
Null Bangalore | Pentesters Approach to AWS IAMDivyanshu
#Abstract:
- Learn more about the real-world methods for auditing AWS IAM (Identity and Access Management) as a pentester. So let us proceed with a brief discussion of IAM as well as some typical misconfigurations and their potential exploits in order to reinforce the understanding of IAM security best practices.
- Gain actionable insights into AWS IAM policies and roles, using hands on approach.
#Prerequisites:
- Basic understanding of AWS services and architecture
- Familiarity with cloud security concepts
- Experience using the AWS Management Console or AWS CLI.
- For hands on lab create account on [killercoda.com](https://killercoda.com/cloudsecurity-scenario/)
# Scenario Covered:
- Basics of IAM in AWS
- Implementing IAM Policies with Least Privilege to Manage S3 Bucket
- Objective: Create an S3 bucket with least privilege IAM policy and validate access.
- Steps:
- Create S3 bucket.
- Attach least privilege policy to IAM user.
- Validate access.
- Exploiting IAM PassRole Misconfiguration
-Allows a user to pass a specific IAM role to an AWS service (ec2), typically used for service access delegation. Then exploit PassRole Misconfiguration granting unauthorized access to sensitive resources.
- Objective: Demonstrate how a PassRole misconfiguration can grant unauthorized access.
- Steps:
- Allow user to pass IAM role to EC2.
- Exploit misconfiguration for unauthorized access.
- Access sensitive resources.
- Exploiting IAM AssumeRole Misconfiguration with Overly Permissive Role
- An overly permissive IAM role configuration can lead to privilege escalation by creating a role with administrative privileges and allow a user to assume this role.
- Objective: Show how overly permissive IAM roles can lead to privilege escalation.
- Steps:
- Create role with administrative privileges.
- Allow user to assume the role.
- Perform administrative actions.
- Differentiation between PassRole vs AssumeRole
Try at [killercoda.com](https://killercoda.com/cloudsecurity-scenario/)
Build the Next Generation of Apps with the Einstein 1 Platform.
Rejoignez Philippe Ozil pour une session de workshops qui vous guidera à travers les détails de la plateforme Einstein 1, l'importance des données pour la création d'applications d'intelligence artificielle et les différents outils et technologies que Salesforce propose pour vous apporter tous les bénéfices de l'IA.
AI in customer support Use cases solutions development and implementation.pdfmahaffeycheryld
AI in customer support will integrate with emerging technologies such as augmented reality (AR) and virtual reality (VR) to enhance service delivery. AR-enabled smart glasses or VR environments will provide immersive support experiences, allowing customers to visualize solutions, receive step-by-step guidance, and interact with virtual support agents in real-time. These technologies will bridge the gap between physical and digital experiences, offering innovative ways to resolve issues, demonstrate products, and deliver personalized training and support.
https://www.leewayhertz.com/ai-in-customer-support/#How-does-AI-work-in-customer-support
Levelised Cost of Hydrogen (LCOH) Calculator ManualMassimo Talia
The aim of this manual is to explain the
methodology behind the Levelized Cost of
Hydrogen (LCOH) calculator. Moreover, this
manual also demonstrates how the calculator
can be used for estimating the expenses associated with hydrogen production in Europe
using low-temperature electrolysis considering different sources of electricity
3. Apache Beam Overview
Apache Beam is an advanced unified programming model designed to provide
efficient and portable data processing pipelines.
● Unified - Single programming model for both batch and streaming
● Advanced - Strong consistency via event-time, i.e. windowing, triggering, late
arrival handling, accumulation, etc
● Portable - Execute pipelines of multiple programming language SDKs,
including Java, Python and Go
● Efficient - Write and share SDKs, IO connectors, and transformation libraries
https://beam.apache.org/
4. Beam Model
● A Pipeline encapsulates your entire data
processing task, from start to finish
● IO is the end points for data input and
output
● A PCollection represents an
immutable distributed data set that
your Beam pipeline operates on
● A PTransform represents a data
processing operation, or a step, in
your pipeline
IO.read
IO.write
PTransform
IO.read
IO.write
PCollection
Pipeline
6. Beam Event Time
12:00 12:01 12:02 12:03
12:02
12:03
12:04
2
3
4
5
6
7
8
1
1-min fixed
window using
processing time
12:01
ProcessingTime
Event time
7. Beam Event Time
12:00 12:01 12:02 12:03
12:02
12:03
12:04
1
2
1-min fixed
window using
event time
- Watermark: a
timestamp that all
events before that have
arrived.
- Data that arrives with a
timestamp after the
watermark is
considered late data.
- Example using simple
watermark of event
timestamp 12:01
watermark
ProcessingTime
Event time
8. Beam Event Time
12:00 12:01 12:02 12:03
12:02
12:03
12:04
1
2
3
4
5
6
12:01
1-min fixed
window using
event time
- Watermark: a
timestamp that all
events before that have
arrived.
- Data that arrives with a
timestamp after the
watermark is
considered late data.
- Example using simple
watermark of event
timestamp
watermark
ProcessingTime
Event time
9. Beam Event Time
12:00 12:01 12:02 12:03
12:02
12:03
12:04
1
2
3
4
5
6
7
8
12:01
1-min fixed
window using
event time
- Watermark: a
timestamp that all
events before that have
arrived.
- Data that arrives with a
timestamp after the
watermark is
considered late data.
- Example using simple
watermark of event
timestamp
watermark
lateProcessingTime
Event time
10. Beam Windowing
Windowing divides data into event-time-based finite chunks.
Often required when doing aggregations over unbounded data.
Fixed Sliding
1 2 3
54
Sessions
2
431
Key
2
Key
1
Key
3
Time
2 3 4
11. Beam Stateful Processing
12:00 12:01 12:02 12:03
12:02
12:03
12:04
12:01
(news)
(msg)
(msg)
(jobs)
(msg)
(network)
(news) 12:00-12:01 12:01-12:02 12:02-12:03
news 1 0 1
msg 0 3 0
network 0 0 1
jobs 0 1 0
● Beam provides several state abstractions,
e.g. ValueState, BagState, MapState,
CombineState
● State is on a per-key-and-window basis
State for counting PageKey:
ProcessingTime
Event time
14. The Goal of Samza Runner
Bring the easy-to-use, but powerful, model of
Beam to Samza users for state-of-art stream
and batch data processing, with portability
across a variety of programming languages.
15. Samza Overview
● The runner combines the large-scale stream processing capabilities of
Samza with the the advanced programming model of Beam
● First class support for local state (with RocksDB store)
● Fault-tolerance with support for incremental checkpointing of state
instead of full snapshots
● A fully asynchronous processing engine that makes remote calls
efficient
● Flexible deployment models, e.g. Yarn and standalone with
Zookeeper
17. How Samza Runner Works?
● A Beam runner translates the Beam API into its native API + runtime
logic, and executed it in a distributed data processing system.
● Samza Runner translates Beam API into Samza high-level API and
execute the logic in a distributed manner, e.g. Yarn, Standalone.
● Samza runner contains the logic to support Beam features
- Beam IO - Event time/Watermark - GroupByKey
- Keyed State - Triggering Timers - Side
Input
18. BoundedSourceSystem
UnboundedSourceSystem
Unbounded/Bounded IO
● UnboundedSourceSystem adapts any
Unbounded IO.Read into a Samza
SystemConsumer. It will 1) split the sources
according to the parallelism needed; 2) generate
IncomingMessageEnvelopes of either event or
watermark
● BoundedSourceSystem adapts any Bounded
IO.Read into a Samza SystemConsumer
● Direct translation is also supported for Samza
native data connectors, e.g. translating
KafkaIO.Read directly into KafkaSystemConsumer
KafkaIO.Read
(KafkaUnboundedSource)
TextIO.Read
(TextSource)
Samza
StreamProcessor
Events/Watermarks
Events/End-of-stream
19. Watermark
● Watermark is injected at a fixed interval from unbounded sources
● Watermarks are propagated through each downstream operators and
aggregated using the following logic:
InputWatermak(op) = max (CurrentInputWatermark(op), min(OutputWatermark(op') | op' is upstream of op))
OutputWatermark(op) = max (CurrentOutputWatermark(op), InputWatermark(op))
21. GroupByKey
GroupByKey
● Automatically inserting partitionBy before
reduce
● The intermediate aggregation results are
stored in Samza key-value stores (RocksDb by
default)
● The output is triggered by watermarks by
default
KafkaIO.Read
partitionBy
FlatMap
Run
ReduceFn State
KV<key, value>
22. State Support
● Beam states are provided by
SamzaStoreStateInternals
● The key for each state cell is
(element key, window id, address)
● Samza also provides an readIterator()
interface for large states that won’t fit
in memory
ValueState
BagState
SetState
MapState
CombingState
WatermarkState
SamzaStoreStateInternals
RocksDb
23. Timer Support
● Beam timers are provided by
SamzaTimerInternalsFactory
● Support both event-time and
processing-time timers
● Event-time timers are managed using a
sorted set ordered by timestamp
● Processing-time timers are managed by
Samza TimerRegistry via
TimerFunction API
● All timers are keyed by TimerKey
(id, namespace, element key)
SamzaTimerInternals
Event-time Timers
GroupByKey
k1
timer1
k2
timer2 k3
timer3
k4
timer4
Processing-time Timers
Key Timer
k1 timer1
k2 timer2
k3 timer3
setTimer
watermark
register
Samza
SystemTimerScheduler
24. View/Side Input
● Beam views: SingletonView,
IterableView, ListView, MapView,
MultimapView
● Beam views are materialized into a
physical stream and broadcast to all
tasks using Samza broadcast
operator
● ParDo will consume the broadcasted
view as side input
KafkaIO.Read
0 1 2 3
ParDo
ParDo
ParDo
ParDo
TextIO.Read
ParDo
Combine.
GloballyAsSingletonView
0
broadcast
Broadcast
Stream
Side input
Main input
25. Deployment
Local (single JVM)
●Default mode: No config
required
●LocalApplicationRunner
●PassthroughJobCoordinator
●All tasks grouped into one
container
Yarn
●RemoteApplicationRunner
●YarnJobFactory
●Configure containers using
job.container.count
N
M
N
M
N
M
N
M
N
M
N
M
RM
Yarn
Cluster
JVM
Process
Standalone (zookeeper)
●LocalApplicationRunner
●ZkJobCoordinator
●Configure zk connection
job.coordinator.zk.connect
StreamProcessor
Samza
Contai
ner
Job
Coordi
nator
StreamProcessor
Samza
Contai
ner
Job
Coordi
nator
StreamProcessor
Samza
Contai
ner
Job
Coordi
nator
StreamProcessor
Samza
Contai
ner
Job
Coordi
nator
Zookeeper
27. Use Case 1: Fixed-window Join to Track Location
Onboard
location
transmitt
er
Radar
Monitor
WithKey
(key by ID)
WithKey
(key by ID)
FixedWindow
(10 min)
FixedWindow
(10 min)
CoGroupByKey
(join)
Location
Info
DB
Suppose you own a Star Trek fleet, and you want to track the location of your Starships. The
location data are gathered through Starship on-board transmitters as well as your radar
monitors. Now let’s track their location in event time of a 10-min window.
(T1,
Enterprise,
SF, 1)
(R1,
Enterprise
SV, 9)
Enterprise,
(SF, 1)
Enterprise,
(SV, 9)
Enterprise,
(SF, 1)
Window(0 : 9)
Enterprise,
(SV, 9)
Window(0 : 9)
ParDo
KafkaIO.Write
DbIO.Write
Enterprise,
(SF, 1), (SV, 9)
Window(0 : 9)
Enterprise,
SV
Window(0 : 9)
Enterprise,
SV
Enterprise,
SV
28. Use Case 2: Session-window Join to Gather Activities
Suppose we are heading out to Disneyland. We would like to know the activity
count for each person. Here we use session window join to gather the activities
done per person.
Ticket Purchase
Event
Membership
Purchase Event
Activity Event
CoGroupByKey
SessionWindow
(4 hour)
SessionWindow
(4 hour)
SessionWindow
(4 hour)
Join by id
Xinyu: G00001 Boris: M00001
G00001: Space Mountain
M00001: Harry Potter
M00001: Small World
Count.perKey
Xinyu: 1
Boris: 2
29. Use Case 3: Sliding-window Aggr. for Feature Generation
Calculate the features of count, top N and sum for particular key for PageView
events using a 1-day sliding window with 1-min update interval.
PageView Event
SlidingWindow
(1day, every min)
Count.perKey Top.largestPerKey(n)
Filtter.by
Sum.globally
Schema pageViewSchema = RowSqlTypes
.builder()
.withVarcharField("pageKey")
.withTimestampField("timestamp")
.build();
PCollection<Row> pageViewsRows = pageViews
.apply(MapElements
.into(TypeDescriptor.of(Row.class))
.via((PageViewEvent pv) ->
Row.withSchema(pageViewSchema)
.addValues(pv.pageKey.toString(),
new DateTime(pv.time)).build()))
.setCoder(pageViewSchema.getRowCoder());
PCollection<KV<String, Long>> counts = pageViewsRows
.apply(BeamSql.query(
"SELECT COUNT(*) AS `count` FROM pageView "
+ "GROUP BY pageKey, "
+ "HOP(timestamp, INTERVAL '1' MINUTE, INTERVAL '1' DAY)"));
Alternatively, using SQL:
31. Future Work
● Python! ● Async Support ● Table API
# A sample word count
p =Pipeline(options=pipeline_options)
# Read the text file[pattern] into a PCollection.
lines = p | 'read' >> ReadFromText(known_args.input)
# Count the occurrences of each word.
counts = (lines
| 'split' >> (ParDo(WordExtractingDoFn())
.with_output_types(unicode))
| 'pair_with_one' >>Map(lambda x: (x, 1))
| 'group' >> GroupByKey()
| 'count' >> Map(lambda (word, ones): (word,
sum(ones))))
# Format the counts into a PCollection of strings.
output = counts
| 'format' >>Map(lambda (word, c): '%s: %s' %
(word, c))
# Write the output using a "Write" transform that has side
effects.
# pylint: disable=expression-not-assigned
output | 'write' >> WriteToText(known_args.output)
result = p.run()
result.wait_until_finish()
// Use CompletionStage for asynchronous processing
input.apply(ParDo.of(
new DoFn<InputT, OutputT>() {
@ProcessElement
public void process
(@Element CompletionStage<InputT> element, ...) {
element.thenApply(...)
}
}
));
// PTable is the Table abstraction
PTable<KV<String, User>> userTable =
pipeline.apply(
EspressoTable.readWrite()
.withDb("dbname")
.withTable("user"));
pageView
.apply(TableParDo.of(
new DoFn<KV<String, PageViewEvent>, String>() {
@ProcessElement
public void processElement(ProcessContext c,
@TableContext.Inject TableContext tc) {
String id = c.element().getKey();
//table lookup
Table<String, User> users = tc.getTable(userTable);
User user = settings.get(id);
c.output(id + “:” + user.getName().toString());
}
})
.withTables(userTable));
// Convenient helper class to do the same thing
PCollection<String> result = PCollectionTableJoin
.of(pageView, userTable)
.into(TypeDescriptors.strings())
.via((pv, user) ->
pv.getKey() + “:” + user.getName().toString());
32. Thank you!
And
Special Thanks to Our Early Adopters:
Yingkai Hu, Froila Dsouza, Zhongen Tao,
Nithin Reddy, Bruce Su
https://beam.apache.org/documentation/runners/samza/
Editor's Notes
Talk about the key features that are not available in current samza.
How samza runner works? What runner is. What we need to support Beam features.
When propagating watermarks across stages (connected by intermediate streams), partitionBy operator will send the watermarks to a single downstream task to aggregate the watermarks and then broadcast aggregated watermark to all the peer tasks.
As far as we know, most of the existing Beam runners don’t support bigger-than-memory state.