Robust Stream Processing With Apache Flink

•Download as PPTX, PDF•

2 likes•588 views

My talk from Reactive Summit 2016 - Not terribly interesting without the actual talk since most of it was live coding and demo.

Robust
Stream Processing
with
Apache Flink
Jamie Grier
@jamiegrier
jamie@data-artisans.com

Who am I?
• Director of Applications Engineering at data
Artisans
• Previously working on streaming computation at
Twitter, Gnip and Boulder Imaging
• Involved in various kinds of stream processing for
about a decade
• High-speed video, social media streaming, general
frameworks for stream processing

Overview
• What is Apache Flink?
• What is Stateful Stream Processing?
• Windowed computation over streams
• Robust Time Handling (Event Time vs Processing Time)
• Robust Failure Handling
• Robust Planned Downtime Handling
• Robust Reprocessing

What is
Apache Flink?
s an open source platform for distributed stream and batch da

Stream Processing
Your
Code
Data Stream Data Stream

Stateful
Stream Processing
Your
Code
Data Stream Data Stream
State

More Complex
Example
Rabbit
MQ
Files
Kafka
Filter
Map
Join /
Sum
Influx
DB
C*

Distributed and Parallel
Deployment
MapR
Stream
s
Files
Kafka
Filter
Pars
e
Join /
Sum
Influx
DB
C*

Benchmarking on
HPC Cluster
0 18,000,000 36,000,000 54,000,000 72,000,000 90,000,000
Throughput: msgs/sec
10 Machines with 40 GigE
72 Million msgs/sec

Robust Stream Processing
with Apache Flink

Amplifier Function
Amplifier
Amplified Stream
State*
*State: Amplification factors for each key

Windowing in Processing
Time
0 1 2 34 56 7 8 9 0 1 2 3 4 5 6 7 8 9
Processing Time
Event Time

Windowing in Event
Time
0 1 2 34 56 7 8 9 0 1 2 3 4 5 6 7 8 9
Event Time

We’re Hiring!
http://data-artisans.com/careers

http://flink-forward.org/kb_sessions/robust-stream-processing-with-apache-flink/ In this hands on talk and demonstration I’ll give a very short introduction to stream processing and then dive into writing code and demonstrating the features in Apache Flink that make truly robust stream processing possible. We’ll focus on correctness and robustness in stream processing. During this live demo we’ll be developing a realtime analytics application and modifying it on the fly based on the topics we’re working though. We’ll exercise Flink’s unique features, demonstrate fault-recovery, clearly explain and demonstrate why Event Time is such an important concept in robust stateful stream processing and talk about and demonstrate the features you need in a stream processor in production. Some of the topics covered will be: – Stateful Stream Processing – Event Time vs. Processing Time – Fault tolerance – State management in the face of faults – Savepoints – Data re-processing – Planned downtime and upgrades

Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...

Flink Forward

http://flink-forward.org/kb_sessions/keynote-tba-2/ The past 12 months saw the data streaming ecosystem mature and grow tremendously with new open source projects and products being offered in the market, and more large-scale production applications of streaming data. It is now understood that streaming data is not a fad, but a growing industry that is here to stay. Apache Flink was one of the pioneering communities advocating that stream processing is a great fit for the continuous nature of data production, and that batch processing can be seen and efficiently performed as a special case of stream processing. Flink saw tremendous growth since the last Flink Forward conference, with the project boasting now more than 200 contributors from several companies, several production installations and broad adoption. In this talk, we discuss several large-scale stream processing use cases that we see at data Artisans. Additionally, we discuss what this accelerated growth means for Flink, how we can sustain this growth moving forward, as well as a vision for the next big directions in Flink.

Flink 1.0-slides

Jamie Grier

Stream processing with Apache Flink @ OfferUp

Bowen Li

Apache Flink Community Updates November 2016 @ Berlin Meetup

Robert Metzger

http://flink-forward.org/kb_sessions/flink-and-beam-current-state-roadmap/ It is no secret that the Dataflow model, which evolved from Google’s MapReduce, Flume, and MillWheel, has been a major influence to Apache Flink’s streaming API. The essentials of this model are captured in Apache Beam. Beam provides the Dataflow API with the option to deploy to various backends (e.g. Flink, Spark). In this talk we will examine the current state of the Flink Runner. Beam’s Runners manage the translation of the Beam API into the backend API. The Beam project itself has made an effort to summarize the capabilities of each Runner to provide an overview of the supported API concepts. From all open sources backends, Flink is currently the Runner which supports the most features. We will look at the supported Beam features and their counterpart in Flink. Further, we will look at potential improvements and upcoming features of the Flink Runner.

Achieving end-to-end visibility into complex event-sourcing transactions usin...

HostedbyConfluent

Event-sourcing systems usage like Kafka is growing rapidly among Node.js applications. Building systems around an event-driven architecture simplifies horizontal scalability in distributed computing models and makes them more resilient to failure. With these advantages, we face new challenges - how to get visibility into these complex processes. Event-driven architecture is async by nature. Tracking the communication between different components is both extremely difficult and important when debugging or figuring out bottlenecks in the system. In this talk, I will present ways to achieve end-to-end and granular visibility into complex event-sourcing transactions using distributed tracing. I will use open-source tools like OpenTelemetry, Jaeger, and Zipkin to showcase a complex Node.js system using Kafka.

Thomas Lamirault_Mohamed Amine Abdessemed -A brief history of time with Apac...

Flink Forward

Many use cases in the telecommunication industry require producing counters, quality metrics, and alarms in a streaming fashion with very low latency. Most of this metrics are only valuable when they’re made available as soon as the associated events happened. In our company we are looking for a system able to produce this kind of real-time indicator, which must handle massive amounts of data (400,000 eps) with often peak loads (like New Year’s Eve) or out-of-order events like massive network disorder. Low latency and flexible window management with specific watermark emission are also a must-haves. Heterogeneous format, multiple flow correlation, and the possibility of late data arrival are other challenges. Flink being already widely used at Bouygues Telecom for real-time data integration, its features made it the evident candidate for the future System. In this talk, we'll present a real use case of streaming analytics using Flink, Kafka & HBase along with other legacy systems.

Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...

Flink Forward

http://flink-forward.org/kb_sessions/flink-in-zalandos-world-of-microservices/ In this talk we present Zalando’s microservices architecture, introduce Saiki – our next generation data integration and distribution platform on AWS and show how we employ stream processing with Apache Flink for near-real time business intelligence. Zalando is one of the largest online fashion retailers in Europe. In order to secure our future growth and remain competitive in this dynamic market, we are transitioning from a monolithic to a microservices architecture and from a hierarchical to an agile organization. We first have a look at how business intelligence processes have been working inside Zalando for the last years and present our current approach – Saiki. It is a scalable, cloud-based data integration and distribution infrastructure that makes data from our many microservices readily available for analytical teams. We no longer live in a world of static data sets, but are instead confronted with endless streams of events that constantly inform us about relevant happenings from all over the enterprise. The processing of these event streams enables us to do near-real time business intelligence. In this context we have evaluated Apache Flink vs. Apache Spark in order to choose the right stream processing framework. Given our requirements, we decided to use Flink as part of our technology stack, alongside with Kafka and Elasticsearch. With these technologies we are currently working on two use cases: a near real-time business process monitoring solution and streaming ETL. Monitoring our business processes enables us to check if technically the Zalando platform works. It also helps us analyze data streams on the fly, e.g. order velocities, delivery velocities and to control service level agreements. On the other hand, streaming ETL is used to relinquish resources from our relational data warehouse, as it struggles with increasingly high loads. In addition to that, it also reduces the latency and facilitates the platform scalability. Finally, we have an outlook on our future use cases, e.g. near-real time sales and price monitoring. Another aspect to be addressed is to lower the entry barrier of stream processing for our colleagues coming from a relational database background.

data Artisans Product Announcement

Flink Forward

Abstractions for managed stream processing platform (Arya Ketan - Flipkart)

KafkaZone

Flink Forward SF 2017: Scott Kidder - Building a Real-Time Anomaly-Detection ...

Flink Forward

Mux uses Apache Flink to identify anomalies in the distribution & playback of digital video for major video streaming websites. Scott Kidder will describe the Apache Flink deployment at Mux leveraging Docker, AWS Kinesis, Zookeeper, HDFS, and InfluxDB. Deploying a Flink application in a zero-downtime production environment can be tricky, so unit- & behavioral-testing, application packaging, upgrade, and monitoring strategies will be covered as well.

Real-Time Dynamic Data Export Using the Kafka Ecosystem

confluent

(Preston Thompson, Braze) Kafka Summit SF 2018 If you collect billions of data points every day and create billions more sending and tracking messages, then you know you need to get your infrastructure right. Our clients use Braze to engage their users over their lifecycle via push notifications, emails, in-app messages and more. Using our Currents product, clients can enable multiple configurable integrations to export this event data in real time to a variety of third-party systems, allowing them to tightly integrate with the rest of their operations and understand the impacts of their engagement strategy. We use Kafka and the Kafka ecosystem to power this high volume real-time export. As you’d expect in a big data environment, we take data collected from a variety of sources—our SDKs, email partner APIs, our own systems—and produce it to Kafka, with topics for each type of event (about 30 types). Kafka Streams filters and transforms this data according to the configurations set by our clients. Clients can choose which types of events should be sent to which third-party systems. Kafka Connect helps to export the data to third-party systems in real time using custom developed connectors. We run a connector instance for each integration for each customer that consumes from the integration-specific topic. On top of it all, we built a service to manage the pipeline. The service provides configurations to the Streams application and also creates topics for new integrations and uses the Connect REST API to create and manage connectors. In this talk, I will discuss: -How we started our journey in designing this large-scale streaming architecture -Why streaming technologies were necessary to solve our technology and business issues -The lessons we learned along the way that can help you with your Kafka-based architecture

Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...

Flink Forward

Apache Flink is a popular stream computing framework for real-time stream computing. Many stream compute algorithms require trailing data in order to compute the intended result. One example is computing the number of user logins in the last 7 days. This creates a dilemma where the results of the stream program are incomplete until the runtime of the program exceeds 7 days. The alternative is to bootstrap the program using historic data to seed the state before shifting to use real-time data. This talk will discuss alternatives to bootstrap programs in Flink. Some alternatives rely on technologies exogenous to the stream program, such as enhancements to the pub/sub layer, that are more generally applicable to other stream compute engines. Other alternatives include enhancements to Flink source implementations. Lyft is exploring another alternative using orchestration of multiple Flink programs. The talk will cover why Lyft pursued this alternative and future directions to further enhance bootstrapping support in Flink.

Stream Processing with Apache Flink

C4Media

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1VhSzmy. Robert Metzger provides an overview of the Apache Flink internals and its streaming-first philosophy, as well as the programming APIs. Filmed at qconlondon.com. Robert Metzger is a PMC member at the Apache Flink project and a cofounder and software engineer at data Artisans. He is the author of many Flink components including the Kafka and YARN connectors.

Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...

Flink Forward

Elastic Data Processing with Apache Flink and Apache Pulsar More and more applications are using Flink for low-latency data processing. Flink unifies batch and stream processing using one computation engine. However in reality, in order to really unify batch and stream processing, it requires a data system offers one unified data representation for both batch and streaming data. Nowadays, streaming data is typically stored in a log storage or messaging system, while batch data is stored in distributed filesystem and object stores. That means that data scientists still need write two different computing jobs to access same data stored in different data systems. Apache Pulsar is the next generation messaging and streaming data system. It was originally built at Yahoo, and has graduated from Apache Incubator and become a Top-Level-Project. Pulsar separates messaging serving and data storage into two layers. Such layered architecture provides high throughput and low-latency while ensuring high availability and scalability. Pulsar’s segment centric storage design along with layered architecture makes Pulsar a perfect unbounded streaming data system, which can well fit into Flink’s computation model. In this talk, Sijie Guo from Apache Pulsar PMC, will introduce Pulsar and its layered architecture and segment-centric storage, detailing how this architecture can well integrate with Flink to provide elastic unified batch and stream processing.

Flink Forward San Francisco 2018 keynote: Srikanth Satya - "Stream Processin...

Flink Forward

Stream Processing in conjunction with a Consistent, Durable, Reliable stream storage is kicking the revolution up a notch in Big Data processing. This modern paradigm is enabling a new generation of data middleware that delivers on the streaming promise of a simplified and unified programming model. From data ingest, transformation, and messaging to search, time series and more, a robust streaming data ecosystem means we’ll all be able to more quickly build applications that solve problems we could not solve before.

Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...

Flink Forward

AirStream is a realtime stream computation framework that supports Flink as one of its processing engines. It allows engineers and data scientists at Airbnb to easily leverage Flink to build real time data pipelines and feedback loops. Multiple mission critical applications have been built on top of it. In this talk, we will start with an overview of AirStream, and describe how we have designed Airstream to leverage SQL support in Flink to allow users to easily build real time data pipelines. We will go over a few production use cases such as building a user activity profiler and building user identity mapping in realtime. We will also cover how we have integrated Airstream into the data infrastructure ecosystem at Airbnb through easily configurable connectors such as Kafka and Hive that allow users to easily leverage these components in their pipelines.

Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...

Flink Forward

http://flink-forward.org/kb_sessions/declarative-stream-processing-with-streamsql-and-cep/ Complex event processing (CEP) and stream analytics are commonly treated as distinct classes of stream processing applications. While CEP workloads identify patterns from event streams in near real-time, stream analytics queries ingest and aggregate high-volume streams. Both types of use cases have very different requirements which resulted in diverging system designs. CEP systems excel at low-latency processing whereas engines for stream analytics achieve high throughput. Recent advances in open source stream processing yielded systems that can process several millions of events per second at sub-second latency. Systems like Apache Flink enable applications that include typical CEP features as well as heavy aggregations. In this talk we will show how Apache Flink unifies CEP and stream analytics workloads. Guided by examples, we introduce Flink’s CEP-enriched StreamSQL interface and discuss how queries are compiled, optimized, and executed on Flink.

Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...

Flink Forward

Apache Beam was open sourced by the big data team at Google in 2016, and has become an active community with participants from all over. Beam is a framework to define data processing workflows and run them on various runners (Flink included). In this talk, I will talk about some cool things you can do with Beam + Flink such as running pipelines written in Go and Python; then I’ll mention some cool tools in the Beam ecosystem. Finally, we’ll wrap up with some cool things we expect to be able to do soon - and how you can get involved.

Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber

confluent

Speaker: Yupeng Fu, Staff Engineer, Uber High availability and reliability are important requirements to Uber services, and the services shall tolerate datacenter failures in a region and fail over to another region. In this talk, we will present the active-active Apache Kafka® at Uber and how it facilitates disaster discovery across regions for Uber services. In particular, we will highlight the key components including topic replication, topic aggregation, offsets sync and then walk through several use cases of their disaster recovery strategy using active-active Kafka. Lastly, we will present several interesting challenges and the future work planned. Yupeng Fu is a staff engineer in Uber Data Org leading the streaming data platform. Previously, he worked at Alluxio and Palantir, building distributed data analysis and storage platforms. Yupeng holds a B.S. and an M.S. from Tsinghua University and did his Ph.D. research on databases at UCSD.

A stream: Ad-hoc Shared Stream Processing - Jeyhun Karimov, DFKI GmbH

Flink Forward

In the last decade, many distributed stream processing engines (SPEs) were developed to perform continuous queries on massive online data. The central design principle of these engines is to handle queries that potentially run forever on data streams with a query-at-a-time model, i.e., each query is optimized and executed separately. In many real applications, streams are not only processed with long-running queries, but also thousands of short-running ad-hoc queries. To support this efficiently, it is essential to share resources and computation for stream ad-hoc queries in a multi-user environment. The goal of this talk is to bridge the gap between stream processing and ad-hoc queries in SPEs by sharing computation and resources. We define three main requirements for ad-hoc shared stream processing: (1) Integration: Ad-hoc query processing should be a composable layer which can extend stream operators, such as join, aggregation, and window operators; (2) Consistency: Ad-hoc query creation and deletion must be performed in a consistent manner and ensure exactly-once semantics and correctness; (3) Performance: In contrast to state-of-the-art SPEs, ad-hoc SPE should not only maximize data throughput but also query throughout via incremental computation and resource sharing. Based on these requirements, we have developed AStream, an ad-hoc, shared computation stream processing framework. To the best of our knowledge, AStream is the first system that supports distributed ad-hoc stream processing. AStream is built on top of Apache Flink. Our experiments show that AStream shows comparable results to Flink for single query deployments and outperforms it in orders of magnitude with multiple queries.

Flink Forward Berlin 2017: Zohar Mizrahi - Python Streaming API

Flink Forward

In this talk, we describe the design and implementation of the Python Streaming API support that has been submitted for inclusion in mainline Flink. Python is one of the most popular programming languages for data analysis. Its readability emphasizes development productivity and as a scripting language, it does not require a compilation nor complex development environment setup. Flink already has support for Python APIs for batch programming and unfortunately, the mechanism used to support batch programs (i.e., DataSet APIs) do does not work for Streaming API. We describe the limitations with the batch implementation and provide insights into how we solved this using Jython. We will walk through some of the examples programs using the new Python API and compare programmability and performance with the Java and Scala streaming APIs.

Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar

confluent

Siphon is a highly available and reliable distributed pub/sub system built using Apache Kafka. It is used to publish, discover and subscribe to near real-time data streams for operational and product intelligence. Siphon is used as a “Databus” by a variety of producers and subscribers in Microsoft, and is compliant with security and privacy requirements. It has a built-in Auditing and Quality control. This session will provide an overview of the use of Kafka at Microsoft, and then deep dive into Siphon. We will describe an important business scenario and talk about the technical details of the system in the context of that scenario. We will also cover the design and implementation of the service, the scale, and real world production experiences from operating the service in the Microsoft cloud environment.

Presentación libro capilla la magdalena parral 27 dic 2007

Alfonso Pacheco Alvarez

TranscriptionDr Muhammad Mustansar

What's hot

Apache Flink @ Alibaba - Seattle Apache Flink Meetup

Bowen Li

Maximilian Michels - Flink and Beam

Flink Forward

Achieving end-to-end visibility into complex event-sourcing transactions usin...

HostedbyConfluent

Thomas Lamirault_Mohamed Amine Abdessemed -A brief history of time with Apac...

Flink Forward

Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...

Flink Forward

data Artisans Product Announcement

Flink Forward

Abstractions for managed stream processing platform (Arya Ketan - Flipkart)

KafkaZone

Flink Forward SF 2017: Scott Kidder - Building a Real-Time Anomaly-Detection ...

Flink Forward

Real-Time Dynamic Data Export Using the Kafka Ecosystem

confluent

Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...

Flink Forward

Stream Processing with Apache Flink

C4Media

Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...

Flink Forward

Flink Forward San Francisco 2018 keynote: Srikanth Satya - "Stream Processin...

Flink Forward

Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...

Flink Forward

Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...

Flink Forward

Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...

Flink Forward

Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber

confluent

A stream: Ad-hoc Shared Stream Processing - Jeyhun Karimov, DFKI GmbH

Flink Forward

Flink Forward Berlin 2017: Zohar Mizrahi - Python Streaming API

Flink Forward

Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar

confluent

What's hot (20)