Apache Flink Training Workshop @ HadoopCon2016 - #4 Advanced Stream Processing

•

2 likes•703 views

Presenter: Gordon Tai HadoopCon 2016 Flink Tutorial Workshop (2016/09/09): "Apache Flink Training - #4 Advanced Stream Processing"

Data & Analytics

Apache Flink™ Training
Advanced Stream Processing
tzulitai@apache.org
Tzu-Li (Gordon) Tai
@tzulitai Sept 2016 @ HadoopCon

● Flink’s notion of time in streaming jobs
● How Watermarks support Event-Time Processing
● Flink’s fault-tolerant, exactly-once streaming semantics
● Flink’s distributed snapshot checkpointing
● Out-of-core streaming state backends
00 This session will be about ...
4

● Processing Time:
○ The timestamp at which a system processes an event
○ “Wall Time”
● Ingestion Time:
○ The timestamp at which a system receives an event
○ “Wall Time”
● Event Time:
○ The timestamp at which an event is generated
01 Different Kinds of “Time”
6

02 Why Wall Time is Incorrect
8
● Think Twitter hash-tag count every 5 minutes
○ We would want the result to reflect the number of
Twitter tweets actually tweeted in a 5 minute
window
○ Not the number of tweet events the stream
processor receives within 5 minutes

02 Why Wall Time is Incorrect
9
● Think replaying a Kafka topic on a windowed
streaming application …
○ If you’re replaying a queue, windows are
definitely wrong if using a wall clock

03 Watermarks & Event-Time
10
● Watermarks is a way to let Flink monitor the
progress of event time
● Essentially a record that flows within the data stream
● Watermarks carry a timestamp t. When a task
receives a t watermark, it knows that there will be no
more events with timestamp t’ < t

06 Event-Time Processing API
13
Tell Flink to use “Event Time”
Assign event timestamps
and watermarks

14
Exactly-Once Streaming Fault-Tolerance

07 Stateful Streaming
15
● Any non-trivial streaming application is stateful
● To draw insights from a stream you usually need to
look beyond a single record
● Any kind of aggregation is stateful (ex. windows)

08 What “state” looks like in Flink
16
● Any Flink task can be
stateful
● State is partitioned
with the streams that
are read by stateful
tasks

09 Distributed Snapshots
17
● On each checkpoint trigger,
task managers tell all
stateful tasks that they
manage to snapshot their
own state
● When complete, send
checkpoint
acknowledgement to
JobManager
● Chandy Lamport Distributed
Snapshot Algorithm

09 Distributed Snapshots
18
● On a checkpoint trigger by the JobManager, a
checkpoint barrier is injected into the stream

10 Distributed Snapshots
19
● When a task
receives a
checkpoint barrier,
its state is
checkpointed to a
state backend
● A pointer value to
the stored state is
stored in the
distributed snapshot

11 Distributed Snapshots
20
● After all stateful tasks
acknowledges, the
distributed snapshot
is completed
● Only fully completed
snapshots are used
for restore on failure

12 Checkpointing API
21
val env = StreamExecutionEnvironment.getExecutionEnvironment()
env.enableCheckpointing(100) // trigger checkpoint every 100ms
env.setStateBackend(new RocksDBStateBackend(...))

13 Flink Streaming Savepoints
22
● Basically, a checkpointed that is persisted in the state backend
● Allows for stream progress “versioning”

14 Power of Savepoints
23
● No stateless point in time

14 Power of Savepoints
24
● Reprocessing as batch

14 Power of Savepoints
24
● Reprocessing as batch (corrupt state)

14 Power of Savepoints
25
● Reprocessing as streaming, starting from savepoint

15 Power of Savepoints
26
● Reprocessing as streaming, starting from savepoint

Logs are one of the most important sources to monitor and reveal some significant events of interest. In this presentation, we introduced an implementation of log streams processing architecture based on Apache Flink. With fluentd, different kinds of emitted logs are collected and sent to Kafka. After having processed by Flink, we try to build a dash board utilizing elasticsearch and kibana for visualization.

Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview

Apache Flink Taiwan User Group

Apache Software Foundation: How To Contribute, with Apache Flink as Example (...

Apache Flink Taiwan User Group

Flink Connector Development Tips & Tricks

Eron Wright

Virtual Flink Forward 2020: Build your next-generation stream platform based ...

Flink Forward

As organizations are getting better at capturing streaming data and the data velocity and volume are ever-increasing, the traditional messaging queues or log storage systems are suffering from scalability or operational and maintenance problems. Apache Pulsar is a multi-tenant, high-performance distributed pub-sub messaging system. Pulsar includes multiple features, such as native support for multiple clusters in a Pulsar instance, seamless geo-replication of messages across clusters, very low publishing and end-to-end latency, seamless scalability to over a million topics, and guaranteed message delivery with persistent message storage provided by Apache BookKeeper. In this talk, I will use one of the most popular stream processing engines, Apache Flink, as an example, to share our experience in building a stream processing and storage stack. Some of the traits are: * How to ensure end-to-end exactly-once semantics based on Pulsar's durable and replayable storage as well as Pulsar transaction. * How to implement Pulsar topics as infinite tables based on Pulsar's schema. * How to efficiently store stream states in Pulsar based on Pulsar's layered storage API. * A usage scenario that chaining all functionalities in the streaming platform.

Flink Forward SF 2017: Konstantinos Kloudas - Extending Flink’s Streaming APIs

Flink Forward

As more and more organizations and individual users turn to Apache Flink for their streaming workloads, there is a bigger demand for additional functionality out-of-the-box. On one hand, there is demand for more low-level APIs that allow for more control, while on the other, users ask for more high-level additions that make the common cases easier to express. This talk will present the new concepts added to the Datastream API in Flink-1.2 and for the upcoming Flink-1.3 release that tried to consolidate the aforementioned goals. We will talk, among others, about the ProcessFunction, a new low level stream processing primitive that gives the user full control over how each event is processed and can register and react to timers, changes in the windowing logic that allow for more flexible windowing strategies, side outputs, and new features concerning the Flink connectors.

Kostas Kloudas - Extending Flink's Streaming APIs

Ververica

Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...

Flink Forward

Stateful stream processing with exactly-once guarantees is one of Apache Flink's distinctive features and we can observe that the scale of state that is managed by Flink in production constantly grows. This leads to a couple of interesting challenges for state handling in Flink. In this talk, we presents current and future developments to improve the handling of large state and recovery in Apache Flink. We show how to keep snapshots for large state swift and how to minimize negative effects on job performance through incremental and asynchronous checkpointing. Furthermore, we discuss how to greatly accelerate recovery under failures and for rescaling. In this context, we go into details about improved execution graph recovery, caching state on task managers, and considering new features of modern storage architectures for our state backends.

Witnessing the rise of stream processing from the driving seat, we see Apache Flink® and associated technologies used for a wide variety of business applications, from routing data through systems, serving as a backbone for real-time analytics on live data using SQL, detecting credit card fraud, to implementing complete end-to-end social networks. Such applications enable modern data-driven businesses where decisions and actions happen in real-time, and transform traditional businesses to become more data-driven. Observing the variety of these applications implemented using Flink, it becomes apparent that the traditional dividing line between analytics and operational applications is becoming more and more blurry. Historically, operational applications were built using transactional databases, and analytics were done offline. In contrast, Flink’s, state, checkpoints, and time management are the core building blocks for both operational applications with strong data consistency needs, and for real-time analytics with correctness guarantees. With these shared building blocks, developers start building what is arguably a new class of data-driven applications: applications that are operational in that they serve live systems and at the same time analytical in that they perform complex data analysis. Following application architectures like CQRS and using new features like Flink’s queryable state, streaming analytics and online applications move even closer to each other. In this talk, guided by real-world use cases, we present how the unique core concepts behind Flink simplify the development, deployment, and management of data-driven applications, and we conclude with a vision for the future for Flink and stream processing.

Francesco Versaci - Flink in genomics - efficient and scalable processing of ...

Flink Forward

http://flink-forward.org/kb_sessions/flink-in-genomics-efficient-and-scalable-processing-of-raw-illumina-bcl-data/ A single run in genome sequencing can easily produce several terabytes of data, which subsequently feed a complex pipeline of tools. Typically, the first step in this workflow is a rearrangement of data, roughly equivalent to a matrix transposition, to reconstruct the original DNA fragments from the raw BCL data, where the fragments are sliced and scattered over multiple files. This step is followed by the sorting of the fragments by a specific identifying tag sequence, which is attached during the preparation of the sample. In this talk we will present a parallel program which performs these essential operations. Our BCL converter is shown to have comparable performance to the shared-memory Illumina bcl2fastq tool, while also enabling easy and scalable distributed-memory parallelization. We will describe the techniques we have used to achieve high performance and discuss the features of Flink which we have particularly appreciated as well as the ones which we think are still missing.

Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...

Flink Forward

This talk shares experiences from deploying and tuning Flink steam processing applications for very large scale. We share lessons learned from users, contributors, and our own experiments about running demanding streaming jobs at scale. The talk will explain what aspects currently render a job as particularly demanding, show how to configure and tune a large scale Flink job, and outline what the Flink community is working on to make the out-of-the-box for experience as smooth as possible. We will, for example, dive into - analyzing and tuning checkpointing - selecting and configuring state backends - understanding common bottlenecks - understanding and configuring network parameters

Stream Loops on Flink - Reinventing the wheel for the streaming era

Paris Carbone

This document discusses adding iterative processing capabilities to stream processing systems like Apache Flink. It proposes programming model extensions that treat iterative computations as structured loops over windows. Progress would be tracked using progress timestamps rather than watermarks to allow for arbitrary loop structures. Challenges include managing state and cyclic flow control to avoid deadlocks while encouraging iteration completion.

Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas

Flink Forward

Keystone Data Pipeline manages several thousand Flink pipelines, with variable workloads. These pipelines are simple routers which consume from Kafka and write to one of three sinks. In order to alleviate our operational overhead, we’ve implemented autoscaling for our routers. Autoscaling has reduced our resource usage by 25% - 45% (varying by region and time), and has reduced our on call burden. This talk will take an in depth look at the mathematics, algorithms, and infrastructure details for implementing autoscaling of simple pipelines at scale. It will also discuss future work for autoscaling complex pipelines.

2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...

Ververica

Learn how the combination of Apache Kafka and Apache Flink is making stateful stream processing even more expressive and flexible to support applications in streaming that were previously not considered streamable. The new world of applications and fast data architectures has broken up the database: Raw data persistence comes in the form of event logs, and the state of the world is computed by a stream processor. Apache Kafka provides a strong solution for the event log, while Apache Flink forms a powerful foundation for the computation over the event streams. In this talk we discuss how Flink’s abstraction and management of application state have evolved over time and how Flink’s snapshot persistence model and Kafka’s log work together to form a base to build ‘versioned applications’. We will also show how end-to-end exactly-once processing works through a smart integration of Kafka’s transactions and Flink’s checkpointing mechanism.

Flink Forward Berlin 2017: Matt Zimmer - Custom, Complex Windows at Scale Usi...

Flink Forward

The windowing capabilities offered by most stream processing engines are limited to aligned windows of a fixed duration. However, many real-world event processing use cases don’t fit this rigid structure, resulting in awkward processing pipelines. There haven’t been good alternatives, until recently that is. Apache Flink offers a rich Window API that supports implementing unaligned windows of varying duration. In this talk, Matt Zimmer will discuss using this API at Netflix to aggregate events into windows customized along varying definitions of a session. He will talk about implementation details such as: * Handling out-of-order events * Limiting state build-up while aggregating a subset of events from an event stream * Periodically emitting early results * Creating windows bounded by a type of event Attendees will leave this talk with practical techniques and knowledge to implement their own custom windows in Apache Flink.

Flink Forward Berlin 2017: Zohar Mizrahi - Python Streaming API

Flink Forward

In this talk, we describe the design and implementation of the Python Streaming API support that has been submitted for inclusion in mainline Flink. Python is one of the most popular programming languages for data analysis. Its readability emphasizes development productivity and as a scripting language, it does not require a compilation nor complex development environment setup. Flink already has support for Python APIs for batch programming and unfortunately, the mechanism used to support batch programs (i.e., DataSet APIs) do does not work for Streaming API. We describe the limitations with the batch implementation and provide insights into how we solved this using Jython. We will walk through some of the examples programs using the new Python API and compare programmability and performance with the Java and Scala streaming APIs.

Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...

Ververica

Stream Processing is emerging as a popular paradigm for data processing architectures, because it handles the continuous nature of most data and computation and gets rid of artificial boundaries and delays. The fact that stream processing is gaining rapid adoption is also due to more powerful and maturing technology (much of it open source at the ASF) that has solved many of the hard technical challenges. We discuss Apache Flink's approach to high performance stream processing with state, strong consistency, low latency, and sophisticated handling of time. With such building blocks, Apache Flink can handle classes of problems previously considered out of reach for stream processing. We also take a sneak preview at the next steps for Flink.

Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...

Flink Forward

Apache Flink's DataStream API is very expressive and gives users precise control over time and state. However, many applications do not require this level of expressiveness and can be implemented more concisely and easily with a domain-specific API. SQL is undoubtedly the most widely used language for data processing but usually applied in the domain of batch processing. Apache Flink features two relational APIs for unified stream and batch processing, the Table API, a language-integrated relational query API for Scala and Java, and SQL. A Table API or SQL query computes the same result regardless whether it is evaluated on a static file or on a Kafka topic. While Flink evaluates queries on batch input like a conventional query engine, queries on streaming input are continuously processed and their results constantly updated and refined. In this talk we present Flink’s unified relational APIs, show how streaming SQL queries are processed, and discuss exciting new use-cases.

Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink

Ververica

As Apache Flink continues to push the boundaries of stateful stream processing as an integral part of its past releases, increasing numbers of users are starting to realize the potential of stateful stream processing as a promising paradigm for robust and reactive data analytics as well as event-driven applications. This talk aims at covering the general idea and motivations of stateful stream processing, and how Flink enables it with its powerful set of state management features and programming APIs. In addition to that, we will also take a look at the recent advancements related to Flink's state management and large state handling that were driven by our team at data Artisans team in the latest version 1.3 (expected release by end of May / early June).

C100 k and go

tracymacding

This document discusses the C100K problem of handling 100,000 concurrent network connections efficiently and describes how the Go programming language solves this problem. It explains that Go uses lightweight goroutines instead of OS threads, has a fast scheduler, and uses non-blocking I/O with epoll to efficiently handle a large number of clients with a small memory footprint on each CPU core. An example TCP/HTTP server is shown to demonstrate how Go implements networking.

Flink Forward Berlin 2017: Patrick Lucas - Flink in Containerland

Flink Forward

Apache Flink, a powerful distributed stateful stream processing framework, is an especially good fit for deployment on a containerization platform: its storage requirement is primarily external (e.g. HDFS or S3), clusters often share the lifetime of the jobs that run on them, and the flexibility of allocating resources on such a platform allows for scaling jobs up and down as necessary. In this talk I will give a brief introduction to Apache Flink, then describe the journey to making it a first-class citizen of the container world. I will cover my experience preparing to publish the “official repository” of Flink images on Docker Hub, the challenges of fitting a Flink deployment in a Kubernetes-shaped box, and the rough edges of Flink itself that were exposed by this process.

Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)

Apache Flink Taiwan User Group

This document provides an overview of Apache Flink and stream processing. It discusses how stream processing has changed data infrastructure by enabling real-time analysis with low latency. Traditional batch processing had limitations like high latency of hours. Flink allows analyzing streaming data with sub-second latency using mechanisms like windows, state handling, and fault tolerance through distributed snapshots. The document benchmarks Flink performance against other frameworks on a Yahoo! production use case, finding Flink can achieve over 15 million messages/second throughput.

What's new in 1.9.0 blink planner - Kurt Young, Alibaba

Flink Forward

Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...

Flink Forward

We present a new design pattern for data streaming applications, using Apache Flink and Apache Kafka: Building applications directly on top of the stream processor, rather than on top of key/value databases populated by data streams. Unlike classical setups that use stream processors or libraries to pre-process/aggregate events and update a database with the results, this setup simply gives the role of the database to the stream processor (here Apache Flink), routing queries to its workers who directly answer them from their internal state computed over the log of events (Apache Kafka). This talk will cover both the high-level introduction to the architecture, the techniques in Flink/Kafka that make this approach possible, as well as a live demo.

Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...

Flink Forward

Apache Flink provides powerful stream processing capabilities which can allow organizations to move directly from batch to real time analytics, skipping the lambda architecture entirely. However, getting to production is not always as simple as rewriting your job in a new API, but requires rethinking your application design with a stream first mindset. This talk will cover MediaMath’s journey in rebuilding its reporting infrastructure using Apache Flink. We will discuss high level architectural designs when building an extensible reporting platform as well as deep dive into specific technical hurdles. Topics will include managing a Flink cluster on EC2 spot instances, reconciling Flink’s consistency model with S3’s, handling massive data skew as well as tools and techniques for building performant, fault tolerant streaming applications.

Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...

Flink Forward

This document discusses providing an R dataframe abstraction for efficient distributed computation on Apache Flink. The goals are to provide a natural API for R and achieve performance comparable to Flink's native dataflow. The approach represents R dataframes as Flink data sets and compiles R functions into the native execution plan where possible. For user-defined R functions, they are evaluated within worker tasks using a just-in-time compiler. This allows executing R code within the same Java virtual machine as Flink for good performance, even on a single node. Results show it can achieve native Flink performance even for functions containing R code.

Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On

Apache Flink Taiwan User Group

Flink 1.0-slides

Jamie Grier

What's hot

Flink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatest

Flink Forward

Flink Forward Berlin 2017: Dominik Bruhn - Deploying Flink Jobs as Docker Con...

Flink Forward

Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...

Flink Forward

Francesco Versaci - Flink in genomics - efficient and scalable processing of ...

Flink Forward

Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...

Flink Forward

Stream Loops on Flink - Reinventing the wheel for the streaming era

Paris Carbone

Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas

Flink Forward

2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...

Ververica

Flink Forward Berlin 2017: Matt Zimmer - Custom, Complex Windows at Scale Usi...

Flink Forward

Flink Forward Berlin 2017: Zohar Mizrahi - Python Streaming API

Flink Forward

Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...

Ververica

Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...

Flink Forward

Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink

Ververica

C100 k and go

tracymacding

Flink Forward Berlin 2017: Patrick Lucas - Flink in Containerland

Flink Forward

Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)

Apache Flink Taiwan User Group

What's new in 1.9.0 blink planner - Kurt Young, Alibaba

Flink Forward

Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...

Flink Forward

Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...

Flink Forward

Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...

Flink Forward

What's hot (20)