Aljoscha Krettek – Notions of Time

•

2 likes•6,705 views

This document discusses how Apache Flink handles time and windows in streaming data. It explains that streaming data never stops arriving, so windows are used to bucket incoming elements. Windows can be defined based on event time (the timestamp of when events occurred) or processing time (when the system processed the events). Event time is more accurate but processing time is easier to implement. Flink allows for windows based on event time by using watermarks to track the progress of event times and ensure windows have all elements. The document provides an example of how to define event time and processing time windows using the Flink API.

Technology

Notions of Time
Aljoscha Krettek
aljoscha@apache.org
@aljoscha
How Apache Flink™ Handles Time and Windows

6
In Streaming:
Arriving data never stops!

7
Solution:
Put elements into buckets,
these are called windows

8
Window (5 min)
Count #Hashtags
Just saw #Trump on
#CNN, super cool. :D
Trump: 2394
Cheese: 12984
Money: 42

9
What I didn’t mention
• tweets have a timestamp,
their event time
• tweets from across the globe
arrive with delay
=> tweets with different
timestamps arrive out-of-order

Window (5 min)
Count #Hashtags
12:34 (13.10.2015):
Just saw #Trump on
#CNN, super cool. :D
Trump: 2394
Cheese: 12984
Money: 42
These arrive with
3 minutes slack
Form windows based
on processing time
of the machine.
Processing Time != Event Time
10

11
Why do people use this?
• easy to implement
• low latency
• this is what systems give you
(Spark Streaming, Apex,
Samza, Storm)*
*not Google Cloud Dataflow

13
Window (5 min)
Correlate Tweets
and News
something...
These still have 3 min slack.
These have 8 min slack.
12:33 (13.10.2015):
Donald Trump speaks
at Cheese conference.
Processing Time != Event Time

Processing Time != Event Time
=> Mismatch in the
timespace continuum

15
Use cases
• out-of-order elements
• sources with delay
• recovery/fault-tolerance
• “catching up” with a stream
Who does it?
• Google Cloud Dataflow
• Apache Flink

17
We need a
Global Clock
that runs on
event time
instead of
processing time.

18
This is a source
This is our window operator
1
0
0
0 0
1
2
1
2
1
1
This is the current event-time time
2
2
2
2
2
This is a watermark.

20
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(ProcessingTime);
DataStream<Tweet> text = env.addSource(new TwitterSrc());
DataStream<Tuple2<String, Integer>> counts = text
.flatMap(new ExtractHashtags())
.keyBy(“name”)
.timeWindow(Time.of(5, MINUTES)
.apply(new HashtagCounter());
Processing Time

21
Event Time
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(EventTime);
DataStream<Tweet> text = env.addSource(new TwitterSrc());
text = text.assignTimestamps(new MyTimestampExtractor());
DataStream<Tuple2<String, Integer>> counts = text
.flatMap(new ExtractHashtags())
.keyBy(“name”)
.timeWindow(Time.of(5, MINUTES)
.apply(new HashtagCounter());

22
TL;DL*
• stream data is infinite
• windows are helpful
• event-time != processing time
• watermarks to the rescue
• Flink can do it
*too long, didn’t listen

32-35
24-27
20-23
8-110-3
4-7
24
Tumbling Windows of 4 Seconds
123412
4
59
9 0
20
20
22212326323321
26
353642

An in-depth look at Apache Flink’s Streaming Dataflow Engine. Flink executes data streaming programs directly as streams with low latency and flexible user-defined state and models batch programs as streaming programs on finite data streams. The slides cover the general design of the runtime and show how the engine is able to support diverse features and workloads without compromising on performance or usability. Flink Forward, Berlin October 13, 2015

http://flink-forward.org/kb_sessions/the-future-of-apache-flinktm/ In this session we will first have a look at the current state of Apache Flink before diving into some of the upcoming features that are either already in development or still in the design phase. Some of the features currently in development that we are going to cover are: – Dynamic Scaling: Adapting a running program to changing workloads. – Queryable State: External querying of internal Flink state. This has the power to replace key/value stores by turning Flink into a key value store that allows for up to date querying of results. – Side Inputs: Having additional data that evolves over time as input to a stream operation. For the glimpse at the far-off future of Apache Flink™ we dare not make any predictions yet. In the session we will look at the latest whisperings and see what the community is currently thinking up as solutions to existing problems and predicted future challenges in the stream processing space.

Flink Streaming @BudapestData

Gyula Fóra

Flink Streaming Hadoop Summit San JoseKostas Tzoumas

Apache Flink: API, runtime, and project roadmap

Kostas Tzoumas

Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

Flink Forward

Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...

Ververica

Stream Processing is emerging as a popular paradigm for data processing architectures, because it handles the continuous nature of most data and computation and gets rid of artificial boundaries and delays. The fact that stream processing is gaining rapid adoption is also due to more powerful and maturing technology (much of it open source at the ASF) that has solved many of the hard technical challenges. We discuss Apache Flink's approach to high performance stream processing with state, strong consistency, low latency, and sophisticated handling of time. With such building blocks, Apache Flink can handle classes of problems previously considered out of reach for stream processing. We also take a sneak preview at the next steps for Flink.

Apache Flink Training: System Overview

Flink Forward

First Flink Bay Area meetup

Kostas Tzoumas

Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...

Ververica

Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...

Flink Forward

Apache Flink's DataStream API is very expressive and gives users precise control over time and state. However, many applications do not require this level of expressiveness and can be implemented more concisely and easily with a domain-specific API. SQL is undoubtedly the most widely used language for data processing but usually applied in the domain of batch processing. Apache Flink features two relational APIs for unified stream and batch processing, the Table API, a language-integrated relational query API for Scala and Java, and SQL. A Table API or SQL query computes the same result regardless whether it is evaluated on a static file or on a Kafka topic. While Flink evaluates queries on batch input like a conventional query engine, queries on streaming input are continuously processed and their results constantly updated and refined. In this talk we present Flink’s unified relational APIs, show how streaming SQL queries are processed, and discuss exciting new use-cases.

Tech Talk @ Google on Flink Fault Tolerance and HA

Paris Carbone

Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...

Flink Forward

http://flink-forward.org/kb_sessions/declarative-stream-processing-with-streamsql-and-cep/ Complex event processing (CEP) and stream analytics are commonly treated as distinct classes of stream processing applications. While CEP workloads identify patterns from event streams in near real-time, stream analytics queries ingest and aggregate high-volume streams. Both types of use cases have very different requirements which resulted in diverging system designs. CEP systems excel at low-latency processing whereas engines for stream analytics achieve high throughput. Recent advances in open source stream processing yielded systems that can process several millions of events per second at sub-second latency. Systems like Apache Flink enable applications that include typical CEP features as well as heavy aggregations. In this talk we will show how Apache Flink unifies CEP and stream analytics workloads. Guided by examples, we introduce Flink’s CEP-enriched StreamSQL interface and discuss how queries are compiled, optimized, and executed on Flink.

Matthias J. Sax – A Tale of Squirrels and Storms

Flink Forward

Christian Kreuzfeld – Static vs Dynamic Stream Processing

Flink Forward

Michael Häusler – Everyday flink

Flink Forward

What's hot

Debunking Common Myths in Stream Processing

Kostas Tzoumas

Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)

Apache Flink Taiwan User Group

Apache Flink at Strata San Jose 2016

Kostas Tzoumas

Big Data Warsaw

Maximilian Michels

Flink internals web Kostas Tzoumas

Continuous Processing with Apache Flink - Strata London 2016

Stephan Ewen

Streaming Analytics & CEP - Two sides of the same coin?

Till Rohrmann

Aljoscha Krettek - The Future of Apache Flink

Flink Forward

Flink Streaming @BudapestData

Gyula Fóra

Flink Streaming Hadoop Summit San JoseKostas Tzoumas

Apache Flink: API, runtime, and project roadmap

Kostas Tzoumas

Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

Flink Forward

Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...

Ververica

Apache Flink Training: System Overview

Flink Forward

First Flink Bay Area meetup

Kostas Tzoumas

Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...

Ververica

Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...

Flink Forward

Tech Talk @ Google on Flink Fault Tolerance and HA

Paris Carbone

Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...

Flink Forward

Matthias J. Sax – A Tale of Squirrels and Storms

Flink Forward

What's hot (20)

Debunking Common Myths in Stream Processing

Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)

Apache Flink at Strata San Jose 2016

Big Data Warsaw

Flink internals web

Continuous Processing with Apache Flink - Strata London 2016

Streaming Analytics & CEP - Two sides of the same coin?

Aljoscha Krettek - The Future of Apache Flink

Flink Streaming @BudapestData

Flink Streaming Hadoop Summit San Jose

Apache Flink: API, runtime, and project roadmap

Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...

Apache Flink Training: System Overview

First Flink Bay Area meetup

Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...

Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...

Tech Talk @ Google on Flink Fault Tolerance and HA

Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...

Matthias J. Sax – A Tale of Squirrels and Storms

Viewers also liked

Christian Kreuzfeld – Static vs Dynamic Stream Processing

Flink Forward

Michael Häusler – Everyday flink

Flink Forward

Assaf Araki – Real Time Analytics at Scale

Flink Forward

Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink

Flink Forward

Mikio Braun – Data flow vs. procedural programming

Flink Forward

Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin

Flink Forward

Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...

Flink Forward

Slim Baltagi – Flink vs. Spark

Flink Forward

Flink Case Study: Bouygues Telecom

Flink Forward

Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink

Flink Forward

Introduction to Apache Flink - Fast and reliable big data processing

Till Rohrmann

This presentation introduces Apache Flink, a massively parallel data processing engine which currently undergoes the incubation process at the Apache Software Foundation. Flink's programming primitives are presented and it is shown how easily a distributed PageRank algorithm can be implemented with Flink. Intriguing features such as dedicated memory management, Hadoop compatibility, streaming and automatic optimisation make it an unique system in the world of Big Data processing.

Apache Flink Training: DataStream API Part 1 Basic

Flink Forward

Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka

Flink Forward

Vasia Kalavri – Training: Gelly School

Flink Forward

William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...

Flink Forward

Fabian Hueske – Juggling with Bits and Bytes

Flink Forward

Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink

Flink Forward

Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time

Flink Forward

Apache Flink Training: DataSet API Basics

Flink Forward

Alexander Kolb – Flink. Yet another Streaming Framework?

Flink Forward

Viewers also liked (20)

Christian Kreuzfeld – Static vs Dynamic Stream Processing

Michael Häusler – Everyday flink

Assaf Araki – Real Time Analytics at Scale

Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink

Mikio Braun – Data flow vs. procedural programming

Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin

Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...

Slim Baltagi – Flink vs. Spark

Flink Case Study: Bouygues Telecom

Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink

Introduction to Apache Flink - Fast and reliable big data processing

Apache Flink Training: DataStream API Part 1 Basic

Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka

Vasia Kalavri – Training: Gelly School

William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...

Fabian Hueske – Juggling with Bits and Bytes

Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink

Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time

Apache Flink Training: DataSet API Basics

Alexander Kolb – Flink. Yet another Streaming Framework?

Similar to Aljoscha Krettek – Notions of Time

Aljoscha Krettek - Apache Flink® and IoT: How Stateful Event-Time Processing ...

Ververica

Apache Flink® is an open-source stream processing framework for distributed and accurate data streaming applications. An increasing number of IoT use cases will (and some already do) require robust processing frameworks that can handle an ever-increasing amount of data and provide insights in real time. Apache Flink is one of the contenders for the top spot among such frameworks and in this presentation Aljoscha Krettek will highlight some of the properties that make Flink so well suited for IoT use cases: We will first learn what stream processing frameworks in general provide before diving into stateful stream processing and event-time based stream-processing. We will see why these two features are important for IoT scenarios and also why they, together with Flink’s robust handling of failures, enable accurate and robust analytics on real-time streaming data.

Stream processing with Apache Flink - Maximilian Michels Data Artisans

Evention

Apache Flink is an open source platform for distributed stream and batch data processing. At its core, Flink is a streaming dataflow engine which provides data distribution, communication, and fault tolerance for distributed computations over data streams. On top of this core, APIs make it easy to develop distributed data analysis programs. Libraries for graph processing or machine learning provide convenient abstractions for solving large-scale problems. Apache Flink integrates with a multitude of other open source systems like Hadoop, databases, or message queues. Its streaming capabilities make it a perfect fit for traditional batch processing as well as state of the art stream processing.

Graduating Flink Streaming - Chicago meetup

Márton Balassi

Flink. Pure Streaming

Indizen Technologies

El día 21 de Septiembre, tuvimos el placer de acoger en nuestras oficinas un Meetup impartido por nuestro compañero Paco Guerrero sobre la plataforma Apache Flink. "Apache Flink es una plataforma open source de procesamiento en tiempo real, que está en auge al ofrecer características de las que otras tecnologías con las que compite no disponen, sin impacto en su rendimiento. En esta formación introduciremos la filosofía y motor de procesamiento que hace a Flink tan especial y potente. También recorreremos los pilares básicos que confirman a Flink como la plataforma de streaming más prometedora actualmente"

Java Performance Mistakes

Andreas Grabner

Too many database queries, too much data loaded into memory, overloaded html pages, bad architectural decisions, ... These are all reasons why Java Applications are slow. In this presentation - first given at Boston Java Meetup - shows 6 real life examples on why Java-based Applications failed - and you may even heard about this in the news. All examples and the technical details were captured using Dynatrace which is available as a 30 Day Free Trial - http://bit.ly/dttrial - with an option to extend it for another 180 Days in case you share some of your results with us

Corporate Secret Challenge - CyberDefenders.org by Azad

Azad Mzuri

What Your Tech Lead Thinks You Know (But Didn't Teach You)

Chris Riccomini

Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing

DoiT International

Wcl303 russinovichconleyc

Analyzing social media with Python and other tools (2/4) Department of Communication Science, University of Amsterdam

Dataflow - A Unified Model for Batch and Streaming Data Processing

DoiT International

Linux School: Advanced Administration for IBM Software

Bill Malchisky Jr.

We've covered the basics before, now let's deep dive and get to the advanced items. We'll quickly review partition approaches, then demo many command line tools and skills that can help save you time with your Linux installation and maintenance tasks -- customized for IBM/Lotus software. Tips, knowledge, techniques, and as much information as I can provide in an hour. This is a new session for the IBM/Lotus Community, debuted at Engage User Group, in Breda, The Netherlands.

Discovery DevOps

Patto Kub

.NET Core Summer event 2019 in NL - War stories from .NET team -- Karel Zikmund

Karel Zikmund

Application Performance Troubleshooting 1x1 - Von Schweinen, Schlangen und Pa...

rschuppe

NDC Oslo 2019 - War stories from .NET team -- Karel Zikmund

Karel Zikmund

What the Heck Just Happened?

Ken Evans

Silverlight vs HTML5 - Lessons learned from the real world...

Peter Gfader

Linux Performance Analysis: New Tools and Old Secrets

Brendan Gregg

Talk for USENIX/LISA2014 by Brendan Gregg, Netflix. At Netflix performance is crucial, and we use many high to low level tools to analyze our stack in different ways. In this talk, I will introduce new system observability tools we are using at Netflix, which I've ported from my DTraceToolkit, and are intended for our Linux 3.2 cloud instances. These show that Linux can do more than you may think, by using creative hacks and workarounds with existing kernel features (ftrace, perf_events). While these are solving issues on current versions of Linux, I'll also briefly summarize the future in this space: eBPF, ktap, SystemTap, sysdig, etc.

Flink Forward San Francisco 2019: Real-time Processing with Flink for Machine...

Flink Forward

eal-time Processing with Flink for Machine Learning at Netflix Machine learning plays a critical role in providing a great Netflix member experience. It is used to drive many parts of the site including video recommendations, search results ranking, and selection of artwork images. Providing high-fidelity, near real-time data is increasingly important for these machine learning pipelines, especially as multi-armed bandit and reinforcement learning techniques, in addition to more ""traditional"" supervised learning, become more prevalent. With access to this data, models are able to converge more quickly, features can be updated more frequently, and analysis can be done in a more timely manner. In this talk, we will focus on the practical details of leveraging Flink to process trillions of events per day, work with the time dimension, and manage large and frequently-changing state. We will discuss different processing schemes and dataflows, scalability and resiliency challenges we tackled, operational considerations, and instrumentation we added for monitoring job health in production.

Similar to Aljoscha Krettek – Notions of Time (20)

Aljoscha Krettek - Apache Flink® and IoT: How Stateful Event-Time Processing ...

Stream processing with Apache Flink - Maximilian Michels Data Artisans

Graduating Flink Streaming - Chicago meetup

Flink. Pure Streaming

Java Performance Mistakes

Corporate Secret Challenge - CyberDefenders.org by Azad

What Your Tech Lead Thinks You Know (But Didn't Teach You)

Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing

Wcl303 russinovich

Analyzing social media with Python and other tools (2/4)

Dataflow - A Unified Model for Batch and Streaming Data Processing

Linux School: Advanced Administration for IBM Software

Discovery DevOps

.NET Core Summer event 2019 in NL - War stories from .NET team -- Karel Zikmund

Application Performance Troubleshooting 1x1 - Von Schweinen, Schlangen und Pa...

NDC Oslo 2019 - War stories from .NET team -- Karel Zikmund

What the Heck Just Happened?

Silverlight vs HTML5 - Lessons learned from the real world...

Linux Performance Analysis: New Tools and Old Secrets

Flink Forward San Francisco 2019: Real-time Processing with Flink for Machine...

More from Flink Forward

Building a fully managed stream processing platform on Flink at scale for Lin...

Flink Forward

Apache Flink is a distributed stream processing framework that allows users to process and analyze data in real-time. At LinkedIn, we developed a fully managed stream processing platform on Flink running on K8s to power hundreds of stream processing pipelines in production. This platform is the backbone for other infra systems like Search, Espresso (internal document store) and feature management etc. We provide a rich authoring and testing environment which allows users to create, test, and deploy their streaming jobs in a self-serve fashion within minutes. Users can focus on their business logic, leaving the Flink platform to take care of management aspects such as split deployment, resource provisioning, auto-scaling, job monitoring, alerting, failure recovery and much more. In this talk, we will introduce the overall platform architecture, highlight the unique value propositions that it brings to stream processing at LinkedIn and share the experiences and lessons we have learned.

Evening out the uneven: dealing with skew in Flink

Flink Forward

Flink Forward San Francisco 2022. When running Flink jobs, skew is a common problem that results in wasted resources and limited scalability. In the past years, we have helped our customers and users solve various skew-related issues in their Flink jobs or clusters. In this talk, we will present the different types of skew that users often run into: data skew, key skew, event time skew, state skew, and scheduling skew, and discuss solutions for each of them. We hope this will serve as a guideline to help you reduce skew in your Flink environment. by Jun Qin & Karl Friedrich

“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...

Flink Forward

Flink Forward San Francisco 2022. To improve Amazon Alexa experiences and support machine learning inference at scale, we built an automated end-to-end solution for incremental model building or fine-tuning machine learning models through continuous learning, continual learning, and/or semi-supervised active learning. Customer privacy is our top concern at Alexa, and as we build solutions, we face unique challenges when operating at scale such as supporting multiple applications with tens of thousands of transactions per second with several dependencies including near-real time inference endpoints at low latencies. Apache Flink helps us transform and discover metrics in near-real time in our solution. In this talk, we will cover the challenges that we faced, how we scale the infrastructure to meet the needs of ML teams across Alexa, and go into how we enable specific use cases that use Apache Flink on Amazon Kinesis Data Analytics to improve Alexa experiences to delight our customers while preserving their privacy. by Aansh Shah

Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...

Flink Forward

Flink Forward San Francisco 2022. Probably everyone who has written stateful Apache Flink applications has used one of the fault-tolerant keyed state primitives ValueState, ListState, and MapState. With RocksDB, however, retrieving and updating items comes at an increased cost that you should be aware of. Sometimes, these may not be avoidable with the current API, e.g., for efficient event-time stream-sorting or streaming joins where you need to iterate one or two buffered streams in the right order. With FLIP-220, we are introducing a new state primitive: BinarySortedMultiMapState. This new form of state offers you to (a) efficiently store lists of values for a user-provided key, and (b) iterate keyed state in a well-defined sort order. Both features can be backed efficiently by RocksDB with a 2x performance improvement over the current workarounds. This talk will go into the details of the new API and its implementation, present how to use it in your application, and talk about the process of getting it into Flink. by Nico Kruber

Introducing the Apache Flink Kubernetes Operator

Flink Forward

Flink Forward San Francisco 2022. The Apache Flink Kubernetes Operator provides a consistent approach to manage Flink applications automatically, without any human interaction, by extending the Kubernetes API. Given the increasing adoption of Kubernetes based Flink deployments the community has been working on a Kubernetes native solution as part of Flink that can benefit from the rich experience of community members and ultimately make Flink easier to adopt. In this talk we give a technical introduction to the Flink Kubernetes Operator and demonstrate the core features and use-cases through in-depth examples." by Thomas Weise

Autoscaling Flink with Reactive Mode

Flink Forward

Flink Forward San Francisco 2022. Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo. by Robert Metzger

Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...

Flink Forward

Flink Forward San Francisco 2022. Flink consumers read from Kafka as a scalable, high throughput, and low latency data source. However, there are challenges in scaling out data streams where migration and multiple Kafka clusters are required. Thus, we introduced a new Kafka source to read sharded data across multiple Kafka clusters in a way that conforms well with elastic, dynamic, and reliable infrastructure. In this presentation, we will present the source design and how the solution increases application availability while reducing maintenance toil. Furthermore, we will describe how we extended the existing KafkaSource to provide mechanisms to read logical streams located on multiple clusters, to dynamically adapt to infrastructure changes, and to perform transparent cluster migrations and failover. by Mason Chen

One sink to rule them all: Introducing the new Async Sink

Flink Forward

Flink Forward San Francisco 2022. Next time you want to integrate with a new destination for a demo, concept or production application, the Async Sink framework will bootstrap development, allowing you to move quickly without compromise. In Flink 1.15 we introduced the Async Sink base (FLIP-171), with the goal to encapsulate common logic and allow developers to focus on the key integration code. The new framework handles things like request batching, buffering records, applying backpressure, retry strategies, and at least once semantics. It allows you to focus on your business logic, rather than spending time integrating with your downstream consumers. During the session we will dive deep into the internals to uncover how it works, why it was designed this way, and how to use it. We will code up a new sink from scratch and demonstrate how to quickly push data to a destination. At the end of this talk you will be ready to start implementing your own Flink sink using the new Async Sink framework. by Steffen Hausmann & Danny Cranmer

Tuning Apache Kafka Connectors for Flink.pptx

Flink Forward

Flink Forward San Francisco 2022. In normal situations, the default Kafka consumer and producer configuration options work well. But we all know life is not all roses and rainbows and in this session we’ll explore a few knobs that can save the day in atypical scenarios. First, we'll take a detailed look at the parameters available when reading from Kafka. We’ll inspect the params helping us to spot quickly an application lock or crash, the ones that can significantly improve the performance and the ones to touch with gloves since they could cause more harm than benefit. Moreover we’ll explore the partitioning options and discuss when diverging from the default strategy is needed. Next, we’ll discuss the Kafka Sink. After browsing the available options we'll then dive deep into understanding how to approach use cases like sinking enormous records, managing spikes, and handling small but frequent updates.. If you want to understand how to make your application survive when the sky is dark, this session is for you! by Olena Babenko

Flink powered stream processing platform at Pinterest

Flink Forward

Flink Forward San Francisco 2022. Pinterest is a visual discovery engine that serves over 433MM users. Stream processing allows us to unlock value from realtime data for pinners. At Pinterest, we adopt Flink as the unified streaming processing engine. In this talk, we will share our journey in building a stream processing platform with Flink and how we onboarding critical use cases to the platform. Pinterest has supported 90+near realtime streaming applications. We will cover the problem statement, how we evaluate potential solutions and our decision to build the framework. by Rainie Li & Kanchi Masalia

Apache Flink in the Cloud-Native Era

Flink Forward

Flink Forward San Francisco 2022. This talk will take you on the long journey of Apache Flink into the cloud-native era. It started all the way from where Hadoop and YARN were the standard way of deploying and operating data applications. We're going to deep dive into the cloud-native set of principles and how they map to the Apache Flink internals and recent improvements. We'll cover fast checkpointing, fault tolerance, resource elasticity, minimal infrastructure dependencies, industry-standard tooling, ease of deployment and declarative APIs. After this talk you'll get a broader understanding of the operational requirements for a modern streaming application and where the current limits are. by David Moravek

Where is my bottleneck? Performance troubleshooting in Flink

Flink Forward

Flinkn Forward San Francisco 2022. In this talk, we will cover various topics around performance issues that can arise when running a Flink job and how to troubleshoot them. We’ll start with the basics, like understanding what the job is doing and what backpressure is. Next, we will see how to identify bottlenecks and which tools or metrics can be helpful in the process. Finally, we will also discuss potential performance issues during the checkpointing or recovery process, as well as and some tips and Flink features that can speed up checkpointing and recovery times. by Piotr Nowojski

Using the New Apache Flink Kubernetes Operator in a Production Deployment

Flink Forward

Flink Forward San Francisco 2022. Running natively on Kubernetes, using the new Apache Flink Kubernetes Operator is a great way to deploy and manage Flink application and session deployments. In this presentation, we provide: - A brief overview of Kubernetes operators and their benefits. - Introduce the five levels of the operator maturity model. - Introduce the newly released Apache Flink Kubernetes Operator and FlinkDeployment CRs - Dockerfile modifications you can make to swap out UBI images and Java of the underlying Flink Operator container - Enhancements we're making in: - Versioning/Upgradeability/Stability - Security - Demo of the Apache Flink Operator in-action, with a technical preview of an upcoming product using the Flink Kubernetes Operator. - Lessons learned - Q&A by James Busche & Ted Chang

The Current State of Table API in 2022

Flink Forward

Flink Forward San Francisco 2022. The Table API is one of the most actively developed components of Flink in recent time. Inspired by databases and SQL, it encapsulates concepts many developers are familiar with. It can be used with both bounded and unbounded streams in a unified way. But from afar it can be difficult to keep track of what this API is capable of and how it relates to Flink's other APIs. In this talk, we will explore the current state of Table API. We will show how it can be used as a batch processor, a changelog processor, or a streaming ETL tool with many built-in functions and operators for deduplicating, joining, and aggregating data. By comparing it to the DataStream API we will highlight differences and elaborate on when to use which API. We will demonstrate hybrid pipelines in which both APIs interact with one another and contribute their unique strengths. Finally, we will take a look at some of the most recent additions as a first step to stateful upgrades. by David Andreson

Flink SQL on Pulsar made easy

Flink Forward

Flink Forward San Francisco 2022. Based on the new Flink-Pulsar connector, we implemented Flink's TableAPI and Catalog to help users to interact with the Pulsar cluster via Flink SQL easily. We would like to go through the design and implementation of the SQL connector in the following aspects: 1. Two different modes of use Pulsar as a metadata store 2. Data format transformation and management 3. SQL semantics support within Pulsar context by Sijie Guo & Neng Lu

Dynamic Rule-based Real-time Market Data Alerts

Flink Forward

Flink Forward San Francisco 2022. At Bloomberg, we deal with high volumes of real-time market data. Our clients expect to be notified of any anomalies in this market data, which may indicate volatile movements in the markets, notable trades, forthcoming events, or system failures. The parameters for these alerts are always evolving and our clients can update them dynamically. In this talk, we'll cover how we utilized the open source Apache Flink and Siddhi SQL projects to build a distributed, scalable, low-latency and dynamic rule-based, real-time alerting system to solve our clients' needs. We'll also cover the lessons we learned along our journey. by Ajay Vyasapeetam & Madhuri Jain

Exactly-Once Financial Data Processing at Scale with Flink and Pinot

Flink Forward

Flink Forward San Francisco 2022. At Stripe we have created a complete end to end exactly-once processing pipeline to process financial data at scale, by combining the exactly-once power from Flink, Kafka, and Pinot together. The pipeline provides exactly-once guarantee, end-to-end latency within a minute, deduplication against hundreds of billions of keys, and sub-second query latency against the whole dataset with trillion level rows. In this session we will discuss the technical challenges of designing, optimizing, and operating the whole pipeline, including Flink, Kafka, and Pinot. We will also share our lessons learned and the benefits gained from exactly-once processing. by Xiang Zhang & Pratyush Sharma & Xiaoman Dong

Processing Semantically-Ordered Streams in Financial Services

Flink Forward

Flink Forward San Francisco 2022. What if my data is already in order? Stream Processing has given us an elegant and powerful solution for running analytic queries and logic over high volumes of continuously arriving data. However, in both Apache Flink and Apache Beam, the notion of time-ordering is baked in at a very low level, making it difficult to express computations that are interested in a semantic-, rather than time-ordering of the data. In financial services, what often matters the most about the data moving between systems is not when the data was created, but in what order, to the extent that many institutions engineer a global sequencing over all data entering and produced by their systems to achieve complete determinism. How, then, can financial institutions and others best employ Stream Processing on streams of data that are already ordered? I will cover various techniques that can make this work, as well as seek input from the community on how Flink might be improved to better support these use-cases. by Patrick Lucas

Tame the small files problem and optimize data layout for streaming ingestion...

Flink Forward

Flink Forward San Francisco 2022. In modern data platform architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to iceberg tables can suffer by two problems (1) small files problem that can hurt read performance (2) poor data clustering that can make file pruning less effective. To address those two problems, we propose adding a shuffling stage to the Flink Iceberg streaming writer. The shuffling stage can intelligently group data via bin packing or range partition. This can reduce the number of concurrent files that every task writes. It can also improve data clustering. In this talk, we will explain the motivations in details and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling. by Gang Ye & Steven Wu

Batch Processing at Scale with Flink & Iceberg

Flink Forward

Flink Forward San Francisco 2022. Goldman Sachs's Data Lake platform serves as the firm's centralized data platform, ingesting 140K (and growing!) batches per day of Datasets of varying shape and size. Powered by Flink and using metadata configured by platform users, ingestion applications are generated dynamically at runtime to extract, transform, and load data into centralized storage where it is then exported to warehousing solutions such as Sybase IQ, Snowflake, and Amazon Redshift. Data Latency is one of many key considerations as producers and consumers have their own commitments to satisfy. Consumers range from people/systems issuing queries, to applications using engines like Spark, Hive, and Presto to transform data into refined Datasets. Apache Iceberg allows our applications to not only benefit from consistency guarantees important when running on eventually consistent storage like S3, but also allows us the opportunity to improve our batch processing patterns with its scalability-focused features. by Andreas Hailu

More from Flink Forward (20)

Building a fully managed stream processing platform on Flink at scale for Lin...

Evening out the uneven: dealing with skew in Flink

“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...

Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...

Introducing the Apache Flink Kubernetes Operator

Autoscaling Flink with Reactive Mode

Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...

One sink to rule them all: Introducing the new Async Sink

Tuning Apache Kafka Connectors for Flink.pptx

Flink powered stream processing platform at Pinterest

Apache Flink in the Cloud-Native Era

Where is my bottleneck? Performance troubleshooting in Flink

Using the New Apache Flink Kubernetes Operator in a Production Deployment

The Current State of Table API in 2022

Flink SQL on Pulsar made easy

Dynamic Rule-based Real-time Market Data Alerts

Exactly-Once Financial Data Processing at Scale with Flink and Pinot

Processing Semantically-Ordered Streams in Financial Services

Tame the small files problem and optimize data layout for streaming ingestion...

Batch Processing at Scale with Flink & Iceberg

Recently uploaded

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

Tobias Schneck

As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other? Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.

JMeter webinar - integration with InfluxDB and Grafana

RTTS

Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application. In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics. Length: 30 minutes Session Overview ------------------------------------------- During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana: - What out-of-the-box solutions are available for real-time monitoring JMeter tests? - What are the benefits of integrating InfluxDB and Grafana into the load testing stack? - Which features are provided by Grafana? - Demonstration of InfluxDB and Grafana using a practice web application To view the webinar recording, go to: https://www.rttsweb.com/jmeter-integration-webinar

Essentials of Automations: Optimizing FME Workflows with Parameters

Safe Software

Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place. Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects. Here’s what you’ll gain: - Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows. - Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy. - Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency. - Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity. We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic. Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

UiPathCommunity

💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™: See how to accelerate model training and optimize model performance with active learning Learn about the latest enhancements to out-of-the-box document processing – with little to no training required Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath. Speakers: 👨‍🏫 Andras Palfi, Senior Product Manager, UiPath 👩‍🏫 Lenka Dulovicova, Product Program Manager, UiPath

The Art of the Pitch: WordPress Relationships and Sales

Laura Byrne

Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes? All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

James Anderson

Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management. The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM). Speakers: Bob Boule Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle. Gopinath Rebala Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

DanBrown980551

Do you want to learn how to model and simulate an electrical network from scratch in under an hour? Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)! During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook. PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides: - A fully editable and extendable library for grid component modelling; - Visualization tools to display your network; - Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses; The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well. What you will learn during the webinar: - For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills; - For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.

Generating a custom Ruby SDK for your web service or Rails API using Smithy

g2nightmarescribd

Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.

Assuring Contact Center Experiences for Your Customers With ThousandEyes

ThousandEyes

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

Prayukth K V

The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development. The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers: State of global ICS asset and network exposure Sectoral targets and attacks as well as the cost of ransom Global APT activity, AI usage, actor and tactic profiles, and implications Rise in volumes of AI-powered cyberattacks Major cyber events in 2024 Malware and malicious payload trends Cyberattack types and targets Vulnerability exploit attempts on CVEs Attacks on counties – USA Expansion of bot farms – how, where, and why In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East Why are attacks on smart factories rising? Cyber risk predictions Axis of attacks – Europe Systemic attacks in the Middle East Download the full report from here: https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...

Jeffrey Haguewood

Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows. We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases. This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams. Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Albert Hoitingh

Elevating Tactical DDD Patterns Through Object Calisthenics

Dorra BARTAGUIZ

After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!

Connector Corner: Automate dynamic content and events by pushing a button

DianaGray10

Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to: Create a campaign using Mailchimp with merge tags/fields Send an interactive Slack channel message (using buttons) Have the message received by managers and peers along with a test email for review But there’s more: In a second workflow supporting the same use case, you’ll see: Your campaign sent to target colleagues for approval If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team But—if the “Reject” button is pushed, colleagues will be alerted via Slack message Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors. And... Speakers: Akshay Agnihotri, Product Manager Charlie Greenberg, Host

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf

FIDO Alliance

GraphRAG is All You need? LLM & Knowledge Graph

Guy Korland

Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs. 1. Unifying Large Language Models and Knowledge Graphs: A Roadmap. https://arxiv.org/abs/2306.08302 2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs: https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/

Neuro-symbolic is not enough, we need neuro-*semantic*

Frank van Harmelen

Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”. All of this illustrated with link prediction over knowledge graphs, but the argument is general.

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

Product School

Knowledge engineering: from people to machines and back

Elena Simperl

Key Trends Shaping the Future of Infrastructure.pdf

Cheryl Hung

Recently uploaded (20)

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

JMeter webinar - integration with InfluxDB and Grafana

Essentials of Automations: Optimizing FME Workflows with Parameters

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

The Art of the Pitch: WordPress Relationships and Sales

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

Generating a custom Ruby SDK for your web service or Rails API using Smithy

Assuring Contact Center Experiences for Your Customers With ThousandEyes

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Elevating Tactical DDD Patterns Through Object Calisthenics

Connector Corner: Automate dynamic content and events by pushing a button

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf

GraphRAG is All You need? LLM & Knowledge Graph

Neuro-symbolic is not enough, we need neuro-*semantic*

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

Knowledge engineering: from people to machines and back

Key Trends Shaping the Future of Infrastructure.pdf

Aljoscha Krettek – Notions of Time

1. Notions of Time Aljoscha Krettek aljoscha@apache.org @aljoscha How Apache Flink™ Handles Time and Windows

2. Adventures in Timespace

3. 3 Why Windows*? *not Microsoft Windows…

4. 4 That’s why…

5. 5 StreamingBatch

6. 6 In Streaming: Arriving data never stops!

7. 7 Solution: Put elements into buckets, these are called windows

8. 8 Window (5 min) Count #Hashtags Just saw #Trump on #CNN, super cool. :D Trump: 2394 Cheese: 12984 Money: 42

9. 9 What I didn’t mention • tweets have a timestamp, their event time • tweets from across the globe arrive with delay => tweets with different timestamps arrive out-of-order

10. Window (5 min) Count #Hashtags 12:34 (13.10.2015): Just saw #Trump on #CNN, super cool. :D Trump: 2394 Cheese: 12984 Money: 42 These arrive with 3 minutes slack Form windows based on processing time of the machine. Processing Time != Event Time 10

11. 11 Why do people use this? • easy to implement • low latency • this is what systems give you (Spark Streaming, Apex, Samza, Storm)* *not Google Cloud Dataflow

12. 12 Lets look at a more complex example.

13. 13 Window (5 min) Correlate Tweets and News something... These still have 3 min slack. These have 8 min slack. 12:33 (13.10.2015): Donald Trump speaks at Cheese conference. Processing Time != Event Time

14. Processing Time != Event Time => Mismatch in the timespace continuum

15. 15 Use cases • out-of-order elements • sources with delay • recovery/fault-tolerance • “catching up” with a stream Who does it? • Google Cloud Dataflow • Apache Flink

16. 16 How can we do this?

17. 17 We need a Global Clock that runs on event time instead of processing time.

18. 18 This is a source This is our window operator 1 0 0 0 0 1 2 1 2 1 1 This is the current event-time time 2 2 2 2 2 This is a watermark.

19. 19 Now, show me the API!

20. 20 StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setStreamTimeCharacteristic(ProcessingTime); DataStream<Tweet> text = env.addSource(new TwitterSrc()); DataStream<Tuple2<String, Integer>> counts = text .flatMap(new ExtractHashtags()) .keyBy(“name”) .timeWindow(Time.of(5, MINUTES) .apply(new HashtagCounter()); Processing Time

21. 21 Event Time StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setStreamTimeCharacteristic(EventTime); DataStream<Tweet> text = env.addSource(new TwitterSrc()); text = text.assignTimestamps(new MyTimestampExtractor()); DataStream<Tuple2<String, Integer>> counts = text .flatMap(new ExtractHashtags()) .keyBy(“name”) .timeWindow(Time.of(5, MINUTES) .apply(new HashtagCounter());

22. 22 TL;DL* • stream data is infinite • windows are helpful • event-time != processing time • watermarks to the rescue • Flink can do it *too long, didn’t listen

23. flink.apache.org @ApacheFlink

24. 32-35 24-27 20-23 8-110-3 4-7 24 Tumbling Windows of 4 Seconds 123412 4 59 9 0 20 20 22212326323321 26 353642

Editor's Notes

Slack is the amount of time by which elements arrive late.
Catching up, for example with elements in Kafka, you would still want correct windows based on timestamp in elements.

Aljoscha Krettek – Notions of Time

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Aljoscha Krettek – Notions of Time

Similar to Aljoscha Krettek – Notions of Time (20)

More from Flink Forward

More from Flink Forward (20)

Recently uploaded

Recently uploaded (20)

Aljoscha Krettek – Notions of Time

Editor's Notes