Santander Stream Processing with Apache Flink

Stream Processing with
Apache Flink®
Flink Overview 101
+

01
02
03
04
05
Intro : The world has changed..some news
Understanding the importance of stream processing
Why Apache Flink is becoming the de facto standard
Enhancing Apache Flink as a cloud-native service
Questions
Agenda

Intro about conﬂuent community

Some things you should know about us
https://discover.confluent.io/santander-and-confluent-better-together
Our Landing page
Aun hay mas!!! Coming soon
GEN IA & Conﬂuent
webinar 7 May 2024
Apuntate ya !!!!

Some - Quick update from Kafka
summit london

Apache Flink en Confluent Cloud entra en modo General Available: Anunciamos la disponibilidad general de nuestro servicio
totalmente gestionado de Confluent Cloud para Apache Flink®, proporcionando a los clientes una verdadera solución
multicloud para implementar procesamiento de streams de eventos en cualquiera de los tres principales CSP's donde residan
los datos y aplicaciones. Podéis ver más detalles en el blog post.
Presentación de Tableflow: Presentamos nuestra visión para Tableflow, una nueva función en la plataforma de streaming de
eventos de Confluent Cloud que permite a los usuarios convertir tópicos de Apache Kafka y sus esquemas asociados en tablas
de Apache Iceberg® con un solo clic para abastecer mejor a los data lakes y data warehouses. Tableflow estará disponible en
modo early access privado próximamente.
Kora, el motor de Confluent Cloud, es más rápido que nunca: Revelamos que Kora, el motor que alimenta el servicio de
Kafka nativo cloud de Confluent Cloud, ahora es 16 veces más rápido que Kafka opensource.
Mejora de conectores : Se presentaron nuevas mejoras al portafolio de conectores totalmente gestionados de Confluent, que
incluyen DNS Forwarding y puntos de acceso de Egress privados. Además se anuncia un SLA mejorado del 99.99%.
Mejora de Stream Governance : El servicio de Stream Governance en Confluent sigue evolucionando y ahora está habilitado
de forma predeterminada en todos los entornos con un SLA de tiempo de actividad del 99.99% para Schema Registry.
Además, anunciamos que la cobertura regional para Stream Governance se expandirá a todas las regiones de Confluent Cloud,
proporcionando un acceso más fluido a las funcionalidades exclusivas de gobernanza.

Why ….
Why if we can uniﬁed both worlds ?

Understanding the importance of
stream processing

Enable frictionless
access to up-to-date
trustworthy data
products
Share
Reimagine data
streaming everywhere,
on-prem and in every
major public cloud
Stream
Make data in motion
self-service, secure,
compliant and
trustworthy
Govern
Drive greater
data reuse with
always-on stream
processing
Process
Make it easy to
on-ramp and off-ramp
data from existing
systems and apps
Connect
Stream processing is a critical part of data streaming

DATA IN MOTION
Streaming
Applications
Apache
Flink
Apache
Kafka
DATA AT REST
Application
Layer
Compute
Layer
Storage
Layer
Traditional
Databases
File
Systems
Web
Applications
Stream processing acts as the compute layer to
Kafka, powering real-time applications & pipelines

Processing
Kafka
Custom apps
3rd party apps
Databases
Database
Data Warehouse
SaaS app
Queries
Analytics
Interactions
Processing
Processing
Processing down
stream of Kafka
increases latency,
adds costs and
redundancy, and
inhibits data reuse
Increased complexity from
redundant processing
Data systems & applications
built on stale data
Expensive & inefﬁcient to clean
and enrich data multiple times

Custom apps
3rd party apps
Databases
Database
Data Warehouse
SaaS app
Queries
Analytics
Interactions
Processing data at
ingest improves
latency, data
portability, and
cost effectiveness
Maximized data reusability &
consistency
Improved cost-efﬁciency from
cleaning & enriching data once
Real-time apps & data systems
reﬂect current state
Kafka
Storage
Flink
Compute
Stream Processing
Process your data once, process your data right

Heatmap service
Payment service
Supply chain systems
Watch lists
Profile mgmt
Incident mgmt
Customer
profile data
ITSM systems
Central log systems
Fraud & SIEM systems
Alerting systems
AI/ML engines
Visualization apps
Threat vector
Transactions
Payments
Mainframe data
Inventory
Weather
Telemetry
IoT data
Notification engine
Payroll systems
CRM systems
Mobile application
Personalization
Web application
Clickstreams
Customer loyalty
Change logs
Customer data
Recommendation engine
Stream processing enables users to filter, join, and
enrich streams on-the-fly to drive greater data reuse

Why Apache Flink is becoming
the de facto standard

0
50,000
100,000
150,000
2020 2021 2022
2016 2017 2018
Flink
Kafka
Two Apache Projects, Born a
Few Years Apart
Monthly Unique Users
Flink growth has
mirrored the
growth of Kafka,
the de facto
standard for
streaming data
>75% of the Fortune 500 estimated
to be using Kafka
>100,000+ orgs using Kafka
>41,000 Kafka meetup attendees
>750 Kafka Improvement Proposals
>12,000 Jiras for Apache Kafka

Innovative companies have adopted both Kafka & Flink

Digital natives leverage Flink to disrupt markets and
gain competitive advantage
UBER: Real-time Pricing NETFLIX: Personalized Recs STRIPE: Real-time Fraud Detection

Lets have some fun !!!
Who is who?
463 M 30B Gaming
●
WHY
● SQL drive 90% uses cases with less than 50% TTM
● SQL make stream analytics accessible for all data scientist that
allows to create insights connected to revenue
● Keep & Play with statesto deep dive on business questions
● Eg: Revenue x level
● Game user counts RT
● Extremely auto-scalable
● Low latency
● Control temporal de eventos
● Wide variety of handling Script window
● Fault tolerance at the style :Check points & save points

Scalability and
Performance
Fault
Tolerance
Flink is a top 5 Apache project and boasts a robust developer community
Uniﬁed
Processing
Flink is capable of
supporting stream
processing workloads
at tremendous scale
Language
Flexibility
Flink's fault tolerance
mechanisms ensure it
can handle failures
effectively and provide
high availability
Flink supports Java,
Python, & SQL with
150+ built-in functions,
enabling devs to work
in their language of
choice
Flink supports stream
processing, batch
processing, and ad-hoc
analytics through one
technology
SO basically Customers choose Flink because of its
performance and rich feature set

Processing
“Shift Left” with Stream Processing
Expensive & Inefﬁcient
The same data is cleaned, transformed and enriched
multiple locations (often using legacy technologies)
Varying Degree of Staleness
Every downstream system receives and processes
the same data at different times/different time
intervals and provides slightly different semantics.
This leads to inconsistencies and degraded customer
experience downstream.
Complex & Error Prone
The same business logic needs to be maintain at
multiple places, multiple processing engines need to
be maintained and operated.
Databases
Custom Apps
SaaS
Processing
Processing
Database
DWH
Data Lake
Queries
Analytics
Interactions

Processing
“Shift Left” with Stream Processing
Expensive & Inefﬁcient
The same data is cleaned, transformed and enriched
multiple locations (often using legacy technologies)
Varying Degree of Staleness
Every downstream system receives and processes
the same data at different times/different time
intervals and provides slightly different semantics.
This leads to inconsistencies and degraded customer
experience downstream.
Complex & Error Prone
The same business logic needs to be maintain at
multiple places, multiple processing engines need to
be maintained and operated.
Databases
Custom Apps
SaaS
Processing
Processing
Database
DWH
Data Lake
Queries
Analytics
Interactions
Shift Left
Processing

Processing
Pitch “Shift Left” with Stream Processing
Cost Efﬁcient
Data is only processed once in a single place and it is
processed continuously spreading the work over
time.
Fresh Data Everywhere
All application are supplied with equally fresh data
and represent the current state of your business.
Reusable & Consistent
Data is processed continuously and meets the
latency requirements of the most demanding
consumers which further increases reusability.
Databases
Custom Apps
SaaS
Database
DWH
Data Lake
Queries
Analytics
Interactions
Processing

What can I do with Flink?
26
Data Exploration Data Pipelines Real-time Apps
Engineers and Analysts both
need to be able to simply read
and understand the event
streams stored in Kafka
● Metadata discovery
● Throughput analysis
● Data sampling
● Interactive query
Data pipelines are used to
enrich, curate, and transform
events streams, creating new
derived event streams
● Filtering
● Joins
● Projections
● Aggregations
● Flattening
● Enrichment
Whole ecosystems of apps
feed on event streams
automating action in real-time
● Account360
● Next Best Call
● Quality of Service
● Fraud detection
● Intelligent routing
● Abandoned Shopping Cart

Build streaming
data pipelines to
inform real-time
decision making
Create new enriched and curated
streams of higher value using:
● Data transformations
● Streaming joins, temporal joins,
lookup joins, and versioned joins
● Fan out queries, multi-cluster queries
27
t1, 21.5 USD
t3, 55 EUR
t5, 35.3 EUR
t0, EUR:USD=1.00
t2, EUR:USD=1.05
t4: EUR:USD=1.10
t1, 21.5 USD
t3, 57.75 USD
t5, 38.83 USD
Currency rate
Orders
STREAMING DATA PIPELINES

Recognize
patterns and react
to events in a
timely manner
C
price>lag(price)
D
price<lag(price)
C
price>lag(price)
B
price<lag(price)
A
Double Bottom
Period & Volume
Price
Develop applications using ﬁne-grained
control over how time progresses and
data is grouped together using:
● Hopping, tumbling, session windows
● OVER aggregations
● Pattern matching with
MATCH_RECOGNIZE
Pattern Detections
SELECT *
FROM Ticker
MATCH_RECOGNIZE (
PARTITION BY symbol
ORDER BY rowtime
MEASURES
START_ROW.rowtime AS start_tstamp,
LAST(PRICE_DOWN.rowtime) AS bottom_tstamp,
LAST(PRICE_UP.rowtime) AS end_tstamp
ONE ROW PER MATCH
AFTER MATCH SKIP TO LAST PRICE_UP
PATTERN (START_ROW PRICE_DOWN+ PRICE_UP)
DEFINE
PRICE_DOWN AS
(LAST(PRICE_DOWN.price, 1) IS NULL AND PRICE_DOWN.price <
START_ROW.price) OR
PRICE_DOWN.price < LAST(PRICE_DOWN.price, 1),
PRICE_UP AS
PRICE_UP.price > LAST(PRICE_DOWN.price, 1)
) MR;

Analyze real-time
data streams to
generate
important
business insights
Get up-to-date results to power
dashboards or applications requiring
continuous updates using:
● Materialized views
● Temporal analytic functions
● Interactive queries
Account Balance
A $15
B $2
C $15
Account A, +$10
Account B, +$12
Account C, +$5
Account B, -$10
Account C, +$10
Account A, -$5
Account A, +$10
Time
REAL-TIME ANALYTICS
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY category ORDER BY sales
DESC) AS row_num
FROM ShopSales)
WHERE row_num <= 5

Scalability and
Performance
Fault
Tolerance
Flink is a top 5 Apache project and boasts a robust developer community
Uniﬁed
Processing
Flink is capable of
supporting stream
processing workloads
at tremendous scale
Language
Flexibility
Flink's fault tolerance
mechanisms ensure it
can handle failures
effectively and provide
high availability
Flink supports Java,
Python, & SQL with
150+ built-in functions,
enabling devs to work
in their language of
choice
Flink supports stream
processing, batch
processing, and ad-hoc
analytics through one
technology
Developers choose Flink because of its performance
and rich feature set

Flink’s powerful runtime offers limitless scalability
Job Manager
Client
. . . . . .
Task Slot
. . . . . .
Task Slot
. . . . . .
Task Slot
. . . . . .
Task Slot
Data Streams
Deploy, Stop, Cancel Tasks
Trigger Checkpoints
Submit Job
Results
Applications are parallelized into possibly
thousands of tasks that are distributed and
concurrently executed in a cluster

Leverage in-memory performance
. . .
Durable
Storage
Logic State Logic State Logic State
Input
Tasks
Output
In-Memory or
On-Disk State
Local State
Access
Periodic, Asynchronous,
Incremental Snapshots
Stateful Flink applications are optimized for fast access to local state by maintaining task
state in memory or on-disk data structures, resulting in low latency processing.

Flink checkpoints and savepoints enable fault
tolerance and stateful processing
CHECKPOINTS SAVEPOINTS
Automatic snapshot
created by Flink periodically
● Used to recover from failures
● Optimized for quick recovery
● Automatically created and managed
by Flink
User-triggered snapshot at
a speciﬁc point in time
● Enables manual operational tasks,
such as upgrades
● Optimized for operational ﬂexibility
● Created and managed by the user

Flink recovers from failures in a timely and efﬁcient
manner
Job Manager
Client
. . . . . .
Task Slot
. . . . . .
Task Slot
. . . . . .
Task Slot
. . . . . .
Task Slot
Data Streams
Deploy, Stop, Cancel Tasks
Trigger Checkpoints
Submit Job
Results
If a task managers fails, the job manager will
detect the failure and arrange for the job to be
restarted from the most recent state snapshot
X

Flink offers layered APIs at different levels of of
abstraction to handle both common and specialized
use cases
Flink SQL
Table API
DataStream API
ProcessFunction Apache Flink Runtime
Low-level Stream Operator API
DataStream API
ProcessFunction
Table / SQL API
Optimizer /
Planner
Level
of
Abstraction
How
the
Code
is
Organized
Flink SQL
High-level, declarative API that allows you to write SQL
queries to process data streams and batch data as dynamic
tables
Table API
Programmatic equivalent of Flink SQL, allowing you to
deﬁne your business logic in either Java or Python, or
combine it with SQL
DataStream API
Low-level, expressive API that exposes the building blocks
for stream processing, giving you direct access to things like
state and timers
ProcessFunction
The most low-level API, allowing for ﬁne-grained processing
of individual elements for complex event-driven processing
logic and state management

Flink SQL is an ANSI-compliant SQL
engine that can deﬁne both simple
and complex queries, making it
well-suited for most stream processing
use cases, particularly building
real-time data products and pipelines.
GROUP BY color
events
results
COUNT
WHERE color <> orange
4
3
Streaming made
simple.

INSERT INTO enriched_reviews
SELECT id
, review
, invoke_openai(prompt,review)
as score
FROM product_reviews
;
K
N
B
Kate
4 hours ago
This was the worst decision ever.
Nikola
1 day ago
Not bad. Could have been
cheaper.
Brian
3 days ago
Amazing! Game Changer!
K
N
B
Kate
★★★★★ 4 hours ago
This was the worst decision ever.
Nikola
★★★★★ 1 day ago
Not bad. Could have been
cheaper.
Brian
★★★★★ 3 days ago
Amazing! Game Changer!
The Prompt
“Score the following text on a scale of 1
and 5 where 1 is negative and 5 is
positive returning only the number”
DATA STREAMING PLATFORM
Enrich real-time data streams with Generative AI
directly from Flink SQL
COMING SOON

Flink supports uniﬁed stream and batch processing
● Entire pipeline must always be running ● Execution proceeds in stages, running as needed
● Input must be processed as it arrives ● Input may be pre-sorted by time and key
● Results are reported as they become ready ● Results are reported at the end of the job
● Failure recovery resumes from a recent snapshot ● Failure recovery does a reset and full restart
● Flink guarantees effectively exactly-once results
despite out-of-order data and restarts due to
failures, etc.
● Effectively exactly-once guarantees are more
straightforward

Flink SQL operators work across both stream and batch
processing modes
STREAMING AND BATCH
BATCH ONLY
• SELECT FROM [WHERE]
• GROUP BY [HAVING]
(includes time-based windowing)
• OVER aggregations
(including Top-N and Deduplication queries)
• INNER + OUTER JOINs
• MATCH_RECOGNIZE (pattern matching)
• Set Operations
• User-Deﬁned Functions
• Statement Sets
STREAMING ONLY
• ORDER BY time ascending only
• INNER JOIN with
Temporal (versioned) table
External lookup table
• ORDER BY anything

Enhancing Apache Flink as a
cloud-native service

Deployment
Complexity
Setting up Flink requires a
deep understanding of
resource allocation and
management
Management &
Monitoring
Identifying relevant metrics
can be overwhelming for
DevOps teams
Incomplete
Ecosystem
OSS Flink lacks pre-built
integrations with observability,
metadata management, data
governance, and security tooling
Cost &
Risk
Self-supporting Flink incurs
signiﬁcant costs &
resources in terms of infra
footprint and Dev & Ops
FTEs
However, operating Flink on your own (along with
Kafka) is difﬁcult

Months Minutes
Weeks
Open Source
Apache Flink
In-house development and
maintenance without support
Cloud-hosted
Flink services
Manual Day 2 operations with
basic tooling and/or support
Apache Flink on
Conﬂuent Cloud
Fully managed, elastic,
and automated product
capabilities with zero overhead
Go from zero to production in minutes versus months

Real-time processing
Power low-latency applications and pipelines that react to
real-time events and provide timely insights
Data reusability
Share consistent and reusable data streams widely with
downstream applications and systems
Data enrichment
Curate, filter, and augment data on-the-fly with additional
context to improve completeness, accuracy, & compliance
Efficiency
Improve resource utilization and cost-effectiveness by
avoiding redundant processing across silos
Effortlessly filter, join, and enrich your data streams with Apache Flink
“With Confluent’s fully managed Flink offering, we can access, aggregate, and enrich data from IoT sensors, smart
cameras, and Wi-Fi analytics, to swiftly take action on potential threats in real time, such as intrusion detection. This
enables us to process sensor data as soon as the events occur, allowing for faster detection and response to security
incidents without any added operational burden.”

"When used in combination, Apache Flink & Apache Kafka can enable data reusability and avoid redundant downstream
processing. The delivery of Flink & Kafka as fully managed services delivers stream processing without the complexities of
infrastructure management, enabling teams to focus on building real-time streaming applications & pipelines that
differentiate the business."
Enterprise-grade security
Secure stream processing with built-in identity and access
management, RBAC, and audit logs
Stream governance
Enforce data policies and avoid metadata duplication
leveraging native integration with Stream Governance
Monitoring
Ensure the health and uptime of your Flink queries in the
Confluent UI or via 3rd party monitoring services
Connectors
Ensure the health and uptime of your Flink queries in the
Confluent UI or via 3rd party monitoring services
Experience Kafka and Flink seamlessly integrated as a unified platform
Monitoring Connectors
Enterprise-grade
Security
Stream
Governance

Fully managed
Easily develop Flink applications with a serverless, SaaS-
based experience instantly available & without ops burden
Elastic scalability
Automatically scale up or down to meet the demands of the
most complex workloads without overprovisioning
Usage-based billing
Pay only for resources used instead of infrastructure
provisioned, with scale-to-zero pricing
Continuous, no touch updates
Build using an always up-to-date platform with declarative,
versionless APIs and interfaces
Enable high-performance and efﬁcient stream processing at any scale
Throughput Over Time
Capacity Demand
"When used in combination, Apache Flink & Apache Kafka can enable data reusability and avoid redundant downstream
processing. The delivery of Flink & Kafka as fully managed services delivers stream processing without the complexities of
infrastructure management, enabling teams to focus on building real-time streaming applications & pipelines that
differentiate the business."

SQL client in Conﬂuent
Cloud CLI
Different teams with different skills and needs
can access stream processing using the interface of their choice
Rich SQL editing
user interface
Tap into a next-generation, serverless SQL experience …

Select region(s) to create a
compute pool
1
Role bindings automatically
created for you
2
Start processing in Flink
3
…automatically provisioned and instantly available

Santander Stream Processing with Apache Flink

Santander Stream Processing with Apache Flink

More Related Content

What's hot

Similar to Santander Stream Processing with Apache Flink

More from confluent

Recently uploaded

Santander Stream Processing with Apache Flink