Stream Processing with Apache Flink in the Cloud
and Stream Sharing
Kai Waehner
Field CTO, Confluent
Data Streaming is part of our everyday lives
Trending Shows >
Recommendations >
Popular TV >
……
Personalization
Popularity score
Pattern detection
Categorization
Curated features & virality
3
9
3
9
Streaming Data
Pipelines
Data
Sharing
Real-time
Analytics
Cyber-
security
IoT &
Telematics
ML & AI
Customer
360
Stream Processing
…
Core Kafka
Real-Time Applications
Streaming Apps
and Pipelines
Compute
Storage
Data Streaming with Confluent
Governance
Connectors
Platform Security Networking Observability
K A F K A
S T R E A M S
ksqlDB
K A F K A
C O N S U M E R
A N D
P R O D U C E R
S T R E A M
D E S I G N E R
Stream Processing for EVERYONE
Flexibility Simplicity
Steam Processing with Apache Flink
Serverless Flink as part of Confluent Cloud
•
•
• Stream data
•
•
•
Process streams
Two Apache projects, born a few years apart
Immerok acquisition:
Accelerates our efforts of
bringing a cloud- native
Flink service to our
customers
● Also building a cloud-native Flink service
● Employs leading PMC members & committers
for Apache Flink
● Tackling some of the hardest problems in
cloud data infrastructure
Seamlessly process your data everywhere it resides with a Flink service that spans across the three major cloud
providers
Cloud-Native Complete Everywhere
Our Flink service will employ the same product principles
we’ve followed for Kafka
Deployment flexibility
Integrated platform
Leverage Flink fully integrated with
Confluent’s complete feature set,
enabling developers to build stream
processing applications quickly,
reliably, and securely
+
Serverless experience
Eliminate the operational burden
of managing Flink with a fully
managed, cloud-native service
that is simple, secure, and scalable
44
Seamlessly process your data
everywhere it resides with a Flink
service that spans across the three
major cloud providers
Why Stream Processing with Apache Flink?
45
Stream processing use cases
46
Data Exploration Data Pipelines Real-time Apps
Engineers and Analysts
both need to be able to
simply read and
understand the event
streams stored in Kafka
● Metadata discovery
● Throughput analysis
● Data sampling
● Interactive query
Data pipelines are used to
enrich, curate, and transform
events streams, creating new
derived event streams
● Filtering
● Joins
● Projections
● Aggregations
● Flattening
● Enrichment
Whole ecosystems of apps feed
on event streams automating
action in real-time
● Threat detection
● Quality of Service
● Fraud detection
● Intelligent routing
● Alerting
Data Exploration
…
…
SELECT * FROM input where eventType='A'
A A A
Aggregates and Rich Temporal Functions
00:00 00:20 00:40 01:00 01:20
A
B
B
TEMPORAL ANALYTICS FUNCTIONS:
…
COUNT(*) OVER
( PARTITION BY EventType
ORDER BY order_time
RANGE BETWEEN INTERVAL '20'
SECONDS
PRECEDING AND CURRENT ROW
)
A,1 A,2 A,3 A,4 B,1 B,2
A,3 A,1 B,2
WINDOWS:
SELECT EventType, COUNT(*)
TUMBLE (... , INTERVAL '20'
seconds)
GROUP BY EventType
A A A
Aggregates and Rich Temporal Functions
00:00 00:20 00:40 01:00 01:20
A
B
B
WINDOWS:
SELECT EventType, COUNT(*)
TUMBLE (... , INTERVAL '20' seconds)
GROUP BY EventType
TEMPORAL ANALYTICS FUNCTIONS:
…
COUNT(*) OVER
( PARTITION BY EventType
ORDER BY order_time
RANGE BETWEEN INTERVAL '20'
SECONDS
PRECEDING AND CURRENT ROW
)
Windowing and temporal analytics functions offer reach set of constructs for real-time processing and scenarios such as fraud detection.
● Windows (tumbling, hopping, etc.): events produced at regular intervals
● Temporal analytics functions: events products immediately
● Full composability of operators (windows of windows, aggregates of aggregates, etc.)
A,1 A,2 A,3 A,4 B,1 B,2
A,3 A,1 B,2
Data Enrichment
50
Orders
Currency
Rate
t1, 21.5 USD
t3, 55 EUR
t5, 35.3 EUR
t0, EUR:USD=1.01
t2, EUR:USD=1.05
t4: EUR:USD=1.10
t1, 21.5 USD
t3, 57.75 USD
t5, 38.83 USD
SELECT
order_id,
price,
currency,
conversion_rate,
order_time,
FROM orders
LEFT JOIN currency_rates FOR SYSTEM_TIME AS OF orders.order_time
ON orders.currency = currency_rates.currency;
Complex Event Processing (CEP) - Pattern Detection
C
price>lag(price)
B
price<lag(price)
D
price<lag(price)
E
price>lag(price)
MATCH_RECOGNIZE(
PARTITION BY stock_ticker
MEASURES
A as firstvalue
LAST(Z) as lastvalue
PATTERN (A B+ C+ D+ E+)
DEFINE
B as price<LAST(price)
C as price>LAST(price)
D as price<LAST(price)
E as price>LAST(price) and price>LAST(C)
A
Read once, Write many
Fan-out queries using Flink SQL
INSERT INTO cluster1.topicA
SELECT * FROM input where eventType='A'
INSERT INTO cluster1.topicB
SELECT * FROM input where eventType='B'
…
Input
topicA
topicB
topicC
topicD
Support for multi-clusters
topicA
topicB
topicC
topicD
Cluster1
Cluster4
topicA
topicB
Cluster2
Cluster3
Cross cluster processing
Flink
ksqlDB
Kafka Streams
54
55
Serverless Apache Flink in Confluent Cloud
Feature Highlights
New SQL capabilities
Metadata integration
Cross cluster queries
Admin UI
CLI
Feature Highlights
Oauth and RBAC
Metered usage
Cloud UX
Notebook querying
Feature Highlights
99.99% uptime SLA
Private networking
Pricing & packaging
Feature Highlights
GA in all three clouds
Cluster autoscaling
Early Access
Spring 2023
We are planning to GA our Flink service in Q4 2023
Public Preview
Late Summer 2023
2
1
Limited Availability
Late Fall 2023
General Availability
Winter 2023/24
4
3
Product roadmap
56
Data Streaming is part of our everyday lives
Trending Shows >
Recommendations >
Popular TV >
……
Personalization
Popularity score
Pattern detection
Categorization
Curated features & virality
Integration across Kafka Clusters
58
Data Exploration
59
Data Exploration
60
FlinkSQL Query against Kafka Topic (ANSI SQL)
61
Complex Event Processing (Pattern Matching)
62
TL;DR - Serverless Flink in Confluent Cloud
63
64
Confluent Stream Sharing
● Easy collaboration on live data with
partners, customers, and vendors
✓ One-click sharing
✓ Trusted and governed
✓ Secure, granular, auditable
● Single, org-wide portal to discover data
streams from trusted sources
Confluent
Stream Sharing
Secure, trusted, real-time data sharing
Available Now:
GA
66
Confluent Stream Sharing in a Decentralized Data Mesh
Internal and external data sharing in real-time
Faster time to market, better customer experience and new business models
Generate an AsyncAPI Specification
67
Confluent Stream Sharing
Who is Stream Sharing for?
● Mainly for developers and architects in companies with medium-to-high Kafka maturity (Phases 3+ of
the streaming maturity model) with a use case to share their Kafka topics with any external parties
(e.g., vendors, partners, customers) or other internal teams (e.g., from different LoB)
What pain points is it trying to solve?
● Out-of-sync data: most existing solutions involve dumping data from Kafka to a sink in a batch
process and copying data from it before moving it to an external source, which turns real-time data into
stale data.
● Operational complexities: setting up, maintaining and scaling these sharing pipelines to meet security
and privacy requirements requires complex integration work and is operationally taxing
● Vendor lock-in: most sharing solutions require both the Data Provider and Data Recipient to be on the
same platform, resulting in contractual complexity and vendor lock-in
68
Confluent Stream Sharing
What are the differences between Stream Sharing and Cluster Linking in their sharing capabilities?
And which one should we recommend to the customers?
Stream Sharing is our default data-sharing solution
There are three major differences between Cluster Linking and Stream Sharing
● 1) With Stream Sharing, Data Recipients can consume directly from Data Provider’s Kafka cluster without the
need to copy the data, saving cluster infra and provisioning efforts. Cluster Linking requires a destination
cluster ready for byte-by-byte replication
○ Sharing grants recipients access to the shared topic and Schema Registry subjects. 1 topic + n shared
Schema Registry subjects are included in the same share.
● 2) Data Recipients can use any platform to consume from Stream Sharing, whether it’s CC, CP, OSS Kafka,
MSK or Aiven. Cluster Linking requires the Data Recipient to be on CC Dedicated Cluster or CP
● 3) Only an email address is needed for Data Provider to share via Stream Sharing, whereas in Cluster Linking,
both parties’ cluster ID, API credentials are required for multiple steps for setup and provisioning.
69

Stream Processing with Flink and Stream Sharing

  • 1.
    Stream Processing withApache Flink in the Cloud and Stream Sharing Kai Waehner Field CTO, Confluent
  • 2.
    Data Streaming ispart of our everyday lives Trending Shows > Recommendations > Popular TV > …… Personalization Popularity score Pattern detection Categorization Curated features & virality
  • 3.
    3 9 3 9 Streaming Data Pipelines Data Sharing Real-time Analytics Cyber- security IoT & Telematics ML& AI Customer 360 Stream Processing … Core Kafka Real-Time Applications Streaming Apps and Pipelines Compute Storage Data Streaming with Confluent Governance Connectors Platform Security Networking Observability
  • 4.
    K A FK A S T R E A M S ksqlDB K A F K A C O N S U M E R A N D P R O D U C E R S T R E A M D E S I G N E R Stream Processing for EVERYONE Flexibility Simplicity
  • 5.
    Steam Processing withApache Flink Serverless Flink as part of Confluent Cloud
  • 6.
    • • • Stream data • • • Processstreams Two Apache projects, born a few years apart
  • 7.
    Immerok acquisition: Accelerates ourefforts of bringing a cloud- native Flink service to our customers ● Also building a cloud-native Flink service ● Employs leading PMC members & committers for Apache Flink ● Tackling some of the hardest problems in cloud data infrastructure
  • 8.
    Seamlessly process yourdata everywhere it resides with a Flink service that spans across the three major cloud providers Cloud-Native Complete Everywhere Our Flink service will employ the same product principles we’ve followed for Kafka Deployment flexibility Integrated platform Leverage Flink fully integrated with Confluent’s complete feature set, enabling developers to build stream processing applications quickly, reliably, and securely + Serverless experience Eliminate the operational burden of managing Flink with a fully managed, cloud-native service that is simple, secure, and scalable 44 Seamlessly process your data everywhere it resides with a Flink service that spans across the three major cloud providers
  • 9.
    Why Stream Processingwith Apache Flink? 45
  • 10.
    Stream processing usecases 46 Data Exploration Data Pipelines Real-time Apps Engineers and Analysts both need to be able to simply read and understand the event streams stored in Kafka ● Metadata discovery ● Throughput analysis ● Data sampling ● Interactive query Data pipelines are used to enrich, curate, and transform events streams, creating new derived event streams ● Filtering ● Joins ● Projections ● Aggregations ● Flattening ● Enrichment Whole ecosystems of apps feed on event streams automating action in real-time ● Threat detection ● Quality of Service ● Fraud detection ● Intelligent routing ● Alerting
  • 11.
    Data Exploration … … SELECT *FROM input where eventType='A'
  • 12.
    A A A Aggregatesand Rich Temporal Functions 00:00 00:20 00:40 01:00 01:20 A B B TEMPORAL ANALYTICS FUNCTIONS: … COUNT(*) OVER ( PARTITION BY EventType ORDER BY order_time RANGE BETWEEN INTERVAL '20' SECONDS PRECEDING AND CURRENT ROW ) A,1 A,2 A,3 A,4 B,1 B,2 A,3 A,1 B,2 WINDOWS: SELECT EventType, COUNT(*) TUMBLE (... , INTERVAL '20' seconds) GROUP BY EventType
  • 13.
    A A A Aggregatesand Rich Temporal Functions 00:00 00:20 00:40 01:00 01:20 A B B WINDOWS: SELECT EventType, COUNT(*) TUMBLE (... , INTERVAL '20' seconds) GROUP BY EventType TEMPORAL ANALYTICS FUNCTIONS: … COUNT(*) OVER ( PARTITION BY EventType ORDER BY order_time RANGE BETWEEN INTERVAL '20' SECONDS PRECEDING AND CURRENT ROW ) Windowing and temporal analytics functions offer reach set of constructs for real-time processing and scenarios such as fraud detection. ● Windows (tumbling, hopping, etc.): events produced at regular intervals ● Temporal analytics functions: events products immediately ● Full composability of operators (windows of windows, aggregates of aggregates, etc.) A,1 A,2 A,3 A,4 B,1 B,2 A,3 A,1 B,2
  • 14.
    Data Enrichment 50 Orders Currency Rate t1, 21.5USD t3, 55 EUR t5, 35.3 EUR t0, EUR:USD=1.01 t2, EUR:USD=1.05 t4: EUR:USD=1.10 t1, 21.5 USD t3, 57.75 USD t5, 38.83 USD SELECT order_id, price, currency, conversion_rate, order_time, FROM orders LEFT JOIN currency_rates FOR SYSTEM_TIME AS OF orders.order_time ON orders.currency = currency_rates.currency;
  • 15.
    Complex Event Processing(CEP) - Pattern Detection C price>lag(price) B price<lag(price) D price<lag(price) E price>lag(price) MATCH_RECOGNIZE( PARTITION BY stock_ticker MEASURES A as firstvalue LAST(Z) as lastvalue PATTERN (A B+ C+ D+ E+) DEFINE B as price<LAST(price) C as price>LAST(price) D as price<LAST(price) E as price>LAST(price) and price>LAST(C) A
  • 16.
    Read once, Writemany Fan-out queries using Flink SQL INSERT INTO cluster1.topicA SELECT * FROM input where eventType='A' INSERT INTO cluster1.topicB SELECT * FROM input where eventType='B' … Input topicA topicB topicC topicD
  • 17.
  • 18.
  • 19.
    55 Serverless Apache Flinkin Confluent Cloud
  • 20.
    Feature Highlights New SQLcapabilities Metadata integration Cross cluster queries Admin UI CLI Feature Highlights Oauth and RBAC Metered usage Cloud UX Notebook querying Feature Highlights 99.99% uptime SLA Private networking Pricing & packaging Feature Highlights GA in all three clouds Cluster autoscaling Early Access Spring 2023 We are planning to GA our Flink service in Q4 2023 Public Preview Late Summer 2023 2 1 Limited Availability Late Fall 2023 General Availability Winter 2023/24 4 3 Product roadmap 56
  • 21.
    Data Streaming ispart of our everyday lives Trending Shows > Recommendations > Popular TV > …… Personalization Popularity score Pattern detection Categorization Curated features & virality
  • 22.
  • 23.
  • 24.
  • 25.
    FlinkSQL Query againstKafka Topic (ANSI SQL) 61
  • 26.
    Complex Event Processing(Pattern Matching) 62
  • 27.
    TL;DR - ServerlessFlink in Confluent Cloud 63
  • 28.
  • 29.
    ● Easy collaborationon live data with partners, customers, and vendors ✓ One-click sharing ✓ Trusted and governed ✓ Secure, granular, auditable ● Single, org-wide portal to discover data streams from trusted sources Confluent Stream Sharing Secure, trusted, real-time data sharing Available Now: GA
  • 30.
    66 Confluent Stream Sharingin a Decentralized Data Mesh Internal and external data sharing in real-time Faster time to market, better customer experience and new business models
  • 31.
    Generate an AsyncAPISpecification 67
  • 32.
    Confluent Stream Sharing Whois Stream Sharing for? ● Mainly for developers and architects in companies with medium-to-high Kafka maturity (Phases 3+ of the streaming maturity model) with a use case to share their Kafka topics with any external parties (e.g., vendors, partners, customers) or other internal teams (e.g., from different LoB) What pain points is it trying to solve? ● Out-of-sync data: most existing solutions involve dumping data from Kafka to a sink in a batch process and copying data from it before moving it to an external source, which turns real-time data into stale data. ● Operational complexities: setting up, maintaining and scaling these sharing pipelines to meet security and privacy requirements requires complex integration work and is operationally taxing ● Vendor lock-in: most sharing solutions require both the Data Provider and Data Recipient to be on the same platform, resulting in contractual complexity and vendor lock-in 68
  • 33.
    Confluent Stream Sharing Whatare the differences between Stream Sharing and Cluster Linking in their sharing capabilities? And which one should we recommend to the customers? Stream Sharing is our default data-sharing solution There are three major differences between Cluster Linking and Stream Sharing ● 1) With Stream Sharing, Data Recipients can consume directly from Data Provider’s Kafka cluster without the need to copy the data, saving cluster infra and provisioning efforts. Cluster Linking requires a destination cluster ready for byte-by-byte replication ○ Sharing grants recipients access to the shared topic and Schema Registry subjects. 1 topic + n shared Schema Registry subjects are included in the same share. ● 2) Data Recipients can use any platform to consume from Stream Sharing, whether it’s CC, CP, OSS Kafka, MSK or Aiven. Cluster Linking requires the Data Recipient to be on CC Dedicated Cluster or CP ● 3) Only an email address is needed for Data Provider to share via Stream Sharing, whereas in Cluster Linking, both parties’ cluster ID, API credentials are required for multiple steps for setup and provisioning. 69