More Related Content Similar to Unified Data Processing with Apache Flink and Apache Pulsar_Seth Wiesman (20) More from StreamNative (20) Unified Data Processing with Apache Flink and Apache Pulsar_Seth Wiesman1. © 2019 Ververica
Seth Wiesman
Senior Solutions Architect @ Ververica
Committer Apache Flink
Unified Data Processing with Apache Flink and Apache
Pulsar
2. © 2019 Ververica2
About Ververica (the company formerly known as “data Artisans”)
Original Creators of
Apache Flink®
Enterprise Stream Processing
With Ververica Platform
Subsidiary of
Alibaba Group
5. © 2019 Ververica5
2.5 B2M 985 PB
Sub-
Second 100TB
containers data size throughput latency state size
events / sec
Apache Flink at
The "Singles Day" (11/11/2019)
12. © 2019 Ververica12
Application /
Business Logic
Stream
Processor
(Datalake, Database)
Application /
Business Logic
Batch Proc. or Req/resp. Stream Processing
Stream Processing changes how Applications and Data interact
request/trigger result/response
event stream event stream
events are the data
events act as triggers
application logic triggered
by events/changes
13. © 2019 Ververica13
What is Stream Processing for?
data changes slowly
Ad-hoc queries, data exploration,
ML model training
Batch Proc. or Req/resp.
Most business logic
query/logic changes fast data changes fast
query/logic changes slowly
Continuous Streaming
14. © 2019 Ververica14
more lag time
data warehousing
OLAP / BI / reporting
continuous monitoring
(position, risk, …)
real-time ML model
training/evaluation
distributed
OLTP-style apps
more real time
continuous
ETL
real-time behavior modeling
(recommenders, pricing, ..)
The Spectrum of Streaming Data Use Cases
machine learning
model training
unified offline/
real-time analytics
real-time alerts
(fraud, security, …)
17. © 2019 Ververica17
Everything is a Stream
Stream of Requests/Responses to/from Services
Service
DB
à event sourcing architecture
GET /a/b POST /b/c PUT /e/f 200 404 200 200 403
18. © 2019 Ververica18
Everything is a Stream
Stream of Rows in a Table or in Files
2016-3-1
12:00 am
2016-3-1
1:00 am
2016-3-1
2:00 am
2016-3-11
11:00pm
2016-3-12
12:00am
2016-3-12
1:00am
2016-3-11
10:00pm
2016-3-12
2:00am
2016-3-12
3:00am
…
19. © 2019 Ververica19
Everything is a Stream
Stream of Rows in a Table or in Files
2016-3-1
12:00 am
2016-3-1
1:00 am
2016-3-1
2:00 am
2016-3-11
11:00pm
2016-3-12
12:00am
2016-3-12
1:00am
2016-3-11
10:00pm
2016-3-12
2:00am
2016-3-12
3:00am
…
a batch
20. © 2019 Ververica20
Everything is a Stream
Streams may span storage systems
2016-3-1
12:00 am
2016-3-1
1:00 am
2016-3-1
2:00 am
2016-3-11
11:00pm
2016-3-11
10:00pm
…
Parquet files Avro records
more distant past
(e.g., compressed files in DFS/Object Store)
recent past
(e.g., events in MQ/Log)
23. © 2019 Ververica23
Components of a Streaming Data Architecture
Event producers
(applications, servers,
databases, sensors)
Log / Stream Storage
(Pulsar)
Stream
Processing
Stream
Processing
Stream
Processing
Results (Views)
(K/V stores, databases)
Triggered
Applications
(Apache Flink)
24. © 2019 Ververica24
Flink Runtime
Stateful Computations over Data Streams
Stateful
Stream Processing
Streams, State, Time
Event-driven
Applications
Stateful Functions
Streaming Analytics
SQL and Tables
Apache Flink: Analytics and Applications on Streaming Data
26. © 2019 Ververica26
Flink Runtime
Stateful Computations over Data Streams
Stateful
Stream Processing
Streams, State, Time
Event-driven
Applications
Stateful Functions
Streaming Analytics
SQL and Tables
Apache Flink: Analytics and Applications on Streaming Data
27. © 2019 Ververica27
Stateful Stream Processing
Computation
Computation
Computation
Computation
Source (Stream)
Source (Static)
Sink Sink
Transformation
State
State
State
28. © 2019 Ververica28
Example Use Cases
•Real time search and recommendation models (e.g., Alibaba)
•Build a real-time session behavior profile of users (e.g., Netflix)
•Real time trade settlement dashboard (e.g., UBS)
•Real time revenue accounting (various AdTechs)
•Machine Learning-based anomaly/fraud detection (e.g., ING, Microsoft)
•Real-time data refinement and data pipelines (many)
29. © 2019 Ververica29
DataStream API
Source
Transformation
Windowed Transformation
Sink
val lines: DataStream[String] = env.addSource(new FlinkKafkaConsumer011(…))
val events: DataStream[Event] = lines.map((line) => parse(line))
val stats: DataStream[Statistic] = stream
.keyBy("sensor")
.timeWindow(Time.seconds(5))
.sum(new MyAggregationFunction())
stats.addSink(new RollingSink(path))
Streaming
Dataflow
Source Transform Window
(state read/write)
Sink
32. © 2019 Ververica32
Flink Runtime
Stateful Computations over Data Streams
Stateful
Stream Processing
Streams, State, Time
Event-driven
Applications
Stateful Functions
Streaming Analytics
SQL and Tables
Apache Flink: Analytics and Applications on Streaming Data
33. © 2019 Ververica33
Example Use Cases
•Realtime Analytics Platforms (e.g., Alibaba, Uber, Lyft, Yelp!, Tencent)
•Materializing Views (dashboards, data marts)
•ETL - batch and continuous
•Machine Learning Training (Alibaba, new ML library)
34. © 2019 Ververica34
SQL / Table API – Batch Queries
SQL
Query
Batch Query
Execution
SELECT
room,
TUMBLE_END(rowtime, INTERVAL '1' HOUR),
AVG(temperature)
FROM
sensors
GROUP BY
TUMBLE(rowtime, INTERVAL '1' HOUR), room
Full TPC-DS support
in Flink 1.10
36. © 2019 Ververica36
SQL / Table API – Streaming Data Case
SELECT
room,
TUMBLE_END(rowtime, INTERVAL '1' HOUR),
AVG(temperature)
FROM
sensors
GROUP BY
TUMBLE(rowtime, INTERVAL '1' HOUR), room
SQL
Query
Interpret Stream
as Table
Incremental
Query Execution output result
changes as stream
update database
with changes
37. © 2019 Ververica37
FLIP-72
Add Pulsar connectors and Catalog to Apache Flink
> CREATE CATALOG my_pulsar (
‘type’ = ‘pulsar’,
‘adminUrl’ = ‘localhost:9092’
);
> USE my_pulsar;
> INSERT INTO aggregations
SELECT
room,
TUMBLE_END(rowtime, INTERVAL '1' HOUR),
AVG(temperature)
FROM
sensors
GROUP BY
TUMBLE(rowtime, INTERVAL '1' HOUR), room
38. © 2019 Ververica38
Materialized Views Example
logCDC
Continuous
SQL Query
Continuous
SQL Query
Continuous
SQL Query
Materialized View
Materialized View
Archive
39. © 2019 Ververica39
Materialized Views Example
logCDC
Continuous
SQL Query Materialized Views
View Materialization
(streaming)
Dashboard:
Many short queries
(batch)
40. © 2019 Ververica40
Many handy SQL features: Temporal Joins, Pattern Matching, …
SELECT tf.time
tf.price * rh.rate as conv_fare
FROM taxiFare AS tf
LATERAL TABLE (Rates(tf.time)) AS rh
WHERE tf.currency = rh.currency;
42. © 2019 Ververica42
Flink Runtime
Stateful Computations over Data Streams
Stateful
Stream Processing
Streams, State, Time
Event-driven
Applications
Stateful Functions
Streaming Analytics
SQL and Tables
Apache Flink: Analytics and Applications on Streaming Data
45. © 2019 Ververica45
Consistency in Database Applications
App App App
For any failure in any call, it becomes
hard to reason about what effects did or did
not already happen
X
48. © 2019 Ververica48
Stream Processing F-a-a-S
λ
λ
λ
λ
simplicity / generality
state management
composability
lightweight resources
performance
event-driven
Can we combine some
of these properties
?
49. © 2019 Ververica49
Stateful Functions
f(a,b)
f(a,b)
f(a,b)
f(a,b)
f(a,b) mass storage
(S3, GCF, ECS, HDFS, …)
event ingress
event egress
f(a,b)
snapshot
state
50. © 2019 Ververica50
Stateful Functions compared to Stream Proc. & Apache Flink
Apache Flink
DataStream/Table
Stateful Functions
f(a,b)
f(a,b)
f(a,b)
Pool of Resources
(Apache Flink Cluster)
Arbitrary Function-to-Function
messaging. Not restricted to a DAG.
Functions are multiplexed and share resources.
Makes it possible to run many very small jobs.
Solves two major challenges
f(a,b)
f(a,b)
f(a,b)
f(a,b)
f(a,b)
51. © 2019 Ververica51
Example: Ride Sharing App
Driver status
updates
Passenger
ride requests
Ride
status update
Driver
Ride
Pass-
enger
Geo-
index
update create
bill
Inform /
book
bid
lookup
update cell
seeking
confirmed
riding
free
bidding
booked
52. © 2019 Ververica52
data preparation
combining knowledge/information
filtering, enriching,
aggregating, joining events
coordination,
(interacting) state machines
complex event/state
interactions
“occasional” actions or
spiky loads
compute-intensive
or blocking
Stream Processing
Streaming SQL
Stateful Functions F-a-a-S
f(a,b)
f(a,b)
f(a,b)
λ
λ
λ
λ
state-centricevent/stream-centric stateless / compute-centric
53. © 2019 Ververica53
Putting it all together
f(a,b)
f(a,b)
f(a,b)
λ
λ
λ
λ
FaaS
render map/route image
create a receipt PDF
send email
Stateful Functions
ride life-cycle
driver-to-ride matching
Stream Processing
traffic models
demand forecast & pricing
Billing
Passenger updates
Driver position updates
Driver status updates