A Brief History of Stream Processing

A Brief History of
Stream Processing
TimeSeries Meetup, Tallin, Estonia, Europe, Earth,
Milky Way, Universe …. 42

Who I Am
Riccardo Tommasini

Research Fellow @ UT

Future AssistantProfessor

Enthusiast!

Timeline
20152002 2005 2008 201820102006 2007 2014
Precision not Recall

Timeline
20152002 2005 2008 201820102006 2007 2014
Stream Processing + AI
(Deductive)

Timeline
20152002 2005 2008 201820102006 2007 2014
Stream Processing + AI
(Inductive)

Timeline
20152002 2005 2008 201820102006 2007 2014
Big Stream Processing
Starts

Timeline
20152002 2005 2008 201820102006 2007 2014
S. Murthy et al. Pulsar – Real-Time
Analytics at
Scale. Technical report, eBay,
2015. Summingbird: A Framework for
Integrating Batch and Online
MapReduce Computations.  
Trill: A High-Performance
Incremental Query Processor for
Diverse Analytics.  
TelegraphCQ:
Continuous Dataﬂow
Processing.  
NiagaraCQ: A
Scalable
Continuous
Query System
for Internet
Databases.  
Aurora: A New
Model and
Architecture for Data
Stream
Management

https://www.google.com/url?
sa=i&source=images&cd=&ved=2ahUKE
wjaouS-
uK3mAhVEqaQKHfdOA4MQjRx6BAgBE
AQ&url=https%3A%2F%2Fmedium.com
%2Fpersonal-growth-lab%2Fshould-
you-set-realistic-or-highly-ambitious-
goals-7cf400505444&psig=AOvVaw3xPk
Wzwvy8SBXD9NS0o_Oe&ust=15761481
63494865
Is this too
ambitious?

A (Query) Language Perspective
It addresses the problem of manipulating and managing
data-streams. It stresses on the operations that are
necessary to build applications. Some Assumptions
are made on the underlying system(s).
It address the problem of dealing with unbounded
data. Discuss the systems’ primitives that are
necessary to guarantee low-latency and fault-
tolerance in presence of *uncertainty*, e.g., late
arrivals.
A System Perspective
A new hybrid hope perspective
It addresses the problems in between. How do we map
the abstraction of an high-level language to the
underlying system?

Goals
• Exploit lesson learnt
in RDBMS theory.

• simple, clear, yet
powerful language.

• eﬃciently combine
streams and
relations.
• low-latency is
paramount.

• take data distribution
and ordering into
account.

• avoid implicit operator
semantics.
CQL DSM DFM
• Correctness ﬁrst with
sessions support.

• Latency requirements
vs resource cost
should dictate system
choice.

• Single processing
model for bounded
and unbounded data.

Time
• DSMS Clock Time, i.e.,
the timestamp
assigned by the
DSMS.

• Source Time, i.e., the
timestamp assigned by
the source.

• Source Heartbeat are
used to declare no
element with lower
timestamp will be sent.
• Logical Order induced
by the timestamp
assigned by the data
source.

• Physical Order
induced by the
timestamp assigned
by the system at
ingestion.
CQL DSM DFM
• Event Time is the time
at which the event itself
actually occurred.

• Processing Time is the
time at which an event
is observed during
processing.
• Watermark, i.e., global
progress metrics that
tracks the skewness
between ET and PT.

Abstractions
• Streams

• (Time-Varying)
Relations*
• Streams

• Change-log
Streams

• Records Streams

• Tables
CQL DSM DFM
• PCollections
(bounded and
unbounded)

CQL in 3 Slides
A Stream S is a possibly inﬁnite multi-set of elements <s,t>
where s is a tuple belonging to the schema of S and t is a
timestamp.
Relation R is a set of tuples (d1, d2, ..., dn), where each
element dj is a member of Dj, a data domain1.
1 a Data Domain refers to all the values which a data element may contain.

CQL in 3 4 Slides
timestamp.
Relation R is a set of tuples (d1, d2, ..., dn), where each
element dj is a member of Dj, a data domain1.
X

Ok, CQL in 4 5 Slides
timestamp.
Relation R is a mapping from each time instant in T to a
ﬁnite but unbounded bag of tuples belonging to the
schema of R.
X

CQL in 5 Slides
timestamp.
Relation R is a mapping from each time instant in T to a
ﬁnite but unbounded bag of tuples belonging to the
schema of R.

CQL in 5 Slides
Streams Relations
…
<s,τ>
…
<s1>
<s2>
<s3>
infinite
unbounded
sequence finite
bag
Mapping: T ! R
stream-to-relation
relation-to-stream
relation-to-relation
Stream
Relation R(t)
Relational Algebra (Almost)
*Stream operators
Sliding windows

CQL in 5 6 Slides
Stream-to-Relation Operators:

• Sliding Window:  
FROM S [ RANGE 5 Minutes]

• Parametric Sliding Windows:  
FROM S [ RANGE 5 Minutes Slide 1 Min]
• Partitioned Windows:  
FROM S [PARTITIONED BY A1..An ROW m]
X

CQL in 6 Slides
R2R operator
s3
s4 s5
s6
s7
s8
s9 s10
s11
s12S
s1
s2
W(ω,β)
β
ω
t
widthslide

CQL in 6 Slides
Relation-to-Stream Operators:

• Rstream: streams out all data in the last step

• Istream: streams out data in the last step that wasn’t
on the previous step, i.e. streams out what is new

• Dstream: streams out data in the previous step that
isn’t in the last step, i.e. streams out what is old

• What results are being computed

• Where in event time are being computed

• When in processing time are being materialised

• How “earlier results” relate to “later reﬁnements”
Reasoning about time
DFM in X<42* Slides
*overestimation

A Stream is represented as a possibly unbounded collection
of key-value pairs, i.e., a PCollection<K,V>.

PTransforms are PCollections-to-PCollections operations.
DFM in X Slides
(WHAT) Processing Model

DFM in X Slides
(WHAT) Processing Model

DFM in X Slides
(WHERE) Windowing Model

DFM in X Slides
• The DFM windowing model requires extended primitives than CQL’s one.

• Window Assignment, i.e., each element is assigned to a corresponding window.

• Window Merge:

• Drop Timestamp: only the window interval is relevant here on
• GroupBy Key: to enable parallel execution

• Window Merge (the merge logic depends on the window strategy)

• GroupByWindow: to ensure elements are processed in sequence window-
wise.

• ExpandToElements: assigned a valid timestamp to the elements.
(WHERE) Windowing Model
Wall of Text
Disclaimer

DFM in X Slides
• In order to build unaligned (apply across subset of the
data) event-time windows DFM is forced to decouple the
reporting of the window content.

• To do so, DFM introduces an orthogonal model that
allows to signal when a window is ready to be processed.
(WHEN) Triggering Model

DFM in X Slides
• In order to build unaligned (apply across subset of the
data) event-time windows DFM is forced to decouple the
reporting of the window content.

• To do so, DFM introduces an orthogonal model that
allows to signal when a window is ready to be processed.

• Triggers solve this issue allowing DFM users to write an
arbitrary logic to signal the completion of a window.
(WHEN) Triggering Model

DFM in X Slides
In addition to the triggering semantics, DFM introduces different
refinements models to deal with late arrivals

• Discarding, i.e., upon triggering, window contents are discarded
and later results bear no relation to previous results.

• Accumulating, i.e., upon triggering, window contents are left intact
in a persistent state and later results will become a refinement to
previous results

• Accumulating & Retracting, i.e, it extends the accumulating
semantics with retraction of the previous value.
(HOW) Triggering Model (part 2)
Wall of Text
Disclaimer

– Do the math
“X ~ 9 (7 + 2 WoT)”

https://www.google.com/url?
sa=i&source=images&cd=&ved=2ahUKE
wjaouS-
uK3mAhVEqaQKHfdOA4MQjRx6BAgBE
AQ&url=https%3A%2F%2Fmedium.com
%2Fpersonal-growth-lab%2Fshould-
you-set-realistic-or-highly-ambitious-
goals-7cf400505444&psig=AOvVaw3xPk
Wzwvy8SBXD9NS0o_Oe&ust=15761481
63494865
Was this too
ambitious?

Dual Stream Model
The design space

curtesy of Emanuele Della Valle - http://emanueledellavalle.org
A Conceptual View of Kafka
• Producers send messages on
topics

• Consumers read messages
from topics

• Messages are key-value pairs

• Topics are streams of
messages

• Kafka cluster manages topics

A Logical View of Kafka
• Brokers are the main
storage and messaging
components of the Kafka
cluster

Reconciling the two views
of Kafka
• Topics are partitioned across
brokers

• Producers shard messages
over the partitions of a certain
topic

• Typically, the message key
determines which Partition a
message is assigned to

Topic partitioning invites
distributed consumption
• Diﬀerent Consumers can read data
from the same Topic

• By default, each
Consumer will receive all
the messages in the Topic

• Multiple Consumers can be
combined into a Consumer Group

• Consumer Groups provide
scaling capabilities

• Each Consumer is
assigned a subset of
Partitions for consumption

Dual Stream Model
The intuition

Dual Stream Model
The intuition
https://www.michael-noll.com/blog/2018/04/05/of-stream-
and-tables-in-kafka-and-stream-processing-part1/

Dual Stream Model
The Truth about Streams
Stream
Change-log Stream Record Stream
Unbounded and
ordered sequence
of key-value pairs
A streams whose
records are
updates to a table
A streams whose
records are facts
records are
identiﬁed by a
primary key
records are not
identiﬁed by a
primary key

Dual Stream Model
A table is a collection of table versions; one version for each
point in time using the timestamp as a version number.
The Truth about Tables
T(1)={T5 ={⟨A,7.2⟩}} 
T(2) = {T5 = {⟨A,7.2⟩},T6 = {⟨B,14.7⟩}} 
T(3) = {T5 = {⟨A,7.2⟩}, T6 = {⟨A,8.9⟩,⟨B,14.7⟩}}
T(4) = {T3 = {⟨B,12.1⟩},T5 = {⟨A,7.2⟩,⟨B,12.1⟩}, T6 = {⟨A, 8.9⟩, ⟨B, 14.7⟩}} 
T(5) = {T3 = {⟨B,12.1⟩},T5 = {⟨A,7.2⟩,⟨B,12.1⟩}, T6 = {⟨A, 8.9⟩, ⟨B,
14.7⟩},T8 = {⟨A, 8.9⟩, ⟨B, 16.7⟩}}

Dual Stream Model
The Truth about Operations
Stateless Operations

Dual Stream Model
The Truth about Operations
Stateful Operations

Dual Stream Model
The Complete Picture

Dual Stream Model
The DSM simplifies the reasoning about the
transformations, but does not solve the unboundedness
problem.

We still need infinite memory for processing an infinite
stream.

DSM introduces the retention time to make the trade-off
explicit.
Result Correctness vs Runtime Cost

Mapping to Kafka
Minimal

Mapping to Kafka
Minimal
Concept Partitioned Unbounded Ordering Mutable Unique key
constraint
Schema
Topic Yes Yes Yes No No No (raw
bytes)
Stream Yes Yes Yes No No Yes
Table Yes Yes No Yes Yes Yes
Concept Kafka
Streams
KSQL Java Scala Python
Topic - - List/Stream List/Stream[(Array[Byte],
Array[Byte])]
[]
Stream KStream STREAM List/Stream List/Stream[(K, V)] []
Table KTable TABLE HashMap mutable.Map[K, V] {}

CQL
Esper
Kafka 
Streams
Stream 
Duality
DataFlow
KSQL
Flink
Models
& Issues
SECRET  
Model
Complex Event
Processing
Stream Processing
+
AI

Meetup, Tallin
First Event April/March
By Conﬂuent,  
so free beers
Come and talk about
your Kafka experience!

Thanks!
Questions?

rictomm.me

@rictomm

this@rictomm.me

I’m Hiring PhD
Students!!

A Brief History of Stream Processing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A Brief History of Stream Processing

Similar to A Brief History of Stream Processing (20)

Recently uploaded

Recently uploaded (20)

A Brief History of Stream Processing