SlideShare a Scribd company logo
1 of 121
Download to read offline
Information Flow Processing
Amir H. Payberah
amir@sics.se
Amirkabir University of Technology
(Tehran Polytechnic)
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 1 / 83
Motivation
Many applications must process large streams of live data and pro-
vide results in real-time.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 2 / 83
Motivation
Many applications must process large streams of live data and pro-
vide results in real-time.
• Wireless sensor networks
• Traffic management applications
• Stock marketing
• Environmental monitoring applications
• Fraud detection tools
• ...
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 2 / 83
Motivation
Processing information as it flows, without storing them persistently.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 3 / 83
Motivation
Processing information as it flows, without storing them persistently.
Traditional DBMSs:
• Store and index data before processing it.
• Process data only when explicitly asked by the users.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 3 / 83
Motivation
Processing information as it flows, without storing them persistently.
Traditional DBMSs:
• Store and index data before processing it.
• Process data only when explicitly asked by the users.
• Both aspects contrast with our requirements.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 3 / 83
One Name, Different Technologies
Several research communities are contribut-
ing in this area:
• Each brings its own expertise
• Point of view
• Vocabulary: event, data, stream, ...
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 4 / 83
One Name, Different Technologies
Several research communities are contribut-
ing in this area:
• Each brings its own expertise
• Point of view
• Vocabulary: event, data, stream, ...
Tower of Babel Syndrome!
Come on! Let’s go down and confuse them by making them speak different languages,
then they won’t be able to understand each other.
Genesis 11:7
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 4 / 83
Information Flow Processing (IFP)
Source: produces the incoming information flows
Sink: consumes the results of processing
IFP engine: processes incoming flows
Processing rules: how to process the incoming flows
Rule manager: adds/removes processing rules
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 5 / 83
IFP Competing Models
Data Stream Management Systems (DSMS)
Complex Event Processing (CEP)
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 6 / 83
IFP Competing Models
Data Stream Management Systems (DSMS)
Complex Event Processing (CEP)
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 6 / 83
Data Stream Management Systems (DSMS)
An evolution of traditional data processing, as supported by DBMSs.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 7 / 83
DBMS vs. DSMS (1/3)
DBMS: persistent data where updates are relatively infrequent.
DSMS: transient data that is continuously updated.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 8 / 83
DBMS vs. DSMS (2/3)
DBMS: runs queries just once to return a complete answer.
DSMS: executes standing queries, which run continuously and pro-
vide updated answers as new data arrives.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 9 / 83
DBMS vs. DSMS (3/3)
Despite these differences, DSMSs resemble DBMSs: both process
incoming data through a sequence of transformations based on SQL
operators, e.g., selections, aggregates, joins.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 10 / 83
Out of Scope of DSMS
DSMSs focus on producing query answers.
Detection and notification of complex patterns of elements are usu-
ally out of the scope of DSMSs:
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 11 / 83
Out of Scope of DSMS
DSMSs focus on producing query answers.
Detection and notification of complex patterns of elements are usu-
ally out of the scope of DSMSs: Complex Event Processing
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 11 / 83
IFP Competing Models
Data Stream Management Systems (DSMS)
Complex Event Processing (CEP)
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 12 / 83
Complex Event Processing (CEP)
DSMSs limitation: detecting complex patterns of incoming items,
involving sequencing and ordering relationships.
CEP models flowing information items as notifications of events
happening in the external world.
• They have to be filtered and combined to understand what is
happening in terms of higher-level events.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 13 / 83
CEP vs. Publish/Subscribe Systems
CEP systems can be seen as an extension to traditional pub-
lish/subscribe systems.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 14 / 83
CEP vs. Publish/Subscribe Systems
CEP systems can be seen as an extension to traditional pub-
lish/subscribe systems.
Traditional publish/subscribe systems consider each event separately
from the others, and filter them based on their topic or content.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 14 / 83
CEP vs. Publish/Subscribe Systems
CEP systems can be seen as an extension to traditional pub-
lish/subscribe systems.
Traditional publish/subscribe systems consider each event separately
from the others, and filter them based on their topic or content.
CEPs extend this functionality by increasing the expressive power of
the subscription language to consider complex event patterns that
involve the occurrence of multiple related events.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 14 / 83
IFP Modeling
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 15 / 83
One Model, Several Models
Different models to capture different viewpoints.
• Functional model
• Processing model
• Time model
• Data model
• Rule model
• Language model
• Interaction model
• Deployment model
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 16 / 83
Functional Model
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 17 / 83
Functional Model
An abstract architecture of the main functional components of an
IFP engine.
[G. Cugolap et al., Processing Flows of Information: From Data Stream to Complex Event Processing, 2012]
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 18 / 83
Receiver and Clock
Receiver manages the channels connecting the sources with the IFP
engine.
Clock models periodic processing of their inputs.
[G. Cugolap et al., Processing Flows of Information: From Data Stream to Complex Event Processing, 2012]
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 19 / 83
Rules, Decider and Producer
We assume rules can be (logically) decomposed in two parts:
• C → A
• C is the condition
• A is the action
Example (in CQL):
Select IStream(Count(*)) (action)
From F1 [Range 1 Minute] Where F1.A > 0 (condition)
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 20 / 83
Rules, Decider and Producer
We assume rules can be (logically) decomposed in two parts:
• C → A
• C is the condition
• A is the action
Example (in CQL):
Select IStream(Count(*)) (action)
From F1 [Range 1 Minute] Where F1.A > 0 (condition)
This way we can split processing in two phases:
• Decider: determines the items that trigger the rule.
• Producer: use those items to produce the output of the rule.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 20 / 83
Detection-Production Cycle (1/2)
The Detector evaluates all the rules in the Rules store to find those
whose condition part is true.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 21 / 83
Detection-Production Cycle (1/2)
The Detector evaluates all the rules in the Rules store to find those
whose condition part is true.
With the newly arrived information, the Detector may also use the
information present in the Knowledge Base.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 21 / 83
Detection-Production Cycle (1/2)
The Detector evaluates all the rules in the Rules store to find those
whose condition part is true.
With the newly arrived information, the Detector may also use the
information present in the Knowledge Base.
At the end of this phase we have a set of rules that have to be
executed.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 21 / 83
Detection-Production Cycle (2/2)
The Producer takes the information and executes each triggered rule
(i.e., its action part).
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 22 / 83
Detection-Production Cycle (2/2)
The Producer takes the information and executes each triggered rule
(i.e., its action part).
In executing rules, the Producer may combine the items that trig-
gered the rule.
• Received from the Decider together with the information present in
the Knowledge Base.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 22 / 83
Detection-Production Cycle (2/2)
The Producer takes the information and executes each triggered rule
(i.e., its action part).
In executing rules, the Producer may combine the items that trig-
gered the rule.
• Received from the Decider together with the information present in
the Knowledge Base.
Usually, these new items are sent to sinks, through the Forwarder,
or sent internally to be processed again.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 22 / 83
Processing Model
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 23 / 83
Processing Model
Three policies affect the behavior of the system:
• The selection policy
• The consumption policy
• The load shedding policy
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 24 / 83
Selection Policy
Determines if a rule fires once or multiple times and the items ac-
tually selected from the History.
[G. Cugolap et al., Processing Flows of Information: From Data Stream to Complex Event Processing, 2012]
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 25 / 83
Consumption Policy
Determines how the history changes after firing of a rule: what
happens when new items enter the Decider
[G. Cugolap et al., Processing Flows of Information: From Data Stream to Complex Event Processing, 2012]
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 26 / 83
Load Shedding Policy
A technique adopted by some IFP systems to deal with burst inputs.
It can be described as an automatic drop of information items when
the input rate becomes too high for the processing capabilities of
the engine.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 27 / 83
Time Model
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 28 / 83
Time Model
The relationship between the information items flowing into the IFP
engine and the passing of time.
Ability of an IFP system to associate some kind of happened-before
(ordering) relationship to information items.
Four classes:
1 Stream-only
2 Causal
3 Absolute
4 Interval
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 29 / 83
Stream-Only Time Model
Not any special meaning to time.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 30 / 83
Stream-Only Time Model
Not any special meaning to time.
Timestamps are used mainly to order items at the frontier of the
engine (i.e., within the receiver).
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 30 / 83
Stream-Only Time Model
Not any special meaning to time.
Timestamps are used mainly to order items at the frontier of the
engine (i.e., within the receiver).
They are lost during processing.
• The ordering and timestamps of the output stream are conceptually
separate from the ordering and timestamps of the input streams.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 30 / 83
Stream-Only Time Model
Not any special meaning to time.
Timestamps are used mainly to order items at the frontier of the
engine (i.e., within the receiver).
They are lost during processing.
• The ordering and timestamps of the output stream are conceptually
separate from the ordering and timestamps of the input streams.
Example: CQL/Stream
Select DStream(*)
From F1[Rows 5], F2[Range 1 Minute]
Where F1.A = F2.A
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 30 / 83
Causal Time Model
Each item has a label reflecting some kind of causal relationship.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 31 / 83
Causal Time Model
Each item has a label reflecting some kind of causal relationship.
Partial order
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 31 / 83
Causal Time Model
Each item has a label reflecting some kind of causal relationship.
Partial order
Example: Gigascope
Select count(*)
From A, B
Where A.a-1 <= B.b and A.a+1 > B.b
A.a and B.b monotonically increase
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 31 / 83
Absolute Time Model
Information items have an associated timestamp.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 32 / 83
Absolute Time Model
Information items have an associated timestamp.
Defining a single point in time, wrt a (logically) unique clock.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 32 / 83
Absolute Time Model
Information items have an associated timestamp.
Defining a single point in time, wrt a (logically) unique clock.
Information items can be timestamped at source or entering the
engine
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 32 / 83
Absolute Time Model
Information items have an associated timestamp.
Defining a single point in time, wrt a (logically) unique clock.
Information items can be timestamped at source or entering the
engine
Total order
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 32 / 83
Absolute Time Model
Information items have an associated timestamp.
Defining a single point in time, wrt a (logically) unique clock.
Information items can be timestamped at source or entering the
engine
Total order
Example: TESLA/T-Rex
Define Fire(area: string, measuredTemp: double)
From Smoke(area=$a) and last Temp(area=$a and
value>45) within 5 min. from Smoke
Where area=Smoke.area and measuredTemp=Temp.value
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 32 / 83
Interval Time Model
Associate items with an interval, i.e., two timestamps taken from a
global time.
Usually representing: the time when the related event started, the
time when it ended.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 33 / 83
Data Model
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 34 / 83
Data Model
Studies how the different systems
• Represent single data items
• Organize them into data flows
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 35 / 83
Nature of Items
The meaning we associate to information items
• Generic data
• Event notifications
Influences other aspects of an IFP
system
• Time model
• Rule language
• Semantics of processing
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 36 / 83
Format of Items
How information is represented
Influences the way items are processed,
e.g., relational model requires tuples
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 37 / 83
Support for Uncertainty
Ability to associate a degree of
uncertainty to information items.
When present, probabilistic information
is usually exploited in rules during
processing.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 38 / 83
Data Flows
Homogeneous
• Each flow contains data with the same format and kind.
• E.g., a sequence of unbounded tuples generated continuously in
time: · · · (a1, a2, · · · , an, t − 1)(a1, a2, · · · , an, t)(a1, a2, · · · , an, t + 1) · · ·,
where ai denotes an attribute.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 39 / 83
Data Flows
Homogeneous
• Each flow contains data with the same format and kind.
• E.g., a sequence of unbounded tuples generated continuously in
time: · · · (a1, a2, · · · , an, t − 1)(a1, a2, · · · , an, t)(a1, a2, · · · , an, t + 1) · · ·,
where ai denotes an attribute.
Heterogeneous
• Information flows are seen as channels
connecting sources, processors, and sinks
• Each channel may transport items with
different kind and format.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 39 / 83
Rule Model
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 40 / 83
Rule Model
Rules are classified into two macro classes:
• Transforming rules
• Detecting rules
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 41 / 83
Transforming Rules (1/2)
No explicit distinction between detection and production.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 42 / 83
Transforming Rules (1/2)
No explicit distinction between detection and production.
Execution plan of primitive operators (processing elements (PE)).
• A logical network of PEs connected in a DAG.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 42 / 83
Transforming Rules (1/2)
No explicit distinction between detection and production.
Execution plan of primitive operators (processing elements (PE)).
• A logical network of PEs connected in a DAG.
Each PE transforms one or more input flows into one or more output
flows.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 42 / 83
Transforming Rules (2/2)
PEs execute independently and in parallel
Not synchronized
Communicate through messaging
Upstream node vs. downstream node
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 43 / 83
Detecting Rules
An explicit distinction between detection and production.
Usually, the detection is based on a logical predicate that captures
patterns of interest in the history of received items.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 44 / 83
Language Model
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 45 / 83
Language Model
Following the rule model, we define two classes of languages:
• Transforming languages: declarative languages and imperative
languages
• Detecting languages: patternbased
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 46 / 83
Language Model
Following the rule model, we define two classes of languages:
• Transforming languages: declarative languages and imperative
languages
• Detecting languages: patternbased
Specify operations to filter, join, aggregate, ...
Input flows
To produce one or more output flows
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 46 / 83
Declarative Languages
Specify the expected result rather than the desired execution flow.
Usually derive from relational languages, e.g., SQL
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 47 / 83
Declarative Languages
Specify the expected result rather than the desired execution flow.
Usually derive from relational languages, e.g., SQL
Example CQL/Stream:
Select IStream(*)
From F1[Rows 5], F2[Rows 10]
Where F1.A = F2.A
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 47 / 83
Imperative Languages
Specify the desired execution flow
Starting from primitive operators, i.e., PEs
Usually adopt a graphical notation
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 48 / 83
Pattern-Based Languages
Specify a firing condition as a pattern
Select a portion of incoming flows
The action uses selected items to produce new knowledge
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 49 / 83
Pattern-Based Languages
Specify a firing condition as a pattern
Select a portion of incoming flows
The action uses selected items to produce new knowledge
Example, TESLA / T-Rex
Define Fire(area: string, measuredTemp: double)
From Smoke(area=$a) and last
Temp(area=$a and value>45)
within 5 min. from Smoke
Where area=Smoke.area and measuredTemp=Temp.value
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 49 / 83
Interaction Model
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 50 / 83
Interaction Model
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 51 / 83
Interaction Model
Processing Element (PE): a processing unit in a IFP system.
How do PEs interact?
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 51 / 83
Interaction Model
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 52 / 83
Deployment Model
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 53 / 83
Deployment Model
IFP applications may include a large number of sources, sinks, and
PEs.
Possibly dispersed over a wide geographical area.
How the components of the functional model can be distributed to
achieve scalability.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 54 / 83
Deployment Model
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 55 / 83
Deployment Model
Given a network of PEs.
How to map it onto the physical network of nodes
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 56 / 83
Scaling Mechanism
How to scale with increasing the number queries and the rate of
incoming events?
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 57 / 83
Scaling Mechanism
How to scale with increasing the number queries and the rate of
incoming events?
Two main solutions:
• Data partitioning: a reasonable data partitioning and merging scheme
as well as mechanisms to detect points for parallelization.
• Query Partitioning: to distribute the load across available hosts and
to achieve a load balance between these machines.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 57 / 83
Data Partitioning (1/3)
Early approaches parallelize operators by introducing a splitter and
merge operator.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 58 / 83
Data Partitioning (1/3)
Early approaches parallelize operators by introducing a splitter and
merge operator.
The merge operator: union or sort
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 58 / 83
Data Partitioning (1/3)
Early approaches parallelize operators by introducing a splitter and
merge operator.
The merge operator: union or sort
E.g., parallelize a filter operation using a round robin scheme.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 58 / 83
Data Partitioning (1/3)
Early approaches parallelize operators by introducing a splitter and
merge operator.
The merge operator: union or sort
E.g., parallelize a filter operation using a round robin scheme.
E.g., parallelize an aggregation using a hash scheme.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 58 / 83
Data Partitioning (2/3)
In more recent systems:
E.g., in Storm a user can express data parallelism by defining the
number of parallel tasks per operator.
E.g., S4 creates a PE for each new key in the data stream.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 59 / 83
Data Partitioning (2/3)
In more recent systems:
E.g., in Storm a user can express data parallelism by defining the
number of parallel tasks per operator.
E.g., S4 creates a PE for each new key in the data stream.
In both approaches the user needs to understand the data parallelism
and explicitly enforce in its code the sequential ordering.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 59 / 83
Data Partitioning (3/3)
An auto-parallelization approach has recently be proposed.
A combination of compiler and runtime:
• The compiler detects regions for parallelization.
• The system runtime guarantees that output tuples follow the same
order as for a sequential execution.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 60 / 83
Query Partitioning
Operator placement problem: the problem of assigning a set of
operators to a set of available hosts.
A single PE can be running in parallel on different nodes.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 61 / 83
Query Partitioning
Operator placement problem: the problem of assigning a set of
operators to a set of available hosts.
A single PE can be running in parallel on different nodes.
E.g., SEEP can dynamically vary the number of processing nodes
within the system based on the workload.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 61 / 83
Fault Tolerance Mechanism
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 62 / 83
Recovery Methods
The recovery methods of streaming frameworks must take:
• Correctness, e.g., data loss and duplicates
• Performance, e.g., low latency
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 63 / 83
Basic Idea
Each processing node has an associated backup node.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 64 / 83
Basic Idea
Each processing node has an associated backup node.
The backup node’s stream processing engine is identical to the pri-
mary one.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 64 / 83
Basic Idea
Each processing node has an associated backup node.
The backup node’s stream processing engine is identical to the pri-
mary one.
But the state of the backup node is not necessarily the same as that
of the primary.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 64 / 83
Basic Idea
Each processing node has an associated backup node.
The backup node’s stream processing engine is identical to the pri-
mary one.
But the state of the backup node is not necessarily the same as that
of the primary.
If a primary node fails, its backup node takes over the operation of
the failed node.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 64 / 83
Recovery Methods
GAP recovery
Rollback recovery
Precise recovery
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 65 / 83
GAP Recovery
The weakest recovery guarantee
A new task takes over the operations of the failed task.
The new task starts from an empty state.
Tuples can be lost during the recovery phase.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 66 / 83
Rollback Recovery
The information loss is avoided, but the output may contain dupli-
cate tuples.
Three types of rollback recovery:
• Active backup
• Passive backup
• Upstream backup
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 67 / 83
Rollback Recovery - Active Backup
Both primary and backup nodes are given the same input.
The output tuples of the backup node are logged at the output
queues and they are not sent downstream.
If the primary fails, the backup takes over by sending the logged tu-
ples to all downstream neighbors and then continuing its processing.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 68 / 83
Rollback Recovery - Passive Backup
Periodically check-points processing state to a shared storage.
The backup node takes over from the latest checkpoint when the
primary fails.
The backup node is always equal or behind the primary.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 69 / 83
Rollback Recovery - Upstream Backup
Upstream nodes store the tuples until the downstream nodes ac-
knowledge them.
If a node fails, an empty node rebuilds the latest state of the failed
primary from the logs kept at the upstream server.
There is no backup node in this model.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 70 / 83
Precise Recovery
Post-failure output is exactly the same as the output without failure.
Can be achieved by modifying the algorithms for rollback recovery.
• For example, in passive backup, after a failure occurs the backup
node can ask the downstream nodes for the latest tuples they
received and trim the output queues accordingly to prevent the
duplicates.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 71 / 83
Brief History of IFP Systems
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 72 / 83
First Generation
A stand-alone prototypes or as extensions of existing database en-
gines.
They were developed with a specific use case in mind and are very
limited regarding the supported operator types as well as available
functionalities.
Niagara, Telegraph, Aurora, ...
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 73 / 83
First Generation Example - Aurora
A single site stream-processing engine (centralized).
DAG based processing model for streams.
Push-based strategy.
The first Aurora did not support fault tolerance.
Stream Query Algebra (SQuAl), i.e., SQL with additional features,
e.g., windowed queries.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 74 / 83
Second Generation
Systems extended the ideas of data stream processing with advanced
features such as fault tolerance, adaptive query processing, as well
as an enhanced operator expressiveness.
Borealis, CEDR, System S, CAPE, ...
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 75 / 83
Second Generation Example - Borealis
Distributed version of Aurora.
Advanced functionalities on top of Aurora:
• Dynamic revision of query results: correct errors in previously
reported data.
• Dynamic query modifications: change certain attributes of the
query at runtime.
Pull-based strategy.
Rollback recovery with active backup.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 76 / 83
Third Generation
Driven by the trend towards cloud computing: highly scalable and
robust towards faults.
Storm, Apache S4, D-Streams, SEEP, StreamCloud, ...
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 77 / 83
Third Generation Example - Storm (1/2)
Stream processing is guaranteed: a message cannot be lost due to
node failures.
DAG based processing:
• the DAG is called Topology
• the PEs are called Bolts
• the stream sources are called Spouts
It does not have an explicit programming
paradigm.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 78 / 83
Third Generation Example - Storm (2/2)
Pull-based strategy.
Rollback recovery with upstream backup.
Three sets of nodes:
• Nimbus: distributes the code among the worker nodes, and keeps
track of the progress of the worker nodes
• Supervisor: the set of worker nodes
• Zookeeper: coordination between supervisor nodes and the Nimbus
Built by twitter
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 79 / 83
Third Generation Example - S4 (1/2)
S4: Simple Scalable Streaming System.
Constructing a DAG structure of PEs at runtime.
• A PE is instantiated for each value of the key attribute.
The processing model is inspired by MapReduce.
Events are dispatched to nodes according to their key.
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 80 / 83
Third Generation Example - S4 (2/2)
Push-based strategy
GAP recovery
Communication layer: coordination between the processing nodes
and the messaging between nodes.
• Uses Zookeeper
Built by yahoo
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 81 / 83
Summary
IFP: DSMS and CEP
IFP modeling: functional, processing, time, data, rule, language,
interaction, deployment
Recovering models: GAP, Rollback, and Precise
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 82 / 83
Questions?
Acknowledgements
Some slides and pictures were derived from G. Cugola slides
(Politecnico di Milano).
Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 83 / 83

More Related Content

Viewers also liked

Linux Module Programming
Linux Module ProgrammingLinux Module Programming
Linux Module ProgrammingAmir Payberah
 
MegaStore and Spanner
MegaStore and SpannerMegaStore and Spanner
MegaStore and SpannerAmir Payberah
 
Graph processing - Graphlab
Graph processing - GraphlabGraph processing - Graphlab
Graph processing - GraphlabAmir Payberah
 
Process Management - Part1
Process Management - Part1Process Management - Part1
Process Management - Part1Amir Payberah
 
Introduction to Operating Systems - Part3
Introduction to Operating Systems - Part3Introduction to Operating Systems - Part3
Introduction to Operating Systems - Part3Amir Payberah
 
Introduction to Operating Systems - Part2
Introduction to Operating Systems - Part2Introduction to Operating Systems - Part2
Introduction to Operating Systems - Part2Amir Payberah
 
Process Management - Part2
Process Management - Part2Process Management - Part2
Process Management - Part2Amir Payberah
 
CPU Scheduling - Part2
CPU Scheduling - Part2CPU Scheduling - Part2
CPU Scheduling - Part2Amir Payberah
 
The Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics PlatformThe Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics PlatformAmir Payberah
 
Process Synchronization - Part2
Process Synchronization -  Part2Process Synchronization -  Part2
Process Synchronization - Part2Amir Payberah
 
Process Synchronization - Part1
Process Synchronization -  Part1Process Synchronization -  Part1
Process Synchronization - Part1Amir Payberah
 
Data Intensive Computing Frameworks
Data Intensive Computing FrameworksData Intensive Computing Frameworks
Data Intensive Computing FrameworksAmir Payberah
 

Viewers also liked (20)

Linux Module Programming
Linux Module ProgrammingLinux Module Programming
Linux Module Programming
 
MegaStore and Spanner
MegaStore and SpannerMegaStore and Spanner
MegaStore and Spanner
 
Graph processing - Graphlab
Graph processing - GraphlabGraph processing - Graphlab
Graph processing - Graphlab
 
Process Management - Part1
Process Management - Part1Process Management - Part1
Process Management - Part1
 
Threads
ThreadsThreads
Threads
 
Introduction to Operating Systems - Part3
Introduction to Operating Systems - Part3Introduction to Operating Systems - Part3
Introduction to Operating Systems - Part3
 
Mesos and YARN
Mesos and YARNMesos and YARN
Mesos and YARN
 
Main Memory - Part2
Main Memory - Part2Main Memory - Part2
Main Memory - Part2
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
Protection
ProtectionProtection
Protection
 
Security
SecuritySecurity
Security
 
Introduction to Operating Systems - Part2
Introduction to Operating Systems - Part2Introduction to Operating Systems - Part2
Introduction to Operating Systems - Part2
 
IO Systems
IO SystemsIO Systems
IO Systems
 
Process Management - Part2
Process Management - Part2Process Management - Part2
Process Management - Part2
 
Storage
StorageStorage
Storage
 
CPU Scheduling - Part2
CPU Scheduling - Part2CPU Scheduling - Part2
CPU Scheduling - Part2
 
The Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics PlatformThe Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics Platform
 
Process Synchronization - Part2
Process Synchronization -  Part2Process Synchronization -  Part2
Process Synchronization - Part2
 
Process Synchronization - Part1
Process Synchronization -  Part1Process Synchronization -  Part1
Process Synchronization - Part1
 
Data Intensive Computing Frameworks
Data Intensive Computing FrameworksData Intensive Computing Frameworks
Data Intensive Computing Frameworks
 

Similar to Introduction to stream processing

Introduction to cloud computing and big data - part1
Introduction to cloud computing and big data - part1Introduction to cloud computing and big data - part1
Introduction to cloud computing and big data - part1Amir Payberah
 
Introduction to Operating Systems - Part1
Introduction to Operating Systems - Part1Introduction to Operating Systems - Part1
Introduction to Operating Systems - Part1Amir Payberah
 
SA UNIT I STREAMING ANALYTICS.pdf
SA UNIT I STREAMING ANALYTICS.pdfSA UNIT I STREAMING ANALYTICS.pdf
SA UNIT I STREAMING ANALYTICS.pdfManjuAppukuttan2
 
Parallel machines flinkforward2017
Parallel machines flinkforward2017Parallel machines flinkforward2017
Parallel machines flinkforward2017Nisha Talagala
 
SplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding OverviewSplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding OverviewSplunk
 
Introduction to cloud computing and big data - part2
Introduction to cloud computing and big data - part2Introduction to cloud computing and big data - part2
Introduction to cloud computing and big data - part2Amir Payberah
 
Workflow Process Management and Enterprise Application Integration in Healthcare
Workflow Process Management and Enterprise Application Integration in HealthcareWorkflow Process Management and Enterprise Application Integration in Healthcare
Workflow Process Management and Enterprise Application Integration in HealthcareAmit Sheth
 
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding OverviewSplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding OverviewSplunk
 
Smart E-Logistics for SCM Spend Analysis
Smart E-Logistics for SCM Spend AnalysisSmart E-Logistics for SCM Spend Analysis
Smart E-Logistics for SCM Spend AnalysisIRJET Journal
 
Event-driven BPM the JBoss way
Event-driven BPM the JBoss wayEvent-driven BPM the JBoss way
Event-driven BPM the JBoss wayKris Verlaenen
 
IRJET-Concurrency Control Model for Distributed Database
IRJET-Concurrency Control Model for Distributed DatabaseIRJET-Concurrency Control Model for Distributed Database
IRJET-Concurrency Control Model for Distributed DatabaseIRJET Journal
 
A New Data Stream Mining Algorithm for Interestingness-rich Association Rules
A New Data Stream Mining Algorithm for Interestingness-rich Association RulesA New Data Stream Mining Algorithm for Interestingness-rich Association Rules
A New Data Stream Mining Algorithm for Interestingness-rich Association RulesVenu Madhav
 
Alan Weber from Cimetrix talks about Multi Source Data Collection
Alan Weber from Cimetrix talks about Multi Source Data CollectionAlan Weber from Cimetrix talks about Multi Source Data Collection
Alan Weber from Cimetrix talks about Multi Source Data CollectionKimberly Daich
 
The Role of Models in Semiconductor Smart Manufacturing
The Role of Models in Semiconductor Smart ManufacturingThe Role of Models in Semiconductor Smart Manufacturing
The Role of Models in Semiconductor Smart ManufacturingKimberly Daich
 
ESM Best Practices: Trends
ESM Best Practices: TrendsESM Best Practices: Trends
ESM Best Practices: TrendsProtect724
 

Similar to Introduction to stream processing (20)

Introduction to cloud computing and big data - part1
Introduction to cloud computing and big data - part1Introduction to cloud computing and big data - part1
Introduction to cloud computing and big data - part1
 
Bigtable
BigtableBigtable
Bigtable
 
Introduction to Operating Systems - Part1
Introduction to Operating Systems - Part1Introduction to Operating Systems - Part1
Introduction to Operating Systems - Part1
 
SA UNIT I STREAMING ANALYTICS.pdf
SA UNIT I STREAMING ANALYTICS.pdfSA UNIT I STREAMING ANALYTICS.pdf
SA UNIT I STREAMING ANALYTICS.pdf
 
Uint-4 Mining Data Stream.pdf
Uint-4 Mining Data Stream.pdfUint-4 Mining Data Stream.pdf
Uint-4 Mining Data Stream.pdf
 
Uint-4 Mining Data Stream.pdf
Uint-4 Mining Data Stream.pdfUint-4 Mining Data Stream.pdf
Uint-4 Mining Data Stream.pdf
 
Parallel machines flinkforward2017
Parallel machines flinkforward2017Parallel machines flinkforward2017
Parallel machines flinkforward2017
 
Chubby
ChubbyChubby
Chubby
 
SplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding OverviewSplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding Overview
 
Introduction to cloud computing and big data - part2
Introduction to cloud computing and big data - part2Introduction to cloud computing and big data - part2
Introduction to cloud computing and big data - part2
 
Workflow Process Management and Enterprise Application Integration in Healthcare
Workflow Process Management and Enterprise Application Integration in HealthcareWorkflow Process Management and Enterprise Application Integration in Healthcare
Workflow Process Management and Enterprise Application Integration in Healthcare
 
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding OverviewSplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
 
Smart E-Logistics for SCM Spend Analysis
Smart E-Logistics for SCM Spend AnalysisSmart E-Logistics for SCM Spend Analysis
Smart E-Logistics for SCM Spend Analysis
 
Event-driven BPM the JBoss way
Event-driven BPM the JBoss wayEvent-driven BPM the JBoss way
Event-driven BPM the JBoss way
 
IRJET-Concurrency Control Model for Distributed Database
IRJET-Concurrency Control Model for Distributed DatabaseIRJET-Concurrency Control Model for Distributed Database
IRJET-Concurrency Control Model for Distributed Database
 
A New Data Stream Mining Algorithm for Interestingness-rich Association Rules
A New Data Stream Mining Algorithm for Interestingness-rich Association RulesA New Data Stream Mining Algorithm for Interestingness-rich Association Rules
A New Data Stream Mining Algorithm for Interestingness-rich Association Rules
 
Alan Weber from Cimetrix talks about Multi Source Data Collection
Alan Weber from Cimetrix talks about Multi Source Data CollectionAlan Weber from Cimetrix talks about Multi Source Data Collection
Alan Weber from Cimetrix talks about Multi Source Data Collection
 
The Role of Models in Semiconductor Smart Manufacturing
The Role of Models in Semiconductor Smart ManufacturingThe Role of Models in Semiconductor Smart Manufacturing
The Role of Models in Semiconductor Smart Manufacturing
 
Deadlocks
DeadlocksDeadlocks
Deadlocks
 
ESM Best Practices: Trends
ESM Best Practices: TrendsESM Best Practices: Trends
ESM Best Practices: Trends
 

More from Amir Payberah

P2P Content Distribution Network
P2P Content Distribution NetworkP2P Content Distribution Network
P2P Content Distribution NetworkAmir Payberah
 
The Spark Big Data Analytics Platform
The Spark Big Data Analytics PlatformThe Spark Big Data Analytics Platform
The Spark Big Data Analytics PlatformAmir Payberah
 
File System Implementation - Part2
File System Implementation - Part2File System Implementation - Part2
File System Implementation - Part2Amir Payberah
 
File System Implementation - Part1
File System Implementation - Part1File System Implementation - Part1
File System Implementation - Part1Amir Payberah
 
File System Interface
File System InterfaceFile System Interface
File System InterfaceAmir Payberah
 
Virtual Memory - Part2
Virtual Memory - Part2Virtual Memory - Part2
Virtual Memory - Part2Amir Payberah
 
Virtual Memory - Part1
Virtual Memory - Part1Virtual Memory - Part1
Virtual Memory - Part1Amir Payberah
 
CPU Scheduling - Part1
CPU Scheduling - Part1CPU Scheduling - Part1
CPU Scheduling - Part1Amir Payberah
 

More from Amir Payberah (9)

P2P Content Distribution Network
P2P Content Distribution NetworkP2P Content Distribution Network
P2P Content Distribution Network
 
The Spark Big Data Analytics Platform
The Spark Big Data Analytics PlatformThe Spark Big Data Analytics Platform
The Spark Big Data Analytics Platform
 
File System Implementation - Part2
File System Implementation - Part2File System Implementation - Part2
File System Implementation - Part2
 
File System Implementation - Part1
File System Implementation - Part1File System Implementation - Part1
File System Implementation - Part1
 
File System Interface
File System InterfaceFile System Interface
File System Interface
 
Virtual Memory - Part2
Virtual Memory - Part2Virtual Memory - Part2
Virtual Memory - Part2
 
Virtual Memory - Part1
Virtual Memory - Part1Virtual Memory - Part1
Virtual Memory - Part1
 
Main Memory - Part1
Main Memory - Part1Main Memory - Part1
Main Memory - Part1
 
CPU Scheduling - Part1
CPU Scheduling - Part1CPU Scheduling - Part1
CPU Scheduling - Part1
 

Recently uploaded

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 

Recently uploaded (20)

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 

Introduction to stream processing

  • 1. Information Flow Processing Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 1 / 83
  • 2. Motivation Many applications must process large streams of live data and pro- vide results in real-time. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 2 / 83
  • 3. Motivation Many applications must process large streams of live data and pro- vide results in real-time. • Wireless sensor networks • Traffic management applications • Stock marketing • Environmental monitoring applications • Fraud detection tools • ... Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 2 / 83
  • 4. Motivation Processing information as it flows, without storing them persistently. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 3 / 83
  • 5. Motivation Processing information as it flows, without storing them persistently. Traditional DBMSs: • Store and index data before processing it. • Process data only when explicitly asked by the users. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 3 / 83
  • 6. Motivation Processing information as it flows, without storing them persistently. Traditional DBMSs: • Store and index data before processing it. • Process data only when explicitly asked by the users. • Both aspects contrast with our requirements. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 3 / 83
  • 7. One Name, Different Technologies Several research communities are contribut- ing in this area: • Each brings its own expertise • Point of view • Vocabulary: event, data, stream, ... Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 4 / 83
  • 8. One Name, Different Technologies Several research communities are contribut- ing in this area: • Each brings its own expertise • Point of view • Vocabulary: event, data, stream, ... Tower of Babel Syndrome! Come on! Let’s go down and confuse them by making them speak different languages, then they won’t be able to understand each other. Genesis 11:7 Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 4 / 83
  • 9. Information Flow Processing (IFP) Source: produces the incoming information flows Sink: consumes the results of processing IFP engine: processes incoming flows Processing rules: how to process the incoming flows Rule manager: adds/removes processing rules Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 5 / 83
  • 10. IFP Competing Models Data Stream Management Systems (DSMS) Complex Event Processing (CEP) Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 6 / 83
  • 11. IFP Competing Models Data Stream Management Systems (DSMS) Complex Event Processing (CEP) Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 6 / 83
  • 12. Data Stream Management Systems (DSMS) An evolution of traditional data processing, as supported by DBMSs. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 7 / 83
  • 13. DBMS vs. DSMS (1/3) DBMS: persistent data where updates are relatively infrequent. DSMS: transient data that is continuously updated. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 8 / 83
  • 14. DBMS vs. DSMS (2/3) DBMS: runs queries just once to return a complete answer. DSMS: executes standing queries, which run continuously and pro- vide updated answers as new data arrives. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 9 / 83
  • 15. DBMS vs. DSMS (3/3) Despite these differences, DSMSs resemble DBMSs: both process incoming data through a sequence of transformations based on SQL operators, e.g., selections, aggregates, joins. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 10 / 83
  • 16. Out of Scope of DSMS DSMSs focus on producing query answers. Detection and notification of complex patterns of elements are usu- ally out of the scope of DSMSs: Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 11 / 83
  • 17. Out of Scope of DSMS DSMSs focus on producing query answers. Detection and notification of complex patterns of elements are usu- ally out of the scope of DSMSs: Complex Event Processing Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 11 / 83
  • 18. IFP Competing Models Data Stream Management Systems (DSMS) Complex Event Processing (CEP) Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 12 / 83
  • 19. Complex Event Processing (CEP) DSMSs limitation: detecting complex patterns of incoming items, involving sequencing and ordering relationships. CEP models flowing information items as notifications of events happening in the external world. • They have to be filtered and combined to understand what is happening in terms of higher-level events. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 13 / 83
  • 20. CEP vs. Publish/Subscribe Systems CEP systems can be seen as an extension to traditional pub- lish/subscribe systems. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 14 / 83
  • 21. CEP vs. Publish/Subscribe Systems CEP systems can be seen as an extension to traditional pub- lish/subscribe systems. Traditional publish/subscribe systems consider each event separately from the others, and filter them based on their topic or content. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 14 / 83
  • 22. CEP vs. Publish/Subscribe Systems CEP systems can be seen as an extension to traditional pub- lish/subscribe systems. Traditional publish/subscribe systems consider each event separately from the others, and filter them based on their topic or content. CEPs extend this functionality by increasing the expressive power of the subscription language to consider complex event patterns that involve the occurrence of multiple related events. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 14 / 83
  • 23. IFP Modeling Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 15 / 83
  • 24. One Model, Several Models Different models to capture different viewpoints. • Functional model • Processing model • Time model • Data model • Rule model • Language model • Interaction model • Deployment model Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 16 / 83
  • 25. Functional Model Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 17 / 83
  • 26. Functional Model An abstract architecture of the main functional components of an IFP engine. [G. Cugolap et al., Processing Flows of Information: From Data Stream to Complex Event Processing, 2012] Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 18 / 83
  • 27. Receiver and Clock Receiver manages the channels connecting the sources with the IFP engine. Clock models periodic processing of their inputs. [G. Cugolap et al., Processing Flows of Information: From Data Stream to Complex Event Processing, 2012] Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 19 / 83
  • 28. Rules, Decider and Producer We assume rules can be (logically) decomposed in two parts: • C → A • C is the condition • A is the action Example (in CQL): Select IStream(Count(*)) (action) From F1 [Range 1 Minute] Where F1.A > 0 (condition) Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 20 / 83
  • 29. Rules, Decider and Producer We assume rules can be (logically) decomposed in two parts: • C → A • C is the condition • A is the action Example (in CQL): Select IStream(Count(*)) (action) From F1 [Range 1 Minute] Where F1.A > 0 (condition) This way we can split processing in two phases: • Decider: determines the items that trigger the rule. • Producer: use those items to produce the output of the rule. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 20 / 83
  • 30. Detection-Production Cycle (1/2) The Detector evaluates all the rules in the Rules store to find those whose condition part is true. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 21 / 83
  • 31. Detection-Production Cycle (1/2) The Detector evaluates all the rules in the Rules store to find those whose condition part is true. With the newly arrived information, the Detector may also use the information present in the Knowledge Base. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 21 / 83
  • 32. Detection-Production Cycle (1/2) The Detector evaluates all the rules in the Rules store to find those whose condition part is true. With the newly arrived information, the Detector may also use the information present in the Knowledge Base. At the end of this phase we have a set of rules that have to be executed. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 21 / 83
  • 33. Detection-Production Cycle (2/2) The Producer takes the information and executes each triggered rule (i.e., its action part). Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 22 / 83
  • 34. Detection-Production Cycle (2/2) The Producer takes the information and executes each triggered rule (i.e., its action part). In executing rules, the Producer may combine the items that trig- gered the rule. • Received from the Decider together with the information present in the Knowledge Base. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 22 / 83
  • 35. Detection-Production Cycle (2/2) The Producer takes the information and executes each triggered rule (i.e., its action part). In executing rules, the Producer may combine the items that trig- gered the rule. • Received from the Decider together with the information present in the Knowledge Base. Usually, these new items are sent to sinks, through the Forwarder, or sent internally to be processed again. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 22 / 83
  • 36. Processing Model Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 23 / 83
  • 37. Processing Model Three policies affect the behavior of the system: • The selection policy • The consumption policy • The load shedding policy Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 24 / 83
  • 38. Selection Policy Determines if a rule fires once or multiple times and the items ac- tually selected from the History. [G. Cugolap et al., Processing Flows of Information: From Data Stream to Complex Event Processing, 2012] Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 25 / 83
  • 39. Consumption Policy Determines how the history changes after firing of a rule: what happens when new items enter the Decider [G. Cugolap et al., Processing Flows of Information: From Data Stream to Complex Event Processing, 2012] Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 26 / 83
  • 40. Load Shedding Policy A technique adopted by some IFP systems to deal with burst inputs. It can be described as an automatic drop of information items when the input rate becomes too high for the processing capabilities of the engine. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 27 / 83
  • 41. Time Model Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 28 / 83
  • 42. Time Model The relationship between the information items flowing into the IFP engine and the passing of time. Ability of an IFP system to associate some kind of happened-before (ordering) relationship to information items. Four classes: 1 Stream-only 2 Causal 3 Absolute 4 Interval Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 29 / 83
  • 43. Stream-Only Time Model Not any special meaning to time. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 30 / 83
  • 44. Stream-Only Time Model Not any special meaning to time. Timestamps are used mainly to order items at the frontier of the engine (i.e., within the receiver). Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 30 / 83
  • 45. Stream-Only Time Model Not any special meaning to time. Timestamps are used mainly to order items at the frontier of the engine (i.e., within the receiver). They are lost during processing. • The ordering and timestamps of the output stream are conceptually separate from the ordering and timestamps of the input streams. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 30 / 83
  • 46. Stream-Only Time Model Not any special meaning to time. Timestamps are used mainly to order items at the frontier of the engine (i.e., within the receiver). They are lost during processing. • The ordering and timestamps of the output stream are conceptually separate from the ordering and timestamps of the input streams. Example: CQL/Stream Select DStream(*) From F1[Rows 5], F2[Range 1 Minute] Where F1.A = F2.A Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 30 / 83
  • 47. Causal Time Model Each item has a label reflecting some kind of causal relationship. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 31 / 83
  • 48. Causal Time Model Each item has a label reflecting some kind of causal relationship. Partial order Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 31 / 83
  • 49. Causal Time Model Each item has a label reflecting some kind of causal relationship. Partial order Example: Gigascope Select count(*) From A, B Where A.a-1 <= B.b and A.a+1 > B.b A.a and B.b monotonically increase Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 31 / 83
  • 50. Absolute Time Model Information items have an associated timestamp. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 32 / 83
  • 51. Absolute Time Model Information items have an associated timestamp. Defining a single point in time, wrt a (logically) unique clock. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 32 / 83
  • 52. Absolute Time Model Information items have an associated timestamp. Defining a single point in time, wrt a (logically) unique clock. Information items can be timestamped at source or entering the engine Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 32 / 83
  • 53. Absolute Time Model Information items have an associated timestamp. Defining a single point in time, wrt a (logically) unique clock. Information items can be timestamped at source or entering the engine Total order Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 32 / 83
  • 54. Absolute Time Model Information items have an associated timestamp. Defining a single point in time, wrt a (logically) unique clock. Information items can be timestamped at source or entering the engine Total order Example: TESLA/T-Rex Define Fire(area: string, measuredTemp: double) From Smoke(area=$a) and last Temp(area=$a and value>45) within 5 min. from Smoke Where area=Smoke.area and measuredTemp=Temp.value Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 32 / 83
  • 55. Interval Time Model Associate items with an interval, i.e., two timestamps taken from a global time. Usually representing: the time when the related event started, the time when it ended. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 33 / 83
  • 56. Data Model Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 34 / 83
  • 57. Data Model Studies how the different systems • Represent single data items • Organize them into data flows Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 35 / 83
  • 58. Nature of Items The meaning we associate to information items • Generic data • Event notifications Influences other aspects of an IFP system • Time model • Rule language • Semantics of processing Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 36 / 83
  • 59. Format of Items How information is represented Influences the way items are processed, e.g., relational model requires tuples Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 37 / 83
  • 60. Support for Uncertainty Ability to associate a degree of uncertainty to information items. When present, probabilistic information is usually exploited in rules during processing. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 38 / 83
  • 61. Data Flows Homogeneous • Each flow contains data with the same format and kind. • E.g., a sequence of unbounded tuples generated continuously in time: · · · (a1, a2, · · · , an, t − 1)(a1, a2, · · · , an, t)(a1, a2, · · · , an, t + 1) · · ·, where ai denotes an attribute. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 39 / 83
  • 62. Data Flows Homogeneous • Each flow contains data with the same format and kind. • E.g., a sequence of unbounded tuples generated continuously in time: · · · (a1, a2, · · · , an, t − 1)(a1, a2, · · · , an, t)(a1, a2, · · · , an, t + 1) · · ·, where ai denotes an attribute. Heterogeneous • Information flows are seen as channels connecting sources, processors, and sinks • Each channel may transport items with different kind and format. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 39 / 83
  • 63. Rule Model Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 40 / 83
  • 64. Rule Model Rules are classified into two macro classes: • Transforming rules • Detecting rules Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 41 / 83
  • 65. Transforming Rules (1/2) No explicit distinction between detection and production. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 42 / 83
  • 66. Transforming Rules (1/2) No explicit distinction between detection and production. Execution plan of primitive operators (processing elements (PE)). • A logical network of PEs connected in a DAG. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 42 / 83
  • 67. Transforming Rules (1/2) No explicit distinction between detection and production. Execution plan of primitive operators (processing elements (PE)). • A logical network of PEs connected in a DAG. Each PE transforms one or more input flows into one or more output flows. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 42 / 83
  • 68. Transforming Rules (2/2) PEs execute independently and in parallel Not synchronized Communicate through messaging Upstream node vs. downstream node Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 43 / 83
  • 69. Detecting Rules An explicit distinction between detection and production. Usually, the detection is based on a logical predicate that captures patterns of interest in the history of received items. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 44 / 83
  • 70. Language Model Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 45 / 83
  • 71. Language Model Following the rule model, we define two classes of languages: • Transforming languages: declarative languages and imperative languages • Detecting languages: patternbased Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 46 / 83
  • 72. Language Model Following the rule model, we define two classes of languages: • Transforming languages: declarative languages and imperative languages • Detecting languages: patternbased Specify operations to filter, join, aggregate, ... Input flows To produce one or more output flows Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 46 / 83
  • 73. Declarative Languages Specify the expected result rather than the desired execution flow. Usually derive from relational languages, e.g., SQL Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 47 / 83
  • 74. Declarative Languages Specify the expected result rather than the desired execution flow. Usually derive from relational languages, e.g., SQL Example CQL/Stream: Select IStream(*) From F1[Rows 5], F2[Rows 10] Where F1.A = F2.A Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 47 / 83
  • 75. Imperative Languages Specify the desired execution flow Starting from primitive operators, i.e., PEs Usually adopt a graphical notation Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 48 / 83
  • 76. Pattern-Based Languages Specify a firing condition as a pattern Select a portion of incoming flows The action uses selected items to produce new knowledge Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 49 / 83
  • 77. Pattern-Based Languages Specify a firing condition as a pattern Select a portion of incoming flows The action uses selected items to produce new knowledge Example, TESLA / T-Rex Define Fire(area: string, measuredTemp: double) From Smoke(area=$a) and last Temp(area=$a and value>45) within 5 min. from Smoke Where area=Smoke.area and measuredTemp=Temp.value Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 49 / 83
  • 78. Interaction Model Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 50 / 83
  • 79. Interaction Model Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 51 / 83
  • 80. Interaction Model Processing Element (PE): a processing unit in a IFP system. How do PEs interact? Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 51 / 83
  • 81. Interaction Model Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 52 / 83
  • 82. Deployment Model Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 53 / 83
  • 83. Deployment Model IFP applications may include a large number of sources, sinks, and PEs. Possibly dispersed over a wide geographical area. How the components of the functional model can be distributed to achieve scalability. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 54 / 83
  • 84. Deployment Model Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 55 / 83
  • 85. Deployment Model Given a network of PEs. How to map it onto the physical network of nodes Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 56 / 83
  • 86. Scaling Mechanism How to scale with increasing the number queries and the rate of incoming events? Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 57 / 83
  • 87. Scaling Mechanism How to scale with increasing the number queries and the rate of incoming events? Two main solutions: • Data partitioning: a reasonable data partitioning and merging scheme as well as mechanisms to detect points for parallelization. • Query Partitioning: to distribute the load across available hosts and to achieve a load balance between these machines. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 57 / 83
  • 88. Data Partitioning (1/3) Early approaches parallelize operators by introducing a splitter and merge operator. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 58 / 83
  • 89. Data Partitioning (1/3) Early approaches parallelize operators by introducing a splitter and merge operator. The merge operator: union or sort Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 58 / 83
  • 90. Data Partitioning (1/3) Early approaches parallelize operators by introducing a splitter and merge operator. The merge operator: union or sort E.g., parallelize a filter operation using a round robin scheme. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 58 / 83
  • 91. Data Partitioning (1/3) Early approaches parallelize operators by introducing a splitter and merge operator. The merge operator: union or sort E.g., parallelize a filter operation using a round robin scheme. E.g., parallelize an aggregation using a hash scheme. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 58 / 83
  • 92. Data Partitioning (2/3) In more recent systems: E.g., in Storm a user can express data parallelism by defining the number of parallel tasks per operator. E.g., S4 creates a PE for each new key in the data stream. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 59 / 83
  • 93. Data Partitioning (2/3) In more recent systems: E.g., in Storm a user can express data parallelism by defining the number of parallel tasks per operator. E.g., S4 creates a PE for each new key in the data stream. In both approaches the user needs to understand the data parallelism and explicitly enforce in its code the sequential ordering. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 59 / 83
  • 94. Data Partitioning (3/3) An auto-parallelization approach has recently be proposed. A combination of compiler and runtime: • The compiler detects regions for parallelization. • The system runtime guarantees that output tuples follow the same order as for a sequential execution. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 60 / 83
  • 95. Query Partitioning Operator placement problem: the problem of assigning a set of operators to a set of available hosts. A single PE can be running in parallel on different nodes. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 61 / 83
  • 96. Query Partitioning Operator placement problem: the problem of assigning a set of operators to a set of available hosts. A single PE can be running in parallel on different nodes. E.g., SEEP can dynamically vary the number of processing nodes within the system based on the workload. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 61 / 83
  • 97. Fault Tolerance Mechanism Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 62 / 83
  • 98. Recovery Methods The recovery methods of streaming frameworks must take: • Correctness, e.g., data loss and duplicates • Performance, e.g., low latency Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 63 / 83
  • 99. Basic Idea Each processing node has an associated backup node. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 64 / 83
  • 100. Basic Idea Each processing node has an associated backup node. The backup node’s stream processing engine is identical to the pri- mary one. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 64 / 83
  • 101. Basic Idea Each processing node has an associated backup node. The backup node’s stream processing engine is identical to the pri- mary one. But the state of the backup node is not necessarily the same as that of the primary. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 64 / 83
  • 102. Basic Idea Each processing node has an associated backup node. The backup node’s stream processing engine is identical to the pri- mary one. But the state of the backup node is not necessarily the same as that of the primary. If a primary node fails, its backup node takes over the operation of the failed node. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 64 / 83
  • 103. Recovery Methods GAP recovery Rollback recovery Precise recovery Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 65 / 83
  • 104. GAP Recovery The weakest recovery guarantee A new task takes over the operations of the failed task. The new task starts from an empty state. Tuples can be lost during the recovery phase. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 66 / 83
  • 105. Rollback Recovery The information loss is avoided, but the output may contain dupli- cate tuples. Three types of rollback recovery: • Active backup • Passive backup • Upstream backup Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 67 / 83
  • 106. Rollback Recovery - Active Backup Both primary and backup nodes are given the same input. The output tuples of the backup node are logged at the output queues and they are not sent downstream. If the primary fails, the backup takes over by sending the logged tu- ples to all downstream neighbors and then continuing its processing. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 68 / 83
  • 107. Rollback Recovery - Passive Backup Periodically check-points processing state to a shared storage. The backup node takes over from the latest checkpoint when the primary fails. The backup node is always equal or behind the primary. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 69 / 83
  • 108. Rollback Recovery - Upstream Backup Upstream nodes store the tuples until the downstream nodes ac- knowledge them. If a node fails, an empty node rebuilds the latest state of the failed primary from the logs kept at the upstream server. There is no backup node in this model. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 70 / 83
  • 109. Precise Recovery Post-failure output is exactly the same as the output without failure. Can be achieved by modifying the algorithms for rollback recovery. • For example, in passive backup, after a failure occurs the backup node can ask the downstream nodes for the latest tuples they received and trim the output queues accordingly to prevent the duplicates. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 71 / 83
  • 110. Brief History of IFP Systems Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 72 / 83
  • 111. First Generation A stand-alone prototypes or as extensions of existing database en- gines. They were developed with a specific use case in mind and are very limited regarding the supported operator types as well as available functionalities. Niagara, Telegraph, Aurora, ... Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 73 / 83
  • 112. First Generation Example - Aurora A single site stream-processing engine (centralized). DAG based processing model for streams. Push-based strategy. The first Aurora did not support fault tolerance. Stream Query Algebra (SQuAl), i.e., SQL with additional features, e.g., windowed queries. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 74 / 83
  • 113. Second Generation Systems extended the ideas of data stream processing with advanced features such as fault tolerance, adaptive query processing, as well as an enhanced operator expressiveness. Borealis, CEDR, System S, CAPE, ... Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 75 / 83
  • 114. Second Generation Example - Borealis Distributed version of Aurora. Advanced functionalities on top of Aurora: • Dynamic revision of query results: correct errors in previously reported data. • Dynamic query modifications: change certain attributes of the query at runtime. Pull-based strategy. Rollback recovery with active backup. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 76 / 83
  • 115. Third Generation Driven by the trend towards cloud computing: highly scalable and robust towards faults. Storm, Apache S4, D-Streams, SEEP, StreamCloud, ... Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 77 / 83
  • 116. Third Generation Example - Storm (1/2) Stream processing is guaranteed: a message cannot be lost due to node failures. DAG based processing: • the DAG is called Topology • the PEs are called Bolts • the stream sources are called Spouts It does not have an explicit programming paradigm. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 78 / 83
  • 117. Third Generation Example - Storm (2/2) Pull-based strategy. Rollback recovery with upstream backup. Three sets of nodes: • Nimbus: distributes the code among the worker nodes, and keeps track of the progress of the worker nodes • Supervisor: the set of worker nodes • Zookeeper: coordination between supervisor nodes and the Nimbus Built by twitter Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 79 / 83
  • 118. Third Generation Example - S4 (1/2) S4: Simple Scalable Streaming System. Constructing a DAG structure of PEs at runtime. • A PE is instantiated for each value of the key attribute. The processing model is inspired by MapReduce. Events are dispatched to nodes according to their key. Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 80 / 83
  • 119. Third Generation Example - S4 (2/2) Push-based strategy GAP recovery Communication layer: coordination between the processing nodes and the messaging between nodes. • Uses Zookeeper Built by yahoo Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 81 / 83
  • 120. Summary IFP: DSMS and CEP IFP modeling: functional, processing, time, data, rule, language, interaction, deployment Recovering models: GAP, Rollback, and Precise Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 82 / 83
  • 121. Questions? Acknowledgements Some slides and pictures were derived from G. Cugola slides (Politecnico di Milano). Amir H. Payberah (Tehran Polytechnic) Information Flow Processing 1393/8/24 83 / 83