2. Outline
• Data Stream Management System
• Streaming Operators & their implementation
• Query Processing
• Loadshadding & Query Approximation
3. Data Stream Management
• Stream data - Produced incrementally over time, rather than being available in full before its
processing begins.
• Examples:
Defination:
Data for almost any large-scale data-management task is continuously collected over a wide area, and at a
much greater rate than ever before.
4. Typically, data streams exhibit the following characteristics:
infinite length
continuous data arrival
high data rates
requirements for low-latency
real-time query processing, and data that are usually time-stamped and generally arrive
in either temporal order or close to it.
Characteristics of Data Stream Management
Data Stream Management Languages
SQL-based language called StreaQuel
XML-QL based language NiagaraCQ
SQL-based CQL (Continous Query Language)
5. Query Operators & Implementation
Query Operators
• The blocking/nonblocking properties of operators independent of the language in which
they are expressed, and
• The abstract properties of stream functions expressible by blocking/nonblocking
operators.
Blocking Query Operator: A blocking query operator is a query operator that is unable to
produce the first tuple of the output until it has seen the entire input.
Nonblocking Query Operator: A nonblocking query operator is one that produces all the
tuples of the output before it has detected the end of the input.
6. • Consider operators that take sequences (streams) as input and return sequences
(streams) as output. For instance consider an operator G that takes a sequence S as
input and produces a sequence G(S) as output:
Streaming Operator Functionality
S −→ −→ G(S)G
• G operates as an incremental transducer, which for each new input tuple in S, adds
zero, one, or several tuples to the output.
7. • Join operators problematic on streams
• May need to join arbitrarily far apart stream
tuples
• Operations on implicit / explicit windows
• Selections, (duplicate preserving) projections
are straightforward
• Local, per-element operators
• Duplicate eliminating projection is like grouping
• Projection needs to include ordering attribute
• No restriction for position ordered streams
8. Query Processing
• Declarative queries ->Logical query plan -> Physical Plan
• Directed Acyclic Graphs (nodes->operators, edges -> data flow)
• Queries sharing memory/streams combined to a single plan
• Scheduling
o FIFS, Round Robin – simple, not efficient
o Operators with higher throughput – low latency
o Operators with min processing & selectivity –
smaller queue
• Heartbeats & Punctuations
o Typically issued by sources
o Reduce amount of states needed by operators
o Prevent operators doing unnecessary tasks
o Query plans can also issue heartbeats to avoid
pipeline stalls and delayed results
9. o Random sampling
o Semantic load shedding to drop less important
o Objective is to minimize the drop in accuracy
o Challenging for complex query plan with multiple streams
and operators
Load Shedding & Approximation
• applications require real-time, or near real-time response and are characterized by high
speed arrival rate.
Two types of approximation that have been suggested are:
1. Max-Subset results
The objective is to maximize the size of the resulting join.
2. Sampled results
The objective is to provide a fair random sample of the join result
10. • The productivity of a tuple determines its contribution to the multi-way join.
• For simplicity, we denote the set of join tuples of n windows with the i-th window only
containing t to be TWi={t}.
Two Priority Measures:
• Maximum Subset To provide a maximum subset of the true result, we should shed the tuple
with least productivity in order to minimize the loss caused by load shedding.
• Random Sampling To provide a random sample of the true result, one may control the fraction
of the tuples produced by each tuple
Estimating Productivity:
• The sketching techniques to find approximating complex query answers. The class of queries
that they considered is of the form:
SELECT AGG FROM R1, . . . , Rr WHERE θ
where AGG is an arbitrary aggregate operator such as COUNT, SUM and θ represents the conjunction
of equi-join conditions.