Data Stream Management

Distributed Database
SUBMITTED TO: Sir Hammad
SUBMITTED BY: Syed Umair Raza 8th Semester
THE UNIVERSITY OF LAHORE

Outline
• Data Stream Management System
• Streaming Operators & their implementation
• Query Processing
• Loadshadding & Query Approximation

Data Stream Management
• Stream data - Produced incrementally over time, rather than being available in full before its
processing begins.
• Examples:
Defination:
Data for almost any large-scale data-management task is continuously collected over a wide area, and at a
much greater rate than ever before.

Typically, data streams exhibit the following characteristics:
 infinite length
 continuous data arrival
 high data rates
 requirements for low-latency
 real-time query processing, and data that are usually time-stamped and generally arrive
in either temporal order or close to it.
Characteristics of Data Stream Management
Data Stream Management Languages
 SQL-based language called StreaQuel
 XML-QL based language NiagaraCQ
 SQL-based CQL (Continous Query Language)

Query Operators & Implementation
Query Operators
• The blocking/nonblocking properties of operators independent of the language in which
they are expressed, and
• The abstract properties of stream functions expressible by blocking/nonblocking
operators.
Blocking Query Operator: A blocking query operator is a query operator that is unable to
produce the first tuple of the output until it has seen the entire input.
Nonblocking Query Operator: A nonblocking query operator is one that produces all the
tuples of the output before it has detected the end of the input.

• Consider operators that take sequences (streams) as input and return sequences
(streams) as output. For instance consider an operator G that takes a sequence S as
input and produces a sequence G(S) as output:
Streaming Operator Functionality
S −→ −→ G(S)G
• G operates as an incremental transducer, which for each new input tuple in S, adds
zero, one, or several tuples to the output.

• Join operators problematic on streams
• May need to join arbitrarily far apart stream
tuples
• Operations on implicit / explicit windows
• Selections, (duplicate preserving) projections
are straightforward
• Local, per-element operators
• Duplicate eliminating projection is like grouping
• Projection needs to include ordering attribute
• No restriction for position ordered streams

Query Processing
• Declarative queries ->Logical query plan -> Physical Plan
• Directed Acyclic Graphs (nodes->operators, edges -> data flow)
• Queries sharing memory/streams combined to a single plan
• Scheduling
o FIFS, Round Robin – simple, not efficient
o Operators with higher throughput – low latency
o Operators with min processing & selectivity –
smaller queue
• Heartbeats & Punctuations
o Typically issued by sources
o Reduce amount of states needed by operators
o Prevent operators doing unnecessary tasks
o Query plans can also issue heartbeats to avoid
pipeline stalls and delayed results

o Random sampling
o Semantic load shedding to drop less important
o Objective is to minimize the drop in accuracy
o Challenging for complex query plan with multiple streams
and operators
Load Shedding & Approximation
• applications require real-time, or near real-time response and are characterized by high
speed arrival rate.
Two types of approximation that have been suggested are:
1. Max-Subset results
The objective is to maximize the size of the resulting join.
2. Sampled results
The objective is to provide a fair random sample of the join result

• The productivity of a tuple determines its contribution to the multi-way join.
• For simplicity, we denote the set of join tuples of n windows with the i-th window only
containing t to be TWi={t}.
Two Priority Measures:
• Maximum Subset To provide a maximum subset of the true result, we should shed the tuple
with least productivity in order to minimize the loss caused by load shedding.
• Random Sampling To provide a random sample of the true result, one may control the fraction
of the tuples produced by each tuple
Estimating Productivity:
• The sketching techniques to find approximating complex query answers. The class of queries
that they considered is of the form:
SELECT AGG FROM R1, . . . , Rr WHERE θ
where AGG is an arbitrary aggregate operator such as COUNT, SUM and θ represents the conjunction
of equi-join conditions.

References
• https://link.springer.com/referenceworkentry/10.1007%2F978-0-387-39940-9_137
• http://eecs.wsu.edu/~yinghui/mat/courses/spring%202016/Reading/chp5-
data%20stream%20management.pdf
• https://www.microsoft.com/en-us/research/publication/189-theory-stream-queries/

Data Stream Management

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Stream Management

Similar to Data Stream Management (20)

Recently uploaded

Recently uploaded (20)

Data Stream Management