Distributed Database
SUBMITTED TO: Sir Hammad
SUBMITTED BY: Syed Umair Raza 8th Semester
THE UNIVERSITY OF LAHORE
Outline
• Data Stream Management System
• Streaming Operators & their implementation
• Query Processing
• Loadshadding & Query Approximation
Data Stream Management
• Stream data - Produced incrementally over time, rather than being available in full before its
processing begins.
• Examples:
Defination:
Data for almost any large-scale data-management task is continuously collected over a wide area, and at a
much greater rate than ever before.
Typically, data streams exhibit the following characteristics:
 infinite length
 continuous data arrival
 high data rates
 requirements for low-latency
 real-time query processing, and data that are usually time-stamped and generally arrive
in either temporal order or close to it.
Characteristics of Data Stream Management
Data Stream Management Languages
 SQL-based language called StreaQuel
 XML-QL based language NiagaraCQ
 SQL-based CQL (Continous Query Language)
Query Operators & Implementation
Query Operators
• The blocking/nonblocking properties of operators independent of the language in which
they are expressed, and
• The abstract properties of stream functions expressible by blocking/nonblocking
operators.
Blocking Query Operator: A blocking query operator is a query operator that is unable to
produce the first tuple of the output until it has seen the entire input.
Nonblocking Query Operator: A nonblocking query operator is one that produces all the
tuples of the output before it has detected the end of the input.
• Consider operators that take sequences (streams) as input and return sequences
(streams) as output. For instance consider an operator G that takes a sequence S as
input and produces a sequence G(S) as output:
Streaming Operator Functionality
S −→ −→ G(S)G
• G operates as an incremental transducer, which for each new input tuple in S, adds
zero, one, or several tuples to the output.
• Join operators problematic on streams
• May need to join arbitrarily far apart stream
tuples
• Operations on implicit / explicit windows
• Selections, (duplicate preserving) projections
are straightforward
• Local, per-element operators
• Duplicate eliminating projection is like grouping
• Projection needs to include ordering attribute
• No restriction for position ordered streams
Query Processing
• Declarative queries ->Logical query plan -> Physical Plan
• Directed Acyclic Graphs (nodes->operators, edges -> data flow)
• Queries sharing memory/streams combined to a single plan
• Scheduling
o FIFS, Round Robin – simple, not efficient
o Operators with higher throughput – low latency
o Operators with min processing & selectivity –
smaller queue
• Heartbeats & Punctuations
o Typically issued by sources
o Reduce amount of states needed by operators
o Prevent operators doing unnecessary tasks
o Query plans can also issue heartbeats to avoid
pipeline stalls and delayed results
o Random sampling
o Semantic load shedding to drop less important
o Objective is to minimize the drop in accuracy
o Challenging for complex query plan with multiple streams
and operators
Load Shedding & Approximation
• applications require real-time, or near real-time response and are characterized by high
speed arrival rate.
Two types of approximation that have been suggested are:
1. Max-Subset results
The objective is to maximize the size of the resulting join.
2. Sampled results
The objective is to provide a fair random sample of the join result
• The productivity of a tuple determines its contribution to the multi-way join.
• For simplicity, we denote the set of join tuples of n windows with the i-th window only
containing t to be TWi={t}.
Two Priority Measures:
• Maximum Subset To provide a maximum subset of the true result, we should shed the tuple
with least productivity in order to minimize the loss caused by load shedding.
• Random Sampling To provide a random sample of the true result, one may control the fraction
of the tuples produced by each tuple
Estimating Productivity:
• The sketching techniques to find approximating complex query answers. The class of queries
that they considered is of the form:
SELECT AGG FROM R1, . . . , Rr WHERE θ
where AGG is an arbitrary aggregate operator such as COUNT, SUM and θ represents the conjunction
of equi-join conditions.
References
• https://link.springer.com/referenceworkentry/10.1007%2F978-0-387-39940-9_137
• http://eecs.wsu.edu/~yinghui/mat/courses/spring%202016/Reading/chp5-
data%20stream%20management.pdf
• https://www.microsoft.com/en-us/research/publication/189-theory-stream-queries/

Data Stream Management

  • 1.
    Distributed Database SUBMITTED TO:Sir Hammad SUBMITTED BY: Syed Umair Raza 8th Semester THE UNIVERSITY OF LAHORE
  • 2.
    Outline • Data StreamManagement System • Streaming Operators & their implementation • Query Processing • Loadshadding & Query Approximation
  • 3.
    Data Stream Management •Stream data - Produced incrementally over time, rather than being available in full before its processing begins. • Examples: Defination: Data for almost any large-scale data-management task is continuously collected over a wide area, and at a much greater rate than ever before.
  • 4.
    Typically, data streamsexhibit the following characteristics:  infinite length  continuous data arrival  high data rates  requirements for low-latency  real-time query processing, and data that are usually time-stamped and generally arrive in either temporal order or close to it. Characteristics of Data Stream Management Data Stream Management Languages  SQL-based language called StreaQuel  XML-QL based language NiagaraCQ  SQL-based CQL (Continous Query Language)
  • 5.
    Query Operators &Implementation Query Operators • The blocking/nonblocking properties of operators independent of the language in which they are expressed, and • The abstract properties of stream functions expressible by blocking/nonblocking operators. Blocking Query Operator: A blocking query operator is a query operator that is unable to produce the first tuple of the output until it has seen the entire input. Nonblocking Query Operator: A nonblocking query operator is one that produces all the tuples of the output before it has detected the end of the input.
  • 6.
    • Consider operatorsthat take sequences (streams) as input and return sequences (streams) as output. For instance consider an operator G that takes a sequence S as input and produces a sequence G(S) as output: Streaming Operator Functionality S −→ −→ G(S)G • G operates as an incremental transducer, which for each new input tuple in S, adds zero, one, or several tuples to the output.
  • 7.
    • Join operatorsproblematic on streams • May need to join arbitrarily far apart stream tuples • Operations on implicit / explicit windows • Selections, (duplicate preserving) projections are straightforward • Local, per-element operators • Duplicate eliminating projection is like grouping • Projection needs to include ordering attribute • No restriction for position ordered streams
  • 8.
    Query Processing • Declarativequeries ->Logical query plan -> Physical Plan • Directed Acyclic Graphs (nodes->operators, edges -> data flow) • Queries sharing memory/streams combined to a single plan • Scheduling o FIFS, Round Robin – simple, not efficient o Operators with higher throughput – low latency o Operators with min processing & selectivity – smaller queue • Heartbeats & Punctuations o Typically issued by sources o Reduce amount of states needed by operators o Prevent operators doing unnecessary tasks o Query plans can also issue heartbeats to avoid pipeline stalls and delayed results
  • 9.
    o Random sampling oSemantic load shedding to drop less important o Objective is to minimize the drop in accuracy o Challenging for complex query plan with multiple streams and operators Load Shedding & Approximation • applications require real-time, or near real-time response and are characterized by high speed arrival rate. Two types of approximation that have been suggested are: 1. Max-Subset results The objective is to maximize the size of the resulting join. 2. Sampled results The objective is to provide a fair random sample of the join result
  • 10.
    • The productivityof a tuple determines its contribution to the multi-way join. • For simplicity, we denote the set of join tuples of n windows with the i-th window only containing t to be TWi={t}. Two Priority Measures: • Maximum Subset To provide a maximum subset of the true result, we should shed the tuple with least productivity in order to minimize the loss caused by load shedding. • Random Sampling To provide a random sample of the true result, one may control the fraction of the tuples produced by each tuple Estimating Productivity: • The sketching techniques to find approximating complex query answers. The class of queries that they considered is of the form: SELECT AGG FROM R1, . . . , Rr WHERE θ where AGG is an arbitrary aggregate operator such as COUNT, SUM and θ represents the conjunction of equi-join conditions.
  • 11.