2. Outline
• Introduction
o Motivation
o Problem Statement
o Definitions
• Data Stream Management System (DSMS)
• Streaming Data Warehouse (SDW)
• Discussion
3. Introduction
• Stream data - Produced incrementally over time, rather than
being available in full before its processing begins
• Examples:
• Applications:
o Sensor Networks - E.g. TinyDB
o Network Traffic Analysis - E.g. Traffic statistics and critical condition
detection.
o Financial Tickers - On-line analysis of stock prices, discover correlations,
identify trends.
o Transaction Log Analysis - E.g. Web click streams and telephone calls
Transaction data streams Log Streams
Credit card purchases,
Telecommunications,
Web Accesses
Climate Data
GPS tracking
Sensor networks
IP networks
4. Motivation
• Massive data sets:
o Huge numbers of users, e.g.,
• AT&T long-distance: ~ 300M calls/day
• AT&T IP backbone: ~ 10B IP flows/day
o Highly detailed measurements, e.g.,
• NOAA: satellite-based measurements of earth geodetics
o Huge number of measurement points, e.g.,
• Sensor networks with huge number of sensors
• Near real-time analysis
o ISP: controlling service levels
o NOAA: tornado detection using weather radar
o Hospital: Patient monitoring
• Traditional data feeds
o Simple queries (e.g., value lookup) needed in real-time
o Complex queries (e.g., trend analyses) performed off-line
5. Problem Statement
DBMS DSMS
Data Persistent Relations Streams, time windows
Data Access Random Sequential, One-pass
Updates Arbitrary Append Only
Update Rates Relatively Low High, bursty
Processing Model Query Driven Data driven
Queries One time Continuous
Query Plans Fixed Adaptive
Query Optimizations One Query Multi-query
Query Answers Exact Exact or Approximate
Latency Relatively High Low
Data
Warehouse
SDW
Data Historical Recent and
Historical
Update
Frequency
Low High
Update
Propagation
Synchronous Asynchronous
ETL Process Complex Fast, Light-
weight
Fig : Comparison of Data Stream Management Systems
and Streaming Data Warehouses with traditional database
and warehouse systems
6. Definitions
• Non-blocking Execution : Query operator Q doesn’t require
entire input
• Monotonicity : All previous results preserved
o Q(т) € Q(т’), for query operator Q, where т <= т’
o Q is monotonic only if non-blocking
• Delta : Doesn’t hold monotonicity property , produce update
result at time т, negative / Positive delta
• Punctuation : Special tuple containing a predicate that is
guaranteed to be satisfied by the remainder of the data stream
• Heartbeat : Punctuations that govern timestamps of future
tuples
• Average slowdown = Tuple response time/ shortest processing
time
7. Outline
• Introduction
• Data Stream Management System (DSMS)
o Stream Data Models
o Query Language & Semantics
o Query Processing
o Query Optimization
• Streaming Data Warehouse (SDW)
• Discussion
8. DSMS
• Input Buffer/Monitor
o Captures streaming inputs
o May collect statistics on streams
o Random sampling
• Working storage
o Stores recent stream data
o Used for query processing
• Local Storage
o Used for metadata
o Foreign key mapping
o Naming translation
• Query Processor
o Convert queries into execution plans
o Change plans for different workloads /
input rates
o Contains buffers, operator queues
o Deploys scheduling methods
• Continuous Query Repository
• Results
o May input to users, to other applications
o Stored in an SDW for further analysis
Fig : i) Abstract reference architecture of a DSMS & ii) A traditional DBMS
9. Stream Data Models
• Base Streams – Produced by sources, append only
• Derived streams – produced by continuous queries
• Streams have fixed schema
o <timestamp, source IP Addr, source port, destination IP Addr, destination port, size>
• Data Stream Models
o Describe underlying signals S : [l ... N] -> R
o Aggregate model – Range value for a signal
o Cash Register model – Partial non-negative range value
o Turnstile model – Partial range value
o Reset model – Range value; Reset previous value of a signal
• Stream Windows – important to user and query points of view
o Fixed window
o Sliding window
o Landmark window
o Jumping window – update every k-ticks or k-arrivals
o Tumbling window - update every k-ticks or k-arrivals , k = window size
10. Query Language & Semantics
• Query Algebra
o Stream-to-stream
o Mixed Algebra
• Query Operators – Similar syntax to DBMS, very different semantics
• Relation-like query operators
o Selection, projection, union – stateless operators
o Join – window joins
o Aggregate operators
• DSMS exclusive operators
o Buffered sort operator
o Random sampling operator
o User defined aggregate functions (UDAF)
• Query Languages
o GSQL
o CQL
o ESL
11. Query Operators
• Selections, (duplicate preserving)
projections are straightforward
o Local, per-element operators
o Duplicate eliminating projection is like
grouping
o Projection needs to include ordering
attribute
o No restriction for position ordered streams
• Aggregate expressions:
o distributive: sum, count, min, max
o algebraic: average
o holistic: count-distinct, median
Fig: Simple continuous query operators: i) - Selection, ii) Count, iii) Negation
12. Query Operators
• Join operators problematic on
streams
o May need to join arbitrarily far apart
stream tuples
o Operations on implicit / explicit windows
• SELECT * FROM S1, S2
WHERE Sl.attr = S2.attr
GROUP BY Sl.timestamp/60 AS minute
• SELECT * FROM S1, S2
WHERE Sl.attr = S2.attr
GROUP BY IS1 .timestamp| - |S2.timestampl <= w
• SELECT * FROM S1 [RANGE w] , S2 [RANGE w]
WHERE Sl.attr = S2.attr
Fig: Simple continuous query operators: i) Join, ii) Sliding window join with state
13. Query Processing
• Declarative queries ->Logical query plan -> Physical Plan
o Directed Acyclic Graphs (nodes->operators, edges -> data flow)
• Queries sharing memory/streams combined to a single plan
Fig: a) Query plan for two queries: i) a join of streams Sl and S2 with a selection predicate on Sl, and
2) an aggregate on S2. b) A continuous query with selection and tumbling window aggregation
• Scheduling
o FIFS, Round Robin – simple, not efficient
o Operators with higher throughput – low
latency
o Operators with min processing & selectivity –
smaller queue
• Heartbeats & Punctuations
o Typically issued by sources
o Reduce amount of states needed by
operators
o Prevent operators doing unnecessary tasks
o Query plans can also issue heartbeats to
avoid pipeline stalls and delayed results
SELECT minute, SUM(size) FROM s
WHERE destination_port <= 80
GROUP BY timestamp/60 AS minute
14. Query Processing Cont..
• Queries as views & Negative tuples
o Negative tuples implemented by sign on
explicit windows
o Explicit windows on time or count based
o Generated negative tuples processed by
cascading operators
o Negative tuple on aggregate operators
• Count – easy to compute
• Max/Min – Memory intensive
o Twice as many tuples are considered
• Possible avoiding for monotonic
operators
• Tag tuples with expiration time
• Operators known as weak non-
monotonic
Fig: a) Maintaining a view over a sliding window join using negative
tuples b) Finding the maximum element in a sliding window
15. Query Optimization
• Finds efficient query plans
• DBMS focus on minimizing I/O while DSMS try to reduce cost per unit
• Static Analysis and Query Rewriting
o Ensures query can be evaluated in non-
blocking fashion with limited memory
• S(A,B,C), T(D,E)
• ∏A (бA=D & A>I0 & D<20(S x T) ) , Yes
• ∏A (бA=D (S x T) ), No
• ∏A (бB<D & A>I0 & D<20(S x T) ), Yes, if no duplicate
o Common Rules
• Evaluate inexpensive predicates before
complex ones
o Performing selections before joins
o Rules for continuous query operators only
• Selections and explicit time-based windows
commute
• Selections and explicit count-based windows
don’t commute
o Rewrite based on input(s) constraints
• Join of unbounded streams if matching
tuples arrive at most t time units apart
• Multi Query Optimization
Fig : Separate and shared query plans for Ql and Q2
16. Operator Optimization
• Join
o Need to remove expired tuples
o Expiration in each time tick costly
o Periodic removal reduce cost but increase join processing cost
o Probe streams with fewer matches
• Aggregation
o Synopses allow efficient re-computations
o Prefix synopses
• Suitable for sub-tractable aggregates
• For ex: Sum, Count
o Interval synopses
• Suitable for distributive aggregates
• For ex: Min, Max
• Need to access log b intervals
• Basic interval synopses require b accesses
o Holistic aggregates require additional info in synopses
o Algebraic aggregates computed from derived info
• Avg = Sum / Count
Fig : i) Prefix synopses, ii) Interval synopses, iii) Basic interval synopses
17. Query Optimization
• Load Shedding & Approximation
o Random sampling
o Semantic load shedding to drop less important
o Objective is to minimize the drop in accuracy
• Challenging for complex query plan with multiple streams and operators
• Load Balancing
o Write part of stream if possible
• Adaptive Query Optimization
o Query cost-per-unit time may change
o Query plan dynamically re-ordered on speed, selectivity and queue length
o Trade-off between resulting adaptivity and overhead of dynamic routing
• Distributed Query Optimization
o Parallelizing and distributing the system itself
• Split query plan across nodes
• Partition the streams
o Shifting partial computation to the sources
• In-network processing reduce the communication overhead
18. Outline
• Introduction
• Data Stream Management System (DSMS)
• Streaming Data Warehouse (SDW)
o Data ETL
o Update Propagation
o Data Expiration
o Update Scheduling
o Query Processing on SDW
• Discussion
19. SDW
• Data streams/feeds arrive periodically
• ETL process - data cleaning, standardization and so on
• Table types
o Base tables – Sourced directly from raw files
o Derived tables – Materialized view over base or other derived table
• Update scheduler selects files update order
o Based on dependencies and workloads
Fig : Abstract reference architecture of a SDW
20. ETL
• Simple tasks – un-compression, standardization
• Complex tasks
o Joining new data with descriptive attributes relations
• Relations R are disk based
• Data buffer at main memory
• Mesh Join
o Access blocks of R in sequential order
o Tuple removed from buffer when join to all blocks of R
o Loading data into tables
• Tables are partitioned into timestamp ranges
• Affect small number or recent partitions
Fig : Partitioning a table on a timestamp attribute
21. Update Propagation
• Goals
o Propagate changes across layers of derived
tables
o Avoid recomputing an entire derived table
o Efficiently identify partition dependency
• Partition dependencies may not be
obvious from the SQL specification
Fig : Updating a partitioned derived table
Fig : Partition dependency
22. Data Expiration
• Tuples may have variable lifetime
• Tables can be partitioned on insertion and expiration timestamps
o Partitions may not have equal size
• One solution is to assign updates in round robin fashion
Fig : Partitioning a table on two attributes: insertion and expiration timestamp
23. Update Scheduling
• External sources push new data
• So many data feeds and derived
tables
• Resource usage control by using
scheduler
• Minimize data staleness
• Priority weighted staleness metric
to select tables which minimize it
most
Fig : plot of the staleness of a SDW table over time
24. Query Processing
• Overhead of partitioned tables
o Too small partitions are difficult to manage
o Too big ones need to be recomputed as new data arrives
o Solution : Bigger partitions as data become old
• Data Availability and Concurrency control
o Tables are updated frequently
o Queries should not be blocked and output consistent data
o Solution : Multi-version concurrency control at partition level
25. Discussion
• End-to-end data stream management
• DSMS allows relational like queries as well as pattern matching
and event processing queries
• Query semantics are different than traditional ones
• SDW research problems introduced recently
• Didn’t cover data mining techniques, fault tolerance and
distributed processing in the lecture
26. References
1. Data stream management, Luckasz Golab & M. Tamer Özsu
• Data stream management system – introduction, concepts and issues. Morton
Lindeberg, University of Oslo
Editor's Notes
Partition dependencies : Change in raw files / tables could be mapped to partitions of data
Reset model : router cpu measurement
Cash register : internet packet stream
knowing the stream arrival rates and the selectivity of each operator allows us to estimate each operator's output rate.
If a punctuation arrives on one of the input streams with the predicate attr ! = a1 AND attr ! = a2, we can immediately remove tuples with those attr-values from both hash tables.
knowing the stream arrival rates and the selectivity of each operator allows us to estimate each operator's output rate.
If a punctuation arrives on one of the input streams with the predicate attr ! = a1 AND attr ! = a2, we can immediately remove tuples with those attr-values from both hash tables.