Understand the main factors that affect the Distributed Stream model.
How to manage “One-shot queries” in Distributed Streams ?
Handling “message loss” in Distributed Streams.
Introduce “decentralized computations techniques” in Distributed Streams.
A brief overview on “Aggregate Computation” and main problems that exist in it.
Overview of “Minos Garofalakis” [2] for the future of Distributed Streams System
1/36
Agenda
Data Stream Model
Distributed Stream Model
Requirements for stream synopses
Distributed Stream Querying Space
Querying Model
Class of Queries
Communication Model
Distributed Stream Processing
One-shot distributed-stream querying
Tree-based aggregation
Loss
Decentralized computation and gossiping
Continuous distributed-stream
Polling
Tolerance
Distributed Data Streams System
Revisiting objectives
Future Work
References
2/36
Data Stream Model 3/36
Distributed Data Stream Model 4/36
Requirements for stream synopses
Single Pass:
Each record is examined at most once
Small Space:
Memory stored and passed should be minimized as possible
Small time:
Low per-record processing time
5/36
Distributed Stream Querying Space 6/36
Distributed Stream Querying Space (cont’d)
One-shot queries:
On-demand pull query answer from network
Continuous queries:
Track/monitor answer at query site at all times
Detect anomalous behavior in (near) real-time, i.e., “Distributed triggers”
Main challenge is to minimize communication
May use one-shot algorithms as subroutines
7/36
Distributed Stream Querying Space (cont’d)
Minimizing communication often needs approximation
Example:
Continuously monitor average value: must not send every change for exact answer, Only need ‘significant’ changes for approx.
8/36
Distributed Stream Querying Space (cont’d)
Simple algebraic vs. holistic aggregates [3]
Holistic aggregates need the whole input to compute the query result (no summary suffices) E.g., count distinct , need to remember all distinct items to tell if new item is distinct or not
Duplicate-sensitive vs. duplicate-insensitive
Duplicate sensitivity indicates whether the result of aggregation evaluator is affected by a duplicated reading or not.
E.g., BINARY AND vs SUM
Complex queries
E.g., distributed joins
9/36
Distributed Stream Querying Space (cont’d)
Topology:
Routing Schemes:
10/36
Distributed Stream Querying Space (cont’d)
Other network characteristics:
Node failures
loss
…
11/36
One-shot distributed-stream querying
Tree-based aggregation
Loss
Decentralized computation
Gossiping
12/36
Tree-based aggregation
Goal is for root to compute a function of data at leaves
Trivial solution: push all data up tree and compute at base station
Bottleneck at “near root” nodes
Lot of communications
13/36
Tree-based aggregation (cont’d)
Can do much better by
“ In-network” query processing
example: computing max
Instead of sending all data to root node then the root compute the max, each node hears from all children, computes max and sends to parent.
14/36
Tree-based aggregation (cont’d)
Aggregates of interest
SQL Primitives:
min, max, sum, count, avg
More complex:
count distinct, range queries
Data mining:
association rules, clusterings
15/36
Tree-based aggregation (cont’d)
Formal framework for in-network aggregation ( GFE model “ generate / fuse / evaluate”)
Define functions:
Generate, g(i):
take input, produce summary (at leaves)
Fusion, f(x,y):
merge two summaries (at internal nodes)
Evaluate, e(x):
output result (at root)
16/36
Loss
Unreliability
Tree aggregation techniques assumed a reliable network
assuming no node failure
Assuming no loss of messages
Failure can dramatically affect the computation
E.g., sum – if a node near the root fails, then a whole subtree may be lost
17/36
Loss (cont’d)
Unreliability
Failure detection
Message Loss:
Timeout
Node Failure
Keep-Alive Messages
Failure correction
Message Loss:
resending the message
Node Failure:
Rebuild the whole tree (if needed)
Rerun the protocol
18/36
Loss (cont’d)
Order and duplicate insensitivity ( ODI )
e.g., SUM is not ODI
e.g., MIN and MAX are ODIs
How to make ODI summaries for other aggregates?
Example transform “ count distinct ” to ODI using FM Sketch (Flajolet, Martin ‘85)
19/36
Loss (cont’d)
The Flajolet-Martin sketch [4]
Target:
Estimates number of distinct inputs ( count distinct )
Given:
A sequence of input x i
Steps:
Use hash function mapping input items to i with prob 2 -I
Pr[ h(x) = 1 ] = 1/2 Pr[ h(x) = 2 ] = 1/4
Pr[ h(x) = 3 ] = 1/8 … etc.
Construct an FM bitmap of length log N where N is number of inputs
20/36
The Flajolet-Martin sketch (cont’d)
For each incoming value x, set FM[h(x)] = 1
The position of the least significant 0 in the bitmap indicates the logarithm of the number of distinct items seen.
Accuracy:
Taking repetitions with randomly chosen hash functions improves the accuracy
Loss (cont’d) 21/36
Decentralized Computation
Concepts:
All participate in computation.
All get the result.
Anyone can talk to anyone else directly.
22/36
Gossiping
At each round, everyone who knows the data sends it to one of the n participants, chosen at random .
After O(log n) rounds, all n participants know the information.
23/36
Aggregate Computation via Gossip
Gossiping to exchange n secrets.
If we have an ODI summary, we can gossip with this:
When new summary received, merge with current summary
ODI properties ensure repeated merging stays accurate
After O(log n) rounds everyone knows the merged summary.
O(n log n) messages in total
24/36
Continuous Distributed Model
must continuously centralize all data.
Enormous communication overhead !
25/36
Continuous Distributed Model (cont’d)
Continuous
Real-time tracking, rather than one-shot query/response
Distributed
Each remote site only observes part of the global stream(s)
Communication constraints:
must minimize monitoring burden
Streaming
Each site sees a high-speed local data stream
26/36
Continuous Distributed Model (cont’d)
Naïve solution
continuously centralize all data.
27/36
Continuous Distributed Model (cont’d)
Naïve solution
continuously centralize all data.
Enormous communication overhead !
28/36
Continuous Distributed Model (cont’d)
Naïve solution
continuously centralize all data.
Enormous communication overhead !
So what about polling ?
29/36
Continuous Distributed Model (cont’d)
Polling
Sometimes periodic polling suffices for simple tasks
Must balance polling frequency against communication:
Very frequent polling causes high communication
Infrequent polling means delays in observing events
Exact answers are not available all time, so approximated answers must be acceptable.
30/36
Continuous Distributed Model (cont’d)
Because we allow approximation, there is slack .
The tolerance for error between computed answer and truth is:
Absolute: |Y – Y’| <= e
Where e: slack
Y: computed answer
Y’: exact answer
Relative: Y’/Y <= (1±e)
Where eY: slack
Y: computed answer
Y’: exact answer
31/36
Distributed Data Streams System
Overview of “Minos Garofalakis” [2] on the future of DSS
Main algorithmic idea:
Trade-off communication/ approximation
“ Unfortunately, approximate query processing tools are still not widely adopted in current Stream Processing Engines” [5]
“ More complex tools for approximate in-network data processing/collection have yet to gain wider acceptance” [5]
32/36
Revisiting Objectives
Introduce the concept of distributed streams
The concept of distributed streams was explained with its synopses and relevant factors " pass ", " space " and " time “.
Understand the main factors that affects the Distributed Stream model.
Querying model types lead us to the concept that " approximation must be accepted for distributed massive data streams“
Dealing with network topologies introduces the “ duplicate-insensitive ” need, some simple algorithms were introduced in this presentation as example.
Loss is expected in Distributed Streams model. Either data loss or node failure have to be handled.
33/36
Revisiting Objectives (cont’d)
How to manage “One-shot queries” in Distributed Streams ?
It’s better to handle “One shot query” via “in-network” query processing , whenever this is possible.
Handling message loss in Distributed Streams.
Unreliability is an expected behavior in Distributed Streams, mechanisms for detection and correction are introduced to handle the unreliability.
Introduce decentralized computations techniques in Distributed Streams.
Decentralized computations Concepts were introduced with Gossiping .
Using Gossiping to solve the Aggregate Computation problem
34/36
Revisiting Objectives (cont’d)
A brief overview on Aggregate Computation and main problems that exist in it.
behavior was studied
a naïve solution was introduced ( continuously centralize all data)
Improving the naïve solution via polling technique
polling = accepting approximations
Overview of “Minos Garofalakis” for the future of Distributed Streams System
Query processing tools are still not widely adopted.
35/36
Future Work
How to make Stream mining (clustering, associations, classification, change detection,…etc. ) ?
Can all queries be converted to ODI ?
Compressing and Filtering XML Streams [6]
Graph-data streams [7] [8]
Extend ERD to express streams (ERD stream modeling)
A general distributed query language (dist-streamSQL?)
Define a language so a query optimizer can find a plan that guarantees good performance, small communication?
36/36
References
[1] Minos Garofalakis, Processing Massive Data Streams, Yahoo Research & UC Berkeley, The VLDB school in Egypt, April 2008.
[2] Dr. Minos Garofalakis homepage http:// www.softnet.tuc.gr/~minos / (last visit 25 th May 2009)
[3] Graham Cormode, Minos N. Garofalakis, S. Muthukrishnan and Rajeev Rastogi. ACM SIGMOD Conference. Holistic Aggregates in a Networked World Distributed Tracking of Approximate Quantiles. Baltimore, Maryland, USA. 14 th -16 th June, 2005.
[4] Graham Cormode, Fundamentals of Analyzing and Mining Data Streams, Workshop on data stream analysis, San Leucio-Italy, March 15 th -16 th , 2007
[5] Quoted from reference 1, slide 185
[6] Giovanni Guardalben, Compressing and Filtering XML Streams, The W3C Workshop on Binary Interchange of XML Information Item Sets, 24 th , 25 th and 26 th September, 2003, Santa Clara, California, USA
[7] Jian Zhang , Massive Data Streams in Graph Theory and Computational Geometry, Yale University, New Haven, CT, USA, 2005
[8] Prof. Dr. Joan Feigenbaum ’s publications http://www.cs.yale.edu/homes/jf/Massive-Data-Pubs.html (last visit 25 th May 2009)
0 comments
Post a comment