A ggregation  C omputation Over  D istributed  D ata  S treams (partial content) Yueshen Xu [email_address] Middleware, CC...
Paper reference <ul><li>What's Different: Distributed, Continuous Monitoring of  Duplicate- Resilient Aggregates on Data S...
Background <ul><li>Distributed Data Streams </li></ul><ul><ul><li>Where and why? </li></ul></ul><ul><ul><ul><li>Large scal...
Constraints and Features <ul><li>Constraints </li></ul><ul><ul><li>Space  </li></ul></ul><ul><ul><ul><li>Embedded equipmen...
Trouble <ul><li>Duplication    Why? </li></ul><ul><ul><li>Wide scale monitoring invariably encounters the same events at ...
Topic <ul><li>What kind of topics are researchers interested in ? </li></ul><ul><ul><li>Aggregation computation </li></ul>...
Problems and Concerns <ul><li>Distinct count </li></ul><ul><li>To obtain the number of distinct data (item, record, etc) i...
Distinct Counting:  Flajolet-Martin Sketch <ul><li>Flajolet-Martin Sketch </li></ul><ul><ul><li>P. Flajolet, G. Martin . P...
Flajolet-Martin Sketch(Cont.) <ul><li>Preliminary    what do we need? </li></ul><ul><ul><li>the Multi-set M, containing a...
Flajolet-Martin Sketch(Cont.) <ul><li>The algorithm itself </li></ul><ul><ul><li>the core task: remarking the position of ...
Flajolet-Martin Sketch(Cont.) <ul><li>The explanation </li></ul><ul><ul><li>The fact: bitmap[k] equals to 1 iff  after exe...
Flajolet-Martin Sketch(Cont.) <ul><li>Conclusion </li></ul><ul><ul><li>Analysis </li></ul></ul><ul><ul><ul><li>Bit-based, ...
Question <ul><li>Is the CFV in my last report a kind of sketch? </li></ul><ul><ul><li>Yes, I think so </li></ul></ul><ul><...
<ul><li>Q&A </li></ul>12/19/11 Middleware, CCNT, ZJU
Upcoming SlideShare
Loading in …5
×

Aggregation computation over distributed data streams(the final version)

585 views

Published on

I have fixed a few mistakes in the original version and appended some new contents. Likewise, I hope it is of help for you.

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
585
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
11
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Aggregation computation over distributed data streams(the final version)

  1. 1. A ggregation C omputation Over D istributed D ata S treams (partial content) Yueshen Xu [email_address] Middleware, CCNT Zhejiang Univ Middleware, CCNT, ZJU 12/19/11
  2. 2. Paper reference <ul><li>What's Different: Distributed, Continuous Monitoring of Duplicate- Resilient Aggregates on Data Streams </li></ul><ul><ul><li>Published in ICDE, 2006 </li></ul></ul><ul><ul><li>Cited by 61 times </li></ul></ul><ul><ul><li>By Graham Cormode, S. Muthukrishnan etc. </li></ul></ul>12/19/11 Middleware, CCNT, ZJU <ul><li>I think it’s a good reading suitable for freshmen on distributed data streams </li></ul>Bell Lab Expert/27 Rutgers Expert/45 !
  3. 3. Background <ul><li>Distributed Data Streams </li></ul><ul><ul><li>Where and why? </li></ul></ul><ul><ul><ul><li>Large scale monitoring applications </li></ul></ul></ul><ul><ul><ul><li>Many sensors distributed over a wide area </li></ul></ul></ul>12/19/11 Middleware, CCNT, ZJU Just one example <ul><li>Distributed Streaming Model </li></ul><ul><li>What do we research?  Query paradigm </li></ul><ul><ul><li>Centralized </li></ul></ul><ul><ul><li>Decentralized </li></ul></ul>VS
  4. 4. Constraints and Features <ul><li>Constraints </li></ul><ul><ul><li>Space </li></ul></ul><ul><ul><ul><li>Embedded equipments don’t have enough memory </li></ul></ul></ul><ul><ul><li>Processing power </li></ul></ul><ul><ul><ul><li>The same reason </li></ul></ul></ul><ul><ul><li>Communication capability </li></ul></ul><ul><ul><ul><li>Unreliable, spotty and sporadic </li></ul></ul></ul>12/19/11 Middleware, CCNT, ZJU All resources are restricted <ul><li>Features </li></ul><ul><ul><li>Different from ad hoc queries in DBMS, but continuous </li></ul></ul><ul><ul><li>Continuous is one of core characteristics in queries over data streams </li></ul></ul>What’s different?
  5. 5. Trouble <ul><li>Duplication  Why? </li></ul><ul><ul><li>Wide scale monitoring invariably encounters the same events at different points </li></ul></ul>12/19/11 Middleware, CCNT, ZJU <ul><ul><li>Instances </li></ul></ul><ul><ul><ul><li>The same flow will be observed in different routers </li></ul></ul></ul><ul><ul><ul><li>The same individual will be observed by several mobile sensors </li></ul></ul></ul><ul><ul><li>Requirement </li></ul></ul><ul><ul><ul><li>Duplicate-resilient aggregate </li></ul></ul></ul><ul><ul><li>Two vital questions </li></ul></ul><ul><ul><ul><li>What is the amount of duplication in the network? </li></ul></ul></ul><ul><ul><ul><li>What are the versions of classical aggregates in the presence of duplicates? </li></ul></ul></ul>  root of all evil
  6. 6. Topic <ul><li>What kind of topics are researchers interested in ? </li></ul><ul><ul><li>Aggregation computation </li></ul></ul><ul><ul><li>Routing algorithms </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><ul><li>What is the aggregation? </li></ul></ul><ul><ul><li>Summarization, namely a statistical variable computed from the original data sets </li></ul></ul><ul><ul><li>Examples </li></ul></ul><ul><ul><ul><li>min, max, quantile, heavy hitter </li></ul></ul></ul><ul><ul><ul><li>distinct counts, average, sum </li></ul></ul></ul><ul><ul><ul><li>… </li></ul></ul></ul>12/19/11 Middleware, CCNT, ZJU Not strange contacting with data streams Why only aggregation?  transaction Mirror the topic in data base
  7. 7. Problems and Concerns <ul><li>Distinct count </li></ul><ul><li>To obtain the number of distinct data (item, record, etc) in multi-sets, namely the cardinality </li></ul><ul><li>Distinct sample </li></ul><ul><li>Important, but I’m sorry that I haven’t finished this part  </li></ul>12/19/11 Middleware, CCNT, ZJU <ul><li>What does this paper concern about? </li></ul><ul><ul><li>Priority: correctness, communication cost </li></ul></ul><ul><ul><li>Computational cost, space cost </li></ul></ul>! Features attached to those algorithms applied to distributed environments <ul><li>What do we concern about dealing with aggregation computation? </li></ul><ul><ul><li>Space complexity may be more attention-getting than time complexity </li></ul></ul>
  8. 8. Distinct Counting: Flajolet-Martin Sketch <ul><li>Flajolet-Martin Sketch </li></ul><ul><ul><li>P. Flajolet, G. Martin . Probabilistic Counting Algorithms for Data Base Applications . Journal of Computer and System Sciences, 1985(Cited by 628) </li></ul></ul><ul><ul><li>Goal: To estimate the cardinalities of multi-sets of data using relative small space by one pass scan </li></ul></ul><ul><ul><li>The sketch is a kind of data structure, which is the way to obtain the aggregation results.  aggregation </li></ul></ul><ul><ul><li>I think this method can be regarded as the classical application of probability without complexity. </li></ul></ul>12/19/11 Middleware, CCNT, ZJU <ul><li>Give a question: How about you dealing with this problem? </li></ul><ul><ul><li>The computing paradigm of sketching </li></ul></ul><ul><ul><li>Space complexity </li></ul></ul>Be appropriate for using in data streams inherently Just think about one scene in TaoBao
  9. 9. Flajolet-Martin Sketch(Cont.) <ul><li>Preliminary  what do we need? </li></ul><ul><ul><li>the Multi-set M, containing all items/records, and |M| = n </li></ul></ul><ul><ul><li>the upper bound on the number of distinct items/records U, which is more than n </li></ul></ul><ul><ul><li>one bitmap, consisting of L elements, and 2 L = U </li></ul></ul><ul><ul><li>the hash function h(x: item/record), transforming each items into a binary string distributed uniformly over the range of [1…2 L ], just like b 1 b 2 …b L , in which b 1 is the lowest digit, and b L is the highest </li></ul></ul><ul><ul><li>the p(x), attaining the left most position of ‘1’ </li></ul></ul>12/19/11 Middleware, CCNT, ZJU counting not computing 1 1 … 0 0 PPT VS Whiteboard ? x record h(x) 1 L
  10. 10. Flajolet-Martin Sketch(Cont.) <ul><li>The algorithm itself </li></ul><ul><ul><li>the core task: remarking the position of which the leftmost ‘1’ of the hash value recorded by p(x) in bitmap B </li></ul></ul>12/19/11 Middleware, CCNT, ZJU for i:=1 to L do bitmap[i] :=0 for all x in M do begin index := p(hash(x)); if bitmap[index] = the bitmap[index] :=1; end <ul><ul><li>Why? </li></ul></ul>1 0 … 1 0 Bitmap 1 L
  11. 11. Flajolet-Martin Sketch(Cont.) <ul><li>The explanation </li></ul><ul><ul><li>The fact: bitmap[k] equals to 1 iff after execution a pattern of the form 0 k-1 1 has appeared amongst hashed values of records in M </li></ul></ul><ul><ul><li>The probability: the occurrence probability of the pattern 0 k-1 1 is 1/2 k </li></ul></ul><ul><ul><li>Occurrence times: so if |M| = n, then bitmap[1] is accessed approximately n/2 times, bitmap[2] approximately n/4 times </li></ul></ul><ul><ul><li>Extension: bitmap[k] will almost certainly be zero if k >> log 2 (n) and one if k << log 2 (n) wit a fringe of 0 and 1 for k ≈ log 2 (n) </li></ul></ul><ul><ul><li>Selection: the leftmost 0, the rightmost 1 or something else </li></ul></ul>12/19/11 Middleware, CCNT, ZJU U <ul><ul><li>The most practical part is over, and the left is very complicated taken for proving and error analysis, namely all about mathematic </li></ul></ul>for i:=1 to L do bitmap[i] :=0 for all x in M do begin index := p(hash(x)); if bitmap[index] = the bitmap[index] :=1; end
  12. 12. Flajolet-Martin Sketch(Cont.) <ul><li>Conclusion </li></ul><ul><ul><li>Analysis </li></ul></ul><ul><ul><ul><li>Bit-based, reducing the space complexity by constant level </li></ul></ul></ul><ul><ul><ul><li>Space complexity O(log(n))  O(log(log(n))) </li></ul></ul></ul><ul><ul><ul><li>Duplicate-insensitive  duplicate-resilient and flexible </li></ul></ul></ul><ul><ul><ul><li>Order-insensitive  stable and robust </li></ul></ul></ul><ul><ul><ul><li>Additivity  The ability to merge two FM sketches together, and the merger is simply the bitwise-or of each pair of corresponding bitmaps </li></ul></ul></ul><ul><ul><li>Questions </li></ul></ul><ul><ul><ul><li>How to make the value of U? What’s the relationship of U and n? </li></ul></ul></ul><ul><ul><ul><li>How to make the analysis to the error?  log(αn) </li></ul></ul></ul><ul><ul><ul><li>… </li></ul></ul></ul>12/19/11 Middleware, CCNT, ZJU nice qualities for distributed aggregation
  13. 13. Question <ul><li>Is the CFV in my last report a kind of sketch? </li></ul><ul><ul><li>Yes, I think so </li></ul></ul><ul><li>What’s the relationship between sketch and skyline? </li></ul><ul><ul><li>Are they the same? No, just trust me </li></ul></ul><ul><ul><li>I hold the opposite opinion </li></ul></ul><ul><li>Does the aggregation computation belong to the research fields of data mining ? </li></ul><ul><ul><li>No, I suppose it doesn’t and I don’t care </li></ul></ul>12/19/11 Middleware, CCNT, ZJU What’s skyline?
  14. 14. <ul><li>Q&A </li></ul>12/19/11 Middleware, CCNT, ZJU

×