Distributed Streams


Published on

Distributed Streams, how are they constructed, managed and the whole vision beyond the use of them.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Distributed Streams

  1. 1. Distributed Streams [1] Presented by: Ashraf Bashir
  2. 2. Objectives <ul><li>Introduce the model of Distributed Streams. </li></ul><ul><li>Understand the main factors that affect the Distributed Stream model. </li></ul><ul><li>How to manage “One-shot queries” in Distributed Streams ? </li></ul><ul><li>Handling “message loss” in Distributed Streams. </li></ul><ul><li>Introduce “decentralized computations techniques” in Distributed Streams. </li></ul><ul><li>A brief overview on “Aggregate Computation” and main problems that exist in it. </li></ul><ul><li>Overview of “Minos Garofalakis” [2] for the future of Distributed Streams System </li></ul>1/36
  3. 3. Agenda <ul><li>Data Stream Model </li></ul><ul><li>Distributed Stream Model </li></ul><ul><ul><li>Requirements for stream synopses </li></ul></ul><ul><li>Distributed Stream Querying Space </li></ul><ul><ul><li>Querying Model </li></ul></ul><ul><ul><li>Class of Queries </li></ul></ul><ul><ul><li>Communication Model </li></ul></ul><ul><li>Distributed Stream Processing </li></ul><ul><ul><li>One-shot distributed-stream querying </li></ul></ul><ul><ul><ul><li>Tree-based aggregation </li></ul></ul></ul><ul><ul><ul><li>Loss </li></ul></ul></ul><ul><ul><ul><li>Decentralized computation and gossiping </li></ul></ul></ul><ul><ul><li>Continuous distributed-stream </li></ul></ul><ul><ul><ul><li>Polling </li></ul></ul></ul><ul><ul><ul><li>Tolerance </li></ul></ul></ul><ul><ul><li>Distributed Data Streams System </li></ul></ul><ul><li>Revisiting objectives </li></ul><ul><li>Future Work </li></ul><ul><li>References </li></ul>2/36
  4. 4. Data Stream Model 3/36
  5. 5. Distributed Data Stream Model 4/36
  6. 6. Requirements for stream synopses <ul><li>Single Pass: </li></ul><ul><ul><li>Each record is examined at most once </li></ul></ul><ul><li>Small Space: </li></ul><ul><ul><li>Memory stored and passed should be minimized as possible </li></ul></ul><ul><li>Small time: </li></ul><ul><ul><li>Low per-record processing time </li></ul></ul>5/36
  7. 7. Distributed Stream Querying Space 6/36
  8. 8. Distributed Stream Querying Space (cont’d) <ul><li>One-shot queries: </li></ul><ul><ul><li>On-demand pull query answer from network </li></ul></ul><ul><li>Continuous queries: </li></ul><ul><ul><li>Track/monitor answer at query site at all times </li></ul></ul><ul><ul><li>Detect anomalous behavior in (near) real-time, i.e., “Distributed triggers” </li></ul></ul><ul><ul><li>Main challenge is to minimize communication </li></ul></ul><ul><ul><li>May use one-shot algorithms as subroutines </li></ul></ul>7/36
  9. 9. Distributed Stream Querying Space (cont’d) <ul><ul><li>Minimizing communication often needs approximation </li></ul></ul><ul><ul><li>Example: </li></ul></ul><ul><ul><li>Continuously monitor average value: must not send every change for exact answer, Only need ‘significant’ changes for approx. </li></ul></ul>8/36
  10. 10. Distributed Stream Querying Space (cont’d) <ul><li>Simple algebraic vs. holistic aggregates [3] </li></ul><ul><ul><li>Holistic aggregates need the whole input to compute the query result (no summary suffices) E.g., count distinct , need to remember all distinct items to tell if new item is distinct or not </li></ul></ul><ul><li>Duplicate-sensitive vs. duplicate-insensitive </li></ul><ul><ul><li>Duplicate sensitivity indicates whether the result of aggregation evaluator is affected by a duplicated reading or not. </li></ul></ul><ul><ul><li>E.g., BINARY AND vs SUM </li></ul></ul><ul><li>Complex queries </li></ul><ul><ul><li>E.g., distributed joins </li></ul></ul>9/36
  11. 11. Distributed Stream Querying Space (cont’d) <ul><li>Topology: </li></ul><ul><li>Routing Schemes: </li></ul>10/36
  12. 12. Distributed Stream Querying Space (cont’d) <ul><li>Other network characteristics: </li></ul><ul><ul><li>Node failures </li></ul></ul><ul><ul><li>loss </li></ul></ul><ul><ul><li>… </li></ul></ul>11/36
  13. 13. One-shot distributed-stream querying <ul><li>Tree-based aggregation </li></ul><ul><li>Loss </li></ul><ul><li>Decentralized computation </li></ul><ul><li>Gossiping </li></ul>12/36
  14. 14. Tree-based aggregation <ul><li>Goal is for root to compute a function of data at leaves </li></ul><ul><li>Trivial solution: push all data up tree and compute at base station </li></ul><ul><ul><li>Bottleneck at “near root” nodes </li></ul></ul><ul><ul><li>Lot of communications </li></ul></ul>13/36
  15. 15. Tree-based aggregation (cont’d) <ul><li>Can do much better by </li></ul><ul><li>“ In-network” query processing </li></ul><ul><li>example: computing max </li></ul><ul><li>Instead of sending all data to root node then the root compute the max, each node hears from all children, computes max and sends to parent. </li></ul>14/36
  16. 16. Tree-based aggregation (cont’d) <ul><li>Aggregates of interest </li></ul><ul><ul><li>SQL Primitives: </li></ul></ul><ul><ul><ul><li>min, max, sum, count, avg </li></ul></ul></ul><ul><ul><li>More complex: </li></ul></ul><ul><ul><ul><li>count distinct, range queries </li></ul></ul></ul><ul><ul><li>Data mining: </li></ul></ul><ul><ul><ul><li>association rules, clusterings </li></ul></ul></ul>15/36
  17. 17. Tree-based aggregation (cont’d) <ul><li>Formal framework for in-network aggregation ( GFE model “ generate / fuse / evaluate”) </li></ul><ul><ul><li>Define functions: </li></ul></ul><ul><ul><ul><li>Generate, g(i): </li></ul></ul></ul><ul><ul><ul><ul><li>take input, produce summary (at leaves) </li></ul></ul></ul></ul><ul><ul><ul><li>Fusion, f(x,y): </li></ul></ul></ul><ul><ul><ul><ul><li>merge two summaries (at internal nodes) </li></ul></ul></ul></ul><ul><ul><ul><li>Evaluate, e(x): </li></ul></ul></ul><ul><ul><ul><ul><li>output result (at root) </li></ul></ul></ul></ul>16/36
  18. 18. Loss <ul><li>Unreliability </li></ul><ul><li>Tree aggregation techniques assumed a reliable network </li></ul><ul><ul><li>assuming no node failure </li></ul></ul><ul><ul><li>Assuming no loss of messages </li></ul></ul><ul><li>Failure can dramatically affect the computation </li></ul><ul><ul><li>E.g., sum – if a node near the root fails, then a whole subtree may be lost </li></ul></ul>17/36
  19. 19. Loss (cont’d) <ul><li>Unreliability </li></ul><ul><li>Failure detection </li></ul><ul><ul><li>Message Loss: </li></ul></ul><ul><ul><ul><li>Timeout </li></ul></ul></ul><ul><ul><li>Node Failure </li></ul></ul><ul><ul><ul><li>Keep-Alive Messages </li></ul></ul></ul><ul><li>Failure correction </li></ul><ul><ul><li>Message Loss: </li></ul></ul><ul><ul><ul><li>resending the message </li></ul></ul></ul><ul><ul><li>Node Failure: </li></ul></ul><ul><ul><ul><li>Rebuild the whole tree (if needed) </li></ul></ul></ul><ul><ul><ul><li>Rerun the protocol </li></ul></ul></ul>18/36
  20. 20. Loss (cont’d) <ul><li>Order and duplicate insensitivity ( ODI ) </li></ul><ul><ul><li>e.g., SUM is not ODI </li></ul></ul><ul><ul><li>e.g., MIN and MAX are ODIs </li></ul></ul><ul><li>How to make ODI summaries for other aggregates? </li></ul><ul><ul><li>Example transform “ count distinct ” to ODI using FM Sketch (Flajolet, Martin ‘85) </li></ul></ul>19/36
  21. 21. Loss (cont’d) <ul><li>The Flajolet-Martin sketch [4] </li></ul><ul><li>Target: </li></ul><ul><ul><li>Estimates number of distinct inputs ( count distinct ) </li></ul></ul><ul><li>Given: </li></ul><ul><ul><li>A sequence of input x i </li></ul></ul><ul><li>Steps: </li></ul><ul><ul><li>Use hash function mapping input items to i with prob 2 -I </li></ul></ul><ul><ul><li>Pr[ h(x) = 1 ] = 1/2 Pr[ h(x) = 2 ] = 1/4 </li></ul></ul><ul><ul><li>Pr[ h(x) = 3 ] = 1/8 … etc. </li></ul></ul><ul><ul><li>Construct an FM bitmap of length log N where N is number of inputs </li></ul></ul>20/36
  22. 22. <ul><li>The Flajolet-Martin sketch (cont’d) </li></ul><ul><ul><li>For each incoming value x, set FM[h(x)] = 1 </li></ul></ul><ul><ul><li>The position of the least significant 0 in the bitmap indicates the logarithm of the number of distinct items seen. </li></ul></ul><ul><li>Accuracy: </li></ul><ul><ul><li>Taking repetitions with randomly chosen hash functions improves the accuracy </li></ul></ul>Loss (cont’d) 21/36
  23. 23. Decentralized Computation <ul><li>Concepts: </li></ul><ul><li>All participate in computation. </li></ul><ul><li>All get the result. </li></ul><ul><li>Anyone can talk to anyone else directly. </li></ul>22/36
  24. 24. Gossiping <ul><li>At each round, everyone who knows the data sends it to one of the n participants, chosen at random . </li></ul><ul><li>After O(log n) rounds, all n participants know the information. </li></ul>23/36
  25. 25. Aggregate Computation via Gossip <ul><li>Gossiping to exchange n secrets. </li></ul><ul><li>If we have an ODI summary, we can gossip with this: </li></ul><ul><ul><li>When new summary received, merge with current summary </li></ul></ul><ul><ul><li>ODI properties ensure repeated merging stays accurate </li></ul></ul><ul><li>After O(log n) rounds everyone knows the merged summary. </li></ul><ul><li>O(n log n) messages in total </li></ul>24/36
  26. 26. Continuous Distributed Model <ul><li>must continuously centralize all data. </li></ul><ul><li>Enormous communication overhead ! </li></ul>25/36
  27. 27. Continuous Distributed Model (cont’d) <ul><li>Continuous </li></ul><ul><ul><li>Real-time tracking, rather than one-shot query/response </li></ul></ul><ul><li>Distributed </li></ul><ul><ul><li>Each remote site only observes part of the global stream(s) </li></ul></ul><ul><li>Communication constraints: </li></ul><ul><ul><li>must minimize monitoring burden </li></ul></ul><ul><li>Streaming </li></ul><ul><ul><li>Each site sees a high-speed local data stream </li></ul></ul>26/36
  28. 28. Continuous Distributed Model (cont’d) <ul><li>Naïve solution </li></ul><ul><li>continuously centralize all data. </li></ul>27/36
  29. 29. Continuous Distributed Model (cont’d) <ul><li>Naïve solution </li></ul><ul><li>continuously centralize all data. </li></ul><ul><ul><li>Enormous communication overhead ! </li></ul></ul>28/36
  30. 30. Continuous Distributed Model (cont’d) <ul><li>Naïve solution </li></ul><ul><li>continuously centralize all data. </li></ul><ul><ul><li>Enormous communication overhead ! </li></ul></ul><ul><li>So what about polling ? </li></ul>29/36
  31. 31. Continuous Distributed Model (cont’d) <ul><li>Polling </li></ul><ul><li>Sometimes periodic polling suffices for simple tasks </li></ul><ul><li>Must balance polling frequency against communication: </li></ul><ul><ul><li>Very frequent polling causes high communication </li></ul></ul><ul><ul><li>Infrequent polling means delays in observing events </li></ul></ul><ul><li>Exact answers are not available all time, so approximated answers must be acceptable. </li></ul>30/36
  32. 32. Continuous Distributed Model (cont’d) <ul><li>Because we allow approximation, there is slack . </li></ul><ul><li>The tolerance for error between computed answer and truth is: </li></ul><ul><ul><li>Absolute: |Y – Y’| <= e </li></ul></ul><ul><ul><ul><li>Where e: slack </li></ul></ul></ul><ul><ul><ul><li>Y: computed answer </li></ul></ul></ul><ul><ul><ul><li>Y’: exact answer </li></ul></ul></ul><ul><ul><li>Relative: Y’/Y <= (1±e) </li></ul></ul><ul><ul><ul><li>Where eY: slack </li></ul></ul></ul><ul><ul><ul><li>Y: computed answer </li></ul></ul></ul><ul><ul><ul><li>Y’: exact answer </li></ul></ul></ul>31/36
  33. 33. Distributed Data Streams System <ul><li>Overview of “Minos Garofalakis” [2] on the future of DSS </li></ul><ul><li>Main algorithmic idea: </li></ul><ul><ul><li>Trade-off communication/ approximation </li></ul></ul><ul><li>“ Unfortunately, approximate query processing tools are still not widely adopted in current Stream Processing Engines” [5] </li></ul><ul><li>“ More complex tools for approximate in-network data processing/collection have yet to gain wider acceptance” [5] </li></ul>32/36
  34. 34. Revisiting Objectives <ul><li>Introduce the concept of distributed streams </li></ul><ul><ul><li>The concept of distributed streams was explained with its synopses and relevant factors &quot; pass &quot;, &quot; space &quot; and &quot; time “. </li></ul></ul><ul><li>Understand the main factors that affects the Distributed Stream model. </li></ul><ul><ul><li>Querying model types lead us to the concept that &quot; approximation must be accepted for distributed massive data streams“ </li></ul></ul><ul><ul><li>Dealing with network topologies introduces the “ duplicate-insensitive ” need, some simple algorithms were introduced in this presentation as example. </li></ul></ul><ul><ul><li>Loss is expected in Distributed Streams model. Either data loss or node failure have to be handled. </li></ul></ul>33/36
  35. 35. Revisiting Objectives (cont’d) <ul><li>How to manage “One-shot queries” in Distributed Streams ? </li></ul><ul><ul><li>It’s better to handle “One shot query” via “in-network” query processing , whenever this is possible. </li></ul></ul><ul><li>Handling message loss in Distributed Streams. </li></ul><ul><ul><li>Unreliability is an expected behavior in Distributed Streams, mechanisms for detection and correction are introduced to handle the unreliability. </li></ul></ul><ul><li>Introduce decentralized computations techniques in Distributed Streams. </li></ul><ul><ul><li>Decentralized computations Concepts were introduced with Gossiping . </li></ul></ul><ul><ul><li>Using Gossiping to solve the Aggregate Computation problem </li></ul></ul>34/36
  36. 36. Revisiting Objectives (cont’d) <ul><li>A brief overview on Aggregate Computation and main problems that exist in it. </li></ul><ul><ul><li>behavior was studied </li></ul></ul><ul><ul><li>a naïve solution was introduced ( continuously centralize all data) </li></ul></ul><ul><ul><li>Improving the naïve solution via polling technique </li></ul></ul><ul><ul><li>polling = accepting approximations </li></ul></ul><ul><li>Overview of “Minos Garofalakis” for the future of Distributed Streams System </li></ul><ul><ul><li>Query processing tools are still not widely adopted. </li></ul></ul>35/36
  37. 37. Future Work <ul><li>How to make Stream mining (clustering, associations, classification, change detection,…etc. ) ? </li></ul><ul><li>Can all queries be converted to ODI ? </li></ul><ul><li>Compressing and Filtering XML Streams [6] </li></ul><ul><li>Graph-data streams [7] [8] </li></ul><ul><li>Extend ERD to express streams (ERD stream modeling) </li></ul><ul><li>A general distributed query language (dist-streamSQL?) </li></ul><ul><ul><li>Define a language so a query optimizer can find a plan that guarantees good performance, small communication? </li></ul></ul>36/36
  38. 38. References <ul><li>[1] Minos Garofalakis, Processing Massive Data Streams, Yahoo Research & UC Berkeley, The VLDB school in Egypt, April 2008. </li></ul><ul><li>[2] Dr. Minos Garofalakis homepage http:// www.softnet.tuc.gr/~minos / (last visit 25 th May 2009) </li></ul><ul><li>[3] Graham Cormode, Minos N. Garofalakis, S. Muthukrishnan and Rajeev Rastogi. ACM SIGMOD Conference. Holistic Aggregates in a Networked World Distributed Tracking of Approximate Quantiles. Baltimore, Maryland, USA. 14 th -16 th June, 2005. </li></ul><ul><li>[4] Graham Cormode, Fundamentals of Analyzing and Mining Data Streams, Workshop on data stream analysis, San Leucio-Italy, March 15 th -16 th , 2007 </li></ul><ul><li>[5] Quoted from reference 1, slide 185 </li></ul><ul><li>[6] Giovanni Guardalben, Compressing and Filtering XML Streams, The W3C Workshop on Binary Interchange of XML Information Item Sets, 24 th , 25 th and 26 th September, 2003, Santa Clara, California, USA </li></ul><ul><li>[7] Jian Zhang , Massive Data Streams in Graph Theory and Computational Geometry, Yale University, New Haven, CT, USA, 2005 </li></ul><ul><li>[8] Prof. Dr. Joan Feigenbaum ’s publications http://www.cs.yale.edu/homes/jf/Massive-Data-Pubs.html (last visit 25 th May 2009) </li></ul>
  39. 39. Arabic Dictionary <ul><li>Aggregation = تجميع </li></ul><ul><li>Anomalous = شاذ </li></ul><ul><li>Burden = عبء </li></ul><ul><li>Fuse = صهر </li></ul><ul><li>Gossiping النميمة = </li></ul><ul><li>Slack = تراخي / اهمال </li></ul><ul><li>Synopses = موجز / مختصر </li></ul><ul><li>Unreliability = عدم الاعتمادية / عدم الثقة </li></ul>
  40. 40. Acronyms <ul><li>DSS = Distributed Streams Systems </li></ul><ul><li>FM = Flajolet and Martin </li></ul><ul><li>GFE = Generate, fuse and evaluate </li></ul><ul><li>ODI = Order and duplicate insensitivity </li></ul>