Distributed Streams

  • 701 views
Uploaded on

Distributed Streams, how are they constructed, managed and the whole vision beyond the use of them.

Distributed Streams, how are they constructed, managed and the whole vision beyond the use of them.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
701
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Distributed Streams [1] Presented by: Ashraf Bashir
  • 2. Objectives
    • Introduce the model of Distributed Streams.
    • Understand the main factors that affect the Distributed Stream model.
    • How to manage “One-shot queries” in Distributed Streams ?
    • Handling “message loss” in Distributed Streams.
    • Introduce “decentralized computations techniques” in Distributed Streams.
    • A brief overview on “Aggregate Computation” and main problems that exist in it.
    • Overview of “Minos Garofalakis” [2] for the future of Distributed Streams System
    1/36
  • 3. Agenda
    • Data Stream Model
    • Distributed Stream Model
      • Requirements for stream synopses
    • Distributed Stream Querying Space
      • Querying Model
      • Class of Queries
      • Communication Model
    • Distributed Stream Processing
      • One-shot distributed-stream querying
        • Tree-based aggregation
        • Loss
        • Decentralized computation and gossiping
      • Continuous distributed-stream
        • Polling
        • Tolerance
      • Distributed Data Streams System
    • Revisiting objectives
    • Future Work
    • References
    2/36
  • 4. Data Stream Model 3/36
  • 5. Distributed Data Stream Model 4/36
  • 6. Requirements for stream synopses
    • Single Pass:
      • Each record is examined at most once
    • Small Space:
      • Memory stored and passed should be minimized as possible
    • Small time:
      • Low per-record processing time
    5/36
  • 7. Distributed Stream Querying Space 6/36
  • 8. Distributed Stream Querying Space (cont’d)
    • One-shot queries:
      • On-demand pull query answer from network
    • Continuous queries:
      • Track/monitor answer at query site at all times
      • Detect anomalous behavior in (near) real-time, i.e., “Distributed triggers”
      • Main challenge is to minimize communication
      • May use one-shot algorithms as subroutines
    7/36
  • 9. Distributed Stream Querying Space (cont’d)
      • Minimizing communication often needs approximation
      • Example:
      • Continuously monitor average value: must not send every change for exact answer, Only need ‘significant’ changes for approx.
    8/36
  • 10. Distributed Stream Querying Space (cont’d)
    • Simple algebraic vs. holistic aggregates [3]
      • Holistic aggregates need the whole input to compute the query result (no summary suffices) E.g., count distinct , need to remember all distinct items to tell if new item is distinct or not
    • Duplicate-sensitive vs. duplicate-insensitive
      • Duplicate sensitivity indicates whether the result of aggregation evaluator is affected by a duplicated reading or not.
      • E.g., BINARY AND vs SUM
    • Complex queries
      • E.g., distributed joins
    9/36
  • 11. Distributed Stream Querying Space (cont’d)
    • Topology:
    • Routing Schemes:
    10/36
  • 12. Distributed Stream Querying Space (cont’d)
    • Other network characteristics:
      • Node failures
      • loss
    11/36
  • 13. One-shot distributed-stream querying
    • Tree-based aggregation
    • Loss
    • Decentralized computation
    • Gossiping
    12/36
  • 14. Tree-based aggregation
    • Goal is for root to compute a function of data at leaves
    • Trivial solution: push all data up tree and compute at base station
      • Bottleneck at “near root” nodes
      • Lot of communications
    13/36
  • 15. Tree-based aggregation (cont’d)
    • Can do much better by
    • “ In-network” query processing
    • example: computing max
    • Instead of sending all data to root node then the root compute the max, each node hears from all children, computes max and sends to parent.
    14/36
  • 16. Tree-based aggregation (cont’d)
    • Aggregates of interest
      • SQL Primitives:
        • min, max, sum, count, avg
      • More complex:
        • count distinct, range queries
      • Data mining:
        • association rules, clusterings
    15/36
  • 17. Tree-based aggregation (cont’d)
    • Formal framework for in-network aggregation ( GFE model “ generate / fuse / evaluate”)
      • Define functions:
        • Generate, g(i):
          • take input, produce summary (at leaves)
        • Fusion, f(x,y):
          • merge two summaries (at internal nodes)
        • Evaluate, e(x):
          • output result (at root)
    16/36
  • 18. Loss
    • Unreliability
    • Tree aggregation techniques assumed a reliable network
      • assuming no node failure
      • Assuming no loss of messages
    • Failure can dramatically affect the computation
      • E.g., sum – if a node near the root fails, then a whole subtree may be lost
    17/36
  • 19. Loss (cont’d)
    • Unreliability
    • Failure detection
      • Message Loss:
        • Timeout
      • Node Failure
        • Keep-Alive Messages
    • Failure correction
      • Message Loss:
        • resending the message
      • Node Failure:
        • Rebuild the whole tree (if needed)
        • Rerun the protocol
    18/36
  • 20. Loss (cont’d)
    • Order and duplicate insensitivity ( ODI )
      • e.g., SUM is not ODI
      • e.g., MIN and MAX are ODIs
    • How to make ODI summaries for other aggregates?
      • Example transform “ count distinct ” to ODI using FM Sketch (Flajolet, Martin ‘85)
    19/36
  • 21. Loss (cont’d)
    • The Flajolet-Martin sketch [4]
    • Target:
      • Estimates number of distinct inputs ( count distinct )
    • Given:
      • A sequence of input x i
    • Steps:
      • Use hash function mapping input items to i with prob 2 -I
      • Pr[ h(x) = 1 ] = 1/2 Pr[ h(x) = 2 ] = 1/4
      • Pr[ h(x) = 3 ] = 1/8 … etc.
      • Construct an FM bitmap of length log N where N is number of inputs
    20/36
  • 22.
    • The Flajolet-Martin sketch (cont’d)
      • For each incoming value x, set FM[h(x)] = 1
      • The position of the least significant 0 in the bitmap indicates the logarithm of the number of distinct items seen.
    • Accuracy:
      • Taking repetitions with randomly chosen hash functions improves the accuracy
    Loss (cont’d) 21/36
  • 23. Decentralized Computation
    • Concepts:
    • All participate in computation.
    • All get the result.
    • Anyone can talk to anyone else directly.
    22/36
  • 24. Gossiping
    • At each round, everyone who knows the data sends it to one of the n participants, chosen at random .
    • After O(log n) rounds, all n participants know the information.
    23/36
  • 25. Aggregate Computation via Gossip
    • Gossiping to exchange n secrets.
    • If we have an ODI summary, we can gossip with this:
      • When new summary received, merge with current summary
      • ODI properties ensure repeated merging stays accurate
    • After O(log n) rounds everyone knows the merged summary.
    • O(n log n) messages in total
    24/36
  • 26. Continuous Distributed Model
    • must continuously centralize all data.
    • Enormous communication overhead !
    25/36
  • 27. Continuous Distributed Model (cont’d)
    • Continuous
      • Real-time tracking, rather than one-shot query/response
    • Distributed
      • Each remote site only observes part of the global stream(s)
    • Communication constraints:
      • must minimize monitoring burden
    • Streaming
      • Each site sees a high-speed local data stream
    26/36
  • 28. Continuous Distributed Model (cont’d)
    • Naïve solution
    • continuously centralize all data.
    27/36
  • 29. Continuous Distributed Model (cont’d)
    • Naïve solution
    • continuously centralize all data.
      • Enormous communication overhead !
    28/36
  • 30. Continuous Distributed Model (cont’d)
    • Naïve solution
    • continuously centralize all data.
      • Enormous communication overhead !
    • So what about polling ?
    29/36
  • 31. Continuous Distributed Model (cont’d)
    • Polling
    • Sometimes periodic polling suffices for simple tasks
    • Must balance polling frequency against communication:
      • Very frequent polling causes high communication
      • Infrequent polling means delays in observing events
    • Exact answers are not available all time, so approximated answers must be acceptable.
    30/36
  • 32. Continuous Distributed Model (cont’d)
    • Because we allow approximation, there is slack .
    • The tolerance for error between computed answer and truth is:
      • Absolute: |Y – Y’| <= e
        • Where e: slack
        • Y: computed answer
        • Y’: exact answer
      • Relative: Y’/Y <= (1±e)
        • Where eY: slack
        • Y: computed answer
        • Y’: exact answer
    31/36
  • 33. Distributed Data Streams System
    • Overview of “Minos Garofalakis” [2] on the future of DSS
    • Main algorithmic idea:
      • Trade-off communication/ approximation
    • “ Unfortunately, approximate query processing tools are still not widely adopted in current Stream Processing Engines” [5]
    • “ More complex tools for approximate in-network data processing/collection have yet to gain wider acceptance” [5]
    32/36
  • 34. Revisiting Objectives
    • Introduce the concept of distributed streams
      • The concept of distributed streams was explained with its synopses and relevant factors &quot; pass &quot;, &quot; space &quot; and &quot; time “.
    • Understand the main factors that affects the Distributed Stream model.
      • Querying model types lead us to the concept that &quot; approximation must be accepted for distributed massive data streams“
      • Dealing with network topologies introduces the “ duplicate-insensitive ” need, some simple algorithms were introduced in this presentation as example.
      • Loss is expected in Distributed Streams model. Either data loss or node failure have to be handled.
    33/36
  • 35. Revisiting Objectives (cont’d)
    • How to manage “One-shot queries” in Distributed Streams ?
      • It’s better to handle “One shot query” via “in-network” query processing , whenever this is possible.
    • Handling message loss in Distributed Streams.
      • Unreliability is an expected behavior in Distributed Streams, mechanisms for detection and correction are introduced to handle the unreliability.
    • Introduce decentralized computations techniques in Distributed Streams.
      • Decentralized computations Concepts were introduced with Gossiping .
      • Using Gossiping to solve the Aggregate Computation problem
    34/36
  • 36. Revisiting Objectives (cont’d)
    • A brief overview on Aggregate Computation and main problems that exist in it.
      • behavior was studied
      • a naïve solution was introduced ( continuously centralize all data)
      • Improving the naïve solution via polling technique
      • polling = accepting approximations
    • Overview of “Minos Garofalakis” for the future of Distributed Streams System
      • Query processing tools are still not widely adopted.
    35/36
  • 37. Future Work
    • How to make Stream mining (clustering, associations, classification, change detection,…etc. ) ?
    • Can all queries be converted to ODI ?
    • Compressing and Filtering XML Streams [6]
    • Graph-data streams [7] [8]
    • Extend ERD to express streams (ERD stream modeling)
    • A general distributed query language (dist-streamSQL?)
      • Define a language so a query optimizer can find a plan that guarantees good performance, small communication?
    36/36
  • 38. References
    • [1] Minos Garofalakis, Processing Massive Data Streams, Yahoo Research & UC Berkeley, The VLDB school in Egypt, April 2008.
    • [2] Dr. Minos Garofalakis homepage http:// www.softnet.tuc.gr/~minos / (last visit 25 th May 2009)
    • [3] Graham Cormode, Minos N. Garofalakis, S. Muthukrishnan and Rajeev Rastogi. ACM SIGMOD Conference. Holistic Aggregates in a Networked World Distributed Tracking of Approximate Quantiles. Baltimore, Maryland, USA. 14 th -16 th June, 2005.
    • [4] Graham Cormode, Fundamentals of Analyzing and Mining Data Streams, Workshop on data stream analysis, San Leucio-Italy, March 15 th -16 th , 2007
    • [5] Quoted from reference 1, slide 185
    • [6] Giovanni Guardalben, Compressing and Filtering XML Streams, The W3C Workshop on Binary Interchange of XML Information Item Sets, 24 th , 25 th and 26 th September, 2003, Santa Clara, California, USA
    • [7] Jian Zhang , Massive Data Streams in Graph Theory and Computational Geometry, Yale University, New Haven, CT, USA, 2005
    • [8] Prof. Dr. Joan Feigenbaum ’s publications http://www.cs.yale.edu/homes/jf/Massive-Data-Pubs.html (last visit 25 th May 2009)
  • 39. Arabic Dictionary
    • Aggregation = تجميع
    • Anomalous = شاذ
    • Burden = عبء
    • Fuse = صهر
    • Gossiping النميمة =
    • Slack = تراخي / اهمال
    • Synopses = موجز / مختصر
    • Unreliability = عدم الاعتمادية / عدم الثقة
  • 40. Acronyms
    • DSS = Distributed Streams Systems
    • FM = Flajolet and Martin
    • GFE = Generate, fuse and evaluate
    • ODI = Order and duplicate insensitivity