Streaming Algorithms

  • 238 views
Uploaded on

An overview of streaming algorithms: what they are, what the general principles regarding them are, and how they fit into a big data architecture. Also four specific examples of streaming algorithms …

An overview of streaming algorithms: what they are, what the general principles regarding them are, and how they fit into a big data architecture. Also four specific examples of streaming algorithms and use-cases.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
238
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
6
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Streaming Algorithms Joe Kelley Data Engineer July 2013
  • 2. CONFIDENTIAL | 2 Accelerating Your Time to Value Strategy and Roadmap IMAGINE Training and Education ILLUMINATE Hands-On Data Science and Data Engineering IMPLEMENT Leading Provider of Data Science & Engineering for Big Analytics
  • 3. CONFIDENTIAL | 3 • Operates on a continuous stream of data • Unknown or infinite size • Only one pass; options: • Store it • Lose it • Store an approximation • Limited processing time per item • • Limited total memory • What is a Streaming Algorithm? Algorithm Standing Query Ad-hoc Query Input Output Memory Disk
  • 4. CONFIDENTIAL | 4 Why use a Streaming Algorithm? • Compare to typical “Big Data” approach: store everything, analyze later, scale linearly • Streaming Pros: • Lower latency • Lower storage cost • Streaming Cons: • Less flexibility • Lower precision (sometimes) • Answer? • Why not both? Streaming Algorithm Result Initial Answer Long-term Storage Batch Algorithm Result Authoritative Answer
  • 5. CONFIDENTIAL | 5 General Techniques 1. Tunable Approximation 2. Sampling • Sliding window • Fixed number • Fixed percentage 3. Hashing: useful randomness
  • 6. CONFIDENTIAL | 6 Example 1: Sampling device error rates • Stream of (device_id, event, timestamp) • Scenario: • Not enough space to store everything • Simple queries  storing 1% is good enough Device-1 (Device-1, event-1, 10001123) (Device-1, event-3, 10001126) (Device-1, event-1, 10001129) ... Device-2 (Device-2, event-2, 10001124) (Device-2, ERROR, 10001130) (Device-2, event-4, 10001132) ... Device-3 (Device-3, event-3, 10001122) (Device-3, event-1, 10001127) (Device-3, ERROR, 10001135) ... (Device-3, event-3, 10001122) (Device-1, event-1, 10001123) (Device-2, event-2, 10001124) (Device-1, event-3, 10001126) (Device-3, event-1, 10001127) (Device-1, event-1, 10001129) (Device-2, ERROR, 10001130) (Device-2, event-4, 10001132) (Device-3, ERROR, 10001135) ... Input
  • 7. CONFIDENTIAL | 7 Example 1: Sampling device error rates • Stream of (device_id, event, timestamp) • Scenario: • Not enough space to store everything • Simple queries  storing 1% is good enough Algorithm: for each element e: with probability 0.01: store e else: throw out e Can lead to some insidious statistical “bugs”…
  • 8. CONFIDENTIAL | 8 Example 1: Sampling device error rates • Stream of (device_id, event, timestamp) • Scenario: • Not enough space to store everything • Simple queries  storing 1% is good enough Query: How many errors has the average device encountered? Answer: SELECT AVG(n) FROM ( SELECT COUNT(*) AS n FROM events WHERE event = 'ERROR' GROUP BY device_id ) Simple… but off by up to 100x. Each device had only 1% of its events sampled. Can we just multiply by 100?
  • 9. CONFIDENTIAL | 9 Example 1: Sampling device error rates • Stream of (device_id, event, timestamp) • Scenario: • Not enough space to store everything • Simple queries  storing 1% is good enough Better Algorithm: for each element e: if (hash(e.device_id) mod 100) == 0 store e else: throw out e Choose how to hash carefully... or hash every different way
  • 10. CONFIDENTIAL | 10 Example 2: Sampling fixed number Choice of p is crucial: • p = constant  prefer more recent elements. Higher p = more recent • p = k/n  sample uniformly from entire stream Let arr = array of size k for each element e: if arr is not yet full: add e to arr else: with probability p: replace a random element of arr with e else: throw out e Want to sample a fixed count (k), not a fixed percentage. Algorithm:
  • 11. CONFIDENTIAL | 11 Example 2: Sampling fixed number
  • 12. CONFIDENTIAL | 12 Example 3: Counting unique users • Input: stream of (user_id, action, timestamp) • Want to know how many distinct users are seen over a time period • Naïve approach: • Store all user_id’s in a list/tree/hashtable • Millions of users = lot of memory • Better approach: • Store all user_id’s in a database • Good, but maybe it’s not fast enough… • What if an approximate count is ok?
  • 13. CONFIDENTIAL | 13 Example 3: Counting unique users • Input: stream of (user_id, action, timestamp) • Want to know how many distinct users are seen over a time period • Approximate count is ok • Flajolet-Martin Idea: • Hash each user_id into a bit string • Count the trailing zeros • Remember maximum number of trailing zeros seen user_id H(user_id) trailing zeros max(trailing zeros) john_doe 0111001001 0 0 jane_doe 1011011100 2 2 alan_t 0010111000 3 3 EWDijkstra 1101011110 1 3 jane_doe 1011011100 2 3
  • 14. CONFIDENTIAL | 14 Example 3: Counting unique users • Input: stream of (user_id, action, timestamp) • Want to know how many distinct users are seen over a time period • Intuition: • If we had seen 2 distinct users, we would expect 1 trailing zero • If we had seen 4, we would expect 2 trailing zeros • If we had seen , we would expect • In general, if there has been a maximum of trailing zeros, is a reasonable estimation of distinct users • Want more precision? User more independent hash functions, and combine the results • Median = only get powers of two • Mean = subject to skew • Median of means of groups works well in practice
  • 15. CONFIDENTIAL | 15 Example 3: Counting unique users • Input: stream of (user_id, action, timestamp) • Want to know how many distinct users are seen over a time period Flajolet-Martin, all together: arr = int[k] for each item e: for i in 0...k-1: z = trailing_zeros(hashi(e)) if z > arr[i]: arr[i] = z means = group_means(arr) median = median(means) return pow(2, median)
  • 16. CONFIDENTIAL | 16 Example 3: Counting unique users Flajolet-Martin in practice • Devil is in the details • Tunable precision • more hash functions = more precise • See the paper for bounds on precision • Tunable latency • more hash functions = higher latency • faster hash functions = lower latency • faster hash functions = more possibility of correlation = less precision Remember: streaming algorithm for quick, imprecise answer. Back-end batch algorithm for slower, exact answer
  • 17. CONFIDENTIAL | 17 Example 4: Counting Individual Item Frequencies Want to keep track of how many times each item has appeared in the stream Many applications: • How popular is each search term? • How many times has this hashtag been tweeted? • Which IP addresses are DDoS’ing me? Again, two obvious approaches: • In-memory hashmap of itemcount • Database But can we be more clever?
  • 18. CONFIDENTIAL | 18 Example 4: Counting Individual Item Frequencies Want to keep track of how many times each item has appeared in the stream Idea: • Maintain array of counts • Hash each item, increment array at that index To check the count of an item, hash again and check array at that index • Over-estimates because of hash “collisions”
  • 19. CONFIDENTIAL | 19 Example 4: Counting Individual Item Frequencies Count-Min Sketch algorithm: • Maintain 2-d array of size w x d • Choose d different hash functions; each row in array corresponds to one hash function • Hash each item with every hash function, increment the appropriate position in each row • To query an item, hash it d times again, take the minimum value from all rows
  • 20. CONFIDENTIAL | 20 Example 4: Counting Individual Item Frequencies Want to keep track of how many times each item has appeared in the stream Count-Min Sketch, all together: arr = int[d][w] for each item e: for i in 0...d-1: j = hashi(e) mod w arr[i][j]++ def frequency(q): min = +infinity for i in 0...d-1: j = hashi(e) mod w if arr[i][j] < min: min = arr[i][j] return min
  • 21. CONFIDENTIAL | 21 Example 4: Counting Individual Item Frequencies Count-Min Sketch in practice • Devil is in the details • Tunable precision • Bigger array = more precise • See the paper for bounds on precision • Tunable latency • more hash functions = higher latency • Better at estimating more frequent items • Can subtract out estimation of collisions Remember: streaming algorithm for quick, imprecise answer. Back-end batch algorithm for slower, exact answer
  • 22. CONFIDENTIAL | 22 Questions? • Feel free to reach out • www.thinkbiganalytics.com • joe.kelley@thinkbiganalytics.com • www.slideshare.net/jfkelley1 • References: • http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf • http://infolab.stanford.edu/~ullman/mmds.html We’re hiring! Engineers and Data Scientists