Streaming Algorithms
Upcoming SlideShare
Loading in...5
×
 

Streaming Algorithms

on

  • 314 views

An overview of streaming algorithms: what they are, what the general principles regarding them are, and how they fit into a big data architecture. Also four specific examples of streaming algorithms ...

An overview of streaming algorithms: what they are, what the general principles regarding them are, and how they fit into a big data architecture. Also four specific examples of streaming algorithms and use-cases.

Statistics

Views

Total Views
314
Views on SlideShare
314
Embed Views
0

Actions

Likes
0
Downloads
5
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Streaming Algorithms Streaming Algorithms Presentation Transcript

    • Streaming Algorithms Joe Kelley Data Engineer July 2013
    • CONFIDENTIAL | 2 Accelerating Your Time to Value Strategy and Roadmap IMAGINE Training and Education ILLUMINATE Hands-On Data Science and Data Engineering IMPLEMENT Leading Provider of Data Science & Engineering for Big Analytics
    • CONFIDENTIAL | 3 • Operates on a continuous stream of data • Unknown or infinite size • Only one pass; options: • Store it • Lose it • Store an approximation • Limited processing time per item • • Limited total memory • What is a Streaming Algorithm? Algorithm Standing Query Ad-hoc Query Input Output Memory Disk
    • CONFIDENTIAL | 4 Why use a Streaming Algorithm? • Compare to typical “Big Data” approach: store everything, analyze later, scale linearly • Streaming Pros: • Lower latency • Lower storage cost • Streaming Cons: • Less flexibility • Lower precision (sometimes) • Answer? • Why not both? Streaming Algorithm Result Initial Answer Long-term Storage Batch Algorithm Result Authoritative Answer
    • CONFIDENTIAL | 5 General Techniques 1. Tunable Approximation 2. Sampling • Sliding window • Fixed number • Fixed percentage 3. Hashing: useful randomness
    • CONFIDENTIAL | 6 Example 1: Sampling device error rates • Stream of (device_id, event, timestamp) • Scenario: • Not enough space to store everything • Simple queries  storing 1% is good enough Device-1 (Device-1, event-1, 10001123) (Device-1, event-3, 10001126) (Device-1, event-1, 10001129) ... Device-2 (Device-2, event-2, 10001124) (Device-2, ERROR, 10001130) (Device-2, event-4, 10001132) ... Device-3 (Device-3, event-3, 10001122) (Device-3, event-1, 10001127) (Device-3, ERROR, 10001135) ... (Device-3, event-3, 10001122) (Device-1, event-1, 10001123) (Device-2, event-2, 10001124) (Device-1, event-3, 10001126) (Device-3, event-1, 10001127) (Device-1, event-1, 10001129) (Device-2, ERROR, 10001130) (Device-2, event-4, 10001132) (Device-3, ERROR, 10001135) ... Input
    • CONFIDENTIAL | 7 Example 1: Sampling device error rates • Stream of (device_id, event, timestamp) • Scenario: • Not enough space to store everything • Simple queries  storing 1% is good enough Algorithm: for each element e: with probability 0.01: store e else: throw out e Can lead to some insidious statistical “bugs”…
    • CONFIDENTIAL | 8 Example 1: Sampling device error rates • Stream of (device_id, event, timestamp) • Scenario: • Not enough space to store everything • Simple queries  storing 1% is good enough Query: How many errors has the average device encountered? Answer: SELECT AVG(n) FROM ( SELECT COUNT(*) AS n FROM events WHERE event = 'ERROR' GROUP BY device_id ) Simple… but off by up to 100x. Each device had only 1% of its events sampled. Can we just multiply by 100?
    • CONFIDENTIAL | 9 Example 1: Sampling device error rates • Stream of (device_id, event, timestamp) • Scenario: • Not enough space to store everything • Simple queries  storing 1% is good enough Better Algorithm: for each element e: if (hash(e.device_id) mod 100) == 0 store e else: throw out e Choose how to hash carefully... or hash every different way
    • CONFIDENTIAL | 10 Example 2: Sampling fixed number Choice of p is crucial: • p = constant  prefer more recent elements. Higher p = more recent • p = k/n  sample uniformly from entire stream Let arr = array of size k for each element e: if arr is not yet full: add e to arr else: with probability p: replace a random element of arr with e else: throw out e Want to sample a fixed count (k), not a fixed percentage. Algorithm:
    • CONFIDENTIAL | 11 Example 2: Sampling fixed number
    • CONFIDENTIAL | 12 Example 3: Counting unique users • Input: stream of (user_id, action, timestamp) • Want to know how many distinct users are seen over a time period • Naïve approach: • Store all user_id’s in a list/tree/hashtable • Millions of users = lot of memory • Better approach: • Store all user_id’s in a database • Good, but maybe it’s not fast enough… • What if an approximate count is ok?
    • CONFIDENTIAL | 13 Example 3: Counting unique users • Input: stream of (user_id, action, timestamp) • Want to know how many distinct users are seen over a time period • Approximate count is ok • Flajolet-Martin Idea: • Hash each user_id into a bit string • Count the trailing zeros • Remember maximum number of trailing zeros seen user_id H(user_id) trailing zeros max(trailing zeros) john_doe 0111001001 0 0 jane_doe 1011011100 2 2 alan_t 0010111000 3 3 EWDijkstra 1101011110 1 3 jane_doe 1011011100 2 3
    • CONFIDENTIAL | 14 Example 3: Counting unique users • Input: stream of (user_id, action, timestamp) • Want to know how many distinct users are seen over a time period • Intuition: • If we had seen 2 distinct users, we would expect 1 trailing zero • If we had seen 4, we would expect 2 trailing zeros • If we had seen , we would expect • In general, if there has been a maximum of trailing zeros, is a reasonable estimation of distinct users • Want more precision? User more independent hash functions, and combine the results • Median = only get powers of two • Mean = subject to skew • Median of means of groups works well in practice
    • CONFIDENTIAL | 15 Example 3: Counting unique users • Input: stream of (user_id, action, timestamp) • Want to know how many distinct users are seen over a time period Flajolet-Martin, all together: arr = int[k] for each item e: for i in 0...k-1: z = trailing_zeros(hashi(e)) if z > arr[i]: arr[i] = z means = group_means(arr) median = median(means) return pow(2, median)
    • CONFIDENTIAL | 16 Example 3: Counting unique users Flajolet-Martin in practice • Devil is in the details • Tunable precision • more hash functions = more precise • See the paper for bounds on precision • Tunable latency • more hash functions = higher latency • faster hash functions = lower latency • faster hash functions = more possibility of correlation = less precision Remember: streaming algorithm for quick, imprecise answer. Back-end batch algorithm for slower, exact answer
    • CONFIDENTIAL | 17 Example 4: Counting Individual Item Frequencies Want to keep track of how many times each item has appeared in the stream Many applications: • How popular is each search term? • How many times has this hashtag been tweeted? • Which IP addresses are DDoS’ing me? Again, two obvious approaches: • In-memory hashmap of itemcount • Database But can we be more clever?
    • CONFIDENTIAL | 18 Example 4: Counting Individual Item Frequencies Want to keep track of how many times each item has appeared in the stream Idea: • Maintain array of counts • Hash each item, increment array at that index To check the count of an item, hash again and check array at that index • Over-estimates because of hash “collisions”
    • CONFIDENTIAL | 19 Example 4: Counting Individual Item Frequencies Count-Min Sketch algorithm: • Maintain 2-d array of size w x d • Choose d different hash functions; each row in array corresponds to one hash function • Hash each item with every hash function, increment the appropriate position in each row • To query an item, hash it d times again, take the minimum value from all rows
    • CONFIDENTIAL | 20 Example 4: Counting Individual Item Frequencies Want to keep track of how many times each item has appeared in the stream Count-Min Sketch, all together: arr = int[d][w] for each item e: for i in 0...d-1: j = hashi(e) mod w arr[i][j]++ def frequency(q): min = +infinity for i in 0...d-1: j = hashi(e) mod w if arr[i][j] < min: min = arr[i][j] return min
    • CONFIDENTIAL | 21 Example 4: Counting Individual Item Frequencies Count-Min Sketch in practice • Devil is in the details • Tunable precision • Bigger array = more precise • See the paper for bounds on precision • Tunable latency • more hash functions = higher latency • Better at estimating more frequent items • Can subtract out estimation of collisions Remember: streaming algorithm for quick, imprecise answer. Back-end batch algorithm for slower, exact answer
    • CONFIDENTIAL | 22 Questions? • Feel free to reach out • www.thinkbiganalytics.com • joe.kelley@thinkbiganalytics.com • www.slideshare.net/jfkelley1 • References: • http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf • http://infolab.stanford.edu/~ullman/mmds.html We’re hiring! Engineers and Data Scientists