Streaming Algorithms

Streaming Algorithms
Joe Kelley
Data Engineer
July 2013

CONFIDENTIAL | 2
Accelerating Your Time to Value
Strategy
and Roadmap
IMAGINE
Training
and Education
ILLUMINATE
Hands-On
Data Science and
Data Engineering
IMPLEMENT
Leading Provider of
Data Science & Engineering for Big
Analytics

CONFIDENTIAL | 3
• Operates on a continuous stream of data
• Unknown or infinite size
• Only one pass; options:
• Store it
• Lose it
• Store an approximation
• Limited processing time per item
•
• Limited total memory
•
What is a Streaming Algorithm?
Algorithm
Standing Query
Ad-hoc Query
Input
Output
Memory
Disk

CONFIDENTIAL | 4
Why use a Streaming Algorithm?
• Compare to typical “Big Data” approach: store
everything, analyze later, scale linearly
• Streaming Pros:
• Lower latency
• Lower storage cost
• Streaming Cons:
• Less flexibility
• Lower precision (sometimes)
• Answer?
• Why not both?
Streaming
Algorithm
Result
Initial Answer
Long-term Storage Batch Algorithm
Result
Authoritative Answer

CONFIDENTIAL | 5
General Techniques
1. Tunable Approximation
2. Sampling
• Sliding window
• Fixed number
• Fixed percentage
3. Hashing: useful randomness

CONFIDENTIAL | 6
Example 1: Sampling device error rates
• Stream of (device_id, event, timestamp)
• Scenario:
• Not enough space to store everything
• Simple queries  storing 1% is good enough
Device-1
(Device-1, event-1, 10001123)
(Device-1, event-3, 10001126)
(Device-1, event-1, 10001129)
...
Device-2
(Device-2, event-2, 10001124)
(Device-2, ERROR, 10001130)
(Device-2, event-4, 10001132)
...
Device-3
(Device-3, event-3, 10001122)
(Device-3, event-1, 10001127)
...
(Device-3, event-3, 10001122)
(Device-1, event-1, 10001123)
(Device-2, event-2, 10001124)
(Device-1, event-3, 10001126)
(Device-3, event-1, 10001127)
(Device-1, event-1, 10001129)
(Device-2, event-4, 10001132)
...
Input

CONFIDENTIAL | 7
• Scenario:
Algorithm:
for each element e:
with probability 0.01:
store e
else:
throw out e
Can lead to some insidious statistical “bugs”…

CONFIDENTIAL | 8
• Scenario:
Query:
How many errors has the average device encountered?
Answer:
SELECT AVG(n) FROM (
SELECT COUNT(*) AS n FROM events
WHERE event = 'ERROR'
GROUP BY device_id
)
Simple… but off by up to 100x. Each device had only 1% of its events
sampled.
Can we just multiply by 100?

CONFIDENTIAL | 9
• Scenario:
Better Algorithm:
for each element e:
if (hash(e.device_id) mod 100) == 0
store e
else:
throw out e
Choose how to hash carefully... or hash every different way

CONFIDENTIAL | 10
Example 2: Sampling fixed number
Choice of p is crucial:
• p = constant  prefer more recent elements. Higher p = more recent
• p = k/n  sample uniformly from entire stream
Let arr = array of size k
for each element e:
if arr is not yet full:
add e to arr
else:
with probability p:
replace a random element of arr with e
else:
throw out e
Want to sample a fixed count (k), not a fixed percentage.
Algorithm:

CONFIDENTIAL | 11
Example 2: Sampling fixed number

CONFIDENTIAL | 12
Example 3: Counting unique users
• Input: stream of (user_id, action, timestamp)
• Want to know how many distinct users are seen over
a time period
• Naïve approach:
• Store all user_id’s in a list/tree/hashtable
• Millions of users = lot of memory
• Better approach:
• Store all user_id’s in a database
• Good, but maybe it’s not fast enough…
• What if an approximate count is ok?

CONFIDENTIAL | 13
• Want to know how many distinct users are seen over a time period
• Approximate count is ok
• Flajolet-Martin Idea:
• Hash each user_id into a bit string
• Count the trailing zeros
• Remember maximum number of trailing zeros seen
user_id H(user_id) trailing zeros max(trailing zeros)
john_doe 0111001001 0 0
jane_doe 1011011100 2 2
alan_t 0010111000 3 3
EWDijkstra 1101011110 1 3
jane_doe 1011011100 2 3

CONFIDENTIAL | 14
• Intuition:
• If we had seen 2 distinct users, we would expect 1
trailing zero
• If we had seen 4, we would expect 2 trailing zeros
• If we had seen , we would expect
• In general, if there has been a maximum of trailing
zeros, is a reasonable estimation of distinct users
• Want more precision? User more independent hash
functions, and combine the results
• Median = only get powers of two
• Mean = subject to skew
• Median of means of groups works well in practice

CONFIDENTIAL | 15
Flajolet-Martin, all together:
arr = int[k]
for each item e:
for i in 0...k-1:
z = trailing_zeros(hashi(e))
if z > arr[i]:
arr[i] = z
means = group_means(arr)
median = median(means)
return pow(2, median)

CONFIDENTIAL | 16
Flajolet-Martin in practice
• Devil is in the details
• Tunable precision
• more hash functions = more precise
• See the paper for bounds on precision
• Tunable latency
• more hash functions = higher latency
• faster hash functions = lower latency
• faster hash functions = more possibility of
correlation = less precision
Remember: streaming algorithm for quick, imprecise
answer. Back-end batch algorithm for slower, exact
answer

CONFIDENTIAL | 17
Example 4: Counting Individual Item Frequencies
Want to keep track of how many times each item has
appeared in the stream
Many applications:
• How popular is each search term?
• How many times has this hashtag been tweeted?
• Which IP addresses are DDoS’ing me?
Again, two obvious approaches:
• In-memory hashmap of itemcount
• Database
But can we be more clever?

CONFIDENTIAL | 18
Want to keep track of how many times each item has appeared in the stream
Idea:
• Maintain array of counts
• Hash each item, increment array at that index
To check the count of an item, hash again and check
array at that index
• Over-estimates because of hash “collisions”

CONFIDENTIAL | 19
Count-Min Sketch algorithm:
• Maintain 2-d array of size w x d
• Choose d different hash functions; each row in array corresponds to one
hash function
• Hash each item with every hash function, increment the appropriate
position in each row
• To query an item, hash it d times again, take the minimum value from all
rows

CONFIDENTIAL | 20
Want to keep track of how many times each item has appeared in the stream
Count-Min Sketch, all together:
arr = int[d][w]
for each item e:
for i in 0...d-1:
j = hashi(e) mod w
arr[i][j]++
def frequency(q):
min = +infinity
for i in 0...d-1:
j = hashi(e) mod w
if arr[i][j] < min:
min = arr[i][j]
return min

CONFIDENTIAL | 21
Count-Min Sketch in practice
• Devil is in the details
• Tunable precision
• Bigger array = more precise
• See the paper for bounds on precision
• Tunable latency
• more hash functions = higher latency
• Better at estimating more frequent items
• Can subtract out estimation of collisions
Remember: streaming algorithm for quick, imprecise
answer. Back-end batch algorithm for slower, exact
answer

CONFIDENTIAL | 22
Questions?
• Feel free to reach out
• www.thinkbiganalytics.com
• joe.kelley@thinkbiganalytics.com
• www.slideshare.net/jfkelley1
• References:
• http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf
• http://infolab.stanford.edu/~ullman/mmds.html
We’re hiring! Engineers and Data Scientists

Streaming Algorithms

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Streaming Algorithms

Similar to Streaming Algorithms (20)

Recently uploaded

Recently uploaded (20)

Streaming Algorithms