Scalable real-time processing techniques

Scalable real-time
processing techniques
How to almost count
Lars Albertsson, Schibsted

“We promised to count live...
...but since you can’t do that, we used historical
numbers and this cool math to extrapolate.”
?!?

Stream counting is simple
You already have the building blocks
Yet many wait for batch execution
Or go through estimation hoops

Accurate counting
Server Bus
Bucketiser
Bucketiser
Bucketiser
Aggregator
Server
Server
Server
● Straightforward, with some plumbing.
● Heavier than you need.

Now or later? Exact or rough?
Approximation now >> accurate later

Basic scenarios
● How many distinct items in last x minutes?
● What are the top k items in last x minutes?
● How many Ys in last x minutes?
These base techniques are sufficient for
implementing e.g. personalisation and
recommendation algorithms.

Cardinality - distinct stream count
● Naive: Set of hashes. X bits per item.

● Naive 2: Set approximation with Bloom filter
+ counter.

Counting in context
● Look backward, different time windows,
compare.
● Count for a small time quantum, keep
history.
● Aggregate old windows.
● Monoid representations are desirable.

+ counter.
● Naive 3: Hash to bitmap. Count bits.

+ counter.
● Attempt 4: Hash, bitmap, count + collision
compensation. Linear Probabilistic Counter.

+ counter.
● Attempt 4: Hash, bitmap, count + collision
compensation. Linear Probabilistic Counter.
● Read papers… -> HyperLogLog counter

Source: Shakespeare, highscalability.com

Top K counting
U2 65
Gaga 46
Avicii 23
Eminem 21
Dolly 18
U2 65
Gaga 46
Avicii 23
Eminem 21
Peps 19
U2 65
Gaga 46
Avicii 23
Eminem 21
Dolly 20
● Keep k items, assume absentees have
lowest value
● Accurate at top, overcounting in bottom

Approx counting - Count-Min Sketch
● Compute n hashes for key.
● Increment once on each row, col by mod
(hash)
● Retrieve by min() over rows
3 7 20 3 11 6 3+1 4 1 1
3 8 6 2+1 17 13 1 0 4 5
12 7 6 14 2 0 2 3 6+1 7
3 2 12 8+1 10 2 7 2 11 2

Top K with Count-Min Sketch
U2 65
Gaga 46
Avicii 23
Eminem 21
Dolly 18
U2 65
Gaga 46
Avicii 23
Eminem 21
Peps 2
U2 65
Gaga 46
Avicii 23
Eminem 21
Dolly 19
● Keep Heavy Hitters list.
● Lookup absentees in CMS.
● Risk of overcount is smaller and spread out.

Cubic CMS
● Decorate song with geo, age, etc. Pour into
CMS.
● Keep heavy hitters per geo, age group.
*:*:<U2>
SE:*:<U2>
*:31-40:<U2>
SE:31-40:<U2>
+1
+1
+1
+1

Machinery
O(104) messages / s per machine.
You probably only need one. If not, use Storm.
Read and write to pub/sub channel, e.g. Kafka
or ZeroMQ.

Brute force alternative
Dump every single message into
ElasticSearch.
Suitable for high dimensionality cubes.

Recommendations, you said?
● Collaborative filtering - similarity matrix
Users
2 4 1 1 5 2
0 1 7 1 0 6
5 2 9 0 3 0
3 8 0 6 0 7
Items

Shave the matrix
Users
Items
0,0 3
0,1 5
0,2 0
0,3 2
1,0 8
... ...
2,1 9
1,0 8
2,2 7
5,0 7
5,2 6
... ...
Flip Sort
2,1 9
1,0 8
2,2 7
5,0 7
5,2 6
Cut
0 0 0 0 0 0
0 0 7 0 0 6
0 0 9 0 0 0
0 8 0 0 0 7
Noise removed - fine for
recommendations
2 4 1 1 5 2
0 1 7 1 0 6
5 2 9 0 3 0
3 8 0 6 0 7

Hungry for more?
Mikio Braun: http://www.berlinbuzzwords.de/session/real-time-personalization-and-
recommendation-stream-mining
Ted Dunning on deep learning for real-time anomaly detection: http://www.
berlinbuzzwords.de/session/deep-learning-high-performance-time-series-databases
Ted Dunning on Storm: http://www.youtube.com/watch?v=7PcmbI5aC20
Open source: stream-lib, Algebird

Want to work in this area?
lalle@schibsted.com

Scalable real-time processing techniques

Scalable real-time processing techniques

More Related Content

What's hot

Viewers also liked

Similar to Scalable real-time processing techniques

More from Lars Albertsson

Recently uploaded

Scalable real-time processing techniques