Scalable real-time 
processing techniques 
How to almost count 
Lars Albertsson, Schibsted
“We promised to count live... 
...but since you can’t do that, we used historical 
numbers and this cool math to extrapolate.” 
?!?
Stream counting is simple 
You already have the building blocks 
Yet many wait for batch execution 
Or go through estimation hoops
Accurate counting 
Server Bus 
Bucketiser 
Bucketiser 
Bucketiser 
Aggregator 
Server 
Server 
Server 
● Straightforward, with some plumbing. 
● Heavier than you need.
Now or later? Exact or rough? 
Approximation now >> accurate later
Basic scenarios 
● How many distinct items in last x minutes? 
● What are the top k items in last x minutes? 
● How many Ys in last x minutes? 
These base techniques are sufficient for 
implementing e.g. personalisation and 
recommendation algorithms.
Cardinality - distinct stream count 
● Naive: Set of hashes. X bits per item.
Cardinality - distinct stream count 
● Naive: Set of hashes. X bits per item. 
● Naive 2: Set approximation with Bloom filter 
+ counter.
Counting in context 
● Look backward, different time windows, 
compare. 
● Count for a small time quantum, keep 
history. 
● Aggregate old windows. 
● Monoid representations are desirable.
Cardinality - distinct stream count 
● Naive: Set of hashes. X bits per item. 
● Naive 2: Set approximation with Bloom filter 
+ counter. 
● Naive 3: Hash to bitmap. Count bits.
Cardinality - distinct stream count 
● Naive: Set of hashes. X bits per item. 
● Naive 2: Set approximation with Bloom filter 
+ counter. 
● Naive 3: Hash to bitmap. Count bits. 
● Attempt 4: Hash, bitmap, count + collision 
compensation. Linear Probabilistic Counter.
Cardinality - distinct stream count 
● Naive: Set of hashes. X bits per item. 
● Naive 2: Set approximation with Bloom filter 
+ counter. 
● Naive 3: Hash to bitmap. Count bits. 
● Attempt 4: Hash, bitmap, count + collision 
compensation. Linear Probabilistic Counter. 
● Read papers… -> HyperLogLog counter
Cardinality - distinct stream count 
Source: Shakespeare, highscalability.com
Top K counting 
U2 65 
Gaga 46 
Avicii 23 
Eminem 21 
Dolly 18 
U2 65 
Gaga 46 
Avicii 23 
Eminem 21 
Peps 19 
U2 65 
Gaga 46 
Avicii 23 
Eminem 21 
Dolly 20 
● Keep k items, assume absentees have 
lowest value 
● Accurate at top, overcounting in bottom
Approx counting - Count-Min Sketch 
● Compute n hashes for key. 
● Increment once on each row, col by mod 
(hash) 
● Retrieve by min() over rows 
3 7 20 3 11 6 3+1 4 1 1 
3 8 6 2+1 17 13 1 0 4 5 
12 7 6 14 2 0 2 3 6+1 7 
3 2 12 8+1 10 2 7 2 11 2
Top K with Count-Min Sketch 
U2 65 
Gaga 46 
Avicii 23 
Eminem 21 
Dolly 18 
U2 65 
Gaga 46 
Avicii 23 
Eminem 21 
Peps 2 
U2 65 
Gaga 46 
Avicii 23 
Eminem 21 
Dolly 19 
● Keep Heavy Hitters list. 
● Lookup absentees in CMS. 
● Risk of overcount is smaller and spread out.
Cubic CMS 
● Decorate song with geo, age, etc. Pour into 
CMS. 
● Keep heavy hitters per geo, age group. 
*:*:<U2> 
SE:*:<U2> 
*:31-40:<U2> 
SE:31-40:<U2> 
+1 
+1 
+1 
+1
Machinery 
O(104) messages / s per machine. 
You probably only need one. If not, use Storm. 
Read and write to pub/sub channel, e.g. Kafka 
or ZeroMQ.
Brute force alternative 
Dump every single message into 
ElasticSearch. 
Suitable for high dimensionality cubes.
Recommendations, you said? 
● Collaborative filtering - similarity matrix 
Users 
2 4 1 1 5 2 
0 1 7 1 0 6 
5 2 9 0 3 0 
3 8 0 6 0 7 
Items
Shave the matrix 
Users 
Items 
0,0 3 
0,1 5 
0,2 0 
0,3 2 
1,0 8 
... ... 
2,1 9 
1,0 8 
2,2 7 
5,0 7 
5,2 6 
... ... 
Flip Sort 
2,1 9 
1,0 8 
2,2 7 
5,0 7 
5,2 6 
Cut 
0 0 0 0 0 0 
0 0 7 0 0 6 
0 0 9 0 0 0 
0 8 0 0 0 7 
Noise removed - fine for 
recommendations 
2 4 1 1 5 2 
0 1 7 1 0 6 
5 2 9 0 3 0 
3 8 0 6 0 7
Hungry for more? 
Mikio Braun: http://www.berlinbuzzwords.de/session/real-time-personalization-and- 
recommendation-stream-mining 
Ted Dunning on deep learning for real-time anomaly detection: http://www. 
berlinbuzzwords.de/session/deep-learning-high-performance-time-series-databases 
Ted Dunning on Storm: http://www.youtube.com/watch?v=7PcmbI5aC20 
Open source: stream-lib, Algebird
Want to work in this area? 
lalle@schibsted.com
Scalable real-time processing techniques
Scalable real-time processing techniques
Scalable real-time processing techniques
Scalable real-time processing techniques
Scalable real-time processing techniques
Scalable real-time processing techniques
Scalable real-time processing techniques
Scalable real-time processing techniques

Scalable real-time processing techniques

  • 1.
    Scalable real-time processingtechniques How to almost count Lars Albertsson, Schibsted
  • 2.
    “We promised tocount live... ...but since you can’t do that, we used historical numbers and this cool math to extrapolate.” ?!?
  • 3.
    Stream counting issimple You already have the building blocks Yet many wait for batch execution Or go through estimation hoops
  • 4.
    Accurate counting ServerBus Bucketiser Bucketiser Bucketiser Aggregator Server Server Server ● Straightforward, with some plumbing. ● Heavier than you need.
  • 5.
    Now or later?Exact or rough? Approximation now >> accurate later
  • 6.
    Basic scenarios ●How many distinct items in last x minutes? ● What are the top k items in last x minutes? ● How many Ys in last x minutes? These base techniques are sufficient for implementing e.g. personalisation and recommendation algorithms.
  • 7.
    Cardinality - distinctstream count ● Naive: Set of hashes. X bits per item.
  • 8.
    Cardinality - distinctstream count ● Naive: Set of hashes. X bits per item. ● Naive 2: Set approximation with Bloom filter + counter.
  • 9.
    Counting in context ● Look backward, different time windows, compare. ● Count for a small time quantum, keep history. ● Aggregate old windows. ● Monoid representations are desirable.
  • 10.
    Cardinality - distinctstream count ● Naive: Set of hashes. X bits per item. ● Naive 2: Set approximation with Bloom filter + counter. ● Naive 3: Hash to bitmap. Count bits.
  • 11.
    Cardinality - distinctstream count ● Naive: Set of hashes. X bits per item. ● Naive 2: Set approximation with Bloom filter + counter. ● Naive 3: Hash to bitmap. Count bits. ● Attempt 4: Hash, bitmap, count + collision compensation. Linear Probabilistic Counter.
  • 12.
    Cardinality - distinctstream count ● Naive: Set of hashes. X bits per item. ● Naive 2: Set approximation with Bloom filter + counter. ● Naive 3: Hash to bitmap. Count bits. ● Attempt 4: Hash, bitmap, count + collision compensation. Linear Probabilistic Counter. ● Read papers… -> HyperLogLog counter
  • 13.
    Cardinality - distinctstream count Source: Shakespeare, highscalability.com
  • 14.
    Top K counting U2 65 Gaga 46 Avicii 23 Eminem 21 Dolly 18 U2 65 Gaga 46 Avicii 23 Eminem 21 Peps 19 U2 65 Gaga 46 Avicii 23 Eminem 21 Dolly 20 ● Keep k items, assume absentees have lowest value ● Accurate at top, overcounting in bottom
  • 15.
    Approx counting -Count-Min Sketch ● Compute n hashes for key. ● Increment once on each row, col by mod (hash) ● Retrieve by min() over rows 3 7 20 3 11 6 3+1 4 1 1 3 8 6 2+1 17 13 1 0 4 5 12 7 6 14 2 0 2 3 6+1 7 3 2 12 8+1 10 2 7 2 11 2
  • 16.
    Top K withCount-Min Sketch U2 65 Gaga 46 Avicii 23 Eminem 21 Dolly 18 U2 65 Gaga 46 Avicii 23 Eminem 21 Peps 2 U2 65 Gaga 46 Avicii 23 Eminem 21 Dolly 19 ● Keep Heavy Hitters list. ● Lookup absentees in CMS. ● Risk of overcount is smaller and spread out.
  • 17.
    Cubic CMS ●Decorate song with geo, age, etc. Pour into CMS. ● Keep heavy hitters per geo, age group. *:*:<U2> SE:*:<U2> *:31-40:<U2> SE:31-40:<U2> +1 +1 +1 +1
  • 18.
    Machinery O(104) messages/ s per machine. You probably only need one. If not, use Storm. Read and write to pub/sub channel, e.g. Kafka or ZeroMQ.
  • 19.
    Brute force alternative Dump every single message into ElasticSearch. Suitable for high dimensionality cubes.
  • 20.
    Recommendations, you said? ● Collaborative filtering - similarity matrix Users 2 4 1 1 5 2 0 1 7 1 0 6 5 2 9 0 3 0 3 8 0 6 0 7 Items
  • 21.
    Shave the matrix Users Items 0,0 3 0,1 5 0,2 0 0,3 2 1,0 8 ... ... 2,1 9 1,0 8 2,2 7 5,0 7 5,2 6 ... ... Flip Sort 2,1 9 1,0 8 2,2 7 5,0 7 5,2 6 Cut 0 0 0 0 0 0 0 0 7 0 0 6 0 0 9 0 0 0 0 8 0 0 0 7 Noise removed - fine for recommendations 2 4 1 1 5 2 0 1 7 1 0 6 5 2 9 0 3 0 3 8 0 6 0 7
  • 22.
    Hungry for more? Mikio Braun: http://www.berlinbuzzwords.de/session/real-time-personalization-and- recommendation-stream-mining Ted Dunning on deep learning for real-time anomaly detection: http://www. berlinbuzzwords.de/session/deep-learning-high-performance-time-series-databases Ted Dunning on Storm: http://www.youtube.com/watch?v=7PcmbI5aC20 Open source: stream-lib, Algebird
  • 23.
    Want to workin this area? lalle@schibsted.com