Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

of

Scalable real-time processing techniques Slide 1

YouTube videos are no longer supported on SlideShare

View original on YouTube

Scalable real-time processing techniques Slide 3 Scalable real-time processing techniques Slide 4 Scalable real-time processing techniques Slide 5 Scalable real-time processing techniques Slide 6 Scalable real-time processing techniques Slide 7 Scalable real-time processing techniques Slide 8 Scalable real-time processing techniques Slide 9 Scalable real-time processing techniques Slide 10 Scalable real-time processing techniques Slide 11 Scalable real-time processing techniques Slide 12 Scalable real-time processing techniques Slide 13 Scalable real-time processing techniques Slide 14 Scalable real-time processing techniques Slide 15 Scalable real-time processing techniques Slide 16 Scalable real-time processing techniques Slide 17 Scalable real-time processing techniques Slide 18 Scalable real-time processing techniques Slide 19 Scalable real-time processing techniques Slide 20 Scalable real-time processing techniques Slide 21 Scalable real-time processing techniques Slide 22 Scalable real-time processing techniques Slide 23 Scalable real-time processing techniques Slide 24 Scalable real-time processing techniques Slide 25 Scalable real-time processing techniques Slide 26 Scalable real-time processing techniques Slide 27 Scalable real-time processing techniques Slide 28 Scalable real-time processing techniques Slide 29 Scalable real-time processing techniques Slide 30 Scalable real-time processing techniques Slide 31 Scalable real-time processing techniques Slide 32
Upcoming SlideShare
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Next
Download to read offline and view in fullscreen.

5

Share

Download to read offline

Scalable real-time processing techniques

Download to read offline

A glance at a few scalable stream processing techniques.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Scalable real-time processing techniques

  1. 1. Scalable real-time processing techniques How to almost count Lars Albertsson, Schibsted
  2. 2. “We promised to count live... ...but since you can’t do that, we used historical numbers and this cool math to extrapolate.” ?!?
  3. 3. Stream counting is simple You already have the building blocks Yet many wait for batch execution Or go through estimation hoops
  4. 4. Accurate counting Server Bus Bucketiser Bucketiser Bucketiser Aggregator Server Server Server ● Straightforward, with some plumbing. ● Heavier than you need.
  5. 5. Now or later? Exact or rough? Approximation now >> accurate later
  6. 6. Basic scenarios ● How many distinct items in last x minutes? ● What are the top k items in last x minutes? ● How many Ys in last x minutes? These base techniques are sufficient for implementing e.g. personalisation and recommendation algorithms.
  7. 7. Cardinality - distinct stream count ● Naive: Set of hashes. X bits per item.
  8. 8. Cardinality - distinct stream count ● Naive: Set of hashes. X bits per item. ● Naive 2: Set approximation with Bloom filter + counter.
  9. 9. Counting in context ● Look backward, different time windows, compare. ● Count for a small time quantum, keep history. ● Aggregate old windows. ● Monoid representations are desirable.
  10. 10. Cardinality - distinct stream count ● Naive: Set of hashes. X bits per item. ● Naive 2: Set approximation with Bloom filter + counter. ● Naive 3: Hash to bitmap. Count bits.
  11. 11. Cardinality - distinct stream count ● Naive: Set of hashes. X bits per item. ● Naive 2: Set approximation with Bloom filter + counter. ● Naive 3: Hash to bitmap. Count bits. ● Attempt 4: Hash, bitmap, count + collision compensation. Linear Probabilistic Counter.
  12. 12. Cardinality - distinct stream count ● Naive: Set of hashes. X bits per item. ● Naive 2: Set approximation with Bloom filter + counter. ● Naive 3: Hash to bitmap. Count bits. ● Attempt 4: Hash, bitmap, count + collision compensation. Linear Probabilistic Counter. ● Read papers… -> HyperLogLog counter
  13. 13. Cardinality - distinct stream count Source: Shakespeare, highscalability.com
  14. 14. Top K counting U2 65 Gaga 46 Avicii 23 Eminem 21 Dolly 18 U2 65 Gaga 46 Avicii 23 Eminem 21 Peps 19 U2 65 Gaga 46 Avicii 23 Eminem 21 Dolly 20 ● Keep k items, assume absentees have lowest value ● Accurate at top, overcounting in bottom
  15. 15. Approx counting - Count-Min Sketch ● Compute n hashes for key. ● Increment once on each row, col by mod (hash) ● Retrieve by min() over rows 3 7 20 3 11 6 3+1 4 1 1 3 8 6 2+1 17 13 1 0 4 5 12 7 6 14 2 0 2 3 6+1 7 3 2 12 8+1 10 2 7 2 11 2
  16. 16. Top K with Count-Min Sketch U2 65 Gaga 46 Avicii 23 Eminem 21 Dolly 18 U2 65 Gaga 46 Avicii 23 Eminem 21 Peps 2 U2 65 Gaga 46 Avicii 23 Eminem 21 Dolly 19 ● Keep Heavy Hitters list. ● Lookup absentees in CMS. ● Risk of overcount is smaller and spread out.
  17. 17. Cubic CMS ● Decorate song with geo, age, etc. Pour into CMS. ● Keep heavy hitters per geo, age group. *:*:<U2> SE:*:<U2> *:31-40:<U2> SE:31-40:<U2> +1 +1 +1 +1
  18. 18. Machinery O(104) messages / s per machine. You probably only need one. If not, use Storm. Read and write to pub/sub channel, e.g. Kafka or ZeroMQ.
  19. 19. Brute force alternative Dump every single message into ElasticSearch. Suitable for high dimensionality cubes.
  20. 20. Recommendations, you said? ● Collaborative filtering - similarity matrix Users 2 4 1 1 5 2 0 1 7 1 0 6 5 2 9 0 3 0 3 8 0 6 0 7 Items
  21. 21. Shave the matrix Users Items 0,0 3 0,1 5 0,2 0 0,3 2 1,0 8 ... ... 2,1 9 1,0 8 2,2 7 5,0 7 5,2 6 ... ... Flip Sort 2,1 9 1,0 8 2,2 7 5,0 7 5,2 6 Cut 0 0 0 0 0 0 0 0 7 0 0 6 0 0 9 0 0 0 0 8 0 0 0 7 Noise removed - fine for recommendations 2 4 1 1 5 2 0 1 7 1 0 6 5 2 9 0 3 0 3 8 0 6 0 7
  22. 22. Hungry for more? Mikio Braun: http://www.berlinbuzzwords.de/session/real-time-personalization-and- recommendation-stream-mining Ted Dunning on deep learning for real-time anomaly detection: http://www. berlinbuzzwords.de/session/deep-learning-high-performance-time-series-databases Ted Dunning on Storm: http://www.youtube.com/watch?v=7PcmbI5aC20 Open source: stream-lib, Algebird
  23. 23. Want to work in this area? lalle@schibsted.com
  • ssuser553736

    Mar. 22, 2016
  • DaewonJeong

    Mar. 22, 2016
  • hypermin

    Mar. 22, 2016
  • kabhwan

    Mar. 22, 2016
  • roelofp

    Oct. 7, 2014

A glance at a few scalable stream processing techniques.

Views

Total views

1,418

On Slideshare

0

From embeds

0

Number of embeds

17

Actions

Downloads

15

Shares

0

Comments

0

Likes

5

×