Realtime Analytics with Apache Cassandra

3,306 views
3,156 views

Published on

The latest version of my talk, as given at the NoSQL Roadshow Amasterdam, 29th Nov 2012

Realtime Analytics with Apache Cassandra

  1. 1. Realtime Analytics with Apache Cassandra Tom Wilkie Founder & CTO, Acunu Ltd @tom_wilkie
  2. 2. 101• BigTable-style datamodel combined with Dynamo-style consistency• Simple queries - put, get, range queries• Multi-master architecture: no SPOF• Tunable consistency, multi-DC aware• Optimised for random writes & random range queries• Atomic counters, wide rows, composite keys2 Analytics
  3. 3. Combining “big” and “real-time” is hard Live & historical Drill downs Trends... aggregates... and roll ups3 Analytics
  4. 4. Solution Con Scalability $$$ Not realtime Spartan query semantics => complex, DIY solutions4 Analytics
  5. 5. Example I eg “show me the number of mentions of ‘Acunu’ per day, between May and November 2011, on Twitter” Batch (Hadoop) approach would require processing ~30 billion tweets, or ~4.2 TB of data http://blog.twitter.com/2011/03/numbers.html5 Analytics
  6. 6. Okay, so how are we going to do it? For each tweet, increment a bunch of counters, such that answering a query is as easy as reading some counters6 Analytics
  7. 7. Preparing the data 12:32:15 I like #trafficlightsStep 1: Get a feed of 12:33:43 Nobody expects... the tweets 12:33:49 I ate a #bee; woe is... 12:34:04 Man, @acunu rocks!Step 2: Tokenise the tweetStep 3: Increment counters [1234, man] +1 in time buckets for [1234, acunu] +1 each token [1234, rock] +17 Analytics
  8. 8. Querying start: [01/05/11, acunu]Step 1: Do a range query end: [30/05/11, acunu] Key #Mentions [01/05/11 00:01, acunu] 3Step 2: Result table [01/05/11 00:02, acunu] 5 ... ... 90Step 3: Plot pretty graph 45 0 May Jun Jul Aug Sept Oct Nov8 Analytics
  9. 9. k2 k4 k3 k1 Cassandra keys distributed based on hash or row key, ie randomly9 Analytics
  10. 10. Instead of this... Key #Mentions [01/05/11 00:01, acunu] 3 [01/05/11 00:02, acunu] 5 ... ... We do this Key 00:01 00:02 ... [01/05/11, acunu] 3 5 ... [02/05/11, acunu] 12 4 ... ... ... ... Row key is ‘big’ Column key is ‘small’ time bucket time bucket10 Analytics
  11. 11. Towards a more general solution... (Example II)11 Analytics
  12. 12. count grouped by ... day count distinct(session) count ... geographyavg(duration) ... browser12 Analytics
  13. 13. 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3221 :00→22 :00→19 :02→104 ...{ cust_id: user01, ... ... session_id: 102, UK all→228 user01→1 user14→12 user99→7 ... geography: UK, US all→354 user01→4 user04→8 user56→17 ... browser: IE, time: 22:02, ...} UK, 22:00 all→1904 ... ∅ all→87314 UK→238 US→354 ...13 Analytics
  14. 14. 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3222 :00→22 :00→19 :02→105 ...{ cust_id: user01, ... ... session_id: 102, UK all→229 user01→2 user14→12 user99→7 ... geography: UK, US all→354 user01→4 user04→8 user56→17 ... browser: IE, time: 22:02, ...} UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ...14 Analytics
  15. 15. 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3221 :00→22 :00→19 :02→104 ... ... ... UK all→228 user01→1 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1904 ... ∅ all→87314 UK→238 US→354 ...15 Analytics
  16. 16. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3222 :00→22 :01→19 :02→105 ... ... ... UK all→229 user01→2 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ...16 Analytics
  17. 17. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ...17 Analytics
  18. 18. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ...where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ...18 Analytics
  19. 19. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ...where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ...count all ∅ all→87315 UK→239 US→354 ...19 Analytics
  20. 20. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ...where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ...count all ∅ all→87315 UK→239 US→354 ...group all by geo20 Analytics
  21. 21. What about more than just aggregates?21 Analytics
  22. 22. Approximate Analytics Exact Real-time Large Scale22 Analytics
  23. 23. Count Distinct Plan A: keep a list of all the things you’ve seen count them at query time Quick to update ... but at scale ... Takes lots of space Takes a long time to query23 Analytics
  24. 24. Approximate Distinct max # leading zeroes seen so far item hash leading zeroes max so far x 00101001110... 2 2 y 11010100111... 0 2 z 00011101011... 3 3 ... ... to see a max of M takes about 2M items24 Analytics
  25. 25. Approximate Distinct to reduce var, average over m=2k sub-streams item hash index, zeroes max so far x 00101001110... 0, 0 0,0,0,0 y 11010100111... 3, 1 0,0,0,1 z 00011101011... 0, 1 1,0,0,1 ... take the harmonic mean25 Analytics
  26. 26. Okay... now what? Analytics
  27. 27. Analytics counter updatesClick stream events AcunuSensor data Analytics etc • Aggregate incrementally, on the fly • Store live + historical aggregates
  28. 28. 10x vs MySQL... Analytics
  29. 29. Dashboard UI29 Analytics
  30. 30. “Up and running in about 4 hours”“We found out a competitor was scraping our data” “We keep discovering use cases we hadn’t thought of ”http://vimeo.com/54026096 Analytics
  31. 31. "Quick, efficient and easy to get started" "Were still finding new and interesting use cases, which just arent possible with our current datastores." Analytics
  32. 32. Thanks! Questions? http://www.acunu.com/download contact@acunu.com32 Analytics

×