Realtime Analytics  with Apache   Cassandra        Tom Wilkie Founder & CTO, Acunu Ltd       @tom_wilkie
Combining “big” and “real-time” is hard    Live & historical                    Drill downs                         Trends...
Solution              Con                       Scalability                         $$$                      Not realtime ...
Example I    eg “show me the number of mentions of        ‘Acunu’ per day, between May and          November 2011, on Twit...
Okay, so how are we going to                   do it?    For each tweet,    increment a bunch of counters,    such that an...
Preparing the data                              12:32:15 I like #trafficlightsStep 1: Get a feed of    12:33:43 Nobody expe...
Querying                            start: [01/05/11, acunu]Step 1: Do a range query    end:   [30/05/11, acunu]          ...
Instead of this...                                  Key            #Mentions                         [01/05/11 00:01, acun...
Towards a more    general solution...      (Example II)9                          Analytics
count                grouped by ...                    day  count distinct(session)     count       ... geographyavg(durat...
21:00      all→1345    :00→45      :01→62      :02→87       ...                         22:00      all→3221    :00→22     ...
21:00      all→1345     :00→45     :01→62      :02→87       ...                         22:00      all→3222     :00→22    ...
21:00      all→1345    :00→45      :01→62      :02→87       ...      22:00      all→3221    :00→22      :00→19     :02→104...
where time 21:00-22:00 count(*)                          21:00      all→1345    :00→45      :01→62      :02→87       ...  ...
where time 21:00-22:00 count(*)                           21:00      all→1345    :00→45      :01→62      :02→87       ...w...
where time 21:00-22:00 count(*)                           21:00      all→1345     :00→45     :01→62      :02→87       ...w...
where time 21:00-22:00 count(*)                           21:00      all→1345     :00→45     :01→62      :02→87       ...w...
where time 21:00-22:00 count(*)                           21:00      all→1345     :00→45     :01→62      :02→87       ...w...
What about more than       just aggregates?19                            Analytics
Approximate Analytics                 Exact     Real-time           Large Scale20                                       An...
Count Distinct     Plan A: keep a list of all the things you’ve seen               count them at query time               ...
Approximate Distinct     max # leading zeroes seen so far         item          hash        leading zeroes   max so far   ...
Approximate Distinct            to reduce var, average over m=2k sub-streams     item          hash          index, zeroes...
Okay... now what?                    Analytics
Analytics                                     counter                                     updatesClick stream    events   ...
10x vs MySQL...                  Analytics
Dashboard UI27                    Analytics
“Up and running in about 4 hours”“We found out a competitor  was scraping our data”                      “We keep discover...
"Quick, efficient and easy to        get started"                       "Were still finding new and                     inte...
Thanks!     Questions?30                  Analytics
Upcoming SlideShare
Loading in …5
×

Realtime analytics with Apache Cassandra - Tom Wilkie

1,357 views

Published on

Realtime analytics is seen as a key approach in extracting value from the new wave of Big Data. In this talk we will discuss different approaches to realtime analytics, before focusing on how to build realtime analytics application using Apache Cassandra. We will talk about some of the common usecases, how to model the data, and some more advances topics, such as algorithms for approximate analytics.

Published in: Technology
  • Be the first to comment

Realtime analytics with Apache Cassandra - Tom Wilkie

  1. 1. Realtime Analytics with Apache Cassandra Tom Wilkie Founder & CTO, Acunu Ltd @tom_wilkie
  2. 2. Combining “big” and “real-time” is hard Live & historical Drill downs Trends... aggregates... and roll ups2 Analytics
  3. 3. Solution Con Scalability $$$ Not realtime Spartan query semantics => complex, DIY solutions3 Analytics
  4. 4. Example I eg “show me the number of mentions of ‘Acunu’ per day, between May and November 2011, on Twitter” Batch (Hadoop) approach would require processing ~30 billion tweets, or ~4.2 TB of data http://blog.twitter.com/2011/03/numbers.html4 Analytics
  5. 5. Okay, so how are we going to do it? For each tweet, increment a bunch of counters, such that answering a query is as easy as reading some counters5 Analytics
  6. 6. Preparing the data 12:32:15 I like #trafficlightsStep 1: Get a feed of 12:33:43 Nobody expects... the tweets 12:33:49 I ate a #bee; woe is... 12:34:04 Man, @acunu rocks!Step 2: Tokenise the tweetStep 3: Increment counters [1234, man] +1 in time buckets for [1234, acunu] +1 each token [1234, rock] +16 Analytics
  7. 7. Querying start: [01/05/11, acunu]Step 1: Do a range query end: [30/05/11, acunu] Key #Mentions [01/05/11 00:01, acunu] 3Step 2: Result table [01/05/11 00:02, acunu] 5 ... ... 90Step 3: Plot pretty graph 45 0 May Jun Jul Aug Sept Oct Nov7 Analytics
  8. 8. Instead of this... Key #Mentions [01/05/11 00:01, acunu] 3 [01/05/11 00:02, acunu] 5 ... ... We do this Key 00:01 00:02 ... [01/05/11, acunu] 3 5 ... [02/05/11, acunu] 12 4 ... ... ... ... Row key is ‘big’ Column key is ‘small’ time bucket time bucket8 Analytics
  9. 9. Towards a more general solution... (Example II)9 Analytics
  10. 10. count grouped by ... day count distinct(session) count ... geographyavg(duration) ... browser10 Analytics
  11. 11. 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3221 :00→22 :00→19 :02→104 ...{ cust_id: user01, ... ... session_id: 102, UK all→228 user01→1 user14→12 user99→7 ... geography: UK, US all→354 user01→4 user04→8 user56→17 ... browser: IE, time: 22:02, ...} UK, 22:00 all→1904 ... ∅ all→87314 UK→238 US→354 ...11 Analytics
  12. 12. 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3222 :00→22 :00→19 :02→105 ...{ cust_id: user01, ... ... session_id: 102, UK all→229 user01→2 user14→12 user99→7 ... geography: UK, US all→354 user01→4 user04→8 user56→17 ... browser: IE, time: 22:02, ...} UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ...12 Analytics
  13. 13. 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3221 :00→22 :00→19 :02→104 ... ... ... UK all→228 user01→1 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1904 ... ∅ all→87314 UK→238 US→354 ...13 Analytics
  14. 14. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3222 :00→22 :01→19 :02→105 ... ... ... UK all→229 user01→2 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ...14 Analytics
  15. 15. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ...15 Analytics
  16. 16. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ...where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ...16 Analytics
  17. 17. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ...where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ...count all ∅ all→87315 UK→239 US→354 ...17 Analytics
  18. 18. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ...where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ...count all ∅ all→87315 UK→239 US→354 ...group all by geo18 Analytics
  19. 19. What about more than just aggregates?19 Analytics
  20. 20. Approximate Analytics Exact Real-time Large Scale20 Analytics
  21. 21. Count Distinct Plan A: keep a list of all the things you’ve seen count them at query time Quick to update ... but at scale ... Takes lots of space Takes a long time to query21 Analytics
  22. 22. Approximate Distinct max # leading zeroes seen so far item hash leading zeroes max so far x 00101001110... 2 2 y 11010100111... 0 2 z 00011101011... 3 3 ... ... to see a max of M takes about 2M items22 Analytics
  23. 23. Approximate Distinct to reduce var, average over m=2k sub-streams item hash index, zeroes max so far x 00101001110... 0, 0 0,0,0,0 y 11010100111... 3, 1 0,0,0,1 z 00011101011... 0, 1 1,0,0,1 ... take the harmonic mean23 Analytics
  24. 24. Okay... now what? Analytics
  25. 25. Analytics counter updatesClick stream events AcunuSensor data Analytics etc • Aggregate incrementally, on the fly • Store live + historical aggregates
  26. 26. 10x vs MySQL... Analytics
  27. 27. Dashboard UI27 Analytics
  28. 28. “Up and running in about 4 hours”“We found out a competitor was scraping our data” “We keep discovering use cases we hadn’t thought of ” Analytics
  29. 29. "Quick, efficient and easy to get started" "Were still finding new and interesting use cases, which just arent possible with our current datastores." Analytics
  30. 30. Thanks! Questions?30 Analytics

×