Realtime Analytics with Apache Cassandra
Upcoming SlideShare
Loading in...5
×
 

Realtime Analytics with Apache Cassandra

on

  • 3,029 views

The latest version of my talk, as given at the NoSQL Roadshow Amasterdam, 29th Nov 2012

The latest version of my talk, as given at the NoSQL Roadshow Amasterdam, 29th Nov 2012

Statistics

Views

Total Views
3,029
Views on SlideShare
2,981
Embed Views
48

Actions

Likes
5
Downloads
49
Comments
0

2 Embeds 48

https://twitter.com 43
http://www.linkedin.com 5

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Realtime Analytics with Apache Cassandra Realtime Analytics with Apache Cassandra Presentation Transcript

    • Realtime Analytics with Apache Cassandra Tom Wilkie Founder & CTO, Acunu Ltd @tom_wilkie
    • 101• BigTable-style datamodel combined with Dynamo-style consistency• Simple queries - put, get, range queries• Multi-master architecture: no SPOF• Tunable consistency, multi-DC aware• Optimised for random writes & random range queries• Atomic counters, wide rows, composite keys2 Analytics
    • Combining “big” and “real-time” is hard Live & historical Drill downs Trends... aggregates... and roll ups3 Analytics
    • Solution Con Scalability $$$ Not realtime Spartan query semantics => complex, DIY solutions4 Analytics
    • Example I eg “show me the number of mentions of ‘Acunu’ per day, between May and November 2011, on Twitter” Batch (Hadoop) approach would require processing ~30 billion tweets, or ~4.2 TB of data http://blog.twitter.com/2011/03/numbers.html5 Analytics
    • Okay, so how are we going to do it? For each tweet, increment a bunch of counters, such that answering a query is as easy as reading some counters6 Analytics
    • Preparing the data 12:32:15 I like #trafficlightsStep 1: Get a feed of 12:33:43 Nobody expects... the tweets 12:33:49 I ate a #bee; woe is... 12:34:04 Man, @acunu rocks!Step 2: Tokenise the tweetStep 3: Increment counters [1234, man] +1 in time buckets for [1234, acunu] +1 each token [1234, rock] +17 Analytics
    • Querying start: [01/05/11, acunu]Step 1: Do a range query end: [30/05/11, acunu] Key #Mentions [01/05/11 00:01, acunu] 3Step 2: Result table [01/05/11 00:02, acunu] 5 ... ... 90Step 3: Plot pretty graph 45 0 May Jun Jul Aug Sept Oct Nov8 Analytics
    • k2 k4 k3 k1 Cassandra keys distributed based on hash or row key, ie randomly9 Analytics
    • Instead of this... Key #Mentions [01/05/11 00:01, acunu] 3 [01/05/11 00:02, acunu] 5 ... ... We do this Key 00:01 00:02 ... [01/05/11, acunu] 3 5 ... [02/05/11, acunu] 12 4 ... ... ... ... Row key is ‘big’ Column key is ‘small’ time bucket time bucket10 Analytics
    • Towards a more general solution... (Example II)11 Analytics
    • count grouped by ... day count distinct(session) count ... geographyavg(duration) ... browser12 Analytics
    • 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3221 :00→22 :00→19 :02→104 ...{ cust_id: user01, ... ... session_id: 102, UK all→228 user01→1 user14→12 user99→7 ... geography: UK, US all→354 user01→4 user04→8 user56→17 ... browser: IE, time: 22:02, ...} UK, 22:00 all→1904 ... ∅ all→87314 UK→238 US→354 ...13 Analytics
    • 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3222 :00→22 :00→19 :02→105 ...{ cust_id: user01, ... ... session_id: 102, UK all→229 user01→2 user14→12 user99→7 ... geography: UK, US all→354 user01→4 user04→8 user56→17 ... browser: IE, time: 22:02, ...} UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ...14 Analytics
    • 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3221 :00→22 :00→19 :02→104 ... ... ... UK all→228 user01→1 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1904 ... ∅ all→87314 UK→238 US→354 ...15 Analytics
    • where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3222 :00→22 :01→19 :02→105 ... ... ... UK all→229 user01→2 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ...16 Analytics
    • where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ...17 Analytics
    • where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ...where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ...18 Analytics
    • where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ...where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ...count all ∅ all→87315 UK→239 US→354 ...19 Analytics
    • where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ...where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ...count all ∅ all→87315 UK→239 US→354 ...group all by geo20 Analytics
    • What about more than just aggregates?21 Analytics
    • Approximate Analytics Exact Real-time Large Scale22 Analytics
    • Count Distinct Plan A: keep a list of all the things you’ve seen count them at query time Quick to update ... but at scale ... Takes lots of space Takes a long time to query23 Analytics
    • Approximate Distinct max # leading zeroes seen so far item hash leading zeroes max so far x 00101001110... 2 2 y 11010100111... 0 2 z 00011101011... 3 3 ... ... to see a max of M takes about 2M items24 Analytics
    • Approximate Distinct to reduce var, average over m=2k sub-streams item hash index, zeroes max so far x 00101001110... 0, 0 0,0,0,0 y 11010100111... 3, 1 0,0,0,1 z 00011101011... 0, 1 1,0,0,1 ... take the harmonic mean25 Analytics
    • Okay... now what? Analytics
    • Analytics counter updatesClick stream events AcunuSensor data Analytics etc • Aggregate incrementally, on the fly • Store live + historical aggregates
    • 10x vs MySQL... Analytics
    • Dashboard UI29 Analytics
    • “Up and running in about 4 hours”“We found out a competitor was scraping our data” “We keep discovering use cases we hadn’t thought of ”http://vimeo.com/54026096 Analytics
    • "Quick, efficient and easy to get started" "Were still finding new and interesting use cases, which just arent possible with our current datastores." Analytics
    • Thanks! Questions? http://www.acunu.com/download contact@acunu.com32 Analytics