Realtime Analytics with Apache Cassandra - JAX London
1. Realtime Analytics
with Apache
Cassandra
Tom Wilkie
Founder & CTO, Acunu Ltd
@tom_wilkie
2. Combining “big” and “real-time” is hard
Live & historical Drill downs
Trends...
aggregates... and roll ups
2
Analytics
3. Solution Con
Scalability
$$$
Not realtime
Spartan query semantics =>
complex, DIY solutions
3
Analytics
4. Example I
eg “show me the number of mentions of
‘Acunu’ per day, between May and
November 2011, on Twitter”
Batch (Hadoop) approach would require
processing ~30 billion tweets, or ~4.2
TB of data
http://blog.twitter.com/2011/03/numbers.html
4
Analytics
5. Okay, so how are we going to
do it?
For each tweet,
increment a bunch of counters,
such that answering a query
is as easy as reading some counters
5
Analytics
6. Preparing the data
12:32:15 I like #trafficlights
Step 1: Get a feed of 12:33:43 Nobody expects...
the tweets 12:33:49 I ate a #bee; woe is...
12:34:04 Man, @acunu rocks!
Step 2: Tokenise the
tweet
Step 3: Increment counters [1234, man] +1
in time buckets for [1234, acunu] +1
each token [1234, rock] +1
6
Analytics
7. Querying
start: [01/05/11, acunu]
Step 1: Do a range query end: [30/05/11, acunu]
Key #Mentions
[01/05/11 00:01, acunu] 3
Step 2: Result table [01/05/11 00:02, acunu] 5
... ...
90
Step 3: Plot pretty graph 45
0
May Jun Jul Aug Sept Oct Nov
7
Analytics
8. Instead of this...
Key #Mentions
[01/05/11 00:01, acunu] 3
[01/05/11 00:02, acunu] 5
... ...
We do this
Key 00:01 00:02 ...
[01/05/11, acunu] 3 5 ...
[02/05/11, acunu] 12 4 ...
... ... ...
Row key is ‘big’ Column key is ‘small’
time bucket time bucket
8
Analytics
9. Towards a more
general solution...
(Example II)
9
Analytics
10. count
grouped by ...
day
count
distinct
(session)
count ... geography
avg(duration)
... browser
10
Analytics
15. where time 21:00-22:00
count(*)
21:00 all→1345 :00→45 :01→62 :02→87 ...
where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ...
group by minute ... ...
UK all→229 user01→2 user14→12 user99→7 ...
US all→354 user01→4 user04→8 user56→17 ...
...
UK, 22:00 all→1905 ...
∅ all→87315 UK→239 US→354 ...
15
Analytics
16. where time 21:00-22:00
count(*)
21:00 all→1345 :00→45 :01→62 :02→87 ...
where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ...
group by minute ... ...
UK all→229 user01→2 user14→12 user99→7 ...
where geography=UK US all→354 user01→4 user04→8 user56→17 ...
group all by user, ...
UK, 22:00 all→1905 ...
∅ all→87315 UK→239 US→354 ...
16
Analytics
17. where time 21:00-22:00
count(*)
21:00 all→1345 :00→45 :01→62 :02→87 ...
where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ...
group by minute ... ...
UK all→229 user01→2 user14→12 user99→7 ...
where geography=UK US all→354 user01→4 user04→8 user56→17 ...
group all by user, ...
UK, 22:00 all→1905 ...
count all ∅ all→87315 UK→239 US→354 ...
17
Analytics
18. where time 21:00-22:00
count(*)
21:00 all→1345 :00→45 :01→62 :02→87 ...
where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ...
group by minute ... ...
UK all→229 user01→2 user14→12 user99→7 ...
where geography=UK US all→354 user01→4 user04→8 user56→17 ...
group all by user, ...
UK, 22:00 all→1905 ...
count all ∅ all→87315 UK→239 US→354 ...
group all by geo
18
Analytics
21. Count Distinct
Plan A: keep a list of all the things you’ve seen
count them at query time
Quick to update
... but at scale ...
Takes lots of space
Takes a long time to query
21
Analytics
22. Approximate Distinct
max # leading zeroes seen so far
item hash leading zeroes max so far
x 00101001110... 2 2
y 11010100111... 0 2
z 00011101011... 3 3
...
... to see a max of M takes about 2M items
22
Analytics
23. Approximate Distinct
to reduce var, average over m=2k sub-streams
item hash index, zeroes max so far
x 00101001110... 0, 0 0,0,0,0
y 11010100111... 3, 1 0,0,0,1
z 00011101011... 0, 1 1,0,0,1
...
take the harmonic mean
23
Analytics
25. Analytics
counter
updates
Click stream events
Acunu
Sensor data
Analytics
etc
• Aggregate incrementally, on the fly
• Store live + historical aggregates
28. “Up and running in about 4 hours”
“We found out a competitor
was scraping our data”
“We keep discovering use cases
we hadn’t thought of ”
Analytics
29. "Quick, efficient and easy to
get started"
"We're still finding new and
interesting use cases, which just
aren't possible with our
current datastores."
Analytics