Realtime Analytics with Cassandra   or: How I Learned to  Stopped Worrying and       Love Counting                        ...
What is Realtime Analytics?    eg “show me the number of mentions of        ‘Acunu’ per day, between May and          Nove...
Introduction    Live & historical      aggregates...3                                       3
Realtime trends...4                         4
Drill downs    and roll ups5                   5
Okay, so how are we     going to do it?For each tweet,increment a bunch of counters,such that answering a queryis as easy ...
Preparing the data                              12:32:15 I like #trafficlightsStep 1: Get a feed of    12:33:43 Nobody expe...
Querying                            start: [01/05/11, acunu]Step 1: Do a range query    end:   [30/05/11, acunu]          ...
Except it’s not that easy...• Cassandra best practice is to use RandomPartitioner,  so not possible to range queries on ro...
So instead of this...                              Key            #Mentions                     [01/05/11 00:01, acunu]   ...
Demo./painbird.py -u tom_wilkie                              11
Now its your  turn.....               12
1. Get a twitter account - http://twitter.com2. Get some Cassandra VMs - http://goo.gl/O9hkv3. Cluster them up4. Get the c...
Get some Cassandra       VMshttp://goo.gl/O9hkv                      14
Cluster them up• SSH in, set password (on both!)• Check you can connect to the UI• Use UI (click add host)                ...
Get the codeSSH into one of the VMs:# curl https://acunu-oss.s3.amazonaws.com/painbird.tar.gz| tar zxf -# curl -o pycassa....
Implement the “core”• In core.py• def insert_tweet(cassandra, tweet):• def do_query(cassandra, term, start, finish):       ...
Check you data-bash-3.2$ cassandra-cliConnected to: "Test Cluster" on localhost/9160Welcome to Cassandra CLI version 1.0.8...
Extensions             19
UI                        Painbird• Pretty graphs           •   mentions of multiple• Automatically               terms  p...
Upcoming SlideShare
Loading in …5
×

Realtime Analytics on the Twitter Firehose with Cassandra

2,396 views

Published on

Tutorial given by Tom Wilkie at Progressive NoSQL conference, 11/5/12

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,396
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
45
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Realtime Analytics on the Twitter Firehose with Cassandra

  1. 1. Realtime Analytics with Cassandra or: How I Learned to Stopped Worrying and Love Counting 1
  2. 2. What is Realtime Analytics? eg “show me the number of mentions of ‘Acunu’ per day, between May and November 2011, on Twitter” Batch (Hadoop) approach would require processing ~30 billion tweets, or ~4.2 TB of data http://blog.twitter.com/2011/03/numbers.html 2
  3. 3. Introduction Live & historical aggregates...3 3
  4. 4. Realtime trends...4 4
  5. 5. Drill downs and roll ups5 5
  6. 6. Okay, so how are we going to do it?For each tweet,increment a bunch of counters,such that answering a queryis as easy as reading some counters 6
  7. 7. Preparing the data 12:32:15 I like #trafficlightsStep 1: Get a feed of 12:33:43 Nobody expects... the tweets 12:33:49 I ate a #bee; woe is... 12:34:04 Man, @acunu rocks!Step 2: Tokenise the tweetStep 3: Increment counters [1234, man] +1 in time buckets for [1234, acunu] +1 each token [1234, rock] +1 7
  8. 8. Querying start: [01/05/11, acunu]Step 1: Do a range query end: [30/05/11, acunu] Key #Mentions [01/05/11 00:01, acunu] 3Step 2: Result table [01/05/11 00:02, acunu] 5 ... ... 90Step 3: Plot pretty graph 45 0 May Jun Jul Aug Sept Oct Nov 8
  9. 9. Except it’s not that easy...• Cassandra best practice is to use RandomPartitioner, so not possible to range queries on rows• Could manually work out each row in range, do lots of point gets • This would suck - each query would be 100’s of random IOs on disk• Need to use wide rows, range query is a column slice, each query ~1 IO - Denormalisation 9
  10. 10. So instead of this... Key #Mentions [01/05/11 00:01, acunu] 3 [01/05/11 00:02, acunu] 5 ... ... We do this Key 00:01 00:02 ... [01/05/11, acunu] 3 5 ... [02/05/11, acunu] 12 4 ... ... ... ...Row key is ‘big’ Column key is ‘small’ time bucket time bucket 10
  11. 11. Demo./painbird.py -u tom_wilkie 11
  12. 12. Now its your turn..... 12
  13. 13. 1. Get a twitter account - http://twitter.com2. Get some Cassandra VMs - http://goo.gl/O9hkv3. Cluster them up4. Get the code - http://goo.gl5. Implement the missing bits!6. (Prizes for the ones that spot bugs!) 13
  14. 14. Get some Cassandra VMshttp://goo.gl/O9hkv 14
  15. 15. Cluster them up• SSH in, set password (on both!)• Check you can connect to the UI• Use UI (click add host) 15
  16. 16. Get the codeSSH into one of the VMs:# curl https://acunu-oss.s3.amazonaws.com/painbird.tar.gz| tar zxf -# curl -o pycassa.rpm https://acunu-oss.s3.amazonaws.com/pycassa.rpm# rpm -i pycassa.rpm# cd release# ./painbird.py -u tom_wilkie 16
  17. 17. Implement the “core”• In core.py• def insert_tweet(cassandra, tweet):• def do_query(cassandra, term, start, finish): 17
  18. 18. Check you data-bash-3.2$ cassandra-cliConnected to: "Test Cluster" on localhost/9160Welcome to Cassandra CLI version 1.0.8.acunu2Type help; or ? for help.Type quit; or exit; to quit.[default@unknown] use painbird;Authenticated to keyspace: painbird[default@painbird] list keywords;Using default limit of 100-------------------RowKey: m-5-"woe=> (counter=11, value=1) 18
  19. 19. Extensions 19
  20. 20. UI Painbird• Pretty graphs • mentions of multiple• Automatically terms periodically update • sentiment analysis -• Search multiple terms http://www.nltk.org/ 20

×