Realtime Analytics with Cassandra   or: How I Learned to  Stopped Worrying and       Love Counting
Combining “big” and “real-time” is hard    Live & historical                    Drill downs                         Trends...
What is Realtime Analytics?    eg “show me the number of mentions of        ‘Acunu’ per day, between May and          Nove...
Okay, so how are we            going to do it?                                counter                                updat...
Okay, so how are we     going to do it?For each tweet,increment a bunch of counters,such that answering a queryis as easy ...
Preparing the data                              12:32:15 I like #trafficlightsStep 1: Get a feed of    12:33:43 Nobody expe...
Querying                            start: [01/05/11, acunu]Step 1: Do a range query    end:   [30/05/11, acunu]          ...
Except it’s not that easy...• Cassandra best practice is to use RandomPartitioner,  so not possible to range queries on ro...
So instead of this...                              Key            #Mentions                     [01/05/11 00:01, acunu]   ...
Demo./painbird.py -u tom_wilkie    http://ec2-176-34-212-226.eu- west-1.compute.amazonaws.com:8000
Now its your  turn.....
1. Get a twitter account - http://twitter.com2. Get some Cassandra VMs - http://goo.gl/Ruqlt3. Cluster them up4. Get the c...
Get some Cassandra       VMshttp://goo.gl/O9hkv
Cluster them up• SSH in, set password (on both!)• Check you can connect to the UI• Use UI (click add host)
Get the codeSSH into one of the VMs:# curl https://acunu-oss.s3.amazonaws.com/painbird-2.tar.gz | tar zxf -# cd release# ....
Implement the “core”• In core.py• def insert_tweet(cassandra, tweet):• def do_query(cassandra, term, start, finish):
Check you data-bash-3.2$ cassandra-cliConnected to: "Test Cluster" on localhost/9160Welcome to Cassandra CLI version 1.0.8...
Extensions
ExtensionsUI• Pretty graphs• Automatically periodically update?• Search multiple termsPainbird•   mentions of multiple ter...
Upcoming SlideShare
Loading in …5
×

Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

2,754 views

Published on

Slides from my tutorial at Denormalized London on 21 Sept 2012

Published in: Technology, Design
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,754
On SlideShare
0
From Embeds
0
Number of Embeds
200
Actions
Shares
0
Downloads
27
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

  1. 1. Realtime Analytics with Cassandra or: How I Learned to Stopped Worrying and Love Counting
  2. 2. Combining “big” and “real-time” is hard Live & historical Drill downs Trends... aggregates... and roll ups2 Analytics
  3. 3. What is Realtime Analytics? eg “show me the number of mentions of ‘Acunu’ per day, between May and November 2011, on Twitter” Batch (Hadoop) approach would require processing ~30 billion tweets, or ~4.2 TB of data http://blog.twitter.com/2011/03/numbers.html
  4. 4. Okay, so how are we going to do it? counter updates tweetsTwitter ? • Push processing into ingest phase • Make queries fast
  5. 5. Okay, so how are we going to do it?For each tweet,increment a bunch of counters,such that answering a queryis as easy as reading some counters
  6. 6. Preparing the data 12:32:15 I like #trafficlightsStep 1: Get a feed of 12:33:43 Nobody expects... the tweets 12:33:49 I ate a #bee; woe is... 12:34:04 Man, @acunu rocks!Step 2: Tokenise the tweetStep 3: Increment counters [1234, man] +1 in time buckets for [1234, acunu] +1 each token [1234, rock] +1
  7. 7. Querying start: [01/05/11, acunu]Step 1: Do a range query end: [30/05/11, acunu] Key #Mentions [01/05/11 00:01, acunu] 3Step 2: Result table [01/05/11 00:02, acunu] 5 ... ... 90Step 3: Plot pretty graph 45 0 May Jun Jul Aug Sept Oct Nov
  8. 8. Except it’s not that easy...• Cassandra best practice is to use RandomPartitioner, so not possible to range queries on rows• Could manually work out each row in range, do lots of point gets • This would suck - each query would be 100’s of random IOs on disk• Need to use wide rows, range query is a column slice, each query ~1 IO - Denormalisation
  9. 9. So instead of this... Key #Mentions [01/05/11 00:01, acunu] 3 [01/05/11 00:02, acunu] 5 ... ... We do this Key 00:01 00:02 ... [01/05/11, acunu] 3 5 ... [02/05/11, acunu] 12 4 ... ... ... ...Row key is ‘big’ Column key is ‘small’ time bucket time bucket
  10. 10. Demo./painbird.py -u tom_wilkie http://ec2-176-34-212-226.eu- west-1.compute.amazonaws.com:8000
  11. 11. Now its your turn.....
  12. 12. 1. Get a twitter account - http://twitter.com2. Get some Cassandra VMs - http://goo.gl/Ruqlt3. Cluster them up4. Get the code - http://goo.gl/VxXKB5. Implement the missing bits!6. (Prizes for the ones that spot bugs!)
  13. 13. Get some Cassandra VMshttp://goo.gl/O9hkv
  14. 14. Cluster them up• SSH in, set password (on both!)• Check you can connect to the UI• Use UI (click add host)
  15. 15. Get the codeSSH into one of the VMs:# curl https://acunu-oss.s3.amazonaws.com/painbird-2.tar.gz | tar zxf -# cd release# ./painbird.py -u tom_wilkie
  16. 16. Implement the “core”• In core.py• def insert_tweet(cassandra, tweet):• def do_query(cassandra, term, start, finish):
  17. 17. Check you data-bash-3.2$ cassandra-cliConnected to: "Test Cluster" on localhost/9160Welcome to Cassandra CLI version 1.0.8.acunu2Type help; or ? for help.Type quit; or exit; to quit.[default@unknown] use painbird;Authenticated to keyspace: painbird[default@painbird] list keywords;Using default limit of 100-------------------RowKey: m-5-"woe=> (counter=11, value=1)
  18. 18. Extensions
  19. 19. ExtensionsUI• Pretty graphs• Automatically periodically update?• Search multiple termsPainbird• mentions of multiple terms• sentiment analysis - http://www.nltk.org/• filtering by multiple fields (geo + keyword)

×