Your SlideShare is downloading. ×
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012


Published on

Slides from my tutorial at Denormalized London on 21 Sept 2012

Slides from my tutorial at Denormalized London on 21 Sept 2012

Published in: Technology, Design
1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Realtime Analytics with Cassandra or: How I Learned to Stopped Worrying and Love Counting
  • 2. Combining “big” and “real-time” is hard Live & historical Drill downs Trends... aggregates... and roll ups2 Analytics
  • 3. What is Realtime Analytics? eg “show me the number of mentions of ‘Acunu’ per day, between May and November 2011, on Twitter” Batch (Hadoop) approach would require processing ~30 billion tweets, or ~4.2 TB of data
  • 4. Okay, so how are we going to do it? counter updates tweetsTwitter ? • Push processing into ingest phase • Make queries fast
  • 5. Okay, so how are we going to do it?For each tweet,increment a bunch of counters,such that answering a queryis as easy as reading some counters
  • 6. Preparing the data 12:32:15 I like #trafficlightsStep 1: Get a feed of 12:33:43 Nobody expects... the tweets 12:33:49 I ate a #bee; woe is... 12:34:04 Man, @acunu rocks!Step 2: Tokenise the tweetStep 3: Increment counters [1234, man] +1 in time buckets for [1234, acunu] +1 each token [1234, rock] +1
  • 7. Querying start: [01/05/11, acunu]Step 1: Do a range query end: [30/05/11, acunu] Key #Mentions [01/05/11 00:01, acunu] 3Step 2: Result table [01/05/11 00:02, acunu] 5 ... ... 90Step 3: Plot pretty graph 45 0 May Jun Jul Aug Sept Oct Nov
  • 8. Except it’s not that easy...• Cassandra best practice is to use RandomPartitioner, so not possible to range queries on rows• Could manually work out each row in range, do lots of point gets • This would suck - each query would be 100’s of random IOs on disk• Need to use wide rows, range query is a column slice, each query ~1 IO - Denormalisation
  • 9. So instead of this... Key #Mentions [01/05/11 00:01, acunu] 3 [01/05/11 00:02, acunu] 5 ... ... We do this Key 00:01 00:02 ... [01/05/11, acunu] 3 5 ... [02/05/11, acunu] 12 4 ... ... ... ...Row key is ‘big’ Column key is ‘small’ time bucket time bucket
  • 10. Demo./ -u tom_wilkie
  • 11. Now its your turn.....
  • 12. 1. Get a twitter account - http://twitter.com2. Get some Cassandra VMs - Cluster them up4. Get the code - Implement the missing bits!6. (Prizes for the ones that spot bugs!)
  • 13. Get some Cassandra VMs
  • 14. Cluster them up• SSH in, set password (on both!)• Check you can connect to the UI• Use UI (click add host)
  • 15. Get the codeSSH into one of the VMs:# curl | tar zxf -# cd release# ./ -u tom_wilkie
  • 16. Implement the “core”• In• def insert_tweet(cassandra, tweet):• def do_query(cassandra, term, start, finish):
  • 17. Check you data-bash-3.2$ cassandra-cliConnected to: "Test Cluster" on localhost/9160Welcome to Cassandra CLI version 1.0.8.acunu2Type help; or ? for help.Type quit; or exit; to quit.[default@unknown] use painbird;Authenticated to keyspace: painbird[default@painbird] list keywords;Using default limit of 100-------------------RowKey: m-5-"woe=> (counter=11, value=1)
  • 18. Extensions
  • 19. ExtensionsUI• Pretty graphs• Automatically periodically update?• Search multiple termsPainbird• mentions of multiple terms• sentiment analysis -• filtering by multiple fields (geo + keyword)