Your SlideShare is downloading. ×
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

2,284

Published on

Slides from my tutorial at Denormalized London on 21 Sept 2012

Slides from my tutorial at Denormalized London on 21 Sept 2012

Published in: Technology, Design
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,284
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
25
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Realtime Analytics with Cassandra or: How I Learned to Stopped Worrying and Love Counting
  • 2. Combining “big” and “real-time” is hard Live & historical Drill downs Trends... aggregates... and roll ups2 Analytics
  • 3. What is Realtime Analytics? eg “show me the number of mentions of ‘Acunu’ per day, between May and November 2011, on Twitter” Batch (Hadoop) approach would require processing ~30 billion tweets, or ~4.2 TB of data http://blog.twitter.com/2011/03/numbers.html
  • 4. Okay, so how are we going to do it? counter updates tweetsTwitter ? • Push processing into ingest phase • Make queries fast
  • 5. Okay, so how are we going to do it?For each tweet,increment a bunch of counters,such that answering a queryis as easy as reading some counters
  • 6. Preparing the data 12:32:15 I like #trafficlightsStep 1: Get a feed of 12:33:43 Nobody expects... the tweets 12:33:49 I ate a #bee; woe is... 12:34:04 Man, @acunu rocks!Step 2: Tokenise the tweetStep 3: Increment counters [1234, man] +1 in time buckets for [1234, acunu] +1 each token [1234, rock] +1
  • 7. Querying start: [01/05/11, acunu]Step 1: Do a range query end: [30/05/11, acunu] Key #Mentions [01/05/11 00:01, acunu] 3Step 2: Result table [01/05/11 00:02, acunu] 5 ... ... 90Step 3: Plot pretty graph 45 0 May Jun Jul Aug Sept Oct Nov
  • 8. Except it’s not that easy...• Cassandra best practice is to use RandomPartitioner, so not possible to range queries on rows• Could manually work out each row in range, do lots of point gets • This would suck - each query would be 100’s of random IOs on disk• Need to use wide rows, range query is a column slice, each query ~1 IO - Denormalisation
  • 9. So instead of this... Key #Mentions [01/05/11 00:01, acunu] 3 [01/05/11 00:02, acunu] 5 ... ... We do this Key 00:01 00:02 ... [01/05/11, acunu] 3 5 ... [02/05/11, acunu] 12 4 ... ... ... ...Row key is ‘big’ Column key is ‘small’ time bucket time bucket
  • 10. Demo./painbird.py -u tom_wilkie http://ec2-176-34-212-226.eu- west-1.compute.amazonaws.com:8000
  • 11. Now its your turn.....
  • 12. 1. Get a twitter account - http://twitter.com2. Get some Cassandra VMs - http://goo.gl/Ruqlt3. Cluster them up4. Get the code - http://goo.gl/VxXKB5. Implement the missing bits!6. (Prizes for the ones that spot bugs!)
  • 13. Get some Cassandra VMshttp://goo.gl/O9hkv
  • 14. Cluster them up• SSH in, set password (on both!)• Check you can connect to the UI• Use UI (click add host)
  • 15. Get the codeSSH into one of the VMs:# curl https://acunu-oss.s3.amazonaws.com/painbird-2.tar.gz | tar zxf -# cd release# ./painbird.py -u tom_wilkie
  • 16. Implement the “core”• In core.py• def insert_tweet(cassandra, tweet):• def do_query(cassandra, term, start, finish):
  • 17. Check you data-bash-3.2$ cassandra-cliConnected to: "Test Cluster" on localhost/9160Welcome to Cassandra CLI version 1.0.8.acunu2Type help; or ? for help.Type quit; or exit; to quit.[default@unknown] use painbird;Authenticated to keyspace: painbird[default@painbird] list keywords;Using default limit of 100-------------------RowKey: m-5-"woe=> (counter=11, value=1)
  • 18. Extensions
  • 19. ExtensionsUI• Pretty graphs• Automatically periodically update?• Search multiple termsPainbird• mentions of multiple terms• sentiment analysis - http://www.nltk.org/• filtering by multiple fields (geo + keyword)

×