SlideShare a Scribd company logo
1 of 20
Download to read offline
Realtime Analytics
 with Cassandra
   or: How I Learned to
  Stopped Worrying and
       Love Counting


                          1
What is Realtime Analytics?
    eg “show me the number of mentions of
        ‘Acunu’ per day, between May and
          November 2011, on Twitter”


       Batch (Hadoop) approach would
    require processing ~30 billion tweets,
              or ~4.2 TB of data
                 http://blog.twitter.com/2011/03/numbers.html




                                                                2
Introduction



    Live & historical
      aggregates...




3

                                       3
Realtime trends...




4

                         4
Drill downs
    and roll ups


5

                   5
Okay, so how are we
     going to do it?
For each tweet,
increment a bunch of counters,
such that answering a query
is as easy as reading some counters

                                      6
Preparing the data
                              12:32:15 I like #trafficlights
Step 1: Get a feed of    12:33:43 Nobody expects...
        the tweets     12:33:49 I ate a #bee; woe is...
                      12:34:04 Man, @acunu rocks!

Step 2: Tokenise the
        tweet

Step 3: Increment counters            [1234, man]   +1
        in time buckets for           [1234, acunu] +1
        each token                    [1234, rock] +1

                                                              7
Querying
                            start: [01/05/11, acunu]
Step 1: Do a range query    end:   [30/05/11, acunu]

                                       Key            #Mentions
                              [01/05/11 00:01, acunu]    3
Step 2: Result table          [01/05/11 00:02, acunu]    5
                                        ...              ...


                              90

Step 3: Plot pretty graph     45
                               0
                                   May Jun Jul Aug Sept Oct Nov


                                                                  8
Except it’s not that easy...
• Cassandra best practice is to use RandomPartitioner,
  so not possible to range queries on rows
• Could manually work out each row in range, do lots of
  point gets
  • This would suck - each query would be 100’s of random
    IOs on disk
• Need to use wide rows, range query is a column slice,
  each query ~1 IO - Denormalisation

                                                            9
So instead of this...
                              Key            #Mentions
                     [01/05/11 00:01, acunu]    3
                     [01/05/11 00:02, acunu]    5
                               ...              ...




                     We do this
                  Key           00:01       00:02        ...
            [01/05/11, acunu]     3           5          ...
            [02/05/11, acunu]    12           4          ...
                    ...           ...                    ...

Row key is ‘big’                     Column key is ‘small’
 time bucket                             time bucket
                                                               10
Demo
./painbird.py -u tom_wilkie




                              11
Now its your
  turn.....



               12
1. Get a twitter account - http://twitter.com

2. Get some Cassandra VMs - http://goo.gl/O9hkv

3. Cluster them up

4. Get the code - http://goo.gl

5. Implement the missing bits!

6. (Prizes for the ones that spot bugs!)


                                                  13
Get some Cassandra
       VMs


http://goo.gl/O9hkv


                      14
Cluster them up
• SSH in, set password (on both!)
• Check you can connect to the UI
• Use UI (click add host)




                                    15
Get the code
SSH into one of the VMs:
# curl https://acunu-
oss.s3.amazonaws.com/painbird.tar.gz
| tar zxf -
# curl -o pycassa.rpm https://acunu-
oss.s3.amazonaws.com/pycassa.rpm
# rpm -i pycassa.rpm
# cd release
# ./painbird.py -u tom_wilkie
                                       16
Implement the “core”

• In core.py
• def insert_tweet(cassandra, tweet):
• def do_query(cassandra, term, start, finish):


                                                 17
Check you data
-bash-3.2$ cassandra-cli
Connected to: "Test Cluster" on localhost/9160
Welcome to Cassandra CLI version 1.0.8.acunu2
Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.

[default@unknown] use painbird;
Authenticated to keyspace: painbird
[default@painbird] list keywords;
Using default limit of 100
-------------------
RowKey: m-5-"woe
=> (counter=11, value=1)
                                             18
Extensions



             19
UI                        Painbird
• Pretty graphs           •   mentions of multiple
• Automatically               terms
  periodically update     •   sentiment analysis -
• Search multiple terms       http://www.nltk.org/




                                                     20

More Related Content

Similar to Realtime Analytics on the Twitter Firehose with Cassandra

Columnar processing for SQL-on-Hadoop: The best is yet to come
Columnar processing for SQL-on-Hadoop: The best is yet to comeColumnar processing for SQL-on-Hadoop: The best is yet to come
Columnar processing for SQL-on-Hadoop: The best is yet to come
Wang Zuo
 

Similar to Realtime Analytics on the Twitter Firehose with Cassandra (20)

All Your Base
All Your BaseAll Your Base
All Your Base
 
The Hacker News: Hacking Wireless DSL routers via Admin Panel Password Reset ...
The Hacker News: Hacking Wireless DSL routers via Admin Panel Password Reset ...The Hacker News: Hacking Wireless DSL routers via Admin Panel Password Reset ...
The Hacker News: Hacking Wireless DSL routers via Admin Panel Password Reset ...
 
Kafka Summit NYC 2017 - Running Hundreds of Kafka Clusters with 5 People
Kafka Summit NYC 2017 - Running Hundreds of Kafka Clusters with 5 PeopleKafka Summit NYC 2017 - Running Hundreds of Kafka Clusters with 5 People
Kafka Summit NYC 2017 - Running Hundreds of Kafka Clusters with 5 People
 
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
 
Debugging Complex Systems - Erlang Factory SF 2015
Debugging Complex Systems - Erlang Factory SF 2015Debugging Complex Systems - Erlang Factory SF 2015
Debugging Complex Systems - Erlang Factory SF 2015
 
What we Learned Implementing Puppet at Backstop
What we Learned Implementing Puppet at BackstopWhat we Learned Implementing Puppet at Backstop
What we Learned Implementing Puppet at Backstop
 
Sangam 19 - Successful Applications on Autonomous
Sangam 19 - Successful Applications on AutonomousSangam 19 - Successful Applications on Autonomous
Sangam 19 - Successful Applications on Autonomous
 
CBGTBT - Part 4 - Mining
CBGTBT - Part 4 - MiningCBGTBT - Part 4 - Mining
CBGTBT - Part 4 - Mining
 
Easier, Better, Faster, Safer Deployment with Docker and Immutable Containers
Easier, Better, Faster, Safer Deployment with Docker and Immutable ContainersEasier, Better, Faster, Safer Deployment with Docker and Immutable Containers
Easier, Better, Faster, Safer Deployment with Docker and Immutable Containers
 
Sangam 18 - Database Development: Return of the SQL Jedi
Sangam 18 - Database Development: Return of the SQL JediSangam 18 - Database Development: Return of the SQL Jedi
Sangam 18 - Database Development: Return of the SQL Jedi
 
A Percona Support Engineer Walkthrough on pt-stalk
A Percona Support Engineer Walkthrough on pt-stalkA Percona Support Engineer Walkthrough on pt-stalk
A Percona Support Engineer Walkthrough on pt-stalk
 
Columnar processing for SQL-on-Hadoop: The best is yet to come
Columnar processing for SQL-on-Hadoop: The best is yet to comeColumnar processing for SQL-on-Hadoop: The best is yet to come
Columnar processing for SQL-on-Hadoop: The best is yet to come
 
Psychtoolbox (PTB) practical course by Volodymyr B. Bogdanov, Lyon/Kyiv 2018...
Psychtoolbox (PTB) practical course  by Volodymyr B. Bogdanov, Lyon/Kyiv 2018...Psychtoolbox (PTB) practical course  by Volodymyr B. Bogdanov, Lyon/Kyiv 2018...
Psychtoolbox (PTB) practical course by Volodymyr B. Bogdanov, Lyon/Kyiv 2018...
 
Tokyo APAC Groundbreakers tour - The Complete Java Developer
Tokyo APAC Groundbreakers tour - The Complete Java DeveloperTokyo APAC Groundbreakers tour - The Complete Java Developer
Tokyo APAC Groundbreakers tour - The Complete Java Developer
 
The basics of hacking and penetration testing 이제 시작이야 해킹과 침투 테스트 kenneth.s.kwon
The basics of hacking and penetration testing 이제 시작이야 해킹과 침투 테스트 kenneth.s.kwonThe basics of hacking and penetration testing 이제 시작이야 해킹과 침투 테스트 kenneth.s.kwon
The basics of hacking and penetration testing 이제 시작이야 해킹과 침투 테스트 kenneth.s.kwon
 
Embedded Recipes 2018 - Finding sources of Latency In your system - Steven Ro...
Embedded Recipes 2018 - Finding sources of Latency In your system - Steven Ro...Embedded Recipes 2018 - Finding sources of Latency In your system - Steven Ro...
Embedded Recipes 2018 - Finding sources of Latency In your system - Steven Ro...
 
Analyze database system using a 3 d method
Analyze database system using a 3 d methodAnalyze database system using a 3 d method
Analyze database system using a 3 d method
 
Kubernetes in 30 minutes (2017/03/10)
Kubernetes in 30 minutes (2017/03/10)Kubernetes in 30 minutes (2017/03/10)
Kubernetes in 30 minutes (2017/03/10)
 
Let's write a Debugger!
Let's write a Debugger!Let's write a Debugger!
Let's write a Debugger!
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.
 

More from Acunu

Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problems
Acunu
 
Acunu Analytics
Acunu AnalyticsAcunu Analytics
Acunu Analytics
Acunu
 

More from Acunu (20)

Acunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on CassandraAcunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on Cassandra
 
Virtual nodes: Operational Aspirin
Virtual nodes: Operational AspirinVirtual nodes: Operational Aspirin
Virtual nodes: Operational Aspirin
 
Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013 Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013
 
Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problems
 
Acunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra AppsAcunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra Apps
 
Real-time Cassandra
Real-time CassandraReal-time Cassandra
Real-time Cassandra
 
Realtime Analytics with Cassandra
Realtime Analytics with CassandraRealtime Analytics with Cassandra
Realtime Analytics with Cassandra
 
Acunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra LondonAcunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra London
 
Exploring Big Data value for your business
Exploring Big Data value for your businessExploring Big Data value for your business
Exploring Big Data value for your business
 
Progressive NOSQL: Cassandra
Progressive NOSQL: CassandraProgressive NOSQL: Cassandra
Progressive NOSQL: Cassandra
 
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
 
Cassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into CassandraCassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into Cassandra
 
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsCassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
 
Next Generation Cassandra
Next Generation CassandraNext Generation Cassandra
Next Generation Cassandra
 
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
 
Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
Cassandra EU 2012 - Storage Internals by Nicolas Favre-FelixCassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
 
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
 
Cassandra EU 2012 - Data modelling workshop by Richard Low
Cassandra EU 2012 - Data modelling workshop by Richard LowCassandra EU 2012 - Data modelling workshop by Richard Low
Cassandra EU 2012 - Data modelling workshop by Richard Low
 
Acunu Analytics
Acunu AnalyticsAcunu Analytics
Acunu Analytics
 
Cassandra Performance: Past, present & future
Cassandra Performance: Past, present & futureCassandra Performance: Past, present & future
Cassandra Performance: Past, present & future
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Realtime Analytics on the Twitter Firehose with Cassandra

  • 1. Realtime Analytics with Cassandra or: How I Learned to Stopped Worrying and Love Counting 1
  • 2. What is Realtime Analytics? eg “show me the number of mentions of ‘Acunu’ per day, between May and November 2011, on Twitter” Batch (Hadoop) approach would require processing ~30 billion tweets, or ~4.2 TB of data http://blog.twitter.com/2011/03/numbers.html 2
  • 3. Introduction Live & historical aggregates... 3 3
  • 5. Drill downs and roll ups 5 5
  • 6. Okay, so how are we going to do it? For each tweet, increment a bunch of counters, such that answering a query is as easy as reading some counters 6
  • 7. Preparing the data 12:32:15 I like #trafficlights Step 1: Get a feed of 12:33:43 Nobody expects... the tweets 12:33:49 I ate a #bee; woe is... 12:34:04 Man, @acunu rocks! Step 2: Tokenise the tweet Step 3: Increment counters [1234, man] +1 in time buckets for [1234, acunu] +1 each token [1234, rock] +1 7
  • 8. Querying start: [01/05/11, acunu] Step 1: Do a range query end: [30/05/11, acunu] Key #Mentions [01/05/11 00:01, acunu] 3 Step 2: Result table [01/05/11 00:02, acunu] 5 ... ... 90 Step 3: Plot pretty graph 45 0 May Jun Jul Aug Sept Oct Nov 8
  • 9. Except it’s not that easy... • Cassandra best practice is to use RandomPartitioner, so not possible to range queries on rows • Could manually work out each row in range, do lots of point gets • This would suck - each query would be 100’s of random IOs on disk • Need to use wide rows, range query is a column slice, each query ~1 IO - Denormalisation 9
  • 10. So instead of this... Key #Mentions [01/05/11 00:01, acunu] 3 [01/05/11 00:02, acunu] 5 ... ... We do this Key 00:01 00:02 ... [01/05/11, acunu] 3 5 ... [02/05/11, acunu] 12 4 ... ... ... ... Row key is ‘big’ Column key is ‘small’ time bucket time bucket 10
  • 12. Now its your turn..... 12
  • 13. 1. Get a twitter account - http://twitter.com 2. Get some Cassandra VMs - http://goo.gl/O9hkv 3. Cluster them up 4. Get the code - http://goo.gl 5. Implement the missing bits! 6. (Prizes for the ones that spot bugs!) 13
  • 14. Get some Cassandra VMs http://goo.gl/O9hkv 14
  • 15. Cluster them up • SSH in, set password (on both!) • Check you can connect to the UI • Use UI (click add host) 15
  • 16. Get the code SSH into one of the VMs: # curl https://acunu- oss.s3.amazonaws.com/painbird.tar.gz | tar zxf - # curl -o pycassa.rpm https://acunu- oss.s3.amazonaws.com/pycassa.rpm # rpm -i pycassa.rpm # cd release # ./painbird.py -u tom_wilkie 16
  • 17. Implement the “core” • In core.py • def insert_tweet(cassandra, tweet): • def do_query(cassandra, term, start, finish): 17
  • 18. Check you data -bash-3.2$ cassandra-cli Connected to: "Test Cluster" on localhost/9160 Welcome to Cassandra CLI version 1.0.8.acunu2 Type 'help;' or '?' for help. Type 'quit;' or 'exit;' to quit. [default@unknown] use painbird; Authenticated to keyspace: painbird [default@painbird] list keywords; Using default limit of 100 ------------------- RowKey: m-5-"woe => (counter=11, value=1) 18
  • 20. UI Painbird • Pretty graphs • mentions of multiple • Automatically terms periodically update • sentiment analysis - • Search multiple terms http://www.nltk.org/ 20