Cassandra at Twitter for Real-Time Analytics

•Download as KEY, PDF•

7 likes•1,591 views

mubarakss

Cassandra at Twitter
(one use case)
Ryan King

Cassandra Meetup
January 12, 2011

TM

History
‣ Started port of Tweets to Cassandra June 2009
‣ Started other Cassandra projects in 2009 (more on this later)
‣ Abandoned Tweet in 2010

Use cases
‣ realtime traffic/engagement analytics
‣ systems monitoring

Time Series Data
‣ write heavy
‣ stored temporally
‣ viewed temporarily
‣ hierarchical aggregation

Data Model
‣ Distributed Counters (CASSANDRA-1072)
‣ each time series is a row (or rows) of counters
‣ slice over rows to get recent data

Data Model
‣ An example (not exactly the way we do it):

2011-01-12T10:00 2011-01-12T10:01 ...
host:web1:load1 5 4 ...
host:web2:load1 4 3 ...
cluster:web:load1:sum 576 505 ...
cluster:web:load1:count 100 95 ...

Aggregation
‣ Measured every minute (or continuously)
‣ Rollup to courser granularities
‣ More Counters! (aka, let’s do it live)

Aggregation
Minutes 2011-01-12T10:00 2011-01-12T10:01 ...
host:web1:load1:sum 5 4 ...
host:web1:load1:count 1 1 ...
host:web2:load1:sum 4 3 ...
host:web1:load1:count 1 1 ...
cluster:web:load1:sum 576 505 ...
cluster:web:load1:count 100 95 ...
Hours 2011-01-12T10 2011-01-12T11 ...
host:web1:load1:sum 300 250 ...
host:web1:load1:count 60 59 ...
host:web2:load1:sum 4 3 ...
host:web1:load1:count 61 60 ...
cluster:web:load1:sum 3010 2995 ...
cluster:web:load1:count 6000 5990 ...

Aggregation
‣ other dimensions besides time:
‣ clusters
‣ racks / dcs, etc
‣ And combinations of the above

Pros / Cons
‣ Pros
‣ real-time data (average 30s between measurement and visibility)
‣ real time aggregation
‣ flexible data retention (once counters and TTLs work together)
‣ Cons
‣ Storage-intensive
‣ Slow reads

Questions?

ryan@twitter.com
twitter.com/rk

TM

What's hot

Apache Incubator Samza: Stream Processing at LinkedInChris Riccomini

Reactive programming using rx java & akka actors - pdx-scala - june 2014Thomas Lockney

Benchmarking for HTTP/2Kit Chan

Prometheus casual talk1wyukawa

moabcon2012 - Transitioning from Grid EngineFrédérick Lefebvre

Intro to Functional Programming with RxJavaMike Nakhimovich

Unc203Manish Girdhar

Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...Flink Forward

WebSocketsRodrigo Zaccara

Administrative techniques to reduce Kafka costs | Anna Kepler, ViasatHostedbyConfluent

Scaling to Millions of Concurrent SPARQL Queries on the CloudMarin Dimitrov

Monitoring Kafka w/ Prometheuskawamuray

Self Created Load Balancer for MTA on AWSsharu1204

Reactive cocoaHSIEH CHING-FAN

C100 k and gotracymacding

S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...Codemotion Tel Aviv

An introduction to node3Vivian S. Zhang

Monitoring Kubernetes with PrometheusTobias Schmidt

Flink Forward Berlin 2017: Ruben Casado Tejedor - Flink-Kudu connector: an op...Flink Forward

promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...Tokuhiro Matsuno

What's hot (20)

Apache Incubator Samza: Stream Processing at LinkedIn

Reactive programming using rx java & akka actors - pdx-scala - june 2014

Benchmarking for HTTP/2

Prometheus casual talk1

moabcon2012 - Transitioning from Grid Engine

Intro to Functional Programming with RxJava

Unc203

Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...

WebSockets

Administrative techniques to reduce Kafka costs | Anna Kepler, Viasat

Scaling to Millions of Concurrent SPARQL Queries on the Cloud

Monitoring Kafka w/ Prometheus

Self Created Load Balancer for MTA on AWS

Reactive cocoa

C100 k and go

S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...

An introduction to node3

Monitoring Kubernetes with Prometheus

Flink Forward Berlin 2017: Ruben Casado Tejedor - Flink-Kudu connector: an op...

promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...

Viewers also liked

Bay area Cassandra Meetup 2011mubarakss

Twitter with CassandraAdhish Singla

Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016StampedeCon

Curso de creación de Dashboards Open SourceStratebi

Cassandra Tutorialmubarakss

69 claves para conocer Big DataStratebi

Viewers also liked (6)

Bay area Cassandra Meetup 2011

Twitter with Cassandra

Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016

Curso de creación de Dashboards Open Source

Cassandra Tutorial

69 claves para conocer Big Data

Similar to Cassandra at Twitter for Real-Time Analytics

Cloud ProbingTokyo University of Science

StackWatch: A prototype CloudWatch service for CloudStackChiradeep Vittal

High Throughput Analytics with Cassandra & AzureDataStax Academy

Finding an unusual cause of max_user_connections in MySQLOlivier Doucet

Tomcat from a cluster to the cloud on RP3Jean-Frederic Clere

OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...NETWAYS

QNIBTerminal: Understand your datacenter by overlaying multiple information l...QNIB Solutions

Riak add presentationIlya Bogunov

3.2 Streaming and Messaging振东刘

How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightScyllaDB

Apache Flink(tm) - A Next-Generation Stream ProcessorAljoscha Krettek

Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaDataWorks Summit

Streaming ETL with Apache Kafka and KSQLNick Dearden

Building a system for machine and event-oriented data - Velocity, Santa Clara...Eric Sammer

Streaming SQL w/ Apache Calcite Hortonworks

Streaming SQL with Apache CalciteJulian Hyde

UKOUG, Lies, Damn Lies and I/O StatisticsKyle Hailey

Training Slides: Intermediate 202: Performing Cluster Maintenance with Zero-D...Continuent

Four Ways to Improve ASP .NET Performance and ScalabilityAlachisoft

Micronaut brainbitMichel Schudel

Similar to Cassandra at Twitter for Real-Time Analytics (20)

Cloud Probing

StackWatch: A prototype CloudWatch service for CloudStack

High Throughput Analytics with Cassandra & Azure

Finding an unusual cause of max_user_connections in MySQL

Tomcat from a cluster to the cloud on RP3

OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...

QNIBTerminal: Understand your datacenter by overlaying multiple information l...

Riak add presentation

3.2 Streaming and Messaging

How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night

Apache Flink(tm) - A Next-Generation Stream Processor

Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka

Streaming ETL with Apache Kafka and KSQL

Building a system for machine and event-oriented data - Velocity, Santa Clara...

Streaming SQL w/ Apache Calcite

Streaming SQL with Apache Calcite

UKOUG, Lies, Damn Lies and I/O Statistics

Training Slides: Intermediate 202: Performing Cluster Maintenance with Zero-D...

Four Ways to Improve ASP .NET Performance and Scalability

Micronaut brainbit

Cassandra at Twitter for Real-Time Analytics

1. Cassandra at Twitter (one use case) Ryan King Cassandra Meetup January 12, 2011 TM

2. History

3. History ‣ Started port of Tweets to Cassandra June 2009 ‣ Started other Cassandra projects in 2009 (more on this later) ‣ Abandoned Tweet in 2010

4. Time Series Data

5. Use cases ‣ realtime traffic/engagement analytics ‣ systems monitoring

6. Time Series Data ‣ write heavy ‣ stored temporally ‣ viewed temporarily ‣ hierarchical aggregation

7. Data Model ‣ Distributed Counters (CASSANDRA-1072) ‣ each time series is a row (or rows) of counters ‣ slice over rows to get recent data

8. Data Model ‣ An example (not exactly the way we do it): 2011-01-12T10:00 2011-01-12T10:01 ... host:web1:load1 5 4 ... host:web2:load1 4 3 ... cluster:web:load1:sum 576 505 ... cluster:web:load1:count 100 95 ...

9. Aggregation ‣ Measured every minute (or continuously) ‣ Rollup to courser granularities ‣ More Counters! (aka, let’s do it live)

10. Aggregation Minutes 2011-01-12T10:00 2011-01-12T10:01 ... host:web1:load1:sum 5 4 ... host:web1:load1:count 1 1 ... host:web2:load1:sum 4 3 ... host:web1:load1:count 1 1 ... cluster:web:load1:sum 576 505 ... cluster:web:load1:count 100 95 ... Hours 2011-01-12T10 2011-01-12T11 ... host:web1:load1:sum 300 250 ... host:web1:load1:count 60 59 ... host:web2:load1:sum 4 3 ... host:web1:load1:count 61 60 ... cluster:web:load1:sum 3010 2995 ... cluster:web:load1:count 6000 5990 ...

11. Aggregation ‣ other dimensions besides time: ‣ clusters ‣ racks / dcs, etc ‣ And combinations of the above

12. Pros / Cons ‣ Pros ‣ real-time data (average 30s between measurement and visibility) ‣ real time aggregation ‣ flexible data retention (once counters and TTLs work together) ‣ Cons ‣ Storage-intensive ‣ Slow reads

13.

14. Questions? ryan@twitter.com twitter.com/rk TM

15. Obligatory Plug. twitter.com/jobs TM

Editor's Notes

\n
\n
\n
this is our most successful use case\nwe had a general need for realtime high-scale time series data\n
\n
\n
counters work in trunk\nsome things, like averages can be modeled as several counters that get combined at read time\n
\n
\n
\n
you can aggregate by many dimensions at once\nevery combination is persisted separately\n
\n
\n
\n
\n

Cassandra at Twitter for Real-Time Analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Cassandra at Twitter for Real-Time Analytics

Similar to Cassandra at Twitter for Real-Time Analytics (20)

Cassandra at Twitter for Real-Time Analytics

Editor's Notes