Rainbird: Realtime Analytics at Twitter (Strata 2011)

Kevin Weil
Kevin WeilAnalytics Lead at Twitter
Rainbird:
Real-time Analytics @Twitter
Kevin Weil -- @kevinweil
Product Lead for Revenue, Twitter




                                    TM
Agenda
‣   Why Real-time Analytics?
‣   Rainbird and Cassandra
‣   Production Uses at Twitter
‣   Open Source
My Background
‣   Mathematics and Physics at Harvard, Physics at
    Stanford
‣   Tropos Networks (city-wide wireless): mesh
    routing algorithms, GBs of data
‣   Cooliris (web media): Hadoop and Pig for
    analytics, TBs of data
‣   Twitter: Hadoop, Pig, HBase, Cassandra, data
    viz, social graph analysis, soon to be PBs of data
My Background
‣   Mathematics and Physics at Harvard, Physics at
    Stanford
‣   Tropos Networks (city-wide wireless): mesh
    routing algorithms, GBs of data
‣   Cooliris (web media): Hadoop and Pig for
    analytics, TBs of data
‣   Twitter: Hadoop, Pig, HBase, Cassandra, data
    viz, social graph analysis, soon to be PBs of data
    Now revenue products!
Agenda
‣   Why Real-time Analytics?
‣   Rainbird and Cassandra
‣   Production Uses at Twitter
‣   Open Source
Why Real-time Analytics
‣   Twitter is real-time
Why Real-time Analytics
‣   Twitter is real-time
‣   ... even in space
And My Personal Favorite
And My Personal Favorite
Real-time Reporting
‣   Discussion around ad-based revenue model
‣   Help shape the conversation in real-time with
    Promoted Tweets
Real-time Reporting
‣   Discussion around ad-based revenue model
‣   Help shape the conversation in real-time with
    Promoted Tweets
‣   Realtime reporting
    ties it all together
Agenda
‣   Why Real-time Analytics?
‣   Rainbird and Cassandra
‣   Production Uses at Twitter
‣   Open Source
Requirements
‣   Extremely high write volume
‣      Needs to scale to 100,000s of WPS
Requirements
‣   Extremely high write volume
‣      Needs to scale to 100,000s of WPS

‣   High read volume
‣      Needs to scale to 10,000s of RPS
Requirements
‣   Extremely high write volume
‣      Needs to scale to 100,000s of WPS

‣   High read volume
‣      Needs to scale to 10,000s of RPS

‣   Horizontally scalable (reads, storage, etc)
‣      Needs to scale to 100+ TB
Requirements
‣   Extremely high write volume
‣      Needs to scale to 100,000s of WPS

‣   High read volume
‣      Needs to scale to 10,000s of RPS

‣   Horizontally scalable (reads, storage, etc)
‣      Needs to scale to 100+ TB

‣   Low latency
‣      Most reads <100 ms (esp. recent data)
Cassandra
‣   Pro: In-house expertise
‣   Pro: Open source Apache project
‣   Pro: Writes are extremely fast
‣   Pro: Horizontally scalable, low latency
‣   Pro: Other startup adoption (Digg, SimpleGeo)
Cassandra
‣   Pro: In-house expertise
‣   Pro: Open source Apache project
‣   Pro: Writes are extremely fast
‣   Pro: Horizontally scalable, low latency
‣   Pro: Other startup adoption (Digg, SimpleGeo)




‣   Con: It was really young (0.3a)
Cassandra
‣   Pro: Some dudes at Digg had already started
    working on distributed atomic counters in
    Cassandra
Cassandra
‣   Pro: Some dudes at Digg had already started
    working on distributed atomic counters in
    Cassandra
‣   Say hi to @kelvin
Cassandra
‣   Pro: Some dudes at Digg had already started
    working on distributed atomic counters in
    Cassandra
‣   Say hi to @kelvin
‣   And @lenn0x
Cassandra
‣   Pro: Some dudes at Digg had already started
    working on distributed atomic counters in
    Cassandra
‣   Say hi to @kelvin
‣   And @lenn0x
‣   A dude from
    Sweden began helping: @skr
Cassandra
‣   Pro: Some dudes at Digg had already started
    working on distributed atomic counters in
    Cassandra
‣   Say hi to @kelvin
‣   And @lenn0x
‣   A dude from
    Sweden began helping: @skr


‣   Now all at Twitter :)
Rainbird
‣   It counts things. Really quickly.
‣   Layers on top of the distributed
    counters patch, CASSANDRA-1072
Rainbird
‣   It counts things. Really quickly.
‣   Layers on top of the distributed
    counters patch, CASSANDRA-1072


‣   Relies on Zookeeper, Cassandra, Scribe, Thrift
‣   Written in Scala
Rainbird Design
‣   Aggregators
    buffer for 1m
‣   Intelligent
    flush to
    Cassandra
‣   Query
    servers read
    once written
‣   1m is
    configurable
Rainbird Data Structures
struct Event
{
    1: i32 timestamp,
    2: string category,
    3: list<string> key,
    4: i64 value,
    5: optional set<Property> properties,
    6: optional map<Property, i64> propertiesWithCounts
}
Rainbird Data Structures
struct Event
{                               Unix timestamp of event
    1: i32 timestamp,
    2: string category,
    3: list<string> key,
    4: i64 value,
    5: optional set<Property> properties,
    6: optional map<Property, i64> propertiesWithCounts
}
Rainbird Data Structures
struct Event
{                               Stat category name
    1: i32 timestamp,
    2: string category,
    3: list<string> key,
    4: i64 value,
    5: optional set<Property> properties,
    6: optional map<Property, i64> propertiesWithCounts
}
Rainbird Data Structures
struct Event
{                               Stat keys (hierarchical)
    1: i32 timestamp,
    2: string category,
    3: list<string> key,
    4: i64 value,
    5: optional set<Property> properties,
    6: optional map<Property, i64> propertiesWithCounts
}
Rainbird Data Structures
struct Event
{                               Actual count (diff)
    1: i32 timestamp,
    2: string category,
    3: list<string> key,
    4: i64 value,
    5: optional set<Property> properties,
    6: optional map<Property, i64> propertiesWithCounts
}
Rainbird Data Structures
struct Event
{                               More later
    1: i32 timestamp,
    2: string category,
    3: list<string> key,
    4: i64 value,
    5: optional set<Property> properties,
    6: optional map<Property, i64> propertiesWithCounts
}
Hierarchical Aggregation
‣   Say we’re counting Promoted Tweet impressions
‣   category = pti
‣   keys = [advertiser_id, campaign_id, tweet_id]
‣   count = 1
‣   Rainbird automatically increments the count for
‣      [advertiser_id, campaign_id, tweet_id]
‣      [advertiser_id, campaign_id]
‣      [advertiser_id]
‣   Means fast queries over each level of hierarchy
‣   Configurable in rainbird.conf, or dynamically via ZK
Hierarchical Aggregation
‣   Another example: tracking URL shortener tweets/clicks
‣   full URL = http://music.amazon.com/some_really_long_path
‣   keys = [com, amazon, music, full URL]
‣   count = 1
‣   Rainbird automatically increments the count for
‣      [com, amazon, music, full URL]
‣      [com, amazon, music]
‣      [com, amazon]
‣      [com]
‣   Means we can count clicks on full URLs
‣   And automatically aggregate over domains and subdomains!
Hierarchical Aggregation
‣   Another example: tracking URL shortener tweets/clicks
‣   full URL = http://music.amazon.com/some_really_long_path
‣   keys = [com, amazon, music, full URL]
‣   count = 1
‣   Rainbird automatically increments the count for
‣      [com, amazon, music, full URL]
‣      [com, amazon, music]          How many people tweeted
‣      [com, amazon]                 full URL?
‣      [com]
‣   Means we can count clicks on full URLs
‣   And automatically aggregate over domains and subdomains!
Hierarchical Aggregation
‣   Another example: tracking URL shortener tweets/clicks
‣   full URL = http://music.amazon.com/some_really_long_path
‣   keys = [com, amazon, music, full URL]
‣   count = 1
‣   Rainbird automatically increments the count for
‣      [com, amazon, music, full URL]
‣      [com, amazon, music]          How many people tweeted
‣      [com, amazon]                 any music.amazon.com URL?
‣      [com]
‣   Means we can count clicks on full URLs
‣   And automatically aggregate over domains and subdomains!
Hierarchical Aggregation
‣   Another example: tracking URL shortener tweets/clicks
‣   full URL = http://music.amazon.com/some_really_long_path
‣   keys = [com, amazon, music, full URL]
‣   count = 1
‣   Rainbird automatically increments the count for
‣      [com, amazon, music, full URL]
‣      [com, amazon, music]          How many people tweeted
‣      [com, amazon]                 any amazon.com URL?
‣      [com]
‣   Means we can count clicks on full URLs
‣   And automatically aggregate over domains and subdomains!
Hierarchical Aggregation
‣   Another example: tracking URL shortener tweets/clicks
‣   full URL = http://music.amazon.com/some_really_long_path
‣   keys = [com, amazon, music, full URL]
‣   count = 1
‣   Rainbird automatically increments the count for
‣      [com, amazon, music, full URL]
‣      [com, amazon, music]          How many people tweeted
‣      [com, amazon]                 any .com URL?
‣      [com]
‣   Means we can count clicks on full URLs
‣   And automatically aggregate over domains and subdomains!
Temporal Aggregation
‣   Rainbird also does (configurable) temporal
    aggregation
‣   Each count is kept minutely, but also
    denormalized hourly, daily, and all time
‣   Gives us quick counts at varying granularities
    with no large scans at read time
‣      Trading storage for latency
Multiple Formulas
‣   So far we have talked about sums
‣   Could also store counts (1 for each event)
‣   ... which gives us a mean
‣   And sums of squares (count * count for each event)
‣   ... which gives us a standard deviation
‣   And min/max as well


‣   Configure this per-category in rainbird.conf
Rainbird
‣   Write 100,000s of events per second, each with
    hierarchical structure
‣   Query with minutely granularity over any level of
    the hierarchy, get back a time series
‣   Or query all time values
‣   Or query all time means, standard deviations
‣   Latency < 100ms
Agenda
‣   Why Real-time Analytics?
‣   Rainbird and Cassandra
‣   Production Uses at Twitter
‣   Open Source
Production Uses
‣   It turns out we need to count things all the time
‣   As soon as we had this service, we started
    finding all sorts of use cases for it
‣      Promoted Products
‣      Tweeted URLs, by domain/subdomain
‣      Per-user Tweet interactions (fav, RT, follow)
‣      Arbitrary terms in Tweets
‣      Clicks on t.co URLs
Use Cases
‣   Promoted Tweet Analytics
Each different metric is part
Production Uses                of the key hierarchy

‣   Promoted Tweet Analytics
Uses the temporal
                               aggregation to quickly show
Production Uses                different levels of granularity

‣   Promoted Tweet Analytics
Data can be historical, or
Production Uses                from 60 seconds ago

‣   Promoted Tweet Analytics
Production Uses
‣   Internal Monitoring and Alerting




‣   We require operational reporting on all internal services
‣   Needs to be real-time, but also want longer-term
    aggregates
‣   Hierarchical, too: [stat,   datacenter, service, machine]
Production Uses
‣   Tweet Button Counts




‣   Tweet Button counts are requested many many
    times each day from across the web
‣   Uses the all time field
Agenda
‣   Why Real-time Analytics?
‣   Rainbird and Cassandra
‣   Production Uses at Twitter
‣   Open Source
Open Source?
‣   Yes!
Open Source?
‣   Yes!   ... but not yet
Open Source?
‣   Yes!   ... but not yet
‣   Relies on unreleased version of Cassandra
Open Source?
‣   Yes!   ... but not yet
‣   Relies on unreleased version of Cassandra
‣      ... but the counters patch is committed in trunk (0.8)
Open Source?
‣   Yes!   ... but not yet
‣   Relies on unreleased version of Cassandra
‣      ... but the counters patch is committed in trunk (0.8)
‣      ... also relies on some internal frameworks we need to
    open source
Open Source?
‣   Yes!   ... but not yet
‣   Relies on unreleased version of Cassandra
‣      ... but the counters patch is committed in trunk (0.8)
‣      ... also relies on some internal frameworks we need to
    open source
‣   It will happen
Open Source?
‣   Yes!   ... but not yet
‣   Relies on unreleased version of Cassandra
‣      ... but the counters patch is committed in trunk (0.8)
‣      ... also relies on some internal frameworks we need to
    open source
‣   It will happen
‣   See http://github.com/twitter for proof of how much
    Twitter    open source
Team
‣   John Corwin (@johnxorz)
‣   Adam Samet (@damnitsamet)
‣   Johan Oskarsson (@skr)
‣   Kelvin Kakugawa (@kelvin)
‣   Chris Goffinet (@lenn0x)
‣   Steve Jiang (@sjiang)
‣   Kevin Weil (@kevinweil)
If You Only Remember One Slide...
‣   Rainbird is a distributed, high-volume counting
    service built on top of Cassandra
‣   Write 100,000s events per second, query it with
    hierarchy and multiple time granularities, returns
    results in <100 ms
‣   Used by Twitter for multiple products internally,
    including our Promoted Products, operational
    monitoring and Tweet Button
‣   Will be open sourced so the community can use and
    improve it!
Questions?
        Follow me: @kevinweil




                       TM
1 of 60

Recommended

Real Time Analytics: Algorithms and Systems by
Real Time Analytics: Algorithms and SystemsReal Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and SystemsArun Kejariwal
23K views180 slides
Unique ID generation in distributed systems by
Unique ID generation in distributed systemsUnique ID generation in distributed systems
Unique ID generation in distributed systemsDave Gardner
79.5K views31 slides
Distributed Counters in Cassandra (Cassandra Summit 2010) by
Distributed Counters in Cassandra (Cassandra Summit 2010)Distributed Counters in Cassandra (Cassandra Summit 2010)
Distributed Counters in Cassandra (Cassandra Summit 2010)kakugawa
2.4K views41 slides
Making Apache Spark Better with Delta Lake by
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
5.4K views40 slides
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn by
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn confluent
2.3K views56 slides
A visual introduction to Apache Kafka by
A visual introduction to Apache KafkaA visual introduction to Apache Kafka
A visual introduction to Apache KafkaPaul Brebner
4.7K views84 slides

More Related Content

What's hot

Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams... by
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...confluent
4.4K views55 slides
Data Streaming Ecosystem Management at Booking.com by
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com confluent
6K views75 slides
Druid by
DruidDruid
DruidDori Waldman
1.6K views45 slides
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C... by
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...Databricks
6K views88 slides
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013 by
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
61.3K views43 slides
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf by
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfAltinity Ltd
3.2K views48 slides

What's hot(20)

Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams... by confluent
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
confluent4.4K views
Data Streaming Ecosystem Management at Booking.com by confluent
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com
confluent6K views
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C... by Databricks
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Databricks6K views
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013 by mumrah
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
mumrah61.3K views
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf by Altinity Ltd
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Altinity Ltd3.2K views
Introduction to Redis by Dvir Volk
Introduction to RedisIntroduction to Redis
Introduction to Redis
Dvir Volk121.1K views
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc... by Databricks
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Databricks855 views
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at... by Yael Garten
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Yael Garten2.7K views
Extending Twitter's Data Platform to Google Cloud by DataWorks Summit
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit1K views
My first 90 days with ClickHouse.pdf by Alkin Tezuysal
My first 90 days with ClickHouse.pdfMy first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdf
Alkin Tezuysal501 views
Airflow를 이용한 데이터 Workflow 관리 by YoungHeon (Roy) Kim
Airflow를 이용한  데이터 Workflow 관리Airflow를 이용한  데이터 Workflow 관리
Airflow를 이용한 데이터 Workflow 관리
YoungHeon (Roy) Kim9.1K views
Kafka Tutorial - Introduction to Apache Kafka (Part 1) by Jean-Paul Azar
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Jean-Paul Azar8.9K views
Kafka Security 101 and Real-World Tips by confluent
Kafka Security 101 and Real-World Tips Kafka Security 101 and Real-World Tips
Kafka Security 101 and Real-World Tips
confluent6.6K views
A Thorough Comparison of Delta Lake, Iceberg and Hudi by Databricks
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks11.1K views
[236] 카카오의데이터파이프라인 윤도영 by NAVER D2
[236] 카카오의데이터파이프라인 윤도영[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영
NAVER D212.9K views

Viewers also liked

Scalable Event Analytics with MongoDB & Ruby on Rails by
Scalable Event Analytics with MongoDB & Ruby on RailsScalable Event Analytics with MongoDB & Ruby on Rails
Scalable Event Analytics with MongoDB & Ruby on RailsJared Rosoff
28.3K views45 slides
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010) by
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)Kevin Weil
5.4K views75 slides
Hadoop and pig at twitter (oscon 2010) by
Hadoop and pig at twitter (oscon 2010)Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)Kevin Weil
9.5K views35 slides
NoSQL at Twitter (NoSQL EU 2010) by
NoSQL at Twitter (NoSQL EU 2010)NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)Kevin Weil
94.2K views156 slides
Hadoop, Pig, and Twitter (NoSQL East 2009) by
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
143.2K views58 slides
Protocol Buffers and Hadoop at Twitter by
Protocol Buffers and Hadoop at TwitterProtocol Buffers and Hadoop at Twitter
Protocol Buffers and Hadoop at TwitterKevin Weil
41.9K views49 slides

Viewers also liked(19)

Scalable Event Analytics with MongoDB & Ruby on Rails by Jared Rosoff
Scalable Event Analytics with MongoDB & Ruby on RailsScalable Event Analytics with MongoDB & Ruby on Rails
Scalable Event Analytics with MongoDB & Ruby on Rails
Jared Rosoff28.3K views
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010) by Kevin Weil
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Kevin Weil5.4K views
Hadoop and pig at twitter (oscon 2010) by Kevin Weil
Hadoop and pig at twitter (oscon 2010)Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Kevin Weil9.5K views
NoSQL at Twitter (NoSQL EU 2010) by Kevin Weil
NoSQL at Twitter (NoSQL EU 2010)NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)
Kevin Weil94.2K views
Hadoop, Pig, and Twitter (NoSQL East 2009) by Kevin Weil
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
Kevin Weil143.2K views
Protocol Buffers and Hadoop at Twitter by Kevin Weil
Protocol Buffers and Hadoop at TwitterProtocol Buffers and Hadoop at Twitter
Protocol Buffers and Hadoop at Twitter
Kevin Weil41.9K views
Hadoop summit 2010 frameworks panel elephant bird by Kevin Weil
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant bird
Kevin Weil2.1K views
Hadoop at Twitter (Hadoop Summit 2010) by Kevin Weil
Hadoop at Twitter (Hadoop Summit 2010)Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)
Kevin Weil7.1K views
Big Data at Twitter, Chirp 2010 by Kevin Weil
Big Data at Twitter, Chirp 2010Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010
Kevin Weil14K views
Email Optimization: A discussion about how A/B testing generated $500 million... by MarketingSherpa
Email Optimization: A discussion about how A/B testing generated $500 million...Email Optimization: A discussion about how A/B testing generated $500 million...
Email Optimization: A discussion about how A/B testing generated $500 million...
MarketingSherpa5K views
Why So Many Businesses are Hopeless at Blogging - New Statistics for 2013 by Passle
Why So Many Businesses are Hopeless at Blogging - New Statistics for 2013Why So Many Businesses are Hopeless at Blogging - New Statistics for 2013
Why So Many Businesses are Hopeless at Blogging - New Statistics for 2013
Passle5.3K views
Using Single Keyword Ad Groups To Drive PPC Performance by Sam Owen
Using Single Keyword Ad Groups To Drive PPC PerformanceUsing Single Keyword Ad Groups To Drive PPC Performance
Using Single Keyword Ad Groups To Drive PPC Performance
Sam Owen4.8K views
The Social Consumer Study by Don Bulmer
The Social Consumer StudyThe Social Consumer Study
The Social Consumer Study
Don Bulmer7.6K views
Brand Equity Measures by sat2coolguy
Brand Equity MeasuresBrand Equity Measures
Brand Equity Measures
sat2coolguy2.1K views
Novartis: Prevacid 24HR Online Consumer Trial Panel and Brand Ambassador Prog... by Affinitive
Novartis: Prevacid 24HR Online Consumer Trial Panel and Brand Ambassador Prog...Novartis: Prevacid 24HR Online Consumer Trial Panel and Brand Ambassador Prog...
Novartis: Prevacid 24HR Online Consumer Trial Panel and Brand Ambassador Prog...
Affinitive3.8K views
The Response Rates of Personalized Cross-Media Marketing Campaigns by Aaron Corson
The Response Rates of Personalized Cross-Media Marketing CampaignsThe Response Rates of Personalized Cross-Media Marketing Campaigns
The Response Rates of Personalized Cross-Media Marketing Campaigns
Aaron Corson9.3K views

Similar to Rainbird: Realtime Analytics at Twitter (Strata 2011)

Elastic Data Analytics Platform @Datadog by
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogC4Media
1.1K views95 slides
[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson by
[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson
[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac DawsonCODE BLUE
1.1K views60 slides
Streams by
StreamsStreams
StreamsMarielle Lange
801 views21 slides
Creating PostgreSQL-as-a-Service at Scale by
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleSean Chittenden
1.2K views45 slides
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale by
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleSriram Krishnan
4.1K views42 slides
Realtime Analytics on AWS by
Realtime Analytics on AWSRealtime Analytics on AWS
Realtime Analytics on AWSSungmin Kim
503 views79 slides

Similar to Rainbird: Realtime Analytics at Twitter (Strata 2011)(20)

Elastic Data Analytics Platform @Datadog by C4Media
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
C4Media1.1K views
[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson by CODE BLUE
[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson
[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson
CODE BLUE1.1K views
Creating PostgreSQL-as-a-Service at Scale by Sean Chittenden
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at Scale
Sean Chittenden1.2K views
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale by Sriram Krishnan
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Sriram Krishnan4.1K views
Realtime Analytics on AWS by Sungmin Kim
Realtime Analytics on AWSRealtime Analytics on AWS
Realtime Analytics on AWS
Sungmin Kim503 views
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform by ScyllaDB
Real-time Fraud Detection for Southeast Asia’s Leading Mobile PlatformReal-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
ScyllaDB2K views
Using Event Streams in Serverless Applications by Jonathan Dee
Using Event Streams in Serverless ApplicationsUsing Event Streams in Serverless Applications
Using Event Streams in Serverless Applications
Jonathan Dee108 views
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ... by Landon Robinson
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Landon Robinson140 views
2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo by Ian Massingham
2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
Ian Massingham511 views
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi... by Amazon Web Services
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
Amazon Web Services2.3K views
AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data by Amazon Web Services
 AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data
AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data
Amazon Web Services3.6K views
TSAR (TimeSeries AggregatoR) Tech Talk by Anirudh Todi
TSAR (TimeSeries AggregatoR) Tech TalkTSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech Talk
Anirudh Todi1.2K views
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring by Databricks
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Databricks3.5K views
Functional architectural patterns by Lars Albertsson
Functional architectural patternsFunctional architectural patterns
Functional architectural patterns
Lars Albertsson1.7K views
Introduction to Artificial Intelligence and Machine Learning services at AWS ... by Amazon Web Services
Introduction to Artificial Intelligence and Machine Learning services at AWS ...Introduction to Artificial Intelligence and Machine Learning services at AWS ...
Introduction to Artificial Intelligence and Machine Learning services at AWS ...
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP by Daniel Zivkovic
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Daniel Zivkovic189 views
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo by Sid Anand
Cloud Native Data Pipelines (in Eng & Japanese)  - QCon TokyoCloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
Sid Anand1K views

Rainbird: Realtime Analytics at Twitter (Strata 2011)

  • 1. Rainbird: Real-time Analytics @Twitter Kevin Weil -- @kevinweil Product Lead for Revenue, Twitter TM
  • 2. Agenda ‣ Why Real-time Analytics? ‣ Rainbird and Cassandra ‣ Production Uses at Twitter ‣ Open Source
  • 3. My Background ‣ Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data ‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data ‣ Twitter: Hadoop, Pig, HBase, Cassandra, data viz, social graph analysis, soon to be PBs of data
  • 4. My Background ‣ Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data ‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data ‣ Twitter: Hadoop, Pig, HBase, Cassandra, data viz, social graph analysis, soon to be PBs of data Now revenue products!
  • 5. Agenda ‣ Why Real-time Analytics? ‣ Rainbird and Cassandra ‣ Production Uses at Twitter ‣ Open Source
  • 6. Why Real-time Analytics ‣ Twitter is real-time
  • 7. Why Real-time Analytics ‣ Twitter is real-time ‣ ... even in space
  • 8. And My Personal Favorite
  • 9. And My Personal Favorite
  • 10. Real-time Reporting ‣ Discussion around ad-based revenue model ‣ Help shape the conversation in real-time with Promoted Tweets
  • 11. Real-time Reporting ‣ Discussion around ad-based revenue model ‣ Help shape the conversation in real-time with Promoted Tweets ‣ Realtime reporting ties it all together
  • 12. Agenda ‣ Why Real-time Analytics? ‣ Rainbird and Cassandra ‣ Production Uses at Twitter ‣ Open Source
  • 13. Requirements ‣ Extremely high write volume ‣ Needs to scale to 100,000s of WPS
  • 14. Requirements ‣ Extremely high write volume ‣ Needs to scale to 100,000s of WPS ‣ High read volume ‣ Needs to scale to 10,000s of RPS
  • 15. Requirements ‣ Extremely high write volume ‣ Needs to scale to 100,000s of WPS ‣ High read volume ‣ Needs to scale to 10,000s of RPS ‣ Horizontally scalable (reads, storage, etc) ‣ Needs to scale to 100+ TB
  • 16. Requirements ‣ Extremely high write volume ‣ Needs to scale to 100,000s of WPS ‣ High read volume ‣ Needs to scale to 10,000s of RPS ‣ Horizontally scalable (reads, storage, etc) ‣ Needs to scale to 100+ TB ‣ Low latency ‣ Most reads <100 ms (esp. recent data)
  • 17. Cassandra ‣ Pro: In-house expertise ‣ Pro: Open source Apache project ‣ Pro: Writes are extremely fast ‣ Pro: Horizontally scalable, low latency ‣ Pro: Other startup adoption (Digg, SimpleGeo)
  • 18. Cassandra ‣ Pro: In-house expertise ‣ Pro: Open source Apache project ‣ Pro: Writes are extremely fast ‣ Pro: Horizontally scalable, low latency ‣ Pro: Other startup adoption (Digg, SimpleGeo) ‣ Con: It was really young (0.3a)
  • 19. Cassandra ‣ Pro: Some dudes at Digg had already started working on distributed atomic counters in Cassandra
  • 20. Cassandra ‣ Pro: Some dudes at Digg had already started working on distributed atomic counters in Cassandra ‣ Say hi to @kelvin
  • 21. Cassandra ‣ Pro: Some dudes at Digg had already started working on distributed atomic counters in Cassandra ‣ Say hi to @kelvin ‣ And @lenn0x
  • 22. Cassandra ‣ Pro: Some dudes at Digg had already started working on distributed atomic counters in Cassandra ‣ Say hi to @kelvin ‣ And @lenn0x ‣ A dude from Sweden began helping: @skr
  • 23. Cassandra ‣ Pro: Some dudes at Digg had already started working on distributed atomic counters in Cassandra ‣ Say hi to @kelvin ‣ And @lenn0x ‣ A dude from Sweden began helping: @skr ‣ Now all at Twitter :)
  • 24. Rainbird ‣ It counts things. Really quickly. ‣ Layers on top of the distributed counters patch, CASSANDRA-1072
  • 25. Rainbird ‣ It counts things. Really quickly. ‣ Layers on top of the distributed counters patch, CASSANDRA-1072 ‣ Relies on Zookeeper, Cassandra, Scribe, Thrift ‣ Written in Scala
  • 26. Rainbird Design ‣ Aggregators buffer for 1m ‣ Intelligent flush to Cassandra ‣ Query servers read once written ‣ 1m is configurable
  • 27. Rainbird Data Structures struct Event { 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts }
  • 28. Rainbird Data Structures struct Event { Unix timestamp of event 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts }
  • 29. Rainbird Data Structures struct Event { Stat category name 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts }
  • 30. Rainbird Data Structures struct Event { Stat keys (hierarchical) 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts }
  • 31. Rainbird Data Structures struct Event { Actual count (diff) 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts }
  • 32. Rainbird Data Structures struct Event { More later 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts }
  • 33. Hierarchical Aggregation ‣ Say we’re counting Promoted Tweet impressions ‣ category = pti ‣ keys = [advertiser_id, campaign_id, tweet_id] ‣ count = 1 ‣ Rainbird automatically increments the count for ‣ [advertiser_id, campaign_id, tweet_id] ‣ [advertiser_id, campaign_id] ‣ [advertiser_id] ‣ Means fast queries over each level of hierarchy ‣ Configurable in rainbird.conf, or dynamically via ZK
  • 34. Hierarchical Aggregation ‣ Another example: tracking URL shortener tweets/clicks ‣ full URL = http://music.amazon.com/some_really_long_path ‣ keys = [com, amazon, music, full URL] ‣ count = 1 ‣ Rainbird automatically increments the count for ‣ [com, amazon, music, full URL] ‣ [com, amazon, music] ‣ [com, amazon] ‣ [com] ‣ Means we can count clicks on full URLs ‣ And automatically aggregate over domains and subdomains!
  • 35. Hierarchical Aggregation ‣ Another example: tracking URL shortener tweets/clicks ‣ full URL = http://music.amazon.com/some_really_long_path ‣ keys = [com, amazon, music, full URL] ‣ count = 1 ‣ Rainbird automatically increments the count for ‣ [com, amazon, music, full URL] ‣ [com, amazon, music] How many people tweeted ‣ [com, amazon] full URL? ‣ [com] ‣ Means we can count clicks on full URLs ‣ And automatically aggregate over domains and subdomains!
  • 36. Hierarchical Aggregation ‣ Another example: tracking URL shortener tweets/clicks ‣ full URL = http://music.amazon.com/some_really_long_path ‣ keys = [com, amazon, music, full URL] ‣ count = 1 ‣ Rainbird automatically increments the count for ‣ [com, amazon, music, full URL] ‣ [com, amazon, music] How many people tweeted ‣ [com, amazon] any music.amazon.com URL? ‣ [com] ‣ Means we can count clicks on full URLs ‣ And automatically aggregate over domains and subdomains!
  • 37. Hierarchical Aggregation ‣ Another example: tracking URL shortener tweets/clicks ‣ full URL = http://music.amazon.com/some_really_long_path ‣ keys = [com, amazon, music, full URL] ‣ count = 1 ‣ Rainbird automatically increments the count for ‣ [com, amazon, music, full URL] ‣ [com, amazon, music] How many people tweeted ‣ [com, amazon] any amazon.com URL? ‣ [com] ‣ Means we can count clicks on full URLs ‣ And automatically aggregate over domains and subdomains!
  • 38. Hierarchical Aggregation ‣ Another example: tracking URL shortener tweets/clicks ‣ full URL = http://music.amazon.com/some_really_long_path ‣ keys = [com, amazon, music, full URL] ‣ count = 1 ‣ Rainbird automatically increments the count for ‣ [com, amazon, music, full URL] ‣ [com, amazon, music] How many people tweeted ‣ [com, amazon] any .com URL? ‣ [com] ‣ Means we can count clicks on full URLs ‣ And automatically aggregate over domains and subdomains!
  • 39. Temporal Aggregation ‣ Rainbird also does (configurable) temporal aggregation ‣ Each count is kept minutely, but also denormalized hourly, daily, and all time ‣ Gives us quick counts at varying granularities with no large scans at read time ‣ Trading storage for latency
  • 40. Multiple Formulas ‣ So far we have talked about sums ‣ Could also store counts (1 for each event) ‣ ... which gives us a mean ‣ And sums of squares (count * count for each event) ‣ ... which gives us a standard deviation ‣ And min/max as well ‣ Configure this per-category in rainbird.conf
  • 41. Rainbird ‣ Write 100,000s of events per second, each with hierarchical structure ‣ Query with minutely granularity over any level of the hierarchy, get back a time series ‣ Or query all time values ‣ Or query all time means, standard deviations ‣ Latency < 100ms
  • 42. Agenda ‣ Why Real-time Analytics? ‣ Rainbird and Cassandra ‣ Production Uses at Twitter ‣ Open Source
  • 43. Production Uses ‣ It turns out we need to count things all the time ‣ As soon as we had this service, we started finding all sorts of use cases for it ‣ Promoted Products ‣ Tweeted URLs, by domain/subdomain ‣ Per-user Tweet interactions (fav, RT, follow) ‣ Arbitrary terms in Tweets ‣ Clicks on t.co URLs
  • 44. Use Cases ‣ Promoted Tweet Analytics
  • 45. Each different metric is part Production Uses of the key hierarchy ‣ Promoted Tweet Analytics
  • 46. Uses the temporal aggregation to quickly show Production Uses different levels of granularity ‣ Promoted Tweet Analytics
  • 47. Data can be historical, or Production Uses from 60 seconds ago ‣ Promoted Tweet Analytics
  • 48. Production Uses ‣ Internal Monitoring and Alerting ‣ We require operational reporting on all internal services ‣ Needs to be real-time, but also want longer-term aggregates ‣ Hierarchical, too: [stat, datacenter, service, machine]
  • 49. Production Uses ‣ Tweet Button Counts ‣ Tweet Button counts are requested many many times each day from across the web ‣ Uses the all time field
  • 50. Agenda ‣ Why Real-time Analytics? ‣ Rainbird and Cassandra ‣ Production Uses at Twitter ‣ Open Source
  • 52. Open Source? ‣ Yes! ... but not yet
  • 53. Open Source? ‣ Yes! ... but not yet ‣ Relies on unreleased version of Cassandra
  • 54. Open Source? ‣ Yes! ... but not yet ‣ Relies on unreleased version of Cassandra ‣ ... but the counters patch is committed in trunk (0.8)
  • 55. Open Source? ‣ Yes! ... but not yet ‣ Relies on unreleased version of Cassandra ‣ ... but the counters patch is committed in trunk (0.8) ‣ ... also relies on some internal frameworks we need to open source
  • 56. Open Source? ‣ Yes! ... but not yet ‣ Relies on unreleased version of Cassandra ‣ ... but the counters patch is committed in trunk (0.8) ‣ ... also relies on some internal frameworks we need to open source ‣ It will happen
  • 57. Open Source? ‣ Yes! ... but not yet ‣ Relies on unreleased version of Cassandra ‣ ... but the counters patch is committed in trunk (0.8) ‣ ... also relies on some internal frameworks we need to open source ‣ It will happen ‣ See http://github.com/twitter for proof of how much Twitter open source
  • 58. Team ‣ John Corwin (@johnxorz) ‣ Adam Samet (@damnitsamet) ‣ Johan Oskarsson (@skr) ‣ Kelvin Kakugawa (@kelvin) ‣ Chris Goffinet (@lenn0x) ‣ Steve Jiang (@sjiang) ‣ Kevin Weil (@kevinweil)
  • 59. If You Only Remember One Slide... ‣ Rainbird is a distributed, high-volume counting service built on top of Cassandra ‣ Write 100,000s events per second, query it with hierarchy and multiple time granularities, returns results in <100 ms ‣ Used by Twitter for multiple products internally, including our Promoted Products, operational monitoring and Tweet Button ‣ Will be open sourced so the community can use and improve it!
  • 60. Questions? Follow me: @kevinweil TM

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n