SlideShare a Scribd company logo
1 of 21
Cassandra as the central nervous
system of your distributed systems

            /*
            Joe Stein
            http://www.linkedin.com/in/charmalloc
            @allthingshadoop
            @cassandranosql
            @allthingsscala
            @charmalloc
            */



            http://www.medialets.com




                         1
Overview
• Architecture
• Aggregate Metrics/Time Series
• Implementation Over Cassandra




                  2
Medialets

Architecture




      3
Medialets
•   Largest deployment of rich media ads for mobile devices
•   Over 300,000,000 devices supported
•   3-4 TB of new data every day
•   Thousands of services in production
•   Hundreds of thousands ofevents received every second
•   Response times are measured in microseconds
•   Languages
     – 35% JVM (20% Scala& 10% Java)
     – 30% Ruby
     – 20% C/C++
     – 13% Python
     – 2% Bash


                                 4
The million foot view



AdServi    Collecti
  ng         on

           Kafka
 mysql                Hadoop




          Cassandr     mysql
             a
                      Muse

                       mysql
Medialets

Aggregate Metrics/Time Series




              6
Lets look at just one data point captured

•   09/10/2011 11:12:13
•   App = Yahoo!
•   Platform = iOS
•   OS = 4.3.4
•   Device = iPad2,1
•   Resolution = 768x1024
•   Events
    –videoPlayPercent = 38
    –Taste = great




                             7
The time series part of it

• 09/10/2011 11:12:13

       Quarter                   Q3
       Month                     201109
       Week                      201136
       Day                       20110910
       Hour                      2011091011
       Minute                    201109101112
       Second                    20110910111213




                             8
Metrics For Different Wants

Yahoo! + iOS + 4.3.4 + iPad2,1 + 768x1024

Yahoo! + videoPlayPercent = 30 + Taste = great

Yahoo! + Taste = great

Yahoo! + videoPlayPercent = 30

iPad2,1 + videoPlayPercent = 30 + Taste = great

768x1024 + videoPlayPercent = 30 + Taste = great

iOS + 4.3.4 + iPad2,1

                         9
Medialets

Implementation Over Cassandra




              10
Storing the time series

CREATE COLUMN FAMILY ByDay                                   Column Families hold your
WITH default_validation_class=CounterColumnType              rows of data. Each row in
AND key_validation_class=UTF8Type AND comparator=UTF8Type;   each column family will be
                                                             equal to the time period you
CREATE COLUMN FAMILY ByHour                                  are dealing with. So an
WITH default_validation_class=CounterColumnType              “event” occurring at
AND key_validation_class=UTF8Type AND comparator=UTF8Type;
                                                             09/10/2011 12:13:14 will
                                                             become 4 rows
CREATE COLUMN FAMILY ByMinute
WITH default_validation_class=CounterColumnType              BySecond = 20110910121314
AND key_validation_class=UTF8Type AND comparator=UTF8Type;   ByMinute= 201109101213
                                                             ByHour= 2011091012
CREATE COLUMN FAMILY BySecond                                ByDay=20110910
WITH default_validation_class=CounterColumnType
AND key_validation_class=UTF8Type AND comparator=UTF8Type;




                                            11
Why multiple column families?
http://www.datastax.com/docs/1.0/configuration/storage_configuration




                                 12
Generically group by
• app+platform+osversion+device+resolution

• app+event1+event2

• app+event1

• app+event2

• device+event1+event2

• resolution+event1+event2

• platform+osversion+device



                              13
As columns – names are composites

• app+platform+osversion+device+resolution#Yahoo!+iOS+4.3.4+iPad2,1+768x1024

• app+event1+event2#Yahoo!+videoPlayPercent=30+Taste=great

• app+event1#Yahoo!+Taste=great

• app+event2#Yahoo!+videoPlayPercent=30

• device+event1+event2#iPad2,1+videoPlayPercent=30+Taste=great

• resolution+event1+event2#768x1024+videoPlayPercent=30+Taste=great

• platform+osversion+device#iOS+4.3.4+iPad2,1




                                            14
The rows

• ByHour=2011091011
   – app+platform+osversion+device+resolution#Yahoo!+iOS+4.3.4+iPad2,1+768x1024
   – app+event1+event2#Yahoo!+videoPlayPercent=30+Taste=great
   – app+event1#Yahoo!+Taste=great
   – app+event2#Yahoo!+videoPlayPercent=30
   – device+event1+event2#iPad2,1+videoPlayPercent=30+Taste=great
   – resolution+event1+event2#768x1024+videoPlayPercent=30+Taste=great
   – platform+osversion+device#iOS+4.3.4+iPad2,1

• ByDay=20110910
   – app+platform+osversion+device+resolution#Yahoo!+iOS+4.3.4+iPad2,1+768x1024
   – app+event1+event2#Yahoo!+videoPlayPercent=30+Taste=great
   – app+event1#Yahoo!+Taste=great
   – app+event2#Yahoo!+videoPlayPercent=30
   – device+event1+event2#iPad2,1+videoPlayPercent=30+Taste=great
   – resolution+event1+event2#768x1024+videoPlayPercent=30+Taste=great
   – platform+osversion+device#iOS+4.3.4+iPad2,1




                                            15
Inserting data with Hector
• mutator.insertCounter(“20110910, “ByDay”,
  HFactory.createCounterColumn(“app+platform+osversion+device+resolution#Yahoo!+iOS+4.3.4+iP
  ad2,1+768x1024”), 1))

• mutator.insertCounter(“20110910, “ByDay”,
  HFactory.createCounterColumn(“app+event1+event2#Yahoo!+videoPlayPercent=30+Taste=great”)
  , 1))

• mutator.insertCounter(“20110910, “ByDay”,
  HFactory.createCounterColumn(“app+event1#Yahoo!+Taste=great”), 1))

• mutator.insertCounter(“20110910, “ByDay”,
  HFactory.createCounterColumn(“app+event2#Yahoo!+videoPlayPercent=30”), 1))

• mutator.insertCounter(“20110910, “ByDay”,
  HFactory.createCounterColumn(“device+event1+event2#iPad2,1+videoPlayPercent=30+Taste=gre
  at”), 1))

• mutator.insertCounter(“20110910, “ByDay”,
  HFactory.createCounterColumn(“resolution+event1+event2#768x1024+videoPlayPercent=30+Tast
  e=great”), 1))

• mutator.insertCounter(“20110910, “ByDay”,
  HFactory.createCounterColumn(“platform+osversion+device#iOS+4.3.4+iPad2,1


                                             16
Inserting data with Skeletor
           Skeletor is the Scala wrapper of Hector for Cassandra
                     https://github.com/joestein/skeletor
aggregateColumnNames(”AppPlatformOSVersionDeviceResolution") =
   "app+platform+osversion+device+resolution#”

def ccAppPlatformOSVersionDeviceResolution(c: (String) => Unit) = {
c(aggregateColumnNames(”AppPlatformOSVersionDeviceResolution”) + app + p(platform) + p(osversion) +
   p(device) + p(resolution))
}

//rows we are going to write too
aggregateKeys(KEYSPACE  ”ByMonth") = month //201109
aggregateKeys(KEYSPACE  "ByDay") = day //20110910
aggregateKeys(KEYSPACE  ”ByHour") = hour //2011091012
aggregateKeys(KEYSPACE  ”ByMinute") = minute //201109101213


def r(columnName: String): Unit = {
aggregateKeys.foreach{tuple:(ColumnFamily, String) => {
val (columnFamily,row) = tuple
         if (row !=null &&row.size> 0)
                   rows add (columnFamily -> row has columnName inc) //increment the counter
         }
  }
}

ccAppPlatformOSVersionDeviceResolution(r)
                                                   17
Retrieving Data
                    MultigetSliceCounterQuery

•   setColumnFamily(“ByDay”)
•   setKeys("20110910")
•   setRange(”app+event1=","app+event1=~",false,1000)
•   We will get all the apps and counts for event1

• setRange(”app+event2=","app+event2=~",false,1000)
• We will get all the apps and the counts for event2

By app tastes great vs less filling

• Sample code for the aggregate metrics and retrieving them
  https://github.com/joestein/apophis

• What is with the tilde?
                               18
Sort for success
Not magic, just Cassandra




           19
A few more things about retrieving data

• You need to start backwards from here.

• If you want to-do things adhoc then map/reduce is better

• Sometimes more rowsarebetterallowing more nodes to-do work
  – If you need to look at 100,000 metrics it is better to pull this out
    of 100 rows than out of 1
  – Don’t be afraid to make CF and composite keys out of Time+
    Aggregate data
      • 20111023+app=Yahoo!
      • This could be the row that holds ALL of the app information
        for that day, if you want to look at 100 apps at once with 1000
        metrics for each per time period, this could be the way to go




                                   20
Q&A
/*
* Joe Stein
*http://www.linkedin.com/in/charmalloc
*@allthingshadoop
*@cassandranosql
*@allthingsscala
*@charmalloc
*http://github.com/joestein
*/


Medialets
The rich media
adplatform for mobile.
                       connect@medialets.com
                       www.medialets.com/showcase




              21

More Related Content

Viewers also liked

Hadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonHadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With Python
Joe Stein
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
Joe Stein
 

Viewers also liked (20)

Introduction Apache Kafka
Introduction Apache KafkaIntroduction Apache Kafka
Introduction Apache Kafka
 
Developing Frameworks for Apache Mesos
Developing Frameworks  for Apache MesosDeveloping Frameworks  for Apache Mesos
Developing Frameworks for Apache Mesos
 
Hadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonHadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With Python
 
Streaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit LogStreaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit Log
 
Developing with the Go client for Apache Kafka
Developing with the Go client for Apache KafkaDeveloping with the Go client for Apache Kafka
Developing with the Go client for Apache Kafka
 
Introduction To Apache Mesos
Introduction To Apache MesosIntroduction To Apache Mesos
Introduction To Apache Mesos
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
 
Apache Kafka, HDFS, Accumulo and more on Mesos
Apache Kafka, HDFS, Accumulo and more on MesosApache Kafka, HDFS, Accumulo and more on Mesos
Apache Kafka, HDFS, Accumulo and more on Mesos
 
SMACK Stack 1.1
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloReal-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
 
Building and Deploying Application to Apache Mesos
Building and Deploying Application to Apache MesosBuilding and Deploying Application to Apache Mesos
Building and Deploying Application to Apache Mesos
 
Real-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache KafkaReal-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache Kafka
 
Introduction to Apache Mesos
Introduction to Apache MesosIntroduction to Apache Mesos
Introduction to Apache Mesos
 
Get started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache MesosGet started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache Mesos
 
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 

Similar to jstein.cassandra.nyc.2011

John Resig Beijing 2010 (English Version)
John Resig Beijing 2010 (English Version)John Resig Beijing 2010 (English Version)
John Resig Beijing 2010 (English Version)
Jia Mi
 
Evolution of a big data project
Evolution of a big data projectEvolution of a big data project
Evolution of a big data project
Michael Peacock
 
Google I/O 2011, Android Honeycomb Highlights
Google I/O 2011, Android Honeycomb HighlightsGoogle I/O 2011, Android Honeycomb Highlights
Google I/O 2011, Android Honeycomb Highlights
Romain Guy
 

Similar to jstein.cassandra.nyc.2011 (20)

3 Mobile App Dev Problems - Monospace
3 Mobile App Dev Problems - Monospace3 Mobile App Dev Problems - Monospace
3 Mobile App Dev Problems - Monospace
 
Intro to appcelerator
Intro to appceleratorIntro to appcelerator
Intro to appcelerator
 
John Resig Beijing 2010 (English Version)
John Resig Beijing 2010 (English Version)John Resig Beijing 2010 (English Version)
John Resig Beijing 2010 (English Version)
 
Stress Testing at Twitter: a tale of New Year Eves
Stress Testing at Twitter: a tale of New Year EvesStress Testing at Twitter: a tale of New Year Eves
Stress Testing at Twitter: a tale of New Year Eves
 
Improve Your Salesforce Efficiency: Formulas for the Everyday Admin
Improve Your Salesforce Efficiency: Formulas for the Everyday AdminImprove Your Salesforce Efficiency: Formulas for the Everyday Admin
Improve Your Salesforce Efficiency: Formulas for the Everyday Admin
 
Improve Your Salesforce Efficiency: Formulas for the Everyday Admin
Improve Your Salesforce Efficiency: Formulas for the Everyday AdminImprove Your Salesforce Efficiency: Formulas for the Everyday Admin
Improve Your Salesforce Efficiency: Formulas for the Everyday Admin
 
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy NguyenGrokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
 
Twelve ways to make your apps suck less
Twelve ways to make your apps suck lessTwelve ways to make your apps suck less
Twelve ways to make your apps suck less
 
How to build a SaaS solution in 60 days
How to build a SaaS solution in 60 daysHow to build a SaaS solution in 60 days
How to build a SaaS solution in 60 days
 
MongoDB .local Bengaluru 2019: Realm: The Secret Sauce for Better Mobile Apps
MongoDB .local Bengaluru 2019: Realm: The Secret Sauce for Better Mobile AppsMongoDB .local Bengaluru 2019: Realm: The Secret Sauce for Better Mobile Apps
MongoDB .local Bengaluru 2019: Realm: The Secret Sauce for Better Mobile Apps
 
T-Mobile and Elastic
T-Mobile and ElasticT-Mobile and Elastic
T-Mobile and Elastic
 
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
 
Laurentiu macovei meteor. a better way of building apps
Laurentiu macovei   meteor. a better way of building appsLaurentiu macovei   meteor. a better way of building apps
Laurentiu macovei meteor. a better way of building apps
 
Evolution of a big data project
Evolution of a big data projectEvolution of a big data project
Evolution of a big data project
 
Google I/O 2011, Android Honeycomb Highlights
Google I/O 2011, Android Honeycomb HighlightsGoogle I/O 2011, Android Honeycomb Highlights
Google I/O 2011, Android Honeycomb Highlights
 
Swift meetup22june2015
Swift meetup22june2015Swift meetup22june2015
Swift meetup22june2015
 
Practical JSON in MySQL 5.7 and Beyond
Practical JSON in MySQL 5.7 and BeyondPractical JSON in MySQL 5.7 and Beyond
Practical JSON in MySQL 5.7 and Beyond
 
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
 
RTI Data-Distribution Service (DDS) Master Class 2011
RTI Data-Distribution Service (DDS) Master Class 2011RTI Data-Distribution Service (DDS) Master Class 2011
RTI Data-Distribution Service (DDS) Master Class 2011
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 

jstein.cassandra.nyc.2011

  • 1. Cassandra as the central nervous system of your distributed systems /* Joe Stein http://www.linkedin.com/in/charmalloc @allthingshadoop @cassandranosql @allthingsscala @charmalloc */ http://www.medialets.com 1
  • 2. Overview • Architecture • Aggregate Metrics/Time Series • Implementation Over Cassandra 2
  • 4. Medialets • Largest deployment of rich media ads for mobile devices • Over 300,000,000 devices supported • 3-4 TB of new data every day • Thousands of services in production • Hundreds of thousands ofevents received every second • Response times are measured in microseconds • Languages – 35% JVM (20% Scala& 10% Java) – 30% Ruby – 20% C/C++ – 13% Python – 2% Bash 4
  • 5. The million foot view AdServi Collecti ng on Kafka mysql Hadoop Cassandr mysql a Muse mysql
  • 7. Lets look at just one data point captured • 09/10/2011 11:12:13 • App = Yahoo! • Platform = iOS • OS = 4.3.4 • Device = iPad2,1 • Resolution = 768x1024 • Events –videoPlayPercent = 38 –Taste = great 7
  • 8. The time series part of it • 09/10/2011 11:12:13 Quarter Q3 Month 201109 Week 201136 Day 20110910 Hour 2011091011 Minute 201109101112 Second 20110910111213 8
  • 9. Metrics For Different Wants Yahoo! + iOS + 4.3.4 + iPad2,1 + 768x1024 Yahoo! + videoPlayPercent = 30 + Taste = great Yahoo! + Taste = great Yahoo! + videoPlayPercent = 30 iPad2,1 + videoPlayPercent = 30 + Taste = great 768x1024 + videoPlayPercent = 30 + Taste = great iOS + 4.3.4 + iPad2,1 9
  • 11. Storing the time series CREATE COLUMN FAMILY ByDay Column Families hold your WITH default_validation_class=CounterColumnType rows of data. Each row in AND key_validation_class=UTF8Type AND comparator=UTF8Type; each column family will be equal to the time period you CREATE COLUMN FAMILY ByHour are dealing with. So an WITH default_validation_class=CounterColumnType “event” occurring at AND key_validation_class=UTF8Type AND comparator=UTF8Type; 09/10/2011 12:13:14 will become 4 rows CREATE COLUMN FAMILY ByMinute WITH default_validation_class=CounterColumnType BySecond = 20110910121314 AND key_validation_class=UTF8Type AND comparator=UTF8Type; ByMinute= 201109101213 ByHour= 2011091012 CREATE COLUMN FAMILY BySecond ByDay=20110910 WITH default_validation_class=CounterColumnType AND key_validation_class=UTF8Type AND comparator=UTF8Type; 11
  • 12. Why multiple column families? http://www.datastax.com/docs/1.0/configuration/storage_configuration 12
  • 13. Generically group by • app+platform+osversion+device+resolution • app+event1+event2 • app+event1 • app+event2 • device+event1+event2 • resolution+event1+event2 • platform+osversion+device 13
  • 14. As columns – names are composites • app+platform+osversion+device+resolution#Yahoo!+iOS+4.3.4+iPad2,1+768x1024 • app+event1+event2#Yahoo!+videoPlayPercent=30+Taste=great • app+event1#Yahoo!+Taste=great • app+event2#Yahoo!+videoPlayPercent=30 • device+event1+event2#iPad2,1+videoPlayPercent=30+Taste=great • resolution+event1+event2#768x1024+videoPlayPercent=30+Taste=great • platform+osversion+device#iOS+4.3.4+iPad2,1 14
  • 15. The rows • ByHour=2011091011 – app+platform+osversion+device+resolution#Yahoo!+iOS+4.3.4+iPad2,1+768x1024 – app+event1+event2#Yahoo!+videoPlayPercent=30+Taste=great – app+event1#Yahoo!+Taste=great – app+event2#Yahoo!+videoPlayPercent=30 – device+event1+event2#iPad2,1+videoPlayPercent=30+Taste=great – resolution+event1+event2#768x1024+videoPlayPercent=30+Taste=great – platform+osversion+device#iOS+4.3.4+iPad2,1 • ByDay=20110910 – app+platform+osversion+device+resolution#Yahoo!+iOS+4.3.4+iPad2,1+768x1024 – app+event1+event2#Yahoo!+videoPlayPercent=30+Taste=great – app+event1#Yahoo!+Taste=great – app+event2#Yahoo!+videoPlayPercent=30 – device+event1+event2#iPad2,1+videoPlayPercent=30+Taste=great – resolution+event1+event2#768x1024+videoPlayPercent=30+Taste=great – platform+osversion+device#iOS+4.3.4+iPad2,1 15
  • 16. Inserting data with Hector • mutator.insertCounter(“20110910, “ByDay”, HFactory.createCounterColumn(“app+platform+osversion+device+resolution#Yahoo!+iOS+4.3.4+iP ad2,1+768x1024”), 1)) • mutator.insertCounter(“20110910, “ByDay”, HFactory.createCounterColumn(“app+event1+event2#Yahoo!+videoPlayPercent=30+Taste=great”) , 1)) • mutator.insertCounter(“20110910, “ByDay”, HFactory.createCounterColumn(“app+event1#Yahoo!+Taste=great”), 1)) • mutator.insertCounter(“20110910, “ByDay”, HFactory.createCounterColumn(“app+event2#Yahoo!+videoPlayPercent=30”), 1)) • mutator.insertCounter(“20110910, “ByDay”, HFactory.createCounterColumn(“device+event1+event2#iPad2,1+videoPlayPercent=30+Taste=gre at”), 1)) • mutator.insertCounter(“20110910, “ByDay”, HFactory.createCounterColumn(“resolution+event1+event2#768x1024+videoPlayPercent=30+Tast e=great”), 1)) • mutator.insertCounter(“20110910, “ByDay”, HFactory.createCounterColumn(“platform+osversion+device#iOS+4.3.4+iPad2,1 16
  • 17. Inserting data with Skeletor Skeletor is the Scala wrapper of Hector for Cassandra https://github.com/joestein/skeletor aggregateColumnNames(”AppPlatformOSVersionDeviceResolution") = "app+platform+osversion+device+resolution#” def ccAppPlatformOSVersionDeviceResolution(c: (String) => Unit) = { c(aggregateColumnNames(”AppPlatformOSVersionDeviceResolution”) + app + p(platform) + p(osversion) + p(device) + p(resolution)) } //rows we are going to write too aggregateKeys(KEYSPACE ”ByMonth") = month //201109 aggregateKeys(KEYSPACE "ByDay") = day //20110910 aggregateKeys(KEYSPACE ”ByHour") = hour //2011091012 aggregateKeys(KEYSPACE ”ByMinute") = minute //201109101213 def r(columnName: String): Unit = { aggregateKeys.foreach{tuple:(ColumnFamily, String) => { val (columnFamily,row) = tuple if (row !=null &&row.size> 0) rows add (columnFamily -> row has columnName inc) //increment the counter } } } ccAppPlatformOSVersionDeviceResolution(r) 17
  • 18. Retrieving Data MultigetSliceCounterQuery • setColumnFamily(“ByDay”) • setKeys("20110910") • setRange(”app+event1=","app+event1=~",false,1000) • We will get all the apps and counts for event1 • setRange(”app+event2=","app+event2=~",false,1000) • We will get all the apps and the counts for event2 By app tastes great vs less filling • Sample code for the aggregate metrics and retrieving them https://github.com/joestein/apophis • What is with the tilde? 18
  • 19. Sort for success Not magic, just Cassandra 19
  • 20. A few more things about retrieving data • You need to start backwards from here. • If you want to-do things adhoc then map/reduce is better • Sometimes more rowsarebetterallowing more nodes to-do work – If you need to look at 100,000 metrics it is better to pull this out of 100 rows than out of 1 – Don’t be afraid to make CF and composite keys out of Time+ Aggregate data • 20111023+app=Yahoo! • This could be the row that holds ALL of the app information for that day, if you want to look at 100 apps at once with 1000 metrics for each per time period, this could be the way to go 20