jstein.cassandra.nyc.2011

Cassandra as the central nervous
system of your distributed systems

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
@allthingshadoop
@cassandranosql
@allthingsscala
@charmalloc
*/

http://www.medialets.com

1

Overview
• Architecture
• Aggregate Metrics/Time Series
• Implementation Over Cassandra

2

Medialets

Architecture

3

Medialets
• Largest deployment of rich media ads for mobile devices
• Over 300,000,000 devices supported
• 3-4 TB of new data every day
• Thousands of services in production
• Hundreds of thousands ofevents received every second
• Response times are measured in microseconds
• Languages
– 35% JVM (20% Scala& 10% Java)
– 30% Ruby
– 20% C/C++
– 13% Python
– 2% Bash

4

The million foot view

AdServi Collecti
ng on

Kafka
mysql Hadoop

Cassandr mysql
a
Muse

mysql

Medialets

Aggregate Metrics/Time Series

6

Lets look at just one data point captured

• 09/10/2011 11:12:13
• App = Yahoo!
• Platform = iOS
• OS = 4.3.4
• Device = iPad2,1
• Resolution = 768x1024
• Events
–videoPlayPercent = 38
–Taste = great

7

The time series part of it

• 09/10/2011 11:12:13

Quarter Q3
Month 201109
Week 201136
Day 20110910
Hour 2011091011
Minute 201109101112
Second 20110910111213

8

Metrics For Different Wants

Yahoo! + iOS + 4.3.4 + iPad2,1 + 768x1024

Yahoo! + videoPlayPercent = 30 + Taste = great

Yahoo! + Taste = great

Yahoo! + videoPlayPercent = 30

iPad2,1 + videoPlayPercent = 30 + Taste = great

768x1024 + videoPlayPercent = 30 + Taste = great

iOS + 4.3.4 + iPad2,1

9

Medialets

Implementation Over Cassandra

10

Storing the time series

CREATE COLUMN FAMILY ByDay Column Families hold your
WITH default_validation_class=CounterColumnType rows of data. Each row in
AND key_validation_class=UTF8Type AND comparator=UTF8Type; each column family will be
equal to the time period you
CREATE COLUMN FAMILY ByHour are dealing with. So an
WITH default_validation_class=CounterColumnType “event” occurring at
AND key_validation_class=UTF8Type AND comparator=UTF8Type;
09/10/2011 12:13:14 will
become 4 rows
CREATE COLUMN FAMILY ByMinute
WITH default_validation_class=CounterColumnType BySecond = 20110910121314
AND key_validation_class=UTF8Type AND comparator=UTF8Type; ByMinute= 201109101213
ByHour= 2011091012
CREATE COLUMN FAMILY BySecond ByDay=20110910
WITH default_validation_class=CounterColumnType
AND key_validation_class=UTF8Type AND comparator=UTF8Type;

11

Why multiple column families?
http://www.datastax.com/docs/1.0/configuration/storage_configuration

12

Generically group by
• app+platform+osversion+device+resolution

• app+event1+event2

• app+event1

• app+event2

• device+event1+event2

• resolution+event1+event2

• platform+osversion+device

13

As columns – names are composites

• app+platform+osversion+device+resolution#Yahoo!+iOS+4.3.4+iPad2,1+768x1024

• app+event1+event2#Yahoo!+videoPlayPercent=30+Taste=great

• app+event1#Yahoo!+Taste=great

• app+event2#Yahoo!+videoPlayPercent=30

• device+event1+event2#iPad2,1+videoPlayPercent=30+Taste=great

• resolution+event1+event2#768x1024+videoPlayPercent=30+Taste=great

• platform+osversion+device#iOS+4.3.4+iPad2,1

14

The rows

• ByHour=2011091011
– app+platform+osversion+device+resolution#Yahoo!+iOS+4.3.4+iPad2,1+768x1024
– app+event1+event2#Yahoo!+videoPlayPercent=30+Taste=great
– app+event1#Yahoo!+Taste=great
– app+event2#Yahoo!+videoPlayPercent=30
– device+event1+event2#iPad2,1+videoPlayPercent=30+Taste=great
– resolution+event1+event2#768x1024+videoPlayPercent=30+Taste=great
– platform+osversion+device#iOS+4.3.4+iPad2,1

• ByDay=20110910
– app+platform+osversion+device+resolution#Yahoo!+iOS+4.3.4+iPad2,1+768x1024
– app+event1+event2#Yahoo!+videoPlayPercent=30+Taste=great
– app+event1#Yahoo!+Taste=great
– app+event2#Yahoo!+videoPlayPercent=30
– device+event1+event2#iPad2,1+videoPlayPercent=30+Taste=great
– resolution+event1+event2#768x1024+videoPlayPercent=30+Taste=great
– platform+osversion+device#iOS+4.3.4+iPad2,1

15

Inserting data with Hector
• mutator.insertCounter(“20110910, “ByDay”,
HFactory.createCounterColumn(“app+platform+osversion+device+resolution#Yahoo!+iOS+4.3.4+iP
ad2,1+768x1024”), 1))

HFactory.createCounterColumn(“app+event1+event2#Yahoo!+videoPlayPercent=30+Taste=great”)
, 1))

HFactory.createCounterColumn(“app+event1#Yahoo!+Taste=great”), 1))

HFactory.createCounterColumn(“app+event2#Yahoo!+videoPlayPercent=30”), 1))

HFactory.createCounterColumn(“device+event1+event2#iPad2,1+videoPlayPercent=30+Taste=gre
at”), 1))

HFactory.createCounterColumn(“resolution+event1+event2#768x1024+videoPlayPercent=30+Tast
e=great”), 1))

HFactory.createCounterColumn(“platform+osversion+device#iOS+4.3.4+iPad2,1

16

Inserting data with Skeletor
Skeletor is the Scala wrapper of Hector for Cassandra
https://github.com/joestein/skeletor
aggregateColumnNames(”AppPlatformOSVersionDeviceResolution") =
"app+platform+osversion+device+resolution#”

def ccAppPlatformOSVersionDeviceResolution(c: (String) => Unit) = {
c(aggregateColumnNames(”AppPlatformOSVersionDeviceResolution”) + app + p(platform) + p(osversion) +
p(device) + p(resolution))
}

//rows we are going to write too
aggregateKeys(KEYSPACE ”ByMonth") = month //201109
aggregateKeys(KEYSPACE "ByDay") = day //20110910
aggregateKeys(KEYSPACE ”ByHour") = hour //2011091012
aggregateKeys(KEYSPACE ”ByMinute") = minute //201109101213

def r(columnName: String): Unit = {
aggregateKeys.foreach{tuple:(ColumnFamily, String) => {
val (columnFamily,row) = tuple
if (row !=null &&row.size> 0)
rows add (columnFamily -> row has columnName inc) //increment the counter
}
}
}

ccAppPlatformOSVersionDeviceResolution(r)
17

Retrieving Data
MultigetSliceCounterQuery

• setColumnFamily(“ByDay”)
• setKeys("20110910")
• setRange(”app+event1=","app+event1=~",false,1000)
• We will get all the apps and counts for event1

• setRange(”app+event2=","app+event2=~",false,1000)
• We will get all the apps and the counts for event2

By app tastes great vs less filling

• Sample code for the aggregate metrics and retrieving them
https://github.com/joestein/apophis

• What is with the tilde?
18

Sort for success
Not magic, just Cassandra

19

A few more things about retrieving data

• You need to start backwards from here.

• If you want to-do things adhoc then map/reduce is better

• Sometimes more rowsarebetterallowing more nodes to-do work
– If you need to look at 100,000 metrics it is better to pull this out
of 100 rows than out of 1
– Don’t be afraid to make CF and composite keys out of Time+
Aggregate data
• 20111023+app=Yahoo!
• This could be the row that holds ALL of the app information
for that day, if you want to look at 100 apps at once with 1000
metrics for each per time period, this could be the way to go

20

Q&A
/*
* Joe Stein
*http://www.linkedin.com/in/charmalloc
*@allthingshadoop
*@cassandranosql
*@allthingsscala
*@charmalloc
*http://github.com/joestein
*/

Medialets
The rich media
adplatform for mobile.
connect@medialets.com
www.medialets.com/showcase

21

jstein.cassandra.nyc.2011

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to jstein.cassandra.nyc.2011

Similar to jstein.cassandra.nyc.2011 (20)

Recently uploaded

Recently uploaded (20)

jstein.cassandra.nyc.2011