SlideShare a Scribd company logo
1 of 33
Download to read offline
Who We Are 
• Holistic video advertising 
platform for publishers 
• Most transparent global 
marketplace for sellers 
• Founded in 2007, 180+ 
employees globally 
• First to market with video 
RTB in 2010 
• Integrated with over half of 
comScore top 100 pubs 
2+ Billion 
Ad decisions per day 
Reaching 
335+ Million 
Uniques every month 
Serving impressions in 
100+ Countries 
Integrated with 100,000+ publishers Connected to 35+ DSPs 
Partnerships with Industry-Leading Trading Desks 10,000+ Brand Name Advertisers
How 
Big 
is 
Our 
Data? 
● Over 
2 
billion 
ad 
auc1ons 
per 
day 
● Each 
auc1on 
generates 
an 
average 
of 
20-­‐30 
“records” 
● Audience 
data 
● Bid 
data 
● Event 
tracking 
● A 
“record 
everything” 
approach 
would 
result 
in 
approximately 
50 
billion 
records 
per 
day 
● Normalized: 
~ 
1.5 
TB 
/ 
day 
uncompressed 
● Denormalized: 
~ 
5 
TB 
/ 
day 
uncompressed 
● Possibly 
up 
to 
150 
TB 
of 
data 
per 
month 
● We 
are 
not 
currently 
using 
a 
“record 
everything” 
approach, 
but 
we 
want 
to 
get 
there
How 
Fast 
Does 
Our 
Data 
Grow? 
2.5E+09 
2E+09 
1.5E+09 
1E+09 
500000000 
0 
Auctions 
10/9/13 11/9/13 12/9/13 1/9/14 2/9/14 3/9/14 4/9/14 5/9/14 6/9/14 7/9/14 8/9/14 9/9/14 
Auctions 
Growth Curve 
● Typically 
our 
numbers 
double 
every 
6 
months 
● We 
expect 
more 
rapid 
growth 
over 
the 
next 
year 
or 
two
How 
Fast 
Does 
Our 
Data 
Grow? 
1.2E+10 
1E+10 
8E+09 
6E+09 
4E+09 
2E+09 
0 
Auctions 
Auctions 
Growth Curve 
● Typically 
our 
numbers 
double 
every 
6 
months 
● We 
expect 
more 
rapid 
growth 
over 
the 
next 
year 
or 
two
How 
Big 
Might 
Our 
Data 
Get 
in 
a 
Year? 
● Over 
10 
billion 
ad 
auc1ons 
per 
day 
● Each 
auc1on 
generates 
an 
average 
of 
30-­‐40 
“records” 
● Audience 
data 
● Bid 
data 
● Event 
tracking 
● A 
“record 
everything” 
approach 
would 
result 
in 
approximately 
350 
billion 
records 
per 
day 
● Normalized: 
~ 
10.5 
TB 
/ 
day 
uncompressed 
● Denormalized: 
~ 
35 
TB 
/ 
day 
uncompressed 
● Possibly 
up 
to 
1 
PB 
of 
data 
per 
month
Excited to see how we’re 
using Cassandra for all this?
Too bad, we aren’t (yet)!
Where 
Do 
We 
Start?
Audience 
Data 
● Informa1on 
about 
the 
people 
that 
are 
viewing 
ads 
● Segment 
data 
(demographics, 
browsing 
history, 
etc) 
● Ads 
viewed 
● ID 
syncing 
● Used 
for 
adver1sers 
to 
reach 
their 
target 
audience 
● “My 
product 
is 
relevant 
only 
to 
bald, 
le_-­‐handed, 
highly 
educated 
immigrants 
from 
Uzbekistan.” 
● Historically 
stored 
in 
cookies 
● Technology 
advancement 
necessitates 
abandoning 
the 
cookie 
strategy 
● Track 
users 
on 
mul1ple 
devices 
● Mobile 
devices 
and 
connected 
TVs 
don’t 
typically 
support 
cookies 
● Offline 
availability 
of 
data 
provides 
analy1cs 
opportuni1es 
● Discover 
trends 
● Look-­‐alike 
segments
Cookie-­‐based 
Workflow 
Browser SpotXchange 
Browser requests an ad via HTTP 
Server responds with an ad 
The ad payload includes data partner URLs 
Data Partner 
Browser requests partner URL 
Request payload includes partner’s cookies 
Data provider replies with a redirect containing 
segment information Browser redirects to us 
We respond with our own cookies 
containing their segment data 
Browser requests an ad via HTTP 
Now including our cookies 
Server responds with an ad targeted at 
audience segments
Moving 
Away 
from 
Cookies 
● Cookies 
are 
overly 
constraining 
and 
gefng 
worse 
● Limited 
to 
desktop 
traffic 
● Payload 
is 
expensive 
● Bandwidth 
● Processing 
(encryp1on 
and 
encoding) 
● Impossible 
to 
run 
deep 
analy1cs 
● Impossible 
to 
perform 
server-­‐to-­‐server 
synchroniza1on 
● Newer 
iden1fica1on 
standards 
are 
emerging 
● Apple 
IDFA, 
Android 
ID, 
UIDH 
● Facebook/Google 
ID 
● Device 
Fingerprin1ng 
● Moving 
audience 
data 
onto 
the 
server 
allows 
data 
to 
be 
associated 
with 
any 
iden1fier 
and 
even 
tying 
mul1ple 
iden1fiers 
together
Server-­‐side 
Storage 
Workflow 
Browser SpotXchange 
Browser requests an ad via HTTP 
Server responds with an ad 
The ad payload includes data partner URLs 
Data Partner 
Browser requests partner URL with SpotX 
audience ID attached 
Data provider replies with a redirect containing 
segment information and partner audience ID Browser redirects to us 
Browser requests an ad via HTTP 
Server responds with an ad targeted at 
audience segments 
We store segment 
information on the server
Addi1onal 
Capabili1es 
Data Partner Browser SpotXchange 
User visits a site that provides the partner new 
Provider data about that user 
recognizes 
that they 
have synced 
this user with 
us in the past 
Browser requests an ad via HTTP 
Server responds with an ad targeted at the 
new audience segments 
Partner calls us server-to-server with the user 
information, including our ID and new data 
We store the new 
information
Storing 
Audience 
Data 
In 
Cassandra 
Data 
Modeling 
Cluster 
Sizing 
Replica1on 
Strategy
Data 
Modeling
Data 
Modeling 
● Solu1on 
must 
minimize 
latency 
● Ajempt 
to 
constrain 
to 
one 
read 
or 
one 
write 
per 
event 
whenever 
possible 
{! 
"audience_id" : "12345678-1234-1234-1234-123456789012",! 
"segments" : {"123": 1, "456": 3, "789": 1},! 
"foreign_ids" : {! 
"7180" : "967992447104804725",! 
"7347" : "bWv2-HOyJD8y6D",! 
"6960" : "404_53e3bfa26d377"! 
},! 
"pacing" : {! 
"2235" : 1412892591! 
}! 
}!
Data 
Modeling 
● Ad 
auc1oning 
requires 
reading 
nearly 
all 
the 
data 
at 
once 
● Most 
events 
write 
to 
one 
and 
only 
one 
data 
type 
(segments, 
ids, 
etc) 
{! 
"audience_id" : "12345678-1234-1234-1234-123456789012",! 
"segments" : {"123": 1, "456": 3, "789": 1},! 
"foreign_ids" : {! 
"7180" : "967992447104804725",! 
"7347" : "bWv2-HOyJD8y6D",! 
"6960" : "404_53e3bfa26d377"! 
},! 
"pacing" : {! 
"2235" : 1412892591! 
}! 
}!
Data 
Modeling 
● Store 
an 
en1re 
user 
record 
in 
one 
row 
so 
it 
can 
be 
read 
all 
at 
once 
● All 
data 
can 
be 
represented 
as 
a 
tuple 
with 
a 
unique 
iden1fier 
CREATE TABLE audience_data (! 
!audience_id uuid,! 
!type int,! 
!key text,! 
!value text,! 
!PRIMARY KEY (audience_id, type, key)! 
!);! 
! 
SELECT * FROM audience_data WHERE! 
!audience_id = 12345678-1234-1234-1234-123456789012;! 
! 
SELECT * FROM audience_data WHERE! 
!audience_id = 12345678-1234-1234-1234-123456789012 AND! 
!type = 1;! 
! 
INSERT INTO audience_data (audience_id, type, key, value) VALUES! 
!(12345678-1234-1234-1234-123456789012, 1, '123', '1');!
Cluster 
Sizing
Cluster 
Sizing 
● Distributed 
a 
modified 
version 
of 
our 
implementa1on 
to 
produc1on 
● Replaced 
Cassandra 
calls 
with 
writes 
to 
a 
log 
file 
● Created 
a 
spreadsheet 
detailing 
each 
opera1on 
and 
how 
much 
load 
to 
expect 
during 
peak 
1mes 
● Used 
peak 
load 
to 
size 
the 
cluster 
for 
each 
data 
center 
● Used 
formula 
provided 
by 
Aaron 
Morton 
at 
The 
Last 
Pickle 
system_constant * #cores * #nodes = ops / sec! 
replication_factor .! 
! ! 
ops = 1 read or write to one row (cluster in a partition)! 
! 
system_constant = !3000 for AWS! 
! ! !4000 for spinning disk! 
! ! !7-12K for SSD!
Our 
Backwards 
Scenario 
● Typically 
clusters 
start 
small 
and 
grow 
as 
product 
adop1on 
grows 
● Our 
cluster 
will 
be 
working 
hardest 
when 
we 
first 
turn 
it 
on 
● Exis1ng 
cookie 
data 
needs 
to 
migrate 
to 
Cassandra 
● As 
data 
migrates 
the 
load 
will 
decrease, 
normalize, 
and 
then 
increase 
slowly 
over 
the 
next 
few 
months 
● Don’t 
expect 
to 
match 
original 
load 
for 
nearly 
a 
year 
140000 
120000 
100000 
80000 
60000 
40000 
20000 
0 
Peak OPS 
Peak OPS
Cluster 
Sizing 
den01 
iad02 
lon01 
hkg01 
% 
of 
total 
traffic 
40% 
40% 
13% 
7% 
Normal 
tag 
rate 
0.1 
0.1 
0.1 
0.1 
Migra1on 
tag 
rate 
0.75 
0.75 
0.75 
0.75 
SELECT 
DC 
Avg 
46,296 
46,296 
15,046 
8,102 
Peak 
138,889 
138,889 
45,139 
24,306 
FE 
Avg 
126 
263 
684 
675 
Peak 
377 
789 
2,052 
2,025 
UPDATE 
tag 
(typical 
load) 
DC 
Avg 
4,630 
4,630 
1,505 
810 
Peak 
13,889 
13,889 
4,514 
2,431 
FE 
Avg 
13 
26 
68 
68 
Peak 
38 
79 
205 
203 
UPDATE 
tag 
(migra1on) 
DC 
Avg 
30,093 
30,093 
9,780 
5,266 
Peak 
90,278 
90,278 
29,340 
15,799 
FE 
Avg 
82 
171 
445 
439 
Peak 
245 
513 
1,334 
1,317 
Total 
DC 
ops 
(normal 
load) 
Avg 
51,389 
51,389 
16,701 
8,993 
Peak 
154,167 
154,167 
50,104 
26,979 
Total 
DC 
ops 
(migra1on) 
Avg 
87,963 
87,963 
28,588 
15,394 
Peak 
263,889 
263,889 
85,764 
46,181 
Constant 
Nodes 
required 
(8 
core) 
Spinning 
disk 
4000 
17 
17 
6 
3 
SSD 
7000 
10 
10 
4 
3
Cluster 
Sizing 
den01 
iad02 
lon01 
hkg01 
% 
total 
traffic 
40.00% 
40.00% 
13.00% 
7.00% 
Tag 
Daily 
GB 
0.9 
0.9 
0.3 
0.2 
Total 
GB 
84 
84 
27 
15 
Frqcap 
Daily 
GB 
0.6 
0.6 
0.2 
0.1 
Total 
GB 
3.9 
3.9 
1.3 
0.7 
Partner 
Daily 
GB 
8 
8 
3 
1 
Total 
GB 
1509 
1509 
490 
264 
Total 
GB 
3193 
3193 
1038 
559 
Per 
Node 
GB 
456 
456 
346 
186
Replication Strategy
Replica1on 
Strategy 
● Typical Cassandra replication is expensive 
● Each 
write 
is 
replicated 
to 
all 
data 
centers 
● Each 
cluster 
must 
be 
approximately 
the 
same 
size 
● Need 
a 
large 
pipe 
between 
data 
centers 
● 3.7 
million 
columns 
updated 
per 
second 
at 
peak 
load 
● Amount 
of 
replica1on 
needed 
increases 
with 
each 
new 
data 
center
Replica1on 
Strategy 
● Alternate 
strategies 
suggested: 
● Offline 
copying 
of 
SSTables 
● Maintain 
a 
log 
of 
changed 
records 
and 
run 
a 
process 
to 
copy 
those 
periodically 
● We 
realized 
that 
this 
data 
doesn’t 
need 
to 
be 
available 
in 
all 
places 
at 
all 
1mes 
● People 
don’t 
o_en 
move 
far 
enough 
to 
switch 
data 
centers 
● Data 
integrity 
is 
of 
fairly 
low 
importance 
● If 
our 
data 
isn’t 
replicated 
the 
user 
will 
appear 
to 
be 
new 
when 
they 
switch 
data 
centers, 
but 
that 
only 
has 
a 
minor 
short-­‐term 
impact 
on 
applica1on 
performance 
● Other 
replica1on 
strategies 
we 
considered: 
● None 
● Just-­‐in-­‐1me 
● Queued
Replica1on 
Strategy: 
None 
● Don’t 
replicate 
at 
all 
● Each 
data 
center 
has 
its 
own 
completely 
self-­‐contained 
cluster 
● Advantage: 
Simplicity 
● Disadvantage: 
Limits 
our 
ability 
to 
target 
users 
when 
they 
move 
or 
we 
reassign 
regions 
to 
a 
different 
data 
center
Replica1on 
Strategy: 
Just-­‐In-­‐Time 
● Each 
data 
center 
has 
its 
own 
completely 
self-­‐contained 
cluster 
● The 
user’s 
iden1fier 
cookie 
contains 
a 
data 
center 
iden1fier 
● When 
an 
incoming 
request’s 
cookie 
says 
it’s 
from 
a 
different 
data 
center, 
read 
from 
that 
data 
center 
in 
real 
1me 
and 
replicate 
on 
the 
fly 
to 
the 
local 
data 
center 
● Reassign 
the 
cookie 
using 
the 
new 
data 
center 
● Advantage 
● Audience 
data 
is 
(almost) 
always 
available 
(99.99%) 
● Disadvantages 
● Addi1onal 
latency 
while 
wai1ng 
for 
user 
data 
● In 
cookie-­‐less 
situa1ons 
we’d 
need 
to 
query 
all 
data 
centers 
if 
the 
local 
data 
center 
has 
no 
data
Replica1on 
Strategy: 
Just-­‐In-­‐Time 
if(cookie != null) {! 
!audience_id = cookie[id]! 
!audience_dc_id = cookie[dc_id]! 
}! 
else {! 
!audience_id = some other identifier! 
}! 
! 
if(audience_dc_id == local_dc) {! 
!audience_data = local_dc->cassandra->fetch(audience_id)! 
}! 
else {! 
!other_dcs = audience_dc_id != null ?! 
! ! !{audience_dc_id} : {dc1, dc2, dc3}! 
!for dc in other_dcs {! 
! !audience_data = dc1->cassandra->fetch(audience_id)! 
! !if(audience_data != null) {! 
! ! !local_dc->cassandra->write(! 
! ! ! !audience_id, audience_data! 
! ! ! !)! 
! ! !break;! 
! !}! 
!}! 
}!
Replica1on 
Strategy: 
Queued 
● Each 
data 
center 
has 
its 
own 
completely 
self-­‐contained 
cluster 
● When 
a 
fetch 
ajempt 
misses, 
the 
user 
ID 
is 
added 
to 
a 
queue 
for 
reconcilia1on 
● Treat 
the 
user 
as 
a 
new 
user 
and 
store 
their 
data 
locally 
● Background 
process 
consumes 
IDs 
from 
the 
queue 
and 
ajempts 
to 
fetch 
data 
from 
other 
data 
centers 
for 
reconcilia1on 
● Advantages 
● Audience 
data 
is 
mostly 
available 
(98%) 
● Minimal 
addi1onal 
latency 
introduced 
● Disadvantages 
● Addi1onal 
opera1onal 
complexity 
● Occasional 
data 
misses
Replica1on 
Strategy: 
Queued 
if(cookie != null) {! 
!audience_id = cookie[id]! 
!audience_dc_id = cookie[dc_id]! 
}! 
else {! 
!audience_id = some other identifier! 
}! 
! 
audience_data = local_dc->cassandra->fetch(audience_id)! 
if(audience_data == null) {! 
!local_dc->cassandra->queue_for_migration(! 
! !audience_id, audience_dc_id! 
! !)! 
}!
Replica1on 
Strategy: 
Queued 
audience_migrations = local_dc->fetch_from_queue()! 
for {audience_id, audience_dc_id} in audience_migrations {! 
!other_dcs = audience_dc_id != null ?! 
! ! !{audience_dc_id} : {dc1, dc2, dc3}! 
!for dc in other_dcs {! 
! !audience_data = dc1->cassandra->fetch(audience_id)! 
! !if(audience_data != null) {! 
! ! !local_dc->cassandra->write(! 
! ! ! !audience_id, audience_data! 
! ! ! !)! 
! !}! 
!}! 
} !
THANK 
YOU 
Andrew 
Ku0g 
Send 
ques5ons, 
deck 
requests, 
complaints, 
cat 
videos, 
and 
resumes 
to: 
aku0g@spotxchange.com

More Related Content

Viewers also liked

Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...
Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...
Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...DataStax Academy
 
Cassandra Day Denver 2014: Python & Cassandra Best Friends
Cassandra Day Denver 2014: Python & Cassandra Best FriendsCassandra Day Denver 2014: Python & Cassandra Best Friends
Cassandra Day Denver 2014: Python & Cassandra Best FriendsDataStax Academy
 
Cassandra Day Denver 2014: So, You Want to Use Cassandra?
Cassandra Day Denver 2014: So, You Want to Use Cassandra?Cassandra Day Denver 2014: So, You Want to Use Cassandra?
Cassandra Day Denver 2014: So, You Want to Use Cassandra?DataStax Academy
 
Cassandra Day Denver 2014: Building Java Applications with Apache Cassandra
Cassandra Day Denver 2014: Building Java Applications with Apache CassandraCassandra Day Denver 2014: Building Java Applications with Apache Cassandra
Cassandra Day Denver 2014: Building Java Applications with Apache CassandraDataStax Academy
 
Cassandra Day Denver 2014: Cassandra Anti-Pattern Jeopardy
Cassandra Day Denver 2014: Cassandra Anti-Pattern JeopardyCassandra Day Denver 2014: Cassandra Anti-Pattern Jeopardy
Cassandra Day Denver 2014: Cassandra Anti-Pattern JeopardyDataStax Academy
 
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...DataStax Academy
 
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...DataStax Academy
 
Cassandra Day Denver 2014: A Cassandra Data Model for Serving up Cat Videos
Cassandra Day Denver 2014: A Cassandra Data Model for Serving up Cat VideosCassandra Day Denver 2014: A Cassandra Data Model for Serving up Cat Videos
Cassandra Day Denver 2014: A Cassandra Data Model for Serving up Cat VideosDataStax Academy
 
Cassandra Day Denver 2014: Introduction to Apache Cassandra
Cassandra Day Denver 2014: Introduction to Apache CassandraCassandra Day Denver 2014: Introduction to Apache Cassandra
Cassandra Day Denver 2014: Introduction to Apache CassandraDataStax Academy
 
Cassandra Summit 2014: Fuzzy Entity Matching at Scale
Cassandra Summit 2014: Fuzzy Entity Matching at ScaleCassandra Summit 2014: Fuzzy Entity Matching at Scale
Cassandra Summit 2014: Fuzzy Entity Matching at ScaleDataStax Academy
 
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...DataStax Academy
 

Viewers also liked (11)

Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...
Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...
Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...
 
Cassandra Day Denver 2014: Python & Cassandra Best Friends
Cassandra Day Denver 2014: Python & Cassandra Best FriendsCassandra Day Denver 2014: Python & Cassandra Best Friends
Cassandra Day Denver 2014: Python & Cassandra Best Friends
 
Cassandra Day Denver 2014: So, You Want to Use Cassandra?
Cassandra Day Denver 2014: So, You Want to Use Cassandra?Cassandra Day Denver 2014: So, You Want to Use Cassandra?
Cassandra Day Denver 2014: So, You Want to Use Cassandra?
 
Cassandra Day Denver 2014: Building Java Applications with Apache Cassandra
Cassandra Day Denver 2014: Building Java Applications with Apache CassandraCassandra Day Denver 2014: Building Java Applications with Apache Cassandra
Cassandra Day Denver 2014: Building Java Applications with Apache Cassandra
 
Cassandra Day Denver 2014: Cassandra Anti-Pattern Jeopardy
Cassandra Day Denver 2014: Cassandra Anti-Pattern JeopardyCassandra Day Denver 2014: Cassandra Anti-Pattern Jeopardy
Cassandra Day Denver 2014: Cassandra Anti-Pattern Jeopardy
 
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
 
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
 
Cassandra Day Denver 2014: A Cassandra Data Model for Serving up Cat Videos
Cassandra Day Denver 2014: A Cassandra Data Model for Serving up Cat VideosCassandra Day Denver 2014: A Cassandra Data Model for Serving up Cat Videos
Cassandra Day Denver 2014: A Cassandra Data Model for Serving up Cat Videos
 
Cassandra Day Denver 2014: Introduction to Apache Cassandra
Cassandra Day Denver 2014: Introduction to Apache CassandraCassandra Day Denver 2014: Introduction to Apache Cassandra
Cassandra Day Denver 2014: Introduction to Apache Cassandra
 
Cassandra Summit 2014: Fuzzy Entity Matching at Scale
Cassandra Summit 2014: Fuzzy Entity Matching at ScaleCassandra Summit 2014: Fuzzy Entity Matching at Scale
Cassandra Summit 2014: Fuzzy Entity Matching at Scale
 
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...
 

Similar to Cassandra Day Denver 2014: Transitioning to Cassandra for an Already Giant Product

#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitterTwitter Developers
 
Democratizing data science Using spark, hive and druid
Democratizing data science Using spark, hive and druidDemocratizing data science Using spark, hive and druid
Democratizing data science Using spark, hive and druidDataWorks Summit
 
REX Hadoop et R
REX Hadoop et RREX Hadoop et R
REX Hadoop et Rpkernevez
 
Billions of Rows, Millions of Insights, Right Now
Billions of Rows, Millions of Insights, Right NowBillions of Rows, Millions of Insights, Right Now
Billions of Rows, Millions of Insights, Right NowRob Winters
 
The New Edge of the Network
The New Edge of the NetworkThe New Edge of the Network
The New Edge of the NetworkTom Paseka
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
Security Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budgetSecurity Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budgetJuan Berner
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Hernan Costante
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS
 
Webinar: Best Practices for Getting Started with MongoDB
Webinar: Best Practices for Getting Started with MongoDBWebinar: Best Practices for Getting Started with MongoDB
Webinar: Best Practices for Getting Started with MongoDBMongoDB
 
MongoDB Best Practices
MongoDB Best PracticesMongoDB Best Practices
MongoDB Best PracticesLewis Lin 🦊
 
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Jaroslav Gergic
 
Myntra.com's Big Data Platform
Myntra.com's Big Data PlatformMyntra.com's Big Data Platform
Myntra.com's Big Data PlatformApoorva Gaurav
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...DataStax
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleItai Yaffe
 
Optimizing Total Cost of Ownership for the AWS Cloud
Optimizing Total Cost of Ownership for the AWS CloudOptimizing Total Cost of Ownership for the AWS Cloud
Optimizing Total Cost of Ownership for the AWS CloudAmazon Web Services
 
Introduction to Amazon Kinesis Data Streams
Introduction to Amazon Kinesis Data StreamsIntroduction to Amazon Kinesis Data Streams
Introduction to Amazon Kinesis Data StreamsKnoldus Inc.
 
Syncsort & comScore Big Data Warehouse Meetup Sept 2013
Syncsort & comScore Big Data Warehouse Meetup Sept 2013Syncsort & comScore Big Data Warehouse Meetup Sept 2013
Syncsort & comScore Big Data Warehouse Meetup Sept 2013Steven Totman
 

Similar to Cassandra Day Denver 2014: Transitioning to Cassandra for an Already Giant Product (20)

#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
 
Democratizing data science Using spark, hive and druid
Democratizing data science Using spark, hive and druidDemocratizing data science Using spark, hive and druid
Democratizing data science Using spark, hive and druid
 
REX Hadoop et R
REX Hadoop et RREX Hadoop et R
REX Hadoop et R
 
Billions of Rows, Millions of Insights, Right Now
Billions of Rows, Millions of Insights, Right NowBillions of Rows, Millions of Insights, Right Now
Billions of Rows, Millions of Insights, Right Now
 
The New Edge of the Network
The New Edge of the NetworkThe New Edge of the Network
The New Edge of the Network
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Security Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budgetSecurity Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budget
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
Webinar: Best Practices for Getting Started with MongoDB
Webinar: Best Practices for Getting Started with MongoDBWebinar: Best Practices for Getting Started with MongoDB
Webinar: Best Practices for Getting Started with MongoDB
 
MongoDB Best Practices
MongoDB Best PracticesMongoDB Best Practices
MongoDB Best Practices
 
MicroStrategy at Badoo
MicroStrategy at BadooMicroStrategy at Badoo
MicroStrategy at Badoo
 
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
 
Myntra.com's Big Data Platform
Myntra.com's Big Data PlatformMyntra.com's Big Data Platform
Myntra.com's Big Data Platform
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scale
 
Scaling Your Data: Data Democratisation and DataOps
Scaling Your Data: Data Democratisation and DataOpsScaling Your Data: Data Democratisation and DataOps
Scaling Your Data: Data Democratisation and DataOps
 
Optimizing Total Cost of Ownership for the AWS Cloud
Optimizing Total Cost of Ownership for the AWS CloudOptimizing Total Cost of Ownership for the AWS Cloud
Optimizing Total Cost of Ownership for the AWS Cloud
 
Introduction to Amazon Kinesis Data Streams
Introduction to Amazon Kinesis Data StreamsIntroduction to Amazon Kinesis Data Streams
Introduction to Amazon Kinesis Data Streams
 
Syncsort & comScore Big Data Warehouse Meetup Sept 2013
Syncsort & comScore Big Data Warehouse Meetup Sept 2013Syncsort & comScore Big Data Warehouse Meetup Sept 2013
Syncsort & comScore Big Data Warehouse Meetup Sept 2013
 

More from DataStax Academy

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftDataStax Academy
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseDataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraDataStax Academy
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsDataStax Academy
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingDataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackDataStax Academy
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache CassandraDataStax Academy
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready CassandraDataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonDataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with DseDataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraDataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraDataStax Academy
 

More from DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 

Recently uploaded

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Recently uploaded (20)

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

Cassandra Day Denver 2014: Transitioning to Cassandra for an Already Giant Product

  • 1. Who We Are • Holistic video advertising platform for publishers • Most transparent global marketplace for sellers • Founded in 2007, 180+ employees globally • First to market with video RTB in 2010 • Integrated with over half of comScore top 100 pubs 2+ Billion Ad decisions per day Reaching 335+ Million Uniques every month Serving impressions in 100+ Countries Integrated with 100,000+ publishers Connected to 35+ DSPs Partnerships with Industry-Leading Trading Desks 10,000+ Brand Name Advertisers
  • 2. How Big is Our Data? ● Over 2 billion ad auc1ons per day ● Each auc1on generates an average of 20-­‐30 “records” ● Audience data ● Bid data ● Event tracking ● A “record everything” approach would result in approximately 50 billion records per day ● Normalized: ~ 1.5 TB / day uncompressed ● Denormalized: ~ 5 TB / day uncompressed ● Possibly up to 150 TB of data per month ● We are not currently using a “record everything” approach, but we want to get there
  • 3. How Fast Does Our Data Grow? 2.5E+09 2E+09 1.5E+09 1E+09 500000000 0 Auctions 10/9/13 11/9/13 12/9/13 1/9/14 2/9/14 3/9/14 4/9/14 5/9/14 6/9/14 7/9/14 8/9/14 9/9/14 Auctions Growth Curve ● Typically our numbers double every 6 months ● We expect more rapid growth over the next year or two
  • 4. How Fast Does Our Data Grow? 1.2E+10 1E+10 8E+09 6E+09 4E+09 2E+09 0 Auctions Auctions Growth Curve ● Typically our numbers double every 6 months ● We expect more rapid growth over the next year or two
  • 5. How Big Might Our Data Get in a Year? ● Over 10 billion ad auc1ons per day ● Each auc1on generates an average of 30-­‐40 “records” ● Audience data ● Bid data ● Event tracking ● A “record everything” approach would result in approximately 350 billion records per day ● Normalized: ~ 10.5 TB / day uncompressed ● Denormalized: ~ 35 TB / day uncompressed ● Possibly up to 1 PB of data per month
  • 6. Excited to see how we’re using Cassandra for all this?
  • 7. Too bad, we aren’t (yet)!
  • 8. Where Do We Start?
  • 9. Audience Data ● Informa1on about the people that are viewing ads ● Segment data (demographics, browsing history, etc) ● Ads viewed ● ID syncing ● Used for adver1sers to reach their target audience ● “My product is relevant only to bald, le_-­‐handed, highly educated immigrants from Uzbekistan.” ● Historically stored in cookies ● Technology advancement necessitates abandoning the cookie strategy ● Track users on mul1ple devices ● Mobile devices and connected TVs don’t typically support cookies ● Offline availability of data provides analy1cs opportuni1es ● Discover trends ● Look-­‐alike segments
  • 10. Cookie-­‐based Workflow Browser SpotXchange Browser requests an ad via HTTP Server responds with an ad The ad payload includes data partner URLs Data Partner Browser requests partner URL Request payload includes partner’s cookies Data provider replies with a redirect containing segment information Browser redirects to us We respond with our own cookies containing their segment data Browser requests an ad via HTTP Now including our cookies Server responds with an ad targeted at audience segments
  • 11. Moving Away from Cookies ● Cookies are overly constraining and gefng worse ● Limited to desktop traffic ● Payload is expensive ● Bandwidth ● Processing (encryp1on and encoding) ● Impossible to run deep analy1cs ● Impossible to perform server-­‐to-­‐server synchroniza1on ● Newer iden1fica1on standards are emerging ● Apple IDFA, Android ID, UIDH ● Facebook/Google ID ● Device Fingerprin1ng ● Moving audience data onto the server allows data to be associated with any iden1fier and even tying mul1ple iden1fiers together
  • 12. Server-­‐side Storage Workflow Browser SpotXchange Browser requests an ad via HTTP Server responds with an ad The ad payload includes data partner URLs Data Partner Browser requests partner URL with SpotX audience ID attached Data provider replies with a redirect containing segment information and partner audience ID Browser redirects to us Browser requests an ad via HTTP Server responds with an ad targeted at audience segments We store segment information on the server
  • 13. Addi1onal Capabili1es Data Partner Browser SpotXchange User visits a site that provides the partner new Provider data about that user recognizes that they have synced this user with us in the past Browser requests an ad via HTTP Server responds with an ad targeted at the new audience segments Partner calls us server-to-server with the user information, including our ID and new data We store the new information
  • 14. Storing Audience Data In Cassandra Data Modeling Cluster Sizing Replica1on Strategy
  • 16. Data Modeling ● Solu1on must minimize latency ● Ajempt to constrain to one read or one write per event whenever possible {! "audience_id" : "12345678-1234-1234-1234-123456789012",! "segments" : {"123": 1, "456": 3, "789": 1},! "foreign_ids" : {! "7180" : "967992447104804725",! "7347" : "bWv2-HOyJD8y6D",! "6960" : "404_53e3bfa26d377"! },! "pacing" : {! "2235" : 1412892591! }! }!
  • 17. Data Modeling ● Ad auc1oning requires reading nearly all the data at once ● Most events write to one and only one data type (segments, ids, etc) {! "audience_id" : "12345678-1234-1234-1234-123456789012",! "segments" : {"123": 1, "456": 3, "789": 1},! "foreign_ids" : {! "7180" : "967992447104804725",! "7347" : "bWv2-HOyJD8y6D",! "6960" : "404_53e3bfa26d377"! },! "pacing" : {! "2235" : 1412892591! }! }!
  • 18. Data Modeling ● Store an en1re user record in one row so it can be read all at once ● All data can be represented as a tuple with a unique iden1fier CREATE TABLE audience_data (! !audience_id uuid,! !type int,! !key text,! !value text,! !PRIMARY KEY (audience_id, type, key)! !);! ! SELECT * FROM audience_data WHERE! !audience_id = 12345678-1234-1234-1234-123456789012;! ! SELECT * FROM audience_data WHERE! !audience_id = 12345678-1234-1234-1234-123456789012 AND! !type = 1;! ! INSERT INTO audience_data (audience_id, type, key, value) VALUES! !(12345678-1234-1234-1234-123456789012, 1, '123', '1');!
  • 20. Cluster Sizing ● Distributed a modified version of our implementa1on to produc1on ● Replaced Cassandra calls with writes to a log file ● Created a spreadsheet detailing each opera1on and how much load to expect during peak 1mes ● Used peak load to size the cluster for each data center ● Used formula provided by Aaron Morton at The Last Pickle system_constant * #cores * #nodes = ops / sec! replication_factor .! ! ! ops = 1 read or write to one row (cluster in a partition)! ! system_constant = !3000 for AWS! ! ! !4000 for spinning disk! ! ! !7-12K for SSD!
  • 21. Our Backwards Scenario ● Typically clusters start small and grow as product adop1on grows ● Our cluster will be working hardest when we first turn it on ● Exis1ng cookie data needs to migrate to Cassandra ● As data migrates the load will decrease, normalize, and then increase slowly over the next few months ● Don’t expect to match original load for nearly a year 140000 120000 100000 80000 60000 40000 20000 0 Peak OPS Peak OPS
  • 22. Cluster Sizing den01 iad02 lon01 hkg01 % of total traffic 40% 40% 13% 7% Normal tag rate 0.1 0.1 0.1 0.1 Migra1on tag rate 0.75 0.75 0.75 0.75 SELECT DC Avg 46,296 46,296 15,046 8,102 Peak 138,889 138,889 45,139 24,306 FE Avg 126 263 684 675 Peak 377 789 2,052 2,025 UPDATE tag (typical load) DC Avg 4,630 4,630 1,505 810 Peak 13,889 13,889 4,514 2,431 FE Avg 13 26 68 68 Peak 38 79 205 203 UPDATE tag (migra1on) DC Avg 30,093 30,093 9,780 5,266 Peak 90,278 90,278 29,340 15,799 FE Avg 82 171 445 439 Peak 245 513 1,334 1,317 Total DC ops (normal load) Avg 51,389 51,389 16,701 8,993 Peak 154,167 154,167 50,104 26,979 Total DC ops (migra1on) Avg 87,963 87,963 28,588 15,394 Peak 263,889 263,889 85,764 46,181 Constant Nodes required (8 core) Spinning disk 4000 17 17 6 3 SSD 7000 10 10 4 3
  • 23. Cluster Sizing den01 iad02 lon01 hkg01 % total traffic 40.00% 40.00% 13.00% 7.00% Tag Daily GB 0.9 0.9 0.3 0.2 Total GB 84 84 27 15 Frqcap Daily GB 0.6 0.6 0.2 0.1 Total GB 3.9 3.9 1.3 0.7 Partner Daily GB 8 8 3 1 Total GB 1509 1509 490 264 Total GB 3193 3193 1038 559 Per Node GB 456 456 346 186
  • 25. Replica1on Strategy ● Typical Cassandra replication is expensive ● Each write is replicated to all data centers ● Each cluster must be approximately the same size ● Need a large pipe between data centers ● 3.7 million columns updated per second at peak load ● Amount of replica1on needed increases with each new data center
  • 26. Replica1on Strategy ● Alternate strategies suggested: ● Offline copying of SSTables ● Maintain a log of changed records and run a process to copy those periodically ● We realized that this data doesn’t need to be available in all places at all 1mes ● People don’t o_en move far enough to switch data centers ● Data integrity is of fairly low importance ● If our data isn’t replicated the user will appear to be new when they switch data centers, but that only has a minor short-­‐term impact on applica1on performance ● Other replica1on strategies we considered: ● None ● Just-­‐in-­‐1me ● Queued
  • 27. Replica1on Strategy: None ● Don’t replicate at all ● Each data center has its own completely self-­‐contained cluster ● Advantage: Simplicity ● Disadvantage: Limits our ability to target users when they move or we reassign regions to a different data center
  • 28. Replica1on Strategy: Just-­‐In-­‐Time ● Each data center has its own completely self-­‐contained cluster ● The user’s iden1fier cookie contains a data center iden1fier ● When an incoming request’s cookie says it’s from a different data center, read from that data center in real 1me and replicate on the fly to the local data center ● Reassign the cookie using the new data center ● Advantage ● Audience data is (almost) always available (99.99%) ● Disadvantages ● Addi1onal latency while wai1ng for user data ● In cookie-­‐less situa1ons we’d need to query all data centers if the local data center has no data
  • 29. Replica1on Strategy: Just-­‐In-­‐Time if(cookie != null) {! !audience_id = cookie[id]! !audience_dc_id = cookie[dc_id]! }! else {! !audience_id = some other identifier! }! ! if(audience_dc_id == local_dc) {! !audience_data = local_dc->cassandra->fetch(audience_id)! }! else {! !other_dcs = audience_dc_id != null ?! ! ! !{audience_dc_id} : {dc1, dc2, dc3}! !for dc in other_dcs {! ! !audience_data = dc1->cassandra->fetch(audience_id)! ! !if(audience_data != null) {! ! ! !local_dc->cassandra->write(! ! ! ! !audience_id, audience_data! ! ! ! !)! ! ! !break;! ! !}! !}! }!
  • 30. Replica1on Strategy: Queued ● Each data center has its own completely self-­‐contained cluster ● When a fetch ajempt misses, the user ID is added to a queue for reconcilia1on ● Treat the user as a new user and store their data locally ● Background process consumes IDs from the queue and ajempts to fetch data from other data centers for reconcilia1on ● Advantages ● Audience data is mostly available (98%) ● Minimal addi1onal latency introduced ● Disadvantages ● Addi1onal opera1onal complexity ● Occasional data misses
  • 31. Replica1on Strategy: Queued if(cookie != null) {! !audience_id = cookie[id]! !audience_dc_id = cookie[dc_id]! }! else {! !audience_id = some other identifier! }! ! audience_data = local_dc->cassandra->fetch(audience_id)! if(audience_data == null) {! !local_dc->cassandra->queue_for_migration(! ! !audience_id, audience_dc_id! ! !)! }!
  • 32. Replica1on Strategy: Queued audience_migrations = local_dc->fetch_from_queue()! for {audience_id, audience_dc_id} in audience_migrations {! !other_dcs = audience_dc_id != null ?! ! ! !{audience_dc_id} : {dc1, dc2, dc3}! !for dc in other_dcs {! ! !audience_data = dc1->cassandra->fetch(audience_id)! ! !if(audience_data != null) {! ! ! !local_dc->cassandra->write(! ! ! ! !audience_id, audience_data! ! ! ! !)! ! !}! !}! } !
  • 33. THANK YOU Andrew Ku0g Send ques5ons, deck requests, complaints, cat videos, and resumes to: aku0g@spotxchange.com