Cassandra Day SV 2014: Scaling Hulu’s Video Progress Tracking Service with Apache Cassandra

CASSANDRA DAY SILICON VALLEY 2014 – APRIL 7TH – MATT JURIK
SCALING VIDEO PROGRESS TRACKING

Help people find and enjoy
the world’s premium content
when, where and how they want it.
HULU’S MISSION

5
•  Service Oriented Architecture
•  Follow the Unix Philosophy
•  Small services with specialized scopes
•  Small teams focusing on specific areas
•  Right tool for the job
•  Many languages, frameworks, formats
•  Cross team development encouraged
•  If something you depend on needs fixing, feel free
to fix it

VIDEO PROGRESS TRACKING
CODENAME: HUGETOP
6

AGENDA
•  Old architecture
•  New architecture
•  Keyspace design
•  Migrating to cassandra
•  Operations

9
OLD ARCHITECTURE (MYSQL)
HUGETOP (PYTHON)
OTHER SERVICESDEVICESHULU.COM
64 Redis Shards
(Persistence-enabled)
API (PYTHON)
8 MySQL Shards

10
NEW ARCHITECTURE (C*)
HUGETOP (PYTHON)
OTHER SERVICESDEVICESHULU.COM
64 Redis Shards
(Cache-only)
CRAPI (JAVA)
8 Cassandra Nodes

The dilemma
•  Unbounded data growth
•  MySQL very stable, but servers running out of space
•  “Manually resharding is fun!” – No one, ever
Why cassandra?
•  Our data fits cassandra’s data model well.
•  Cassandra promises (and delivers) great scalability
•  Highly available
•  Multi-DC
11
WHY SWITCH?

12
INTERACTION BETWEEN REDIS + CASSANDRA
HUGETOP
64 Redis Shards
(Cache-only)
CRAPI
8 Cassandra Nodes
Video position updates
1.  Write position info to cassandra
2.  Update Redis
Video position requests
Check redis:
If data is loaded in redis,
return it.
Else:
Fetch user’s history from cassandra,
Queue job to update redis,
Return data fetched from cassandra.
Redis
•  Maintains complex indices
•  Enrich data by simulating joins with Lua
Cassandra
•  Provides durability
•  Replenish Redis as necessary

Take one
•  Hadoop-class machines
•  Physical boxes (i.e., no VMs)
•  6 standard 7200rpm drives
•  32gb RAM
•  Leveled compaction + JBOD
•  Write throughput J
•  Read latency L
13
HARDWARE CONSIDERATIONS
Take two
•  SSD-based machines
•  Physical boxes (c-states disabled)
•  550gb RAID5
•  48gb RAM
•  Leveled compaction
•  Write throughput J
•  Read latency J
•  16 nodes split between 2 DCs

14
•  Query last position for user=X, video=Y
•  Query last position for user=X, video=*
•  Daily log of all views needed by other services
•  Two tables: one for updates; one for deletes.
•  Shard data across rows
•  TTL’d
KEYSPACE DESIGN
Copy 1
CREATE TABLE views (
u int, # User ID
v int, # Video ID
c boolean, # Is completed?
p float, # Video position
t timestamp, # Last viewed at
..., # Other fields
PRIMARY KEY (u, v)
);
CREATE TABLE daily_user_views (
s int, # Partition key
u int, # User ID
v int, # Video ID
..., # Other fields
PRIMARY KEY (s, u, v)
);
Copy 2

•  Single row containing one day’s worth of data = too BIG + causes hotspots
•  Fetching single row in parallel is slow
•  Solution: shard each day across 128 rows
=> Spreads data across multiple nodes
=> Query multiple nodes in parallel
15
SHARDING!?
Partition key
userID % 128
+ daysBetween(EPOCH, viewDate) * 128
April 7th, 2014 (daysBetween(EPOCH, “April 7th, 2014”) = 16167):
for(int i = 0; i < 128; i++) {
int k = i % 128 + 16167 * 128
execute(“SELECT * FROM daily_user_views WHERE s = “ + k)
}

16
MIGRATING FROM MYSQL ! CASSANDRA
HUGETOP
1 Read/write to MySQL
MySQL Cassandra
2 Duplicate writes+deletes to Cassandra
- column timestamps = last_played_at date ß Critical for next step
- apply deletions, but also temporarily store them in deletion_ledger

17
MIGRATING FROM MYSQL ! CASSANDRA
HUGETOP
Backfill old data
Again, write to Cassandra with column timestamp = last_viewed_at date (prevents old position from
overwriting new position)
MySQL Cassandra
3
Replay deletions stored in deletion_ledger
Just like inserts, you can specify a timestamp for deletions.
column timestamp = time at which original deletion occurred (prevents deleting new data)
4

•  Use internal tool for automating repairs, backups, etc.
•  Metrics
•  Dump metrics to graphite via custom -javaagent which hooks into yammer metrics
•  Implement a MetricPredicate to filter boring metrics
•  High level monitoring (something is usually wrong if):
•  d(hint count)/dt > 0
•  Large number of old gen collections
•  Lots of SSTables in L0 (and not importing data, bootstrapping, etc)
18
OPERATIONS

•  SSTable Corruption
•  nodetool scrub
•  sstablescrub – if things are really bad
•  Things to watch:
•  Snapshots awesome, but can quickly burn disk space
•  Keep nodes under 50% disk utilization, even if using Leveled Compaction.
19
OPERATIONS

Cassandra Day SV 2014: Scaling Hulu’s Video Progress Tracking Service with Apache Cassandra

More Related Content

What's hot

Viewers also liked

Similar to Cassandra Day SV 2014: Scaling Hulu’s Video Progress Tracking Service with Apache Cassandra

More from DataStax Academy

Recently uploaded

Cassandra Day SV 2014: Scaling Hulu’s Video Progress Tracking Service with Apache Cassandra