C* Summit 2013: Cassandra at Instagram by Rick Branson

CASSANDRA
AT INSTAGRAM
Rick Branson, Infrastructure Engineer
@rbranson
2013 Cassandra Summit
#cassandra13
June 12, 2013
San Francisco, CA

September 2012
Redis fillin' up.

THE OBVIOUS
Memory is expensive.

LESS OBVIOUS:
In-memory "degrades" poorly

•Flat namespace. What's in there?
•Heap fragmentation
•Single threaded

•Boils down to centralized logging
•VERY high skew of writes to reads
(1,000:1)
•Ever growing data set
•Durability highly valued
The Data

• Cassandra 1.1
• 3 EC2 m1.xlarge (2-core, 15GB RAM)
• RAIDed ephemerals (1.6TB of SATA)
• RF=3
• 6GB Heap, 200MB NewSize
• HSHA
The Setup

The horriblecool
thing about Chef...

commit a1489a34d2aa69316b010146ab5254895f7b9141
Author: Rick Branson
Date: Thu Oct 18 20:05:16 2012 -0700
Follow the rules for Cassandra listen_address so I don't
burn a whole day fixing my retarded mistake

commit 41c96f3243a902dd6af4ea29ef6097351a16494a
Date: Tue Oct 30 17:12:00 2012 -0700
Use 256k JVM stack size for C* -- fixes a bug that got
integrated with 1.1.6 packaging + Java 1.6.0_u34+

November 2013
Doubled to 6 nodes.
18,000 connections. Spread those more evenly.

commit 3f2e4f2e5da6fe99d7f3fc13c0da09b464b3a9e0
Date: Wed Nov 21 09:50:21 2012 -0800
Drop key cache size on C*UA cluster: was causing heap
issues, and apparently 1GB is _WAY_ outside of the normal
range of operation for nodes of this size.

commit 5926aa5ce69d48e5f2bb7c0d0e86b411645bc786
Date: Mon Dec 24 12:41:13 2012 -0800
Lower memtable sizes on C* UA cluster to make more room
for compression metadata / bloom filters on heap

1.2.1.
It went well.
well... until...

commit 84982635d5c807840d625c22a8bd4407c1879eba
Date: Thu Jan 31 09:43:56 2013 -0800
Switch Cassandra from tokens to vnodes
commit e990acc5dc69468c8a96a848695fca56e79f8b83
Date: Sun Feb 10 20:26:32 2013 -0800
We aren't ready for vnodes yet guys

TAKEAWAY
Let stupidenterprising, experienced operators that
will submit patches take the first few bullets on
brand-new major versions.

commit acb02daea57dca889c2aa45963754a271fa51566
Date: Sun Feb 10 20:36:34 2013 -0800
Doubled C* cluster

commit cc13a4c15ee0051bb7c4e3b13bd6ae56301ac670
Date: Thu Mar 14 16:23:18 2013 -0700
Subtract token from C*ua7 to replace the node

pycassa exceptions (last 6 months)

•3.4TB
•Will try vnode migration again soon...

TAKEAWAY
Adopt a technology by understanding what it's
best at and letting it do that first, then expand...

•Sharded Redis
•32x68GB (m2.4xlarge)
•Space (memory) bound
•Resharding sucks
•Let's get some better availability...

user_id: [
activity,
activity,
...
]

user_id: [
activity,
activity,
...
]
Thrift Serialized Activity

Bound the Size
user_id: [
activity1,
activity2,
...
activity100,
activity101,
...
]
LTRIM <user_id> 0 99

Undo
user_id: [
activity1,
activity2,
activity3,
...
]
LREM <user_id> 0 <activity2>

C* data model
user_id
TimeUUID1 TimeUUID2
...
TimeUUID101
user_id
<activity> <activity>
...
<activity>

Bound the Size
user_id
TimeUUID1 TimeUUID2
...
TimeUUID101
user_id
...
<activity>
get(<user_id>)
delete(<user_id>,
columns=[<TimeUUID101>,
<TimeUUID102>,
<TimeUUID103>,
...])

The great destroyer of
systems shows up.
Tombstones abound.

user_id
TimeUUID1 TimeUUID2
...
TimeUUID101
user_id <activity> <activity> ... <activity>user_id
timestamp1 timestamp2
...
timestamp101
TimeUUID = timestamp

user_id
TimeUUID1 TimeUUID2
...
TimeUUID101
user_id <activity> <activity> ... <activity>user_id
timestamp1 timestamp2
...
timestamp101
delete(<user_id>,
timestamp=<timestamp101>)
Row Delete
Deletes any data on a row with a timestamp
value equal to or less than the timestamp
provided in the delete operation.

Optimizes Reads
SSTable
max_ts=100
SSTable
max_ts=200
SSTable
max_ts=300
SSTable
max_ts=400
SSTable
max_ts=500
SSTable
max_ts=600
SSTable
max_ts=700
SSTable
max_ts=800
Contains row tombstone
with timestamp 350
Safely ignored
using in-memory
metadata

Undo Support
user_id
TimeUUID1 TimeUUID2
...
TimeUUID101
user_id
...
<activity>
get(<user_id>)
delete(<user_id>,
columns=[<TimeUUID2>])

get(<user_id>)
delete(<user_id>,
columns=[<TimeUUID2>])
Simple Race Condition
The state of the row may have changed
between these two operations.
💩

Replica
[A, B]
Replica
[A]
Writer Writer
insert B read [A]OK
Replica
[A, B]
FAIL
"like Z" undo "like Z"
Diverging Replicas

SuperColumn = Old/Busted
AntiColumn = New/Hotness
user_id
(0, <TimeUUID>) (1, <TimeUUID>) (1, <TimeUUID>)
user_id
anti-column activity activity
"Anti-Column"
Contains an MD5 hash of the activity data it
is marking as deleted.

user_id
(0, <TimeUUID>) (1, <TimeUUID>) (1, <TimeUUID>)
user_id
anti-column activity activity
Composite Column
First component is zero for anti-columns,
splitting the row into two independent lists,
and ensuring the anti-columns always appear
at the head.

Replica
[A, B, C]
Replica
[A, C]
Writer Writer
insert B insert COK
Replica
[A, B, C]
FAIL
"like Z" undo "like Z"
Diverging Replicas: Solved
OK

TAKEAWAY
Read-before-write is a smell. Try to model data as
a log of user "intent" rather than manhandling the
data into place.

•Keep 30% "buffer" for trims.
•Undo without read. (thumbsup)
•Large lists suck for this. (thumbsdown)
•CASSANDRA-5527

Built in two days.
Experience pays.

Reusability is key to rapid rollout.
Great documentation eases concerns.

•C* 1.2.3
•vnodes, LeveledCompactionStrategy
•12 hi1.4xlarge (8-core, 60GB, SSD)
•3 AZs, RF=3, W=2, R=1
•8GB heap, 800MB NewSize

1. Dial up Double Writes
2. Test with "Shadow" Reads
3. Dial up "Real" Reads
Rollout

commit 1c3d99a9e337f9383b093009dba074b8ade20768
Date: Mon May 6 14:58:54 2013 -0700
Bump C* inbox heap size 8G -> 10G, seeing heap pressure

Bootstrapping sucked because compacting
10,000 SSTables takes forever.
sstable_size_in_mb: 5 => 25

Come in on Monday, one of the nodes
was unable to flush and has built
up 8,000+ commit log segments.

"Normal" Rebuild Process
1. /etc/init.d/cassandra stop
2. mv /data/cassandra /data/cassandra.old
3. /etc/init.d/cassandra start

For "non-vnode" clusters, best practice
is to set the initial_token in cassandra.yaml.

for vnode clusters, multiple tokens are
selected randomly when a node is
bootstrapped.

IP address is effectively the "primary key"
for nodes in a ring.

What had happened was.
1. Rebuilding node generated entirely
new tokens and joined cluster.
2. Rest of cluster dropped the previously
stored token data associated with the
rebuilding node's IP address.
3. Token ranges shifted massively.

UPDATE COLUMN FAMILY
InboxActivitiesByUserID
WITH read_repair_chance = 1.0;
stats.inbox.empty

Kicked off "nodetool
repair" and waited... and
waited...

LeveledCompactionStrategy
+ vnodes = tragedy.

kill -3 <cassandra>
"AntiEntropyStage:1"
java.lang.Thread.State: RUNNABLE
<...>
at io.sstable.SSTableReader.decodeKey(SSTableReader.java:1014)
at io.sstable.SSTableReader.getPosition(SSTableReader.java:802)
at io.sstable.SSTableReader.getPosition(SSTableReader.java:717)
at io.sstable.SSTableReader.getPositionsForRanges(SSTableReader.java:664)
at streaming.StreamOut.createPendingFiles(StreamOut.java:155)
at streaming.StreamOut.transferSSTables(StreamOut.java:140)
at streaming.StreamingRepairTask.initiateStreaming(StreamingRepairTask.java:
at streaming.StreamingRepairTask.run(StreamingRepairTask.java:115)
<...>
Every repair task was scanning every
SSTable file to find ranges to repair.

Scan all the things.
•Standard Compaction: Only a few
dozen SSTables.
•Non-VNodes: Repair is done once per
token, and there is only one token.

~20X increase in repair performance.

TAKEAWAY
If you want to use VNodes and
LeveledCompactionStrategy, wait until the 1.2.6
release when CASSANDRA-5569 is merged in.

Where were we?
It was a bad thing to not know data was
inconsistent until we saw an increase in user
reported problems.

CASSANDRA-5618
$ nodetool netstats
Mode: NORMAL
Not sending any streams.
Not receiving any streams.
Read Repair Statistics:
Attempted: 3192520
Mismatch (Blocking): 0
Mismatch (Background): 11584
Pool Name Active Pending Completed
Commands n/a 0 1837765727
Responses n/a 1 1750784545
UPDATE COLUMN FAMILY
InboxActivitiesByUserID
WITH read_repair_chance = 0.01;
99.63% consistent

TAKEAWAY
The way to rebuild a box in a vnode cluster is to
build a brand new node, then remove the old one
with "nodetool removenode."

Fetch & Deserialize Time (measured from app)
Mean vs P90 (ms), trough-to-peak

Column Family: InboxActivitiesByUserID
SSTable count: 3264
SSTables in each level: [1, 10, 105/100, 1053/1000, 2095, 0, 0]
Space used (live): 80114509324
Space used (total): 80444164726
Memtable Columns Count: 2315159
Memtable Data Size: 112197632
Memtable Switch Count: 1312
Read Count: 316192445
Read Latency: 1.982 ms.
Write Count: 1581610760
Write Latency: 0.031 ms.
Pending Tasks: 0
Bloom Filter False Positives: 481617
Bloom Filter False Ratio: 0.08558
Bloom Filter Space Used: 54723960
Compacted row minimum size: 25
Compacted row maximum size: 545791
Compacted row mean size: 3020

C* Summit 2013: Cassandra at Instagram by Rick Branson

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to C* Summit 2013: Cassandra at Instagram by Rick Branson

Similar to C* Summit 2013: Cassandra at Instagram by Rick Branson (20)

More from DataStax Academy

More from DataStax Academy (20)

Recently uploaded

Recently uploaded (20)

C* Summit 2013: Cassandra at Instagram by Rick Branson