A brief history of Instagram's adoption cycle of the open source distributed database Apache Cassandra, in addition to details about it's use case and implementation. This was presented at the San Francisco Cassandra Meetup at the Disqus HQ in August 2013.
10. •Boils down to centralized logging
•VERY high skew of writes to reads
(1,000:1)
•Ever growing data set
•Durability highly valued
•Dumb to store it in RAM, basically...
The Data
18. commit 3f2e4f2e5da6fe99d7f3fc13c0da09b464b3a9e0
Author: Rick Branson
Date: Wed Nov 21 09:50:21 2012 -0800
Drop key cache size on C*UA cluster: was causing heap
issues, and apparently 1GB is _WAY_ outside of the normal
range of operation for nodes of this size.
23. commit 84982635d5c807840d625c22a8bd4407c1879eba
Author: Rick Branson
Date: Thu Jan 31 09:43:56 2013 -0800
Switch Cassandra from tokens to vnodes
commit e990acc5dc69468c8a96a848695fca56e79f8b83
Author: Rick Branson
Date: Sun Feb 10 20:26:32 2013 -0800
We aren't ready for vnodes yet guys
40. user_id
TimeUUID1 TimeUUID2
...
TimeUUID2
user_id <activity> <activity> ... [tombstone]user_id
timestamp1 timestamp2
...
timestamp2
Cassandra internally stores deletes as
tombstones, which mark data for a given
column as deleted at-or-before a timestamp.
Column Delete
tombstone timestamp is >= live
column timestamp, so it will be
hidden from queries and
compacted away.
41. user_id
TimeUUID1 TimeUUID2
...
TimeUUID101
user_id <activity> <activity> ... <activity>user_id
timestamp1 timestamp2
...
timestamp101
TimeUUID = timestamp
To avoid tombstones, exploit that the
timestamp embedded in our TimeUUID
(ordering) is the same as the column
timestamp.
42. user_id
TimeUUID1 TimeUUID2
...
TimeUUID101
user_id <activity> <activity> ... <activity>user_id
timestamp1 timestamp2
...
timestamp101
delete(<user_id>,
timestamp=<timestamp101>)
Row Delete
Cassandra can also store row tombstones,
which delete all data from a row at-or-before
the timestamp provided.
50. SuperColumn = Old/Busted
AntiColumn = New/Hotness
user_id
(0, <TimeUUID>) (1, <TimeUUID>) (1, <TimeUUID>)
user_id
anti-column activity activity
"Anti-Column"
Borrowing from the idea of Cassandra's by-
name tombstones, Contains an MD5 hash of
the activity data "value" it is marking as
deleted.
51. user_id
(0, <TimeUUID>) (1, <TimeUUID>) (1, <TimeUUID>)
user_id
anti-column activity activity
Composite Column
First component is zero for anti-columns,
splitting the row into two independent lists,
and ensuring the anti-columns always appear
at the head.
53. Replica
[A, B, C]
Replica
[A, C]
Writer
insert C
Replica
[A, B, C]
Undo Like
Diverging Replicas: Solved
OK
Instead of read-before-write, an
anti-column is inserted to mark
the activity as deleted.
54. TAKEAWAY
Read-before-write is a smell. Try to model data as
a log of user "intent" rather than manhandling the
data into place.
55. •Keep 30% "buffer" for trims.
•Undo without read. (thumbsup)
•Large lists suck for this. (thumbsdown)
•CASSANDRA-5527
62. Monitor Consistency
$ nodetool netstats
Mode: NORMAL
Not sending any streams.
Not receiving any streams.
Read Repair Statistics:
Attempted: 3192520
Mismatch (Blocking): 0
Mismatch (Background): 11584
Pool Name Active Pending Completed
Commands n/a 0 1837765727
Responses n/a 1 1750784545
UPDATE COLUMN FAMILY
InboxActivitiesByUserID
WITH read_repair_chance = 0.01;
99.63% consistent
63. SSTable Size (again)
Saw lots of GC pressure related to buffer
garbage. Eventually they landed on a new
default in 1.2.9+ (160MB).
sstable_size_in_mb: 25 => 128
64.
65. Fetch & Deserialize Time (measured from app)
Mean vs P90 (ms), trough-to-peak
66. Space used (live): 180114509324
Space used (total): 180444164726
Memtable Columns Count: 2315159
Memtable Data Size: 112197632
Memtable Switch Count: 1312
Read Count: 316192445
Read Latency: 1.982 ms.
Write Count: 1581610760
Write Latency: 0.031 ms.
Pending Tasks: 0
Bloom Filter False Positives: 481617
Bloom Filter False Ratio: 0.08558
Bloom Filter Space Used: 54723960
Compacted row minimum size: 25
Compacted row maximum size: 545791
Compacted row mean size: 3020
68. Exciting Future Things
•Python Native Protocol Driver
•Read CPU Consumption Work
•Mass CQL Adoption
•Triggers
•CAS (for limited use cases)
69. Next 6 Months...
•Node repair visibility & monitoring
•Objects & Associations Storage API on
C* + memcache
•Migrate more from Redis
•New major use case
•Cassandra 2.0?