Your SlideShare is downloading. ×
0
CASSANDRA
AT INSTAGRAM
Rick Branson, Infrastructure Engineer
@rbranson
SF Cassandra Meetup
August 29, 2013
Disqus HQ
September 2012
Redis fillin' up.
What sucks?
THE OBVIOUS
Memory is expensive.
LESS OBVIOUS:
In-memory "degrades" poorly
•Flat namespace. What's in there?
•Heap fragmentation
•Single threaded
BGSAVE
•Boils down to centralized logging
•VERY high skew of writes to reads
(1,000:1)
•Ever growing data set
•Durability highly ...
• Cassandra 1.1
• 3 EC2 m1.xlarge (2-core, 15GB RAM)
• RAIDed ephemerals (1.6TB of SATA)
• RF=3
• 6GB Heap, 200MB NewSize
...
It worked. Mostly.
The horriblecool
thing about Chef...
commit a1489a34d2aa69316b010146ab5254895f7b9141
Author: Rick Branson
Date: Thu Oct 18 20:05:16 2012 -0700
Follow the rules...
commit 41c96f3243a902dd6af4ea29ef6097351a16494a
Author: Rick Branson
Date: Tue Oct 30 17:12:00 2012 -0700
Use 256k JVM sta...
November 2013
Doubled to 6 nodes.
18,000 connections. Spread those more evenly.
commit 3f2e4f2e5da6fe99d7f3fc13c0da09b464b3a9e0
Author: Rick Branson
Date: Wed Nov 21 09:50:21 2012 -0800
Drop key cache s...
commit 5926aa5ce69d48e5f2bb7c0d0e86b411645bc786
Author: Rick Branson
Date: Mon Dec 24 12:41:13 2012 -0800
Lower memtable s...
1.2.1.
It went well.
well... until...
commit 84982635d5c807840d625c22a8bd4407c1879eba
Author: Rick Branson
Date: Thu Jan 31 09:43:56 2013 -0800
Switch Cassandra...
TAKEAWAY
Let stupidenterprising, experienced operators that
will submit patches take the first few bullets on
brand-new ma...
commit acb02daea57dca889c2aa45963754a271fa51566
Author: Rick Branson
Date: Sun Feb 10 20:36:34 2013 -0800
Doubled C* clust...
commit cc13a4c15ee0051bb7c4e3b13bd6ae56301ac670
Author: Rick Branson
Date: Thu Mar 14 16:23:18 2013 -0700
Subtract token f...
pycassa exceptions (last 6 months)
•3.4TB
•vnode migration still pending
TAKEAWAY
Adopt a technology by understanding what it's
best at and letting it do that first, then expand...
•Sharded master/slave Redis
•32x68GB (m2.4xlarge)
•Space (memory) bound
•Resharding sucks
•Failover is manual, wakes us up...
user_id: [
activity,
activity,
...
]
user_id: [
activity,
activity,
...
]
Thrift Serialized Activity
Bound the Size
user_id: [
activity1,
activity2,
...
activity100,
activity101,
...
]
LTRIM <user_id> 0 99
Undo
user_id: [
activity1,
activity2,
activity3,
...
]
LREM <user_id> 0 <activity2>
C* data model
user_id
TimeUUID1 TimeUUID2
...
TimeUUID101
user_id
<activity> <activity>
...
<activity>
Bound the Size
user_id
TimeUUID1 TimeUUID2
...
TimeUUID101
user_id
<activity> <activity>
...
<activity>
get(<user_id>)
del...
The great destroyer of
systems shows up.
Tombstones abound.
user_id
TimeUUID1 TimeUUID2
...
TimeUUID2
user_id <activity> <activity> ... [tombstone]user_id
timestamp1 timestamp2
...
t...
user_id
TimeUUID1 TimeUUID2
...
TimeUUID101
user_id <activity> <activity> ... <activity>user_id
timestamp1 timestamp2
...
...
user_id
TimeUUID1 TimeUUID2
...
TimeUUID101
user_id <activity> <activity> ... <activity>user_id
timestamp1 timestamp2
...
...
Optimizes Reads
SSTable
max_ts=100
SSTable
max_ts=200
SSTable
max_ts=300
SSTable
max_ts=400
SSTable
max_ts=500
SSTable
max...
~10% of actions are undos.
Undo Support
user_id
TimeUUID1 TimeUUID2
...
TimeUUID101
user_id
<activity> <activity>
...
<activity>
get(<user_id>)
delet...
get(<user_id>)
delete(<user_id>,
columns=[<TimeUUID2>])
Simple Race Condition
The state of the row may have changed
betwee...
Replica
[A, B]
Replica
[A]
Writer
insert B OK
Replica
[A, B]
FAIL
Like
Diverging Replicas
Replica
[A, B]
Replica
[A]
Writer
read [A]
Replica
[A, B]
Undo Like
Diverging Replicas
Replica is missing B, so if a read
...
SuperColumn = Old/Busted
AntiColumn = New/Hotness
user_id
(0, <TimeUUID>) (1, <TimeUUID>) (1, <TimeUUID>)
user_id
anti-col...
user_id
(0, <TimeUUID>) (1, <TimeUUID>) (1, <TimeUUID>)
user_id
anti-column activity activity
Composite Column
First compo...
Replica
[A, B]
Replica
[A]
Writer
insert B OK
Replica
[A, B]
FAIL
Like
Diverging Replicas: Solved
Replica
[A, B, C]
Replica
[A, C]
Writer
insert C
Replica
[A, B, C]
Undo Like
Diverging Replicas: Solved
OK
Instead of read...
TAKEAWAY
Read-before-write is a smell. Try to model data as
a log of user "intent" rather than manhandling the
data into p...
•Keep 30% "buffer" for trims.
•Undo without read. (thumbsup)
•Large lists suck for this. (thumbsdown)
•CASSANDRA-5527
Built in two days.
Experience paid off.
Reusability is key to rapid rollout.
Great documentation eases concerns.
•C* 1.2.3
•vnodes, LeveledCompactionStrategy
•12 hi1.4xlarge (8-core, 60GB, 2T SSD)
•3 AZs, RF=3, CL W=TWO R=ONE
•8G heap,...
1. Dial up Double Writes
2. Test with "Shadow" Reads
3. Dial up "Real" Reads
Rollout
commit 1c3d99a9e337f9383b093009dba074b8ade20768
Author: Rick Branson
Date: Mon May 6 14:58:54 2013 -0700
Bump C* inbox hea...
Bootstrapping sucked because compacting
10,000 SSTables takes forever.
sstable_size_in_mb: 5 => 25
Monitor Consistency
$ nodetool netstats
Mode: NORMAL
Not sending any streams.
Not receiving any streams.
Read Repair Stati...
SSTable Size (again)
Saw lots of GC pressure related to buffer
garbage. Eventually they landed on a new
default in 1.2.9+ ...
Fetch & Deserialize Time (measured from app)
Mean vs P90 (ms), trough-to-peak
Space used (live): 180114509324
Space used (total): 180444164726
Memtable Columns Count: 2315159
Memtable Data Size: 11219...
20K 200-column slice reads/sec
30K 1-column mutations/sec
30% CPU utilization
48K clients
Peak Stats
Exciting Future Things
•Python Native Protocol Driver
•Read CPU Consumption Work
•Mass CQL Adoption
•Triggers
•CAS (for li...
Next 6 Months...
•Node repair visibility & monitoring
•Objects & Associations Storage API on
C* + memcache
•Migrate more f...
We're hiring!
Cassandra at Instagram (August 2013)
Cassandra at Instagram (August 2013)
Cassandra at Instagram (August 2013)
Cassandra at Instagram (August 2013)
Cassandra at Instagram (August 2013)
Cassandra at Instagram (August 2013)
Cassandra at Instagram (August 2013)
Cassandra at Instagram (August 2013)
Cassandra at Instagram (August 2013)
Upcoming SlideShare
Loading in...5
×

Cassandra at Instagram (August 2013)

15,375

Published on

A brief history of Instagram's adoption cycle of the open source distributed database Apache Cassandra, in addition to details about it's use case and implementation. This was presented at the San Francisco Cassandra Meetup at the Disqus HQ in August 2013.

Published in: Technology, Business
2 Comments
39 Likes
Statistics
Notes
No Downloads
Views
Total Views
15,375
On Slideshare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
113
Comments
2
Likes
39
Embeds 0
No embeds

No notes for slide

Transcript of "Cassandra at Instagram (August 2013)"

  1. 1. CASSANDRA AT INSTAGRAM Rick Branson, Infrastructure Engineer @rbranson SF Cassandra Meetup August 29, 2013 Disqus HQ
  2. 2. September 2012 Redis fillin' up.
  3. 3. What sucks?
  4. 4. THE OBVIOUS Memory is expensive.
  5. 5. LESS OBVIOUS: In-memory "degrades" poorly
  6. 6. •Flat namespace. What's in there? •Heap fragmentation •Single threaded
  7. 7. BGSAVE
  8. 8. •Boils down to centralized logging •VERY high skew of writes to reads (1,000:1) •Ever growing data set •Durability highly valued •Dumb to store it in RAM, basically... The Data
  9. 9. • Cassandra 1.1 • 3 EC2 m1.xlarge (2-core, 15GB RAM) • RAIDed ephemerals (1.6TB of SATA) • RF=3 • 6GB Heap, 200MB NewSize • HSHA The Setup
  10. 10. It worked. Mostly.
  11. 11. The horriblecool thing about Chef...
  12. 12. commit a1489a34d2aa69316b010146ab5254895f7b9141 Author: Rick Branson Date: Thu Oct 18 20:05:16 2012 -0700 Follow the rules for Cassandra listen_address so I don't burn a whole day fixing my retarded mistake
  13. 13. commit 41c96f3243a902dd6af4ea29ef6097351a16494a Author: Rick Branson Date: Tue Oct 30 17:12:00 2012 -0700 Use 256k JVM stack size for C* -- fixes a bug that got integrated with 1.1.6 packaging + Java 1.6.0_u34+
  14. 14. November 2013 Doubled to 6 nodes. 18,000 connections. Spread those more evenly.
  15. 15. commit 3f2e4f2e5da6fe99d7f3fc13c0da09b464b3a9e0 Author: Rick Branson Date: Wed Nov 21 09:50:21 2012 -0800 Drop key cache size on C*UA cluster: was causing heap issues, and apparently 1GB is _WAY_ outside of the normal range of operation for nodes of this size.
  16. 16. commit 5926aa5ce69d48e5f2bb7c0d0e86b411645bc786 Author: Rick Branson Date: Mon Dec 24 12:41:13 2012 -0800 Lower memtable sizes on C* UA cluster to make more room for compression metadata / bloom filters on heap
  17. 17. 1.2.1. It went well. well... until...
  18. 18. commit 84982635d5c807840d625c22a8bd4407c1879eba Author: Rick Branson Date: Thu Jan 31 09:43:56 2013 -0800 Switch Cassandra from tokens to vnodes commit e990acc5dc69468c8a96a848695fca56e79f8b83 Author: Rick Branson Date: Sun Feb 10 20:26:32 2013 -0800 We aren't ready for vnodes yet guys
  19. 19. TAKEAWAY Let stupidenterprising, experienced operators that will submit patches take the first few bullets on brand-new major versions.
  20. 20. commit acb02daea57dca889c2aa45963754a271fa51566 Author: Rick Branson Date: Sun Feb 10 20:36:34 2013 -0800 Doubled C* cluster
  21. 21. commit cc13a4c15ee0051bb7c4e3b13bd6ae56301ac670 Author: Rick Branson Date: Thu Mar 14 16:23:18 2013 -0700 Subtract token from C*ua7 to replace the node
  22. 22. pycassa exceptions (last 6 months)
  23. 23. •3.4TB •vnode migration still pending
  24. 24. TAKEAWAY Adopt a technology by understanding what it's best at and letting it do that first, then expand...
  25. 25. •Sharded master/slave Redis •32x68GB (m2.4xlarge) •Space (memory) bound •Resharding sucks •Failover is manual, wakes us up at night
  26. 26. user_id: [ activity, activity, ... ]
  27. 27. user_id: [ activity, activity, ... ] Thrift Serialized Activity
  28. 28. Bound the Size user_id: [ activity1, activity2, ... activity100, activity101, ... ] LTRIM <user_id> 0 99
  29. 29. Undo user_id: [ activity1, activity2, activity3, ... ] LREM <user_id> 0 <activity2>
  30. 30. C* data model user_id TimeUUID1 TimeUUID2 ... TimeUUID101 user_id <activity> <activity> ... <activity>
  31. 31. Bound the Size user_id TimeUUID1 TimeUUID2 ... TimeUUID101 user_id <activity> <activity> ... <activity> get(<user_id>) delete(<user_id>, columns=[<TimeUUID101>, <TimeUUID102>, <TimeUUID103>, ...])
  32. 32. The great destroyer of systems shows up. Tombstones abound.
  33. 33. user_id TimeUUID1 TimeUUID2 ... TimeUUID2 user_id <activity> <activity> ... [tombstone]user_id timestamp1 timestamp2 ... timestamp2 Cassandra internally stores deletes as tombstones, which mark data for a given column as deleted at-or-before a timestamp. Column Delete tombstone timestamp is >= live column timestamp, so it will be hidden from queries and compacted away.
  34. 34. user_id TimeUUID1 TimeUUID2 ... TimeUUID101 user_id <activity> <activity> ... <activity>user_id timestamp1 timestamp2 ... timestamp101 TimeUUID = timestamp To avoid tombstones, exploit that the timestamp embedded in our TimeUUID (ordering) is the same as the column timestamp.
  35. 35. user_id TimeUUID1 TimeUUID2 ... TimeUUID101 user_id <activity> <activity> ... <activity>user_id timestamp1 timestamp2 ... timestamp101 delete(<user_id>, timestamp=<timestamp101>) Row Delete Cassandra can also store row tombstones, which delete all data from a row at-or-before the timestamp provided.
  36. 36. Optimizes Reads SSTable max_ts=100 SSTable max_ts=200 SSTable max_ts=300 SSTable max_ts=400 SSTable max_ts=500 SSTable max_ts=600 SSTable max_ts=700 SSTable max_ts=800 Contains row tombstone with timestamp 350 Safely ignored using in-memory metadata
  37. 37. ~10% of actions are undos.
  38. 38. Undo Support user_id TimeUUID1 TimeUUID2 ... TimeUUID101 user_id <activity> <activity> ... <activity> get(<user_id>) delete(<user_id>, columns=[<TimeUUID2>])
  39. 39. get(<user_id>) delete(<user_id>, columns=[<TimeUUID2>]) Simple Race Condition The state of the row may have changed between these two operations. 💩
  40. 40. Replica [A, B] Replica [A] Writer insert B OK Replica [A, B] FAIL Like Diverging Replicas
  41. 41. Replica [A, B] Replica [A] Writer read [A] Replica [A, B] Undo Like Diverging Replicas Replica is missing B, so if a read is required to find B before deleting it, it's going to fail.
  42. 42. SuperColumn = Old/Busted AntiColumn = New/Hotness user_id (0, <TimeUUID>) (1, <TimeUUID>) (1, <TimeUUID>) user_id anti-column activity activity "Anti-Column" Borrowing from the idea of Cassandra's by- name tombstones, Contains an MD5 hash of the activity data "value" it is marking as deleted.
  43. 43. user_id (0, <TimeUUID>) (1, <TimeUUID>) (1, <TimeUUID>) user_id anti-column activity activity Composite Column First component is zero for anti-columns, splitting the row into two independent lists, and ensuring the anti-columns always appear at the head.
  44. 44. Replica [A, B] Replica [A] Writer insert B OK Replica [A, B] FAIL Like Diverging Replicas: Solved
  45. 45. Replica [A, B, C] Replica [A, C] Writer insert C Replica [A, B, C] Undo Like Diverging Replicas: Solved OK Instead of read-before-write, an anti-column is inserted to mark the activity as deleted.
  46. 46. TAKEAWAY Read-before-write is a smell. Try to model data as a log of user "intent" rather than manhandling the data into place.
  47. 47. •Keep 30% "buffer" for trims. •Undo without read. (thumbsup) •Large lists suck for this. (thumbsdown) •CASSANDRA-5527
  48. 48. Built in two days. Experience paid off.
  49. 49. Reusability is key to rapid rollout. Great documentation eases concerns.
  50. 50. •C* 1.2.3 •vnodes, LeveledCompactionStrategy •12 hi1.4xlarge (8-core, 60GB, 2T SSD) •3 AZs, RF=3, CL W=TWO R=ONE •8G heap, 800M NewSize Initial Setup
  51. 51. 1. Dial up Double Writes 2. Test with "Shadow" Reads 3. Dial up "Real" Reads Rollout
  52. 52. commit 1c3d99a9e337f9383b093009dba074b8ade20768 Author: Rick Branson Date: Mon May 6 14:58:54 2013 -0700 Bump C* inbox heap size 8G -> 10G, seeing heap pressure
  53. 53. Bootstrapping sucked because compacting 10,000 SSTables takes forever. sstable_size_in_mb: 5 => 25
  54. 54. Monitor Consistency $ nodetool netstats Mode: NORMAL Not sending any streams. Not receiving any streams. Read Repair Statistics: Attempted: 3192520 Mismatch (Blocking): 0 Mismatch (Background): 11584 Pool Name Active Pending Completed Commands n/a 0 1837765727 Responses n/a 1 1750784545 UPDATE COLUMN FAMILY InboxActivitiesByUserID WITH read_repair_chance = 0.01; 99.63% consistent
  55. 55. SSTable Size (again) Saw lots of GC pressure related to buffer garbage. Eventually they landed on a new default in 1.2.9+ (160MB). sstable_size_in_mb: 25 => 128
  56. 56. Fetch & Deserialize Time (measured from app) Mean vs P90 (ms), trough-to-peak
  57. 57. Space used (live): 180114509324 Space used (total): 180444164726 Memtable Columns Count: 2315159 Memtable Data Size: 112197632 Memtable Switch Count: 1312 Read Count: 316192445 Read Latency: 1.982 ms. Write Count: 1581610760 Write Latency: 0.031 ms. Pending Tasks: 0 Bloom Filter False Positives: 481617 Bloom Filter False Ratio: 0.08558 Bloom Filter Space Used: 54723960 Compacted row minimum size: 25 Compacted row maximum size: 545791 Compacted row mean size: 3020
  58. 58. 20K 200-column slice reads/sec 30K 1-column mutations/sec 30% CPU utilization 48K clients Peak Stats
  59. 59. Exciting Future Things •Python Native Protocol Driver •Read CPU Consumption Work •Mass CQL Adoption •Triggers •CAS (for limited use cases)
  60. 60. Next 6 Months... •Node repair visibility & monitoring •Objects & Associations Storage API on C* + memcache •Migrate more from Redis •New major use case •Cassandra 2.0?
  61. 61. We're hiring!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×