C* Summit 2013: Cassandra at Instagram by Rick Branson

  • 6,108 views
Uploaded on

Speaker: Rick Branson, Infrastructure Engineer at Instagram …

Speaker: Rick Branson, Infrastructure Engineer at Instagram
Cassandra is a critical part of Instagram's large scale site infrastructure that supports more than 100 million active users. This talk is a practical deep dive into data models, systems architecture, and challenges encountered during the implementation process.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
6,108
On Slideshare
0
From Embeds
0
Number of Embeds
20

Actions

Shares
Downloads
54
Comments
0
Likes
10

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. CASSANDRA AT INSTAGRAM Rick Branson, Infrastructure Engineer @rbranson 2013 Cassandra Summit #cassandra13 June 12, 2013 San Francisco, CA
  • 2. September 2012 Redis fillin' up.
  • 3. What sucks?
  • 4. THE OBVIOUS Memory is expensive.
  • 5. LESS OBVIOUS: In-memory "degrades" poorly
  • 6. •Flat namespace. What's in there? •Heap fragmentation •Single threaded
  • 7. BGSAVE
  • 8. •Boils down to centralized logging •VERY high skew of writes to reads (1,000:1) •Ever growing data set •Durability highly valued The Data
  • 9. • Cassandra 1.1 • 3 EC2 m1.xlarge (2-core, 15GB RAM) • RAIDed ephemerals (1.6TB of SATA) • RF=3 • 6GB Heap, 200MB NewSize • HSHA The Setup
  • 10. It worked. Mostly.
  • 11. The horriblecool thing about Chef...
  • 12. commit a1489a34d2aa69316b010146ab5254895f7b9141 Author: Rick Branson Date: Thu Oct 18 20:05:16 2012 -0700 Follow the rules for Cassandra listen_address so I don't burn a whole day fixing my retarded mistake
  • 13. commit 41c96f3243a902dd6af4ea29ef6097351a16494a Author: Rick Branson Date: Tue Oct 30 17:12:00 2012 -0700 Use 256k JVM stack size for C* -- fixes a bug that got integrated with 1.1.6 packaging + Java 1.6.0_u34+
  • 14. November 2013 Doubled to 6 nodes. 18,000 connections. Spread those more evenly.
  • 15. commit 3f2e4f2e5da6fe99d7f3fc13c0da09b464b3a9e0 Author: Rick Branson Date: Wed Nov 21 09:50:21 2012 -0800 Drop key cache size on C*UA cluster: was causing heap issues, and apparently 1GB is _WAY_ outside of the normal range of operation for nodes of this size.
  • 16. commit 5926aa5ce69d48e5f2bb7c0d0e86b411645bc786 Author: Rick Branson Date: Mon Dec 24 12:41:13 2012 -0800 Lower memtable sizes on C* UA cluster to make more room for compression metadata / bloom filters on heap
  • 17. 1.2.1. It went well. well... until...
  • 18. commit 84982635d5c807840d625c22a8bd4407c1879eba Author: Rick Branson Date: Thu Jan 31 09:43:56 2013 -0800 Switch Cassandra from tokens to vnodes commit e990acc5dc69468c8a96a848695fca56e79f8b83 Author: Rick Branson Date: Sun Feb 10 20:26:32 2013 -0800 We aren't ready for vnodes yet guys
  • 19. TAKEAWAY Let stupidenterprising, experienced operators that will submit patches take the first few bullets on brand-new major versions.
  • 20. commit acb02daea57dca889c2aa45963754a271fa51566 Author: Rick Branson Date: Sun Feb 10 20:36:34 2013 -0800 Doubled C* cluster
  • 21. commit cc13a4c15ee0051bb7c4e3b13bd6ae56301ac670 Author: Rick Branson Date: Thu Mar 14 16:23:18 2013 -0700 Subtract token from C*ua7 to replace the node
  • 22. pycassa exceptions (last 6 months)
  • 23. •3.4TB •Will try vnode migration again soon...
  • 24. TAKEAWAY Adopt a technology by understanding what it's best at and letting it do that first, then expand...
  • 25. •Sharded Redis •32x68GB (m2.4xlarge) •Space (memory) bound •Resharding sucks •Let's get some better availability...
  • 26. user_id: [ activity, activity, ... ]
  • 27. user_id: [ activity, activity, ... ] Thrift Serialized Activity
  • 28. Bound the Size user_id: [ activity1, activity2, ... activity100, activity101, ... ] LTRIM <user_id> 0 99
  • 29. Undo user_id: [ activity1, activity2, activity3, ... ] LREM <user_id> 0 <activity2>
  • 30. C* data model user_id TimeUUID1 TimeUUID2 ... TimeUUID101 user_id <activity> <activity> ... <activity>
  • 31. Bound the Size user_id TimeUUID1 TimeUUID2 ... TimeUUID101 user_id <activity> <activity> ... <activity> get(<user_id>) delete(<user_id>, columns=[<TimeUUID101>, <TimeUUID102>, <TimeUUID103>, ...])
  • 32. The great destroyer of systems shows up. Tombstones abound.
  • 33. user_id TimeUUID1 TimeUUID2 ... TimeUUID101 user_id <activity> <activity> ... <activity>user_id timestamp1 timestamp2 ... timestamp101 TimeUUID = timestamp
  • 34. user_id TimeUUID1 TimeUUID2 ... TimeUUID101 user_id <activity> <activity> ... <activity>user_id timestamp1 timestamp2 ... timestamp101 delete(<user_id>, timestamp=<timestamp101>) Row Delete Deletes any data on a row with a timestamp value equal to or less than the timestamp provided in the delete operation.
  • 35. Optimizes Reads SSTable max_ts=100 SSTable max_ts=200 SSTable max_ts=300 SSTable max_ts=400 SSTable max_ts=500 SSTable max_ts=600 SSTable max_ts=700 SSTable max_ts=800 Contains row tombstone with timestamp 350 Safely ignored using in-memory metadata
  • 36. ~10% of actions are undos.
  • 37. Undo Support user_id TimeUUID1 TimeUUID2 ... TimeUUID101 user_id <activity> <activity> ... <activity> get(<user_id>) delete(<user_id>, columns=[<TimeUUID2>])
  • 38. get(<user_id>) delete(<user_id>, columns=[<TimeUUID2>]) Simple Race Condition The state of the row may have changed between these two operations. 💩
  • 39. Replica [A, B] Replica [A] Writer Writer insert B read [A]OK Replica [A, B] FAIL "like Z" undo "like Z" Diverging Replicas
  • 40. SuperColumn = Old/Busted AntiColumn = New/Hotness user_id (0, <TimeUUID>) (1, <TimeUUID>) (1, <TimeUUID>) user_id anti-column activity activity "Anti-Column" Contains an MD5 hash of the activity data it is marking as deleted.
  • 41. user_id (0, <TimeUUID>) (1, <TimeUUID>) (1, <TimeUUID>) user_id anti-column activity activity Composite Column First component is zero for anti-columns, splitting the row into two independent lists, and ensuring the anti-columns always appear at the head.
  • 42. Replica [A, B, C] Replica [A, C] Writer Writer insert B insert COK Replica [A, B, C] FAIL "like Z" undo "like Z" Diverging Replicas: Solved OK
  • 43. TAKEAWAY Read-before-write is a smell. Try to model data as a log of user "intent" rather than manhandling the data into place.
  • 44. •Keep 30% "buffer" for trims. •Undo without read. (thumbsup) •Large lists suck for this. (thumbsdown) •CASSANDRA-5527
  • 45. Built in two days. Experience pays.
  • 46. Reusability is key to rapid rollout. Great documentation eases concerns.
  • 47. •C* 1.2.3 •vnodes, LeveledCompactionStrategy •12 hi1.4xlarge (8-core, 60GB, SSD) •3 AZs, RF=3, W=2, R=1 •8GB heap, 800MB NewSize
  • 48. 1. Dial up Double Writes 2. Test with "Shadow" Reads 3. Dial up "Real" Reads Rollout
  • 49. commit 1c3d99a9e337f9383b093009dba074b8ade20768 Author: Rick Branson Date: Mon May 6 14:58:54 2013 -0700 Bump C* inbox heap size 8G -> 10G, seeing heap pressure
  • 50. Bootstrapping sucked because compacting 10,000 SSTables takes forever. sstable_size_in_mb: 5 => 25
  • 51. Come in on Monday, one of the nodes was unable to flush and has built up 8,000+ commit log segments.
  • 52. "Normal" Rebuild Process 1. /etc/init.d/cassandra stop 2. mv /data/cassandra /data/cassandra.old 3. /etc/init.d/cassandra start
  • 53. For "non-vnode" clusters, best practice is to set the initial_token in cassandra.yaml.
  • 54. for vnode clusters, multiple tokens are selected randomly when a node is bootstrapped.
  • 55. IP address is effectively the "primary key" for nodes in a ring.
  • 56. What had happened was. 1. Rebuilding node generated entirely new tokens and joined cluster. 2. Rest of cluster dropped the previously stored token data associated with the rebuilding node's IP address. 3. Token ranges shifted massively.
  • 57. UPDATE COLUMN FAMILY InboxActivitiesByUserID WITH read_repair_chance = 1.0; stats.inbox.empty
  • 58. Kicked off "nodetool repair" and waited... and waited...
  • 59. LeveledCompactionStrategy + vnodes = tragedy.
  • 60. kill -3 <cassandra> "AntiEntropyStage:1" java.lang.Thread.State: RUNNABLE <...> at io.sstable.SSTableReader.decodeKey(SSTableReader.java:1014) at io.sstable.SSTableReader.getPosition(SSTableReader.java:802) at io.sstable.SSTableReader.getPosition(SSTableReader.java:717) at io.sstable.SSTableReader.getPositionsForRanges(SSTableReader.java:664) at streaming.StreamOut.createPendingFiles(StreamOut.java:155) at streaming.StreamOut.transferSSTables(StreamOut.java:140) at streaming.StreamingRepairTask.initiateStreaming(StreamingRepairTask.java: at streaming.StreamingRepairTask.run(StreamingRepairTask.java:115) <...> Every repair task was scanning every SSTable file to find ranges to repair.
  • 61. Scan all the things. •Standard Compaction: Only a few dozen SSTables. •Non-VNodes: Repair is done once per token, and there is only one token.
  • 62. ~20X increase in repair performance.
  • 63. TAKEAWAY If you want to use VNodes and LeveledCompactionStrategy, wait until the 1.2.6 release when CASSANDRA-5569 is merged in.
  • 64. Where were we? It was a bad thing to not know data was inconsistent until we saw an increase in user reported problems.
  • 65. CASSANDRA-5618 $ nodetool netstats Mode: NORMAL Not sending any streams. Not receiving any streams. Read Repair Statistics: Attempted: 3192520 Mismatch (Blocking): 0 Mismatch (Background): 11584 Pool Name Active Pending Completed Commands n/a 0 1837765727 Responses n/a 1 1750784545 UPDATE COLUMN FAMILY InboxActivitiesByUserID WITH read_repair_chance = 0.01; 99.63% consistent
  • 66. TAKEAWAY The way to rebuild a box in a vnode cluster is to build a brand new node, then remove the old one with "nodetool removenode."
  • 67. Fetch & Deserialize Time (measured from app) Mean vs P90 (ms), trough-to-peak
  • 68. Column Family: InboxActivitiesByUserID SSTable count: 3264 SSTables in each level: [1, 10, 105/100, 1053/1000, 2095, 0, 0] Space used (live): 80114509324 Space used (total): 80444164726 Memtable Columns Count: 2315159 Memtable Data Size: 112197632 Memtable Switch Count: 1312 Read Count: 316192445 Read Latency: 1.982 ms. Write Count: 1581610760 Write Latency: 0.031 ms. Pending Tasks: 0 Bloom Filter False Positives: 481617 Bloom Filter False Ratio: 0.08558 Bloom Filter Space Used: 54723960 Compacted row minimum size: 25 Compacted row maximum size: 545791 Compacted row mean size: 3020
  • 69. Thank you! We're hiring!