This document summarizes Librato's experience migrating their Cassandra infrastructure from using Amazon EC2 instance storage to using Amazon EBS volumes, Elastic Network Interfaces (ENIs), and Amazon VPC. The migration improved performance, reduced costs by 35%, simplified operations by reducing maintenance time, and provided more flexibility and capacity headroom for scaling. Key steps included testing configurations, addressing write timeouts, optimizing commitlog storage, and tuning disk access modes between MMap and standard I/O.
11. Librato circa 2015
●Cassandra 2.0.11 + patches
●i2.2xlarge
• DataStax and Netflix preferred instance type
●160 instances
●Never Amazon EBS – only instance store
●Raid0 over instance store (1.5 TB)
12. Operational challenges
●CPU/cost ratio low on i2.2xlarge
• Kept rings hot to maximize efficiency
●Persistent data tied to instance
• Long MTTR to stream large data sets
●Maximum data volume size
• Had to scale rings for data capacity
15. Enter Amazon VPC migration – 3Q 2015
●Librato moving Classic → Amazon VPC
●Code all the things: SaltStack / Terraform / Flask
●Opportunity to overhaul Cassandra
●Emboldened by CrowdStrike Amazon EBS talk
@ re:Invent 2015
• We can do this!
• Anticipating a big win
21. Write timeouts – When did this start?
Started Write
Timeouts
Bisected to Here
Known Good Tested Version
Feb 2016 March
22. Write timeouts – Found it!
No timeouts in C* 2.1.4
Appear at 2.1.5
Started Write
Timeouts
Feb 2016 March
23. Write timeouts – CASSANDRA-11302
Started Write
Timeouts
Feb 2016 March
24. EBS metrics
● Spent a lot of time second-guessing EBS performance
● No GP2 burst scheduling metrics
● EBS/CloudWatch metrics only 5 minute resolution
26. ● Started with 200 GB GP2 volume
● 600 IOPS max
● Hit bottlenecks during test (15-30 min+)
● Workaround
• Bumped to 1 TB commitlog volume (3k IOPS)
• Tested with sharing commitlog on data disk
Commitlog scaling – April 2016
Write
Timeouts
Started Commitlog
Scaling
Feb 2016 March April
28. ● Throughput Optimized HDD (st1)
● Use a 600 GB st1 partition
● Cost <50% of GP2
● Commitlog separate from data
New commitlog config
Started Write
Timeouts
Commitlog
Scaling
st1 GA st1 Testing
Feb 2016 March April April 9 May
29. Ring connection timeouts – June 2016
● Added our read load
● Small message drops
● Slow start, grew to collapse of ring
● Rolling restarts fixed it for a day
Started Write
Timeouts
Commitlog
Scaling
st1 GA st1 Testing Connection
Timeouts
Feb 2016 March April April 9 May June
31. Ring connection timeouts
● Called The Last Pickle
otc_coalescing_strategy: DISABLED
Started Write
Timeouts
Commitlog
Scaling
st1 GA st1 Testing Connection
Timeouts
Feb 2016 March April April 9 May June
33. Production reached! – July 2016
We’ll live with more network traffic for now
Write
Timeouts
Commitlog
Scaling
Started st1 GA st1 Testing Connection
Timeouts
In Prod!
Feb 2016 March April April 9 May June July 2016
34. Librato today
Real Time Long Retention
One week retention Retain over a year
c4.4xlarge m4.2xlarge
EBS 2 TB GP2 data partition EBS 4T B GP2 data partition
EBS 600 GB ST1 commitlog EBS 600 GB ST1 commitlog
Split ring configurations:
36. Before
● 120 * i2.2xlarge
● Instance cost: $62k monthly*
Real-time rings: before and after
After
● 66 * c4.4xlarge
● 2 TB GP2 + 600 GB ST1
● Instance cost: $25k monthly*
● EBS cost: $15k monthly
● Total: $40k monthly
Total savings
35% (*) 1 year up-front
37. Before
● 36 * i2.2xlarge
● Instance cost: $19k monthly*
Long retention rings: before and after
After
● 30 * m4.2xlarge
● 4 TB GP2 + 600 GB ST1
● Instances: $6k monthly*
● EBS cost: $13k monthly
● Total: $19k monthly
Even cost
2x+ more disk capacity
(*) 1 year up front
39. Reducing MTTR
● Two critical pieces of state for Cassandra:
• Data files (commitlog and sstables)
• Network interface
● Data now on EBS
● ENI provides detached IP address (Amazon VPC only)
● Mobility provides a lot of flexibility
40. Bring them up, bring them down
● Now new rings are brought up fast
● Easier to automate
When not in use
● Shut down nodes
● Park the disks
… Or just destroy them
41. We’ve grown up: managing resources with
Terraform
● Query Terraform for state
● Create EBS
● Create ENI
● Create Security groups
42. Organized to let us remove resources
Keeps us from being cloud hoarders. When we’re done with a ring, we
can remove resources easily.
● Remove EBS
● Remove ENI
● Remove Security groups
Snapshots are still available in case we need them.
43. ● Launch instances
● Attach EBS and ENI
● Configure rings
● Augment with the Salt API
● Clear guardrails built into the process
We launch rings with SaltStack
45. Disaster recovery
● Previously
• Tablesnap to Amazon S3
• Required constant pruning (tablechop)
• Amazon S3 bill high
● Now
• EBS snapshots
• Cron job to snapshot EBS via Ops API
• Cron job to clean old snapshots via Ops API
• Block differences: no pruning needed
46. In-place ring scale up
Now we have a button to push
● Sudden load change
● Rolling operation to scale up instances
● Example: scale from c4.4xl → c4.8xl
After comfortable with capacity, we can still scale ring out with
bootstrap
48. Disk access mode
● MMap (4 KB faults) and Standard (read/write syscalls)
● MMap works well for small, random-access row reads
● Read ahead kept small for performant small reads
● Large compaction operations are sequential I/O
● What does this mean?
49.
50.
51. This impacts cost
● We must provision for a high base IOP load
● Disk size much larger than used capacity
● EBS GP2 IOP size is 256 KB
● What can be done?
52. Hybrid disk access mode
● MMap reads for row queries
● Standard mode (read/write) during compaction
● Ensure reads are chunked
● Chunk size configurable by disk (eg., 256 KB for GP2)
55. Wrapup
● Make it easy to test
production traffic
● Instance flexibility with EBS
● Operational simplicity and
reduced MTTR
● Reduced cost and increased
headroom
Future
● Debug network coalescing
● Cassandra 3.0
● More testing of hybrid disk
access models