Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

PagerDuty: Span the WAN? Yes you can!

840 views

Published on

Most Cassandra usages take advantage of its exceptional performance and ability to handle massive data sets. At PagerDuty, we use Cassandra for entirely different reasons: to reliably manage mutable application states and to maintain durability requirements even in the face of full data center outages. We achieve this by deploying Cassandra clusters with hosts in multiple WAN-separated data centers, configured with per-data center replica placement requirements, and with significant application-level support to use Cassandra as a consistent datastore. Accumulating several years of experience with this approach, we've learned to accommodate the impact of WAN network latency on Cassandra queries, how to horizontally scale while maintaining our placement invariants, why asymmetric load is experienced by nodes in different data centers, and more. This talk will go over our workload and design goals, detail the resultant Cassandra system design, and explain a number of our unintuitive operational learnings about this novel Cassandra usage paradigm.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

PagerDuty: Span the WAN? Yes you can!

  1. 1. 2015-10-01 Span the WAN? Yes you can! paul@pagerduty.com #CassandraSummit
  2. 2. 2015-10-01MAKING PAGERDUTY MORE RELIABLE USING PXC Span the WAN. Why?
  3. 3. 2015-10-01MAKING PAGERDUTY MORE RELIABLE USING PXC
  4. 4. 2015-10-01SPAN THE WAN? YES YOU CAN!
  5. 5. 2015-10-01 PagerDuty: some history •Monolithic Ruby on Rails + MySQL •Hosted in AWS us-east-1 •AWS outages in 2010 and 2011 •…including correlated multi-AZ failures •PagerDuty was heavily impacted •Needed resiliency to this failure mode SPAN THE WAN? YES YOU CAN!
  6. 6. 2015-10-01 Design goals •Continuity during a DC drop (AZ or Region) •No operator intervention •Can’t lose data •Can’t delay data (shelf life) •Timely notifications - always •Measured in 10’s of seconds SPAN THE WAN? YES YOU CAN!
  7. 7. 2015-10-01 Design decisions •Masterless: peer-based & clustered •Can’t tolerate staleness: synchronous WAN replication •Manage state: consistent reads •Opted to use Cassandra •…despite many of Cassandra’s features not being relevant SPAN THE WAN? YES YOU CAN!
  8. 8. 2015-10-01 How Cassandra is often used SPAN THE WAN? YES YOU CAN! •Massive throughput •Lots of data •Horizontally scalable •Eventually consistent •High write:read ratio •High performance individual operations
  9. 9. 2015-10-01 Essential Cassandra features for PagerDuty •Quorum operations •Tuneable consistency •Synchronous WAN replication SPAN THE WAN? YES YOU CAN!
  10. 10. 2015-10-01MAKING PAGERDUTY MORE RELIABLE USING PXC WAN-spanning system design
  11. 11. 2015-10-01 System architecture SPAN THE WAN? YES YOU CAN! Shared cross-DC datastore (Cassandra) Distributed Coordination (ZooKeeper) Clustered Application
  12. 12. 2015-10-01 Quorum consistency systems •Each item replicated N times •Writes: require W of N replicas •Reads: require R of N replicas •W + R <= N: read can miss a write •W + R > N: read can’t miss a write SPAN THE WAN? YES YOU CAN! WRITE READ
  13. 13. 2015-10-01 •Replication factor: N=5 •Three DCs •DC-aware placement strategy •W=3: all writes hit multiple DCs •R=3: all reads hit multiple DCs •3 + 3 > 5: consistent reads Cassandra setup SPAN THE WAN? YES YOU CAN! Cass 5 Cass 1 Cass 2 Cass 4 Cass 3 DC-A DC-C DC-B
  14. 14. 2015-10-01 Data layer summary •Data safe against DC failure •Consistent reads (of acknowledged writes) •Expensive multi-DC writes & reads •Managing state: No ACID transactions! •Enforce “transactions” in the application layer SPAN THE WAN? YES YOU CAN!
  15. 15. 2015-10-01 Application layer: “transactions” •Sequence of logic and Cassandra operations •Implement sequence as idempotent •Failure is not an option •Enforce transaction ordering •Expect (some) (transient) inconsistencies SPAN THE WAN? YES YOU CAN!
  16. 16. 2015-10-01MAKING PAGERDUTY MORE RELIABLE USING PXC Tales from production
  17. 17. 2015-10-01 What about the network? SPAN THE WAN? YES YOU CAN! Cass 5 Cass 1 Cass 2 Cass 4 Cass 3 DC-A DC-C DC-B 24 ms 2 4 m s 3 m s •Network diversity limits DC choices •Result? Uneven network latencies
  18. 18. 2015-10-01 …and how you should think of the network SPAN THE WAN? YES YOU CAN! Cass 5 Cass 1 Cass 2 Cass 4 Cass 3 DC-A DC-C DC-B 24 ms 24 ms 3ms
  19. 19. 2015-10-01 Reads and writes SPAN THE WAN? YES YOU CAN! DC-A DC-B DC-C Client R1 R2 R3 R4 R5
  20. 20. 2015-10-01 Read and write performance SPAN THE WAN? YES YOU CAN! Cass 5 Cass 1 Cass 2 Cass 4 Cass 3 DC-A DC-C DC-B 24 ms 24 ms 3ms •R and W =3 means always hitting replicas in two DCs (by design) •Reads coordinated from DC-B or DC-C nodes will take >3ms •Reads coordinated from DC-A nodes will take >24ms
  21. 21. 2015-10-01 Another latency effect? Per-node read volume SPAN THE WAN? YES YOU CAN!
  22. 22. 2015-10-01 Per-node read volume: why so skewed? SPAN THE WAN? YES YOU CAN!
  23. 23. 2015-10-01 Writes: Which replicas are involved? All 5 SPAN THE WAN? YES YOU CAN! DC-A DC-B DC-C Client R1 R2 R3 R4 R5
  24. 24. 2015-10-01 Writes: per-node volume SPAN THE WAN? YES YOU CAN! Cass 5 Cass 1 Cass 2 Cass 4 Cass 3 DC-A DC-C DC-B 24 ms 24 ms 3ms •N=5, so there is a write op on each replica •All replicas experience the same per-node write load
  25. 25. 2015-10-01 Reads: Which replicas are involved? Only 3! SPAN THE WAN? YES YOU CAN! DC-A DC-B DC-C Client R1 R2 R3 R4 R5
  26. 26. 2015-10-01 Reads: per-node volume SPAN THE WAN? YES YOU CAN! Cass 5 Cass 1 Cass 2 Cass 4 Cass 3 DC-A DC-C DC-B 24 ms 24 ms 3ms •Coordinator chooses R fastest replicas (R=3) •Network latency steers to the nearest replicas
  27. 27. 2015-10-01 Reads: per-node volume (Cass 3 as coord) SPAN THE WAN? YES YOU CAN! Cass 5 Cass 1 Cass 2 Cass 4 Cass 3 DC-A DC-C DC-B 24 ms 24 ms 3ms •Chooses 3, 4, and 5 •Same when Cass 4 or Cass 5 coordinates
  28. 28. 2015-10-01 Reads: per-node volume (Cass 1 as coord) SPAN THE WAN? YES YOU CAN! Cass 5 Cass 1 Cass 2 Cass 4 Cass 3 DC-A DC-C DC-B 24 ms 24 ms 3ms •Hits 1, 2 and (randomly) one of 3, 4, 5 •Same when Cass 2 coordinates
  29. 29. 2015-10-01 Reads: per-node volume, uniform coord usage SPAN THE WAN? YES YOU CAN! Coordinator Node Cass 1 Cass 2 Cass 3 Cass 4 Cass 5 Cass 1 1 1 0.33 0.33 0.33 Cass 2 1 1 0.33 0.33 0.33 Cass 3 0 0 1 1 1 Cass 4 0 0 1 1 1 Cass 5 0 0 1 1 1 Total requests 2 2 3.66 3.66 3.66
  30. 30. 2015-10-01 Per-node read volume: reality vs. theory SPAN THE WAN? YES YOU CAN!
  31. 31. 2015-10-01 What about scaling out? • Asymmetrical per-node read volumes • So each DC has different CPU and disk IO needs • Different node size? • Different per-DC node count? • What about DC degradation or loss? • End up with same-sized nodes SPAN THE WAN? YES YOU CAN!
  32. 32. 2015-10-01MAKING PAGERDUTY MORE RELIABLE USING PXC When a data center vanishes…
  33. 33. 2015-10-01 Major outage: DC-C (May, 2015) • All hosts unreachable for ~5 hours SPAN THE WAN? YES YOU CAN!
  34. 34. 2015-10-01 Seamless data center migration (August 2015) • Moved DC-C fleet from one provider to another • Remove old node; add new node • No application-level migration needed • Zero customer impact SPAN THE WAN? YES YOU CAN!
  35. 35. 2015-10-01 DC-A to DC-B fiber cut (September, 2015) • DC-A to DC-B network latency 24ms -> 200ms, lasted 48 hours • All Cass ops now take 24ms SPAN THE WAN? YES YOU CAN! FIBER CUT EAST-1
  36. 36. 2015-10-01MAKING PAGERDUTY MORE RELIABLE USING PXC And back to where we started
  37. 37. 2015-10-01 What have we learned? • WAN-spanning synchronous replication is a thing • Data layer consistent reads are practical • Application layer consequences for managing state • Network topology affects: • Request performance • Per-node load • Trade off latency for reliability SPAN THE WAN? YES YOU CAN!
  38. 38. 2015-10-01 Span the WAN? Yes you can! SPAN THE WAN? YES YOU CAN!
  39. 39. 2015-10-01 paul@pagerduty.com PAGERDUTY.COM/JOBS SPAN THE WAN? YES YOU CAN!
  40. 40. 2015-10-01 Questions? paul@pagerduty.com

×