Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How netflix manages petabyte scale apache cassandra in the cloud

1,291 views

Published on

At Netflix, we manage petabytes of data in Apache Cassandra which must be reliably accessible to users in mere milliseconds. To achieve this, we have built sophisticated control planes that turn our persistence layer based on Apache Cassandra into a truly self-driving system. We will start with the user interface that Netflix developers use to interact with their Cassandra databases and dive deep into the automation that powers it all. From cluster creation, through scaling up, to cluster death, complex automation drives large fleets of virtual machines hosted on the AWS cloud. First, we will cover the basics of how Netflix deploys Apache Cassandra. In particular, this begins with how we mold Apache Cassandra to the Netflix philosophy of immutable infrastructure, including managing software and hardware upgrades in the face of ever-failing hardware. Then we will explore the concrete techniques needed for such a massive deployment, specifically pull-based control planes and auto-healing strategies. Next, we will cover how Netflix has automated complex but critical Apache Cassandra maintenance tasks such as continuous snapshot backups and always-on anti-entropy repair for keeping our datasets safe and consistent. Both of these systems have gone through multiple architectural evolutions, and we have learned many lessons along the way. Lastly, we will share some of the ways this has gone wrong, and what you can do to avoid them. We will cover a few case studies of major Cassandra outages at Netflix, their root cause, and what we learned from those incidents. At the end of this talk, we hope that participants leave with concrete understanding of the challenges in running massive scale Apache Cassandra as well as solid advice and techniques for building their own self-driving data persistence layer.

Published in: Technology
  • Writing a good research paper isn't easy and it's the fruit of hard work. For help you can check writing expert. Check out, please ⇒ www.WritePaper.info ⇐ I think they are the best
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Writing good research paper is quite easy and very difficult simultaneously. It depends on the individual skill set also. You can get help from research paper writing. Check out, please ⇒ www.WritePaper.info ⇐
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Very nice tips on this. In case you need help on any kind of academic writing visit website ⇒ www.HelpWriting.net ⇐ and place your order
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Ich kann eine Website empfehlen. Er hat mir wirklich geholfen. ⇒ www.WritersHilfe.com ⇐ Zufrieden und beeindruckt.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hi there! I just wanted to share a list of sites that helped me a lot during my studies: .................................................................................................................................... www.EssayWrite.best - Write an essay .................................................................................................................................... www.LitReview.xyz - Summary of books .................................................................................................................................... www.Coursework.best - Online coursework .................................................................................................................................... www.Dissertations.me - proquest dissertations .................................................................................................................................... www.ReMovie.club - Movies reviews .................................................................................................................................... www.WebSlides.vip - Best powerpoint presentations .................................................................................................................................... www.WritePaper.info - Write a research paper .................................................................................................................................... www.EddyHelp.com - Homework help online .................................................................................................................................... www.MyResumeHelp.net - Professional resume writing service .................................................................................................................................. www.HelpWriting.net - Help with writing any papers ......................................................................................................................................... Save so as not to lose
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

How netflix manages petabyte scale apache cassandra in the cloud

  1. 1. How Netflix manages petabyte scale Apache Cassandra in the cloud Joey Lynch, Vinay Chella Netflix’s Distributed Database Engineers
  2. 2. Who are we? Vinay Chella Distributed Systems Engineer Focusing on Apache Cassandra and Data Abstractions Cloud Data Engineering Netflix Joey Lynch Distributed Systems Engineer Distributed system addict and data wrangler Cloud Data Engineering Netflix
  3. 3. Agenda Why use Cassandra? Scale of Cassandra Life of Cassandra Cluster - Where does it start? - Provisioning - Keep it running - Migration / Retiring Murphy’s law applied
  4. 4. Why Apache Cassandra — Millions of operations per sec — Global data replication — Failure isolation at rack level — Chaos ready database — Tunable consistency — Log structured storage engine
  5. 5. — 10’s of thousands instances — 100’s of global C* clusters — >6 PB of data — Millions of requests / second — Replicating several GiB/sec data across the globe Scale
  6. 6. Story of Apache Cassandra and Netflix Inception Provision Keep it running Migrations
  7. 7. Inception Where does it all start?
  8. 8. Inception Inception
  9. 9. Inception Service philosophy — Context not control ◆ Education ◆ Tooling — SLOs are key ◆ Size, rate, latency, availability — Every party must be responsible
  10. 10. Inception Inception
  11. 11. Inception Invest in DevEd
  12. 12. Inception Inception
  13. 13. Better Tooling
  14. 14. Inception Cost insights
  15. 15. Inception Maintenance
  16. 16. Inception Maintenance - Repair Insights
  17. 17. Maintenance - Backup Insights
  18. 18. Inception Maintenance - Node Insights
  19. 19. Inception SLOs are Key
  20. 20. Inception Whom to page?
  21. 21. Inception Good contracts make good partners!
  22. 22. Story of C* Inception Provision Keep it running Migrations
  23. 23. Provision Get up and running fast!
  24. 24. Provision Prove it before serving it Robust AMI build process Pre-flight check
  25. 25. Provision AMI Build process
  26. 26. Provision Pre-flight check EC2 Instance lifecycle
  27. 27. Story of C* Inception Provision Keep it running Migrations
  28. 28. Keep it Running
  29. 29. Keep it Running Imperative Control Plane Declarative Control Plane
  30. 30. Keep it Running Declarative Control PlaneImperative Control Plane Pros: — Easy to implement — Easy to understand Cons: — Fragile to failure — Scaling issues — Have to implement parallelism Pros: — Lower total cost of ownership — Parallel out of the box Cons — Requires different engineering mindset — Hard to implement — Have to stop parallelism
  31. 31. Keep it Running Declarative ToolsImperative Tools jenkins.io/ — “Orchestrators” ◆ Jenkins ◆ Fabric ◆ Ansible — Imperative operation ansible.com/fabfile.org/ — Distributed Databases — Distributed operation — Similar to Actor model puppet.com consul.io/
  32. 32. Declarative Control Plane 1. Obtain desires from datastore 2. Obtain state of world 3. If state != desire, act! Keep it Running
  33. 33. Keep it Running Ring Health Imperative Solution — O(#nodes) commands — Single Network Path — Hard to synthesize global view
  34. 34. Keep it Running Ring Health Declarative Solution Stream Processing!
  35. 35. — Desire: Healthy <30m — Observe: Full cluster state — Action: Remediate or page Keep it Running Ring Health Declarative Solution
  36. 36. — O(#nodes) messages — Degraded hardware is hard to SSH into ◆ Failed ephemerals ◆ Failed network Keep it Running Hardware Health Imperative Solution
  37. 37. Keep it Running Hardware Health Declarative Solution — Desire: Degraded hardware should die <10m — Observe: Local disks, creds, filesystems — Action: Request termination
  38. 38. Keep it Running JVM Health Declarative Solution — Desire: JVM Throughput > 90% — Observe: GC vs Runtime — Action: core dump + terminate https://github.com/Netflix-Skunkworks/jvmquake
  39. 39. Keep it Running Backup Declarative Solution — Desire: Full snapshot every interval — Observe: ◆ State of data ◆ State of S3 — Action: ◆ Upload diff
  40. 40. Keep it Running Case Study: Cassandra Repair
  41. 41. Keep it Running Case Study: Cassandra Repair
  42. 42. Keep it Running At Scale Subrange is Required
  43. 43. Keep it Running Repair Imperative Solution — O(lots) commands — Remote JMX = :( — Single point of failure = :( — 1 month job duration = D:
  44. 44. Keep it Running Repair Declarative Solution — Distributed Control Plane — Makes progress when database is available — Every node follows repair state machine https://issues.apache.org/jira/browse/CASSANDRA-14346
  45. 45. Keep it Running Repair Declarative Solution (#14346)
  46. 46. Keep it Running Cassandra Ecosystem
  47. 47. Keep it Running C* Instance ecosystem PAGE CDEPre-flight check AWS IO detection W I N S T O N CDE Service Alert Atlas Mantis C* C* C* Priam Bolt Cluster Metadata Remediation C*
  48. 48. Story of C* Inception Provision Keep it running Migration
  49. 49. Migration
  50. 50. Migration Migration Philosophy — Immutable infrastructure — Seamless and transparent migrations
  51. 51. Migration Software Migration
  52. 52. Migration Hardware Migration Desires S3 Instance (i2.xl) agent C* Data Incremental backup of C* Data desired instance? i2.2xl Instance (i2.2xl) agent C* Data Download latest SSTables from S3 Copy differential data from source Desires desired instance? i2.2xl
  53. 53. Hardware Migration: The Numbers Naive solution, sequential transfer Parallel S3 download sequential fixup Parallel download and zone wide fixup
  54. 54. Migration Data Migration
  55. 55. Cast Studies Murphy’s law
  56. 56. Cautionary Tale: Rarely run automation — Auto remediation to clean up stale backups, repair snapshots ◆ Triggers only above FATAL disk threshold ◆ Assumption mismatch
  57. 57. Cautionary Tale: Rarely run automation — Auto remediation to clean up stale backups, repair snapshots ◆ Triggers only above FATAL disk threshold ◆ Assumption mismatch — Result: ◆ Automation ended up deleting real backup data
  58. 58. Do you restore your backups constantly?
  59. 59. Take Away Take Away Invest in context sharing, tooling and declarative(desire) infrastructure to run Apache Cassandra at petabyte scale
  60. 60. Thank You.

×