Successfully reported this slideshow.
Your SlideShare is downloading. ×

Experiences building a multi region cassandra operations orchestrator on aws

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
DevOps throughout time
DevOps throughout time
Loading in …3
×

Check these out next

1 of 67 Ad

More Related Content

Slideshows for you (20)

Similar to Experiences building a multi region cassandra operations orchestrator on aws (20)

Advertisement

Recently uploaded (20)

Advertisement

Experiences building a multi region cassandra operations orchestrator on aws

  1. 1. Diego Pacheco Jackson Oliveira Marcelo Serpa (a.k.a Tarzan) Experiences Building a multi-region Cassandra Operations Orchestrator on AWS
  2. 2. About us - Diego Pacheco @diego_pacheco ❏ Cat's Father ❏ Principal Software Architect ❏ Agile Coach ❏ SOA/Microservices Expert ❏ DevOps Practitioner ❏ Speaker ❏ Author diegopacheco http://diego-pacheco.blogspot.com.br/ https://diegopacheco.github.io/
  3. 3. About us - Jackson Oliveira ❏ Father ❏ Software Architect ❏ Devops Engineer ❏ GCP Cloud Architect Certified http://jackson-s-oliveira.blogspot.com/ @cyber_jso cyberjso
  4. 4. About us - Marcelo Serpa (Tarzan) @_marceloserpa ❏ Software Developer ❏ Microservice / DevOps Practitioner ❏ Speaker ❏ Meetup coordinator - NodeJS POA marceloserpa https://medium.com/@marceloserpa
  5. 5. About ilegra.com T
  6. 6. CM Blog POSTs http://diego-pacheco.blogspot.com/2018/07/experiences-building-cassandra.html http://ilegra.com/beyonddata/2018/08/experiences-building-a-cassandra-orchestratorcm/ T
  7. 7. Agenda ❏ About us ❏ Problem, Principles and Design ❏ Team Practices ❏ Outages, Issues and Lessons ❏ Remediation & Cmsh ❏ Lessons Learned ❏ Q&A T
  8. 8. Problem, Principles and Design D
  9. 9. Problem CM solves ❏ Operation Automation ❏ Create Clusters, Decomissions Clusters, Search Clusters ❏ Observability, Remediation ❏ Deployment Automation ❏ Security Groups ❏ Launch Configurations ❏ Auto Scaling Groups ❏ Route53 DNS Entries ❏ EIP ❏ S3 Buckets ❏ Scaling Cloud Operations ❏ No Code Needed ❏ No Manual Work D
  10. 10. Why build CM? D
  11. 11. Why java? Team Background Troubleshooting CASS is written Java D
  12. 12. CM Features ❏ Support for CASS 2.2X e 3.1.X ❏ Backups and point in time restores ❏ Seeds / Token Management ❏ Full AWS Automation (SG, LC and ASG) ❏ Automated node Replacement ❏ Automated Node-by-node repairs ❏ Multi-dc support ❏ REST interfaces ❏ CM Internal state durability / Recovery (local disk and S3) ❏ 100%automated operations for: ❏ Cluster: creation, search, shutdown D
  13. 13. CM Philosophy: Self Healing - Self Operating Self Healing Self Operating D
  14. 14. Dynomite Experiences https://www.youtube.com/watch?v=Z4_rzsZd70o&feature=youtu.be (Netflix 2016) D
  15. 15. CM use cases ❏ Source of Truth of Most microservices ❏ Single Region Cluster ❏ Batch/Streaming Application (Previously with HBase) ❏ Multi-Region Region Cluster ❏ API Gateway (Kong) ❏ Authentication Microservice D
  16. 16. CM Architecture D
  17. 17. Internal Design D
  18. 18. Heartbeat Algo and Design J
  19. 19. Heartbeat Algo and Design J
  20. 20. Step Framework ❏ One task has multiple steps ❏ Order ❏ Run a list of steps for cassandra nodes ❏ Tracker the current step running by node ❏ Skip steps ❏ If step fail, send the message for slack channel ❏ SignalFX 1- Create directories BACKUP 2- Copy data 3- Send to S3 RESTORE 1- Download backups 2- Copy data 3- Restart cassandra J
  21. 21. Graceful shutdown J
  22. 22. Recovery old and new model ❏ OLD way ❏ Disk first ❏ S3 every minute ❏ Flaky: No covering all corner cases ❏ New way ❏ Disk ❏ Send to all Cass nodes ❏ In case of failure call all cass nodes ❏ Get the highest TIMESTAMP and use it. ❏ More reliable TODO draw jackson J
  23. 23. Jenkins JOBS J
  24. 24. Multi-Region Design ❏ CM Topology ❏ Dedicated: 1-1 ❏ Shared: 1-N ❏ Infrastructure details: ❏ CM in both regions exchanges information ❏ CM internode communication with EIP ❏ Public IP + PEM -> VPC Peering ❏ Cassandra: ❏ 2 seeds on US, 1 seed EU ❏ Seeds boots up first ❏ Replicates is async between regions J
  25. 25. Team Practices D
  26. 26. ❏ Clients are: Developers and Cloud Operators ❏ Plannings per Quarters ❏ Tech Lead / Coach ❏ Retro every month ❏ Coaching Sessions - 101 ❏ Design session ❏ Reviews ❏ Refactoring ❏ Kanban + google sheets + trello ❏ DevOps Principles - i.e: Immutable Infrastructure How the team works? Practices. D
  27. 27. How the team works? Tracking. ❏ Tell me a engineer who likes JIRA? Just PMs like JIRA. ❏ We was not using issue tracking first ❏ Issues lost ❏ Look for emails ❏ Ask several times about issues ❏ Repeat same design over and over ❏ Come up in a retrospective ❏ Github as issue tracking ❏ Log issues: bugs and enhancements ❏ Github release tracking D
  28. 28. How the team works? Kanban + Predictability ❏ Simple Google Sheets ❏ Items / Weeks ❏ Check every week is you are on track or not ❏ 100% accuracy for features ❏ 100% WRONG estimate for BUGS (2 weeks ~ 2 months) ❏ Different Nature: Microservices VS Data Layer ❏ Very hard to estimate bugs - Solution? ❏ You can't automate what you don't know ❏ Stability Mindset ❏ Don't introduce bugs == Developer Checklists ❏ Force you to know what to automate later D
  29. 29. How the team works? Releases. Stabilization Windows ❏ 4 Quarters ❏ ~Monthly releases ❏ Looks like waterfall or buffering ❏ Avoid ship bugs to customers ❏ Avoid downtimes ❏ Avoid losing data ❏ It's a must in data layer ❏ Data layer need to be more reliable them microservices ❏ How we did it ? ❏ Single Region - Stabilization window 1 ❏ Multi-DC - Stabilization Window 2 D
  30. 30. How the team works?Documentation and Scalability ❏ About our customer: 42 countries organization ❏ Meetings are bottleneck for scalability ❏ Jenkins DSL (Code in General) kills scalability ❏ Service-Service kills tickets ❏ Documentations kills meetings ❏ Documentation matters ❏ Time Zones ❏ English ❏ Avoid Repetition D
  31. 31. How the team works? Tests! Stability + Checklists ❏ Unit Tests ❏ Integration Tests ❏ Exploratory Tests ❏ Release 1 - 30 Issues (most bugs) ❏ Release 2 - 20 issues (most enchantments) ❏ Stability Mindset / Principles ❏ Exploration tests are a MUST ❏ Try to maximize coverage spectrum ❏ Developer Checklists Works very well D
  32. 32. How the team works? Refactorings. ❏ Strategic VS Tactical Programing ❏ Several Important Refactorings(Re-Designs) like: ❏ Thread Model ❏ Tasks Responsibility ❏ Utils ❏ And much more… ❏ Easy to do In java and good tooling like: Eclipse. ❏ Pay off in a long run ❏ Kill you if you don't do it. D
  33. 33. Flaky Tests ❏ Integration tests ❏ ~20 minutes ❏ Cassandra 3x and Cassandra 2x ❏ Hard to maintain ❏ Async AWS apis (SG, LC and ASG) ❏ Fixed timeout == unstable tests ❏ Solution: Progressive timeout T
  34. 34. Remediation & Cmsh D
  35. 35. Remediation ❏ Why Remediate? ❏ Manual Steps are dangerous ❏ Bad time == Lots of pressure ❏ Started with Dynomite ❏ Scale Up ❏ AMI Patch ❏ Refactor to support Cassandra and CM ❏ Calls DM and CM Health Checkers ❏ Procedural process ❏ Relies one: DM cold bootstrap and CM node_replace + repair. D
  36. 36. Remediation D
  37. 37. Downtime VS No Downtime: Forklift + Dual Write ❏ Downtime ❏ Dump data to file ❏ Dump Keyspace/Schema to file ❏ Upload to S3 ❏ Import in new cluster ❏ No-Downtime ❏ Forklift + Dual writer pattern ❏ Requires code in the microservices ❏ Requires orchestration in Spinnaker. D
  38. 38. CMSH D
  39. 39. Outages, issues and lessons... J
  40. 40. Troubleshooting / Police Forensic Skills J
  41. 41. Troubleshooting / Police Forensic Skills Remediation kill too many nodes and replace did not happen... why? A) AWS Ec2 B) Jenkins C) CM Java Code D) Python Demon E) Java Remediation code F) AWS S3 G) Cassandra Node H) Cassandra Cluster I) Time J) None above J
  42. 42. Troubleshooting / Police Forensic Skills J Remediation CM Cass US 2A Cass US 2B Cass US 2C Cass EU 1A Cass EU 2B Cass EU 2C Cluster activity? Cass US 2A ASG (kill box) New IP?
  43. 43. Fast Vs Slow Issue! ❏ Only with Theories ❏ EVIDENCE to back up our theories/assumptions ❏ Simulations ❏ Solution: ❏ AWS Chaos service :-) ❏ < 1 mim = FAST ❏ > 3 mim = SLOW ❏ In the end of the day it's all about 90s internal TTL ❏ Wait for replace to make sure reflect the REAL world ❏ Wait for HC to make sure capture real world J
  44. 44. Kilometers approach J
  45. 45. Tar Pits D
  46. 46. Outage in prod 32k J
  47. 47. Outage in prod J
  48. 48. Outage in prod: No outage because not live data there! J
  49. 49. Threads Re-design D
  50. 50. Threads Re-design D
  51. 51. Observability Rules! T
  52. 52. Cass 2.1.x to cass 2.2.x issues ❏ Node replace stop working ❏ We generate cass config files ❏ Position and parameters changed from 2.1 to 2.2 ❏ Our code breaked ❏ Big changes on migration from Cass 2.1.x to 2.2.x ❏ Improvements ❏ Improved repair performance. ❏ The commit log is compressed to save disk space. ❏ Fixes ❏ Fix repair hang when snapshot failed (CASSANDRA-10057) ❏ Fix potential NPE on ORDER BY queries with IN (CASSANDRA-10955) ❏ Fix handling of nulls and unsets in IN conditions (CASSANDRA-12981) ❏ https://github.com/apache/cassandra/blob/cassandra-2.2/CHANGES.txt T
  53. 53. S3 Upload issue J
  54. 54. CASS Stress/Load Tests ❏ Some bugs only appears when testing with volume ❏ Add volume might be tricky and time consuming ❏ Latency (do not run scripts from you local env) ❏ Filling up a table with a few text files will take too much time ❏ Parelization is needed ❏ Cassandra-Stress tool comes up handy on such scenario ❏ Customize how many rows and how many parallel threads writes ❏ It used tables with blobs ❏ Customize schema, replication factors and consistency level while running scripts J
  55. 55. OOM Outage! EBS vs Instance Store ❏ EBS is a SPOF ❏ EBS is more expansive ❏ EBS is less performatic ❏ EBS is more flexible ❏ Disk spaces was critical to us ❏ You don’t want run out of disk, believe us.. ❏ Dynamic Disk space definition while launching a cluster ❏ Disk space validations before starting a backup J
  56. 56. Side note on Cass 4.x ❏ Cass 3.x is better than cass 2.x right now ❏ Cass 4.x will be awesome ❏ Netflix work on incremental repairs ❏ Bug fixes - like gossip threads and restart issue ❏ Way more stable - everybody should migrate. ❏ Having Less cassandra versions reduce complexity ❏ Different configurations ❏ Bugs that was fixed and you don't get it - lack of backport(old versions) D
  57. 57. Lessons Learned D
  58. 58. Design is strategic: Avoid complexity, bugs and reduce cost D
  59. 59. Avoid Classitis - FAT classes rules D
  60. 60. Java over bash always | Right tool for the job D Tooling Refactoring Dev Vs Ops Tooling / Mindset
  61. 61. Proof of 9 - Validate the code not the tests D
  62. 62. Hard to Estimate Bugs(Data Layer) = Stabilization Payoff D Microservices Data Layer
  63. 63. Make sure you expand you test coverage radius D
  64. 64. Forense Mindset & Skill | Observability over debug D
  65. 65. Devops Is Plumbing. Automate the Hidden Pipelines! ❏ Remediation ❏ Scale Up ❏ Patch ❏ Upgrade ❏ Much more... ❏ Os Patches ❏ Telemetry ❏ Discovery ❏ Destroy ❏ Restore D
  66. 66. Make tools for your Tools ❏ REPL - Cmsh ❏ Better than: ❏ Run books ❏ REST ❏ Bash Alias ❏ Shared Dashboards ❏ Avoid problems ❏ What monitor ❏ Self-Service Jobs ❏ Better than: ❏ Coding ❏ Jenkins DSL ❏ TF Templates D
  67. 67. Diego Pacheco Jackson Oliveira Marcelo Serpa (a.k.a Tarzan) Experiences Building a multi-region Cassandra Operations Orchestrator on AWS

×