Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DevOps Fest 2020. immutable infrastructure as code. True story.

35 views

Published on

In this talk I’ll explain how we went from classic Pet servers to immutable infrastructure, fully described as code, with Cattle instances. I’ll also share which tools we use and how we evolved our experience with them.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

DevOps Fest 2020. immutable infrastructure as code. True story.

  1. 1. Immutable Infrastructure as Code. True story. Vladlen Fedosov, Director of R&D @Namecheap, Inc
  2. 2. Vladlen Fedosov Director of R&D @Namecheap TL;DR: • 10 years in the industry • Went path from Junior to Architect  • Amateur DevOps evangelist  • AWS ninja • Believe in self-organized, cross-functional teams
  3. 3. “Opening the door for everyone to a free and open Internet”
  4. 4. The beginning Disclaimer: Here I’m talking mostly about my experience and part of the infrastructure that my team was responsible for. So further statements may not apply to every department in the company. Timeline for the things mentioned below was changes to simplify storytelling.
  5. 5. Project takeover from outsourcing company
  6. 6. Brave New World For a small dev team with no infra support / capabilities
  7. 7. So what do we have now?
  8. 8. State of the project after takeover ● Half broken Chef cookbooks ● Sketchy CD pipelines ● Fault tolerance in place ● And… Everything went down after failure of the single (out of 3) etcd node. We realized that we have fault tolerance only on paper
  9. 9. After some refactoring… We’ve got typical setup
  10. 10. Issues we noticed - Multiple apps were sharing same OS, language versions, dependencies - Horizontal scaling was hard - Sometimes failing chef scripts on random instances mostly due to network errors & configuration differences
  11. 11. Configuration Synchronisation problem, in short martinfowler.com/bliki/ConfigurationSynchronization.html
  12. 12. Issues we noticed - Manual sync of the AWS setup between environments - Manually configured CI/CD - Easy to break something and hard to repair or modify anything
  13. 13. Blue sky vision ● Immutable infrastructure ● Everything as code ○ Infrastructure as code ○ CI/CD as code
  14. 14. Blue sky vision ● Hard to break ● Easy to repair ● Easy to modify
  15. 15. Immutable infrastructure + extra takeaways
  16. 16. What is immutable infrastructure? martinfowler.com/bliki/ImmutableServer.html
  17. 17. Why go Immutable? ● Forget about change flow, only “create” matters ● Defeat Configuration Drift ● Use much simpler tools ● Build highly available systems easier ● Fix issues faster
  18. 18. How to achieve this? ● Complexity → Docker (or AMI) images ● OS → Docker runtime only ● App/OS configuration methods: ○ K8s pod definitions (or similar) ○ “cloud-init” ● Terraform to define “datacenter config”
  19. 19. Main tools we use ● AMI images & Docker images ● Cloud-Init (98%) ● Hashicorp Packer (2%) ● Terraform, Terraform everywhere
  20. 20. Single fact you could memorise here Immutable Infrastructure allows you to significantly simplify management steps and consequently reduce number of bugs your customers will face with. Work with images rather than servers.
  21. 21. CI/CD as code
  22. 22. Everything as code: full list ● Everything as code, it supports: ○ job steps definition as code (via pipelines) ○ jobs creation as code (via job dsl) ○ system configuration as code (via groovy API, XML configs & CasC yml) ● Shared libraries, ability to share common steps between apps P.S: Talk to me if know better alternative ;)
  23. 23. Other factors influenced the choice ● Has deployment dashboard so we can see the state of all the environments ● Highly extensible ● Elastic EC2 instances as agents
  24. 24. Nowadays ● It went beyond one project and now almost every team at Namecheap uses it ● 300+ pipelines ● Around 38 projects ● We expect even more in the future
  25. 25. Nowadays: CI @Library('namecheap/common') _ node('CommonCLarge') { ciJavascript.servicePipeline { productName = "ProductA" serviceName = "Apps.Api" } }
  26. 26. Nowadays: CD @Library('namecheap/common') _ properties([parameters([ string(name: 'image', description: 'Application container name'), string(name: 'version', description: 'Application container tag'), choice(choices: ['production', 'sandbox'], description: 'Environment', name: 'env') ])]) node() { def authToken = "XXXX" stage('Deploy') { deployToOKD(params, authToken) } }
  27. 27. Lessons learned ● Hide all complexity inside. Provide as much logic as you can in a form of Shared Libraries. ● Documentation & examples are crucial for developers ● If possible - provide standardized pipelines invoked as Shared Library function ● It takes about 1.5 months for 2 people to setup Jenkins properly for the first time
  28. 28. Single fact you could memorise here Always keep your CI/CD configuration, written as code, near the app, in the same repo. It will give an understanding to everyone in your company on how to build & deploy any app.
  29. 29. Infrastructure as code: Terraform
  30. 30. How Terraform is different to Ansible/Chef/Puppet Terraform is not a configuration management tool. It focuses on the higher-level abstraction of the datacenter (or cloud provider), without sacrificing the ability to use configuration management tools to do what they do best: bootstrapping and initializing resources. resource "aws_instance" "web" { ami = "ami-dbc3b9aa" instance_type = "t2.micro" }
  31. 31. Imperative VS Declarative infrastructure code Declarative “Can I have a cup of coffee on my desk at 9AM on Monday morning?” Imperative “Go to that machine, then get the glass jar, then fill it with water, then put it back in the machine” …you get the idea...
  32. 32. Why go Declarative? Key challenges this approach solves for us: ● Dealing with “Configuration Drift” / State management ● Idempotency ● Dependency graph management, correct order of operations
  33. 33. Terraform CI Try now: https://www.runatlantis.io/ https://app.terraform.io/
  34. 34. Deploy with Terraform What we had: 1. Write TF configs 2. Run TF to create infrastructure 3. Take TF outputs & enter them to Jenkins 4. Deploy app itself with Jenkins What we wanted to have: 1. Write TF configs 2. Deploy app
  35. 35. Deploy with Terraform
  36. 36. Deploy with Terraform !"" vars <-- Environment specific variables # !"" production.tfvars # !"" staging.tfvars !"" main.tf !"" io.tf !"" db.tf !"" etc.tf
  37. 37. Single fact you could memorize here Try to have as much declarative infrastructure configs as you can, avoid imperative scripts at all cost.
  38. 38. Learnings & further improvements
  39. 39. Tests for infrastructure code The more infrastructure code you have - the more bugs you see.
  40. 40. Chaos monkey We wrote a Lambda that randomly reboots every instance once a day. This simple tweak ensures that: ● Apps you launch can survive instance failure ● Updates to the cluster setup is easy as you’re sure that you won’t harm anyone by killing outdated machines ● Problem resolution can be simpler sometimes. You can always reboot/kill any instance that behaves abnormally as a first action Apply this to control fleet too
  41. 41. No SSH keys distribution over instances ● If you’re using AWS - simply install SSM agent to your instances and disable SSH daemon. You will be able to use SSH console to perform your administrative actions. ● If you’re not in AWS - you can use Hashicorp Vault. It provides you with SSH backend that allows central management & audit of the login identities.
  42. 42. Things that work for 3 teams - doesn’t work for 10 Issues # of users
  43. 43. Key learnings here: ● Operational work grows exponentially the more teams you add ● 1 new tool/approach for devs at a time ● Conduct educational courses for big new things like Docker, AWS, Terraform ● Gain trust within the team you’re challenging with a change first ● Documentation is paramount, start it as early as possible Things that work for 3 teams - doesn’t work for 10
  44. 44. DevOps on Call & transparent SLAs ● We’ve established “on call” schedule ● Agreed on SLAs & shared then among the teams ● Created chat room & Jira board Result: significant reduction of the distraction level, better productivity, happier teams Even you’re doing good now, you can make it even better with this practice
  45. 45. Encourage feedback ● Ask for it proactively, show that it's important for you ● Public retrospectives ● Respond to the feedback
  46. 46. Summing up
  47. 47. What we’ve achieved ● Immutable infrastructure (done) ○ ECS with immutable data plane ○ Immutable EC2 instances for stateful instances ● Everything as code (done) ○ Infrastructure as code: Terraform, Cloud-init ○ CI/CD as code: Jenkins Now it’s hard to break & easy to repair things as well as
 easy to track changes.
  48. 48. Vlad Fedosov Director of R&D @Namecheap, Inc vlad.fedosov@gmail.com Or just scan it:
  49. 49. Evolving Terraform experience

×