Advertisement
Advertisement

More Related Content

Similar to Cloud adoption fails - 5 ways deployments go wrong and 5 solutions(20)

Advertisement
Advertisement

Cloud adoption fails - 5 ways deployments go wrong and 5 solutions

  1. 5 ways deployments go wrong and 5 solutions Cloud adoption fails FAIL
  2. “All happy cloud deployments are alike; each unhappy cloud deployment is unhappy in its own way.” Leo Tolstoy Site Reliability Engineer
  3. I’m Yevgeniy Brikman ybrikman.com
  4. Author
  5. Co-founder of Gruntwork gruntwork.io
  6. At Gruntwork, I’ve seen the cloud adoption journeys of hundreds of companies
  7. I’ve seen some go well. I’ve seen some go poorly.
  8. I've seen things you people wouldn’t believe. DDos attacks starting fires off the shoulder of Ohio (us-east-2). I watched C-suite foreheads glitter in the dark near their Fargate bills. All those moments will be lost in time, like tears in rain... Image credit: Blade Runner, Warner Bros, 1982
  9. Why is it so hard?
  10. Because everything has changed about how we build software.
  11. Before After Dev team Write code, “toss it over the wall” Write code, deploy Ops team Rack servers, deploy code Write code, deploy Servers Dedicated physical servers Elastic virtual servers Connectivity Static IPs Dynamic IPs, service discovery Security Physical, strong perimeter, high trust Virtual, end-to-end, zero trust Infra provisioning Manual Infrastructure as Code (IaC) tools Server configuration Manual Configuration management tools Testing Manual Automated testing Deployments Manual Automated Deployment cadence Weeks or months Many times per day Change process Change request tickets Self-service Change cadence Weeks or months Minutes The shift to DevOps and the cloud
  12. Adopting the cloud without acknowledging these changes leads to problems
  13. This talk is about 5 common causes of cloud adoption failure…
  14. Plus 5 solutions based on the patterns that worked across hundreds of companies
  15. The 5 solutions are part of the Gruntwork Production Framework https://docs.gruntwork.io/guides/production-framework/
  16. 1. Do it by hand 2. Do it live 3. Do it on my machine 4. Do it only on my machine 5. Do it once Outline
  17. 1. Do it by hand 2. Do it live 3. Do it on my machine 4. Do it only on my machine 5. Do it once Outline
  18. NUMBER 1: FAIL
  19. Deploying by using the web console for your cloud provider: “ClickOps”
  20. Almost everyone starts this way. Almost everyone regrets it.
  21. Problems with ClickOps: 1. Slow Hours of clicking to spin up a new environment. 2. No reuse Every deploy must be done from scratch. No leverage from previous work. 3. No audit trail All info trapped in one person’s head. No versioning. 4. Error-prone Manual task = human error. Deployment problems. Snowflake servers. Can’t use tests. 5. Tedious No one likes doing slow, repetitive, error-prone, risky work over and over again.
  22. “Realizing your DevOps Engineer left... After deploying everything via ClickOps.” Vasily Vereshchagin Oil on canvas, 1887
  23. Side note: credit to Classic Programmer Paintings for the comic inspiration! https://classicprogrammerpaintings.com/
  24. NUMBER 1: SOLUTION
  25. Create a Service Catalog
  26. A modern Service Catalog.
  27. The modern Service Catalog: 1. Defined as code Using tools such as Terraform, CloudFormation, Docker, Kubernetes, etc. 2. Designed for production use Not a “5 minute demo,” but production-grade code. 3. Meet company requirements out-of-the-box Scalability, HA, security, compliance (e.g., SOC 2, ISO 27001, PCI, HIPAA), etc. 4. Tested to meet company requirements Code reviews, static analysis, functional testing, policy enforcement, etc. 5. Infrastructure and app code Defines templates and patterns for both infrastructure and applications.
  28. Infrastructure templates This is your Cloud API https://docs.gruntwork.io/guides/production- framework/ingredients/service-catalog/infrastructure-templates
  29. Application templates This is your API between the cloud and your apps https://docs.gruntwork.io/guides/production- framework/ingredients/service-catalog/application-templates
  30. Real-world example: Gruntwork Service Catalog
  31. Example infrastructure template for EKS
  32. Example application template for Node.js
  33. Key idea #1: Manage everything as code in a Service Catalog.
  34. Manual provisioning à Infrastructure as code Manual server config à Configuration management Manual app config à Configuration files Manual builds à Continuous integration Manual deployment à Continuous delivery Manual testing à Automated testing Manual policies à Automated policies (OPA) Manual DBA work à Schema migrations Manual specs à Automated specs (BDD)
  35. Recall the problems with ClickOps: 1. Slow Hours of clicking to spin up a new environment. 2. No reuse Every deploy must be done from scratch. No leverage from previous work. 3. No audit trail All info trapped in one person’s head. No reproducibility. No versioning. 4. Error-prone Manual task = human error. Every environment a little bit different. No testing. 5. Tedious No one likes doing slow, repetitive, error-prone, risky work over and over again.
  36. Advantages of code: 1. Slow Fast Computers can do in seconds what it takes a human hours to do. 2. No reuse Reusable Leverage your previous work and the work of others. Evolve your code over time. 3. No audit trail Logged & versioned Everything is in your version control system, including the full history of changes. 4. Error-prone Reliable Code + automated tests + code reviews dramatically reduce errors. 5. Tedious Enjoyable Writing code and being creative is more fun than repetitive, stressful, manual work.
  37. 1. Do it by hand 2. Do it live 3. Do it on my machine 4. Do it only on my machine 5. Do it once Outline
  38. NUMBER 2: FAIL
  39. { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "*", "Resource": "*" } ] } Making everyone an admin
  40. Initially, most companies try to limit permissions…
  41. But IAM is hard Image from Why is AWS IAM So Hard? by Stephen Kuenzli
  42. An error occurred (AccessDenied) when calling the ListBuckets operation: Access Denied (tweak the IAM policy) An error occurred (AccessDenied) when calling the ListBuckets operation: Access Denied (tweak the IAM policy) An error occurred (AccessDenied) when calling the ListBuckets operation: Access Denied And frustrating. It’s just “Access Denied” over and over and over again.
  43. The inevitable result: “F*ck it, we’ll do it live!” and you make everyone an admin.
  44. Problems with everyone is an admin: 1. Weak security Huge blast radius from any mistake. Any compromised credentials may result in a severe security incident. Any guard rails you put in place are ineffective. 2. Sprawl Tons of new accounts and resources spun up and no one knows what they are for. 3. No consistency Everything is configured differently: logging, networking, security controls, etc. 4. Difficult to fix it If everyone is an admin, very hard to “undo” the damage: you don’t know what they’ve done and you’re never 100% confident you’ve reined things in.
  45. “Attempting to get all the AWS accounts under control” Jacques-Louis David Oil on canvas, 1799
  46. NUMBER 2: SOLUTION
  47. Set up your Landing Zone as early as possible
  48. landing zone noun /ˈlændɪŋ zəʊn/ A streamlined way to create new accounts in your cloud provider that are configured out-of-the-box with best practices (e.g., authentication, authorization, logging, monitoring, tagging, guard rails, etc.).
  49. Key ingredients of a Landing Zone: 1. Account structure 2. Account baselines 3. Account vending machine
  50. Key ingredients of a Landing Zone: 1. Account structure 2. Account baselines 3. Account vending machine
  51. account structure noun /əˈkaʊnt ˈstrʌktʃə(r) / How to configure multiple inter-connected accounts in the cloud to provide isolation, compartmentalization, authentication, authorization, auditing, and reporting.
  52. Each cloud recommends different account structures
  53. Key ingredients of a Landing Zone: 1. Account structure 2. Account baselines 3. Account vending machine
  54. account baseline noun /əˈkaʊnt ˈbeɪslaɪn/ The basic set of controls installed in every account to enforce a common set of best practices (e.g., authentication, authorization, logging, monitoring, tagging, guard rails, etc.).
  55. Description Examples Authentication User identity, login, MFA IAM users & roles, SSO, IdPs Authorization User permissions and access IAM policies & groups, ACLs, RBAC Monitoring Audit logging, app logging, metrics CloudTrail, Elastic stack, Grafana Networking IPs, routing, DNS, connectivity VPCs, NAT, Route 53, VPN, SSH, RDP Hardening Network hardening, intrusion detection WAF, IPS, Squid Proxy, GuardDuty Guard rails Limit what actions can be taken IAM policies, SCPs, OPA, AWS Config Compliance Enforce compliance requirements SOC2, ISO 27001, CIS, PCI, HIPAA Ownership Associate accounts & resources with teams Tagging, billing Account baselines should handle:
  56. module "account_baseline" { source = "github.com/gruntwork-io/account-baseline" enable_cloudtrail = true enable_aws_config = true enable_guard_duty = true child_accounts = { dev = "accounts+dev@company.com" stage = "accounts+stage@company.com" prod = "accounts+prod@company.com" } } Define your account baselines as code
  57. Key ingredients of a Landing Zone: 1. Account structure 2. Account baselines 3. Account vending machine
  58. account vending machine noun /əˈkaʊnt ˈvendɪŋ məˈʃiːn/ An official tool or process for spinning up new accounts which enforces each of those accounts is configured with the appropriate account baseline.
  59. Key ingredients for an account vending machine: 1. Self-service Teams should be able to spin up new accounts for themselves on-demand. 2. GitOps-driven Under the hood, manage accounts as code checked into version control. 3. Apply baselines The vending machine ensures the proper baseline is applied to every new account. 4. Provision access The vending machine not only creates accounts, but also grants teams access to them (e.g., via SSO).
  60. module "account_baseline" { source = "github.com/gruntwork-io/account-baseline" child_accounts = { dev = "accounts+dev@company.com" stage = "accounts+stage@company.com" prod = "accounts+prod@company.com" # Add new account example = "accounts+example@company.com" } } Example vending machine: update a file, commit, CI / CD system deploys it
  61. Key idea #2: Set up your Landing Zone as early as you can.
  62. 1. Do it by hand 2. Do it live 3. Do it on my machine 4. Do it only on my machine 5. Do it once Outline
  63. NUMBER 3: FAIL
  64. Deployments are done by humans from their own computers
  65. Even with IaC, relying on a person to do deployments leads to problems
  66. Problems with a person deploying: 1. Error prone Manual process = human error. E.g., fat-fingering a command, forgetting some step. 2. Not reproducible E.g., Wrong version installed locally, accidentally deploying uncommitted changes. 3. Low bus factor Often only 1 or 2 devs can deploy. What if they go on vacation or leave the company? 4. Race conditions Different devs accidentally deploy different code (e.g., different branches) = conflicts. 5. Not secure Deploying arbitrary changes requires arbitrary—admin—permissions. We already know what happens when you give too many people admin permissions.
  67. “Realizing you just ran terraform destroy in prod.” Gustav Courbet Oil on canvas, 1845
  68. NUMBER 3: SOLUTION
  69. Do all deploys through a CI / CD pipeline
  70. Description GitOps-driven The pipeline is triggered by commits to version control Defined as code The full workflow should be defined as code Automated tests The pipeline should run pre-, post-, and during- deploy checks. Preview environments Deploy the changes in each PR into an ephemeral environment Promotion workflows Promote immutable artifacts across environments: e.g., dev à stage à prod Approval workflows For some types of changes, require human approval for deployment to prod Deployment workflows Blue/green deploys, rolling deploys, canary deploys, feature toggles App and infra code Your need a workflows for both application and infrastructure code Key CI / CD pipeline features:
  71. The workflows for app & infra code are similar, but with key differences.
  72. Application code Infrastructure code Run locally • Run the code on localhost • Make a change, refresh • Run the code in the cloud (sandboxes) • Make a change, redeploy (use stages!) Code review • Submit pull request with code changes • Submit pull request with code changes Test • Static analysis: linter • Functional tests: unit, integration, e2e • Static analysis: linter, policy enforcement • Functional tests: plan, integration Release • Merge pull request • Build immutable, versioned artifact • Merge pull request • Create git tag CI config • CI server has limited permissions • CI server triggers K8S, ECS, EC2, etc. • Isolated worker has admin permissions • CI server triggers isolated worker Deploy • Promote artifacts: e.g., dev à stage à prod • Rolling, blue/green, canary, feature flags • Promote tags: e.g., dev à stage à prod • Plan, approve, deploy, hope Workflows for app & infra code:
  73. Key idea #3: The CI / CD pipeline is the only thing that can deploy to prod.
  74. No one has write access to prod (let alone admin access) except the pipeline.
  75. Key idea #4: The CI / CD pipeline will only deploy vetted services from the Service Catalog to prod.
  76. The Catalog + Pipeline are the only path to prod; the API between Devs and Ops.
  77. Key idea #5: The CI / CD pipeline protects its permissions for prod.
  78. { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "*", "Resource": "*" } ] } To deploy arbitrary infra changes, you need arbitrary (admin) permissions!
  79. Giving your CI server direct access to admin permissions considered harmful.
  80. This is a BAD combination: 1. Everyone in your company can access your CI server 2. You use the CI server to execute arbitrary code 3. The CI server has admin permissions
  81. Congratulations, everyone in your company has admin permissions again!
  82. And so do hackers outside your company! https://research.nccgroup.com/2022/01/13/10-real-world-stories- of-how-weve-compromised-ci-cd-pipelines/
  83. The solution: only give admin permissions to an isolated worker
  84. The isolated worker: 1. Is highly locked down Unlike the CI server, no one at the company has direct access to the worker. 2. Can only be triggered by the CI server The CI server only has permissions to trigger the worker via an API & stream logs from it. 3. Exposes a limited, locked-down API The worker only allows you to run certain commands (e.g., terraform apply), in certain repos, in certain branches, in certain folders, etc. 4. Minimizes the potential damage If an attacker gets access to your CI server, the worst they can do is trigger a deploy on your own code. They do NOT get admin permissions directly.
  85. 1. Do it by hand 2. Do it live 3. Do it on my machine 4. Do it only on my machine 5. Do it once Outline
  86. NUMBER 4: FAIL
  87. Only Ops is allowed to deploy
  88. The Ops team, trying to protect the company, acts as a gatekeeper.
  89. But that usually backfires:
  90. Inevitably, the Ops team is overwhelmed and becomes a bottleneck
  91. So the Dev team finds a workaround…
  92. So Ops adds more process… but that just makes things even more backed up.
  93. “The Ops team explains the new 95-step change request process to the Dev team.” Ferdinand Pauwels Oil on canvas, 1872
  94. NUMBER 4: SOLUTION
  95. Provide developers with self-service
  96. Key idea #6: Any team can deploy their own infra + apps from the Service Catalog
  97. The cloud is primarily a tool for Devs, not Ops.
  98. One of the biggest benefits of the cloud: Devs can be more self-sufficient.
  99. Ops team as a gatekeeper: Devs aren’t self sufficient, go slow.
  100. Ops team as enabler: Devs are self- sufficient, go fast.
  101. Enable self-service safely via the Catalog + Pipeline: your API on top of the cloud.
  102. Devs should have sandbox accounts for easy testing, learning, etc.
  103. Tool Clouds Features cloud-nuke AWS Delete all resources older than a certain date; in a certain region; of a certain type. safe-scrub Google Cloud Safely delete unwanted resources in a GCP project Azure Powershell Azure Includes native commands to delete Resource Groups Run cleanup tools in cron jobs to remove old resources in sandbox accounts
  104. In prod, Devs deploy via self-service with the Service Catalog + CI / CD Pipeline.
  105. Key self-service features: 1. GitOps-driven Everything is managed as code and driven by commits to version control. Allows code review, testing, audit log, versioning, etc. 2. UI-driven (optional) Web UI as a layer on top of GitOps layer to make it more accessible. 3. Focus on common use cases E.g., Account vending machine, data store deployment, app deployment. Don’t have to solve everything right away. 4. Access controls Different teams can access/deploy different things. E.g., NetOps team might be able to deploy networking, whereas app teams can deploy orchestration tools and data stores.
  106. module "account_baseline" { source = "github.com/gruntwork-io/account-baseline" child_accounts = { dev = "accounts+dev@company.com" stage = "accounts+stage@company.com" prod = "accounts+prod@company.com" # Add new account example = "accounts+example@company.com" } } Example of self-service: update a file, commit, CI / CD system deploys it
  107. Key idea #7: Any team can contribute to the Service Catalog.
  108. stage prod Modern software involves many moving pieces
  109. If only Ops can add those pieces to the Service Catalog, that’ll be a bottleneck
  110. Automated tests: ✓ tflint ✓ tfsec ✓ OPA ✓ steampipe ✓ checkhov ✓ Terratest Passed: 6. Failed: 0. Skipped: 0. Test run successful. Instead, allow everyone to contribute and enforce company requirements through code reviews and automated tests
  111. 1. Do it by hand 2. Do it live 3. Do it on my machine 4. Do it only on my machine 5. Do it once Outline
  112. NUMBER 5: FAIL
  113. Not taking into account ongoing maintenance work
  114. stage prod Not only are there many moving pieces, but they’re all also constantly changing.
  115. AWS is constantly changing The last S3 security document that we’ll ever need, and how to use it How To Keep Up With AWS Announcements
  116. Docker is constantly changing Docker Releases
  117. Kubernetes is constantly changing Kubernetes Wikipedia page
  118. Terraform is constantly changing Terraform Upgrade Guides
  119. Many companies assume that the initial cloud deployment is the hard part.
  120. It isn’t.
  121. “Software maintenance cost is increasingly growing and estimates showed that about 90% of software life cost is related to its maintenance phase.” Which Factors Affect Software Projects Maintenance Cost More? Sayed Mehdi Hejazi Dehaghani and Nafiseh Hajrahimi
  122. If you don’t have a plan for maintenance, all that code you wrote will rot.
  123. “Coming back to that Terraform codebase after 6 months.” Eero Järnefelt Oil on canvas, 1893
  124. NUMBER 5: SOLUTION
  125. Set up automatic updates
  126. Key auto-update features: 1. Automation-driven Updates are discovered and the code is updated automatically. No relying on a human to remember it. Update cadence should be configurable. 2. GitOps-driven The code is updated via automated pull requests. 3. Automated testing You must have automated tests in place and running against each pull request to let you know if the updated code still works. 4. Automated deployment Once a pull request is merged, it must deploy automatically via the CI / CD pipeline, promoting the update across environments: e.g., dev à stage à prod.
  127. Key idea #8: Updates are pushed to the code via PRs, automatically.
  128. Key idea #9: Code without automated tests will rot.
  129. How to do automated testing for infrastructure code https://terratest.gruntwork.io/docs/getting-started/introduction/#watch-how-to-test-infrastructure-code
  130. 1. Do it by hand 2. Do it live 3. Do it on my machine 4. Do it only on my machine 5. Do it once Outline
  131. Let’s recap:
  132. Key ideas: 1. Manage everything as code in a Service Catalog. 2. Set up your Landing Zone as early as you can. 3. Only the CI / CD Pipeline can deploy to prod. 4. The CI / CD Pipeline only deploys from the Service Catalog. 5. The CI / CD Pipeline protects its admin permissions. 6. Any team can deploy infra + apps from the Service Catalog. 7. Any team can contribute to the Service Catalog. 8. Updates are pushed to the code via PRs, automatically. 9. Code without automated tests will rot.
  133. Fail Description Solution Do it by hand ClickOps Service Catalog Do it live Everyone is an admin Landing Zone Do it on my machine People deploying from their computers CI / CD Pipeline Do it only on my machine Only Ops can deploy Self-Service Do it once Not taking maintenance into account Automatic Updates 5 cloud adoption fails and solutions:
  134. The 5 solutions are part of the Gruntwork Production Framework https://docs.gruntwork.io/guides/production-framework/
  135. If you use this framework, here’s the experience for your Ops team:
  136. Step 1: Create a Service Catalog Everything defined as code. Works for app + infra. You could build from scratch or on top of an existing one (e.g., Gruntwork Service Catalog).
  137. Step 2: Set up your Landing Zone Set up your basic account structure, define account baselines, etc.
  138. Step 3: Set up a CI / CD pipeline Ensure it’s the only way to deploy to prod. Make it work for apps + infra.
  139. Step 4: Provide self-service Enable all teams to deploy. Start with a GitOps solution. Add UI later.
  140. Step 5: Set up automatic updates PRs opened automatically. Automated tests in place for app + infra code.
  141. And here’s the experience for your Dev team:
  142. Step 1: Scaffold a new app Leverage vetted application templates from the Service Catalog and the logic built in: e.g., service discovery, packaging, monitoring, testing, etc.
  143. Step 2: Deploy infrastructure Leverage Self-Service + Service Catalog + CI / CD Pipeline.
  144. Step 3: Iterate on the app Leverage CI / CD built into the templates to deploy subsequent changes.
  145. Step 4: Debug issues Leverage monitoring, logging, alerting, etc. built into the templates.
  146. Step 5: Stay up to date Leverage auto update built into the templates. Automated PRs + tests.
  147. “The Cloud you always wanted.” Thomas Cole Oil on canvas, 1836
  148. Questions? info@gruntwork.io
Advertisement