Cloud adoption fails - 5 ways deployments go wrong and 5 solutions

5 ways deployments go
wrong and 5 solutions
Cloud
adoption
fails
FAIL

“All happy cloud
deployments are alike;
each unhappy cloud
deployment is unhappy
in its own way.”
Leo Tolstoy
Site Reliability Engineer

I’m
Yevgeniy
Brikman
ybrikman.com

Co-founder of
Gruntwork
gruntwork.io

At Gruntwork,
I’ve seen the
cloud adoption
journeys of
hundreds of
companies

I’ve seen some go well.
I’ve seen some go poorly.

I've seen things you people
wouldn’t believe. DDos attacks
starting fires off the shoulder
of Ohio (us-east-2). I watched
C-suite foreheads glitter in the
dark near their Fargate bills.
All those moments will be lost
in time, like tears in rain...
Image credit: Blade Runner, Warner Bros, 1982

Because everything has changed
about how we build software.

Before After
Dev team Write code, “toss it over the wall” Write code, deploy
Ops team Rack servers, deploy code Write code, deploy
Servers Dedicated physical servers Elastic virtual servers
Connectivity Static IPs Dynamic IPs, service discovery
Security Physical, strong perimeter, high trust Virtual, end-to-end, zero trust
Infra provisioning Manual Infrastructure as Code (IaC) tools
Server configuration Manual Configuration management tools
Testing Manual Automated testing
Deployments Manual Automated
Deployment cadence Weeks or months Many times per day
Change process Change request tickets Self-service
Change cadence Weeks or months Minutes
The shift to DevOps and the cloud

Adopting the cloud without acknowledging
these changes leads to problems

This talk is about 5 common causes of
cloud adoption failure…

Plus 5 solutions
based on the
patterns that
worked across
hundreds of
companies

The 5 solutions
are part of the
Gruntwork
Production
Framework
https://docs.gruntwork.io/guides/production-framework/

1. Do it by hand
2. Do it live
3. Do it on my machine
4. Do it only on my machine
5. Do it once
Outline

Deploying by using the web console
for your cloud provider: “ClickOps”

Almost everyone starts this way.
Almost everyone regrets it.

Problems with ClickOps:
1. Slow
Hours of clicking to spin up a new environment.
2. No reuse
Every deploy must be done from scratch. No leverage from previous work.
3. No audit trail
All info trapped in one person’s head. No versioning.
4. Error-prone
Manual task = human error. Deployment problems. Snowflake servers. Can’t use tests.
5. Tedious
No one likes doing slow, repetitive, error-prone, risky work over and over again.

“Realizing your
DevOps Engineer left...
After deploying
everything via
ClickOps.”
Vasily Vereshchagin
Oil on canvas, 1887

Side note:
credit to Classic
Programmer
Paintings for the
comic inspiration!
https://classicprogrammerpaintings.com/

The modern Service Catalog:
1. Defined as code
Using tools such as Terraform, CloudFormation, Docker, Kubernetes, etc.
2. Designed for production use
Not a “5 minute demo,” but production-grade code.
3. Meet company requirements out-of-the-box
Scalability, HA, security, compliance (e.g., SOC 2, ISO 27001, PCI, HIPAA), etc.
4. Tested to meet company requirements
Code reviews, static analysis, functional testing, policy enforcement, etc.
5. Infrastructure and app code
Defines templates and patterns for both infrastructure and applications.

Infrastructure
templates
This is your Cloud API
https://docs.gruntwork.io/guides/production-
framework/ingredients/service-catalog/infrastructure-templates

Application
templates
This is your API between the
cloud and your apps
https://docs.gruntwork.io/guides/production-
framework/ingredients/service-catalog/application-templates

Real-world example: Gruntwork Service Catalog

Example infrastructure template for EKS

Example application template for Node.js

Key idea #1: Manage everything as
code in a Service Catalog.

Manual provisioning à Infrastructure as code
Manual server config à Configuration management
Manual app config à Configuration files
Manual builds à Continuous integration
Manual deployment à Continuous delivery
Manual testing à Automated testing
Manual policies à Automated policies (OPA)
Manual DBA work à Schema migrations
Manual specs à Automated specs (BDD)

Recall the problems with ClickOps:
1. Slow
Hours of clicking to spin up a new environment.
2. No reuse
Every deploy must be done from scratch. No leverage from previous work.
3. No audit trail
All info trapped in one person’s head. No reproducibility. No versioning.
4. Error-prone
Manual task = human error. Every environment a little bit different. No testing.
5. Tedious
No one likes doing slow, repetitive, error-prone, risky work over and over again.

Advantages of code:
1. Slow Fast
Computers can do in seconds what it takes a human hours to do.
2. No reuse Reusable
Leverage your previous work and the work of others. Evolve your code over time.
3. No audit trail Logged & versioned
Everything is in your version control system, including the full history of changes.
4. Error-prone Reliable
Code + automated tests + code reviews dramatically reduce errors.
5. Tedious Enjoyable
Writing code and being creative is more fun than repetitive, stressful, manual work.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "*",
"Resource": "*"
}
]
}
Making everyone an admin

Initially, most companies try to limit
permissions…

But IAM is hard
Image from Why is AWS IAM So Hard? by Stephen Kuenzli

An error occurred (AccessDenied) when calling the
ListBuckets operation: Access Denied
(tweak the IAM policy)
(tweak the IAM policy)
And frustrating. It’s just “Access Denied”
over and over and over again.

The inevitable result: “F*ck it, we’ll do it
live!” and you make everyone an admin.

Problems with everyone is an admin:
1. Weak security
Huge blast radius from any mistake. Any compromised credentials may result in a
severe security incident. Any guard rails you put in place are ineffective.
2. Sprawl
Tons of new accounts and resources spun up and no one knows what they are for.
3. No consistency
Everything is configured differently: logging, networking, security controls, etc.
4. Difficult to fix it
If everyone is an admin, very hard to “undo” the damage: you don’t know what they’ve
done and you’re never 100% confident you’ve reined things in.

“Attempting to
get all the AWS
accounts under
control”
Jacques-Louis David
Oil on canvas, 1799

Set up your Landing Zone as
early as possible

landing zone noun
/ˈlændɪŋ zəʊn/
A streamlined way to create new accounts in your cloud provider that are
configured out-of-the-box with best practices (e.g., authentication, authorization,
logging, monitoring, tagging, guard rails, etc.).

Key ingredients of a Landing Zone:
1. Account structure
2. Account baselines
3. Account vending machine

account structure noun
/əˈkaʊnt ˈstrʌktʃə(r) /
How to configure multiple inter-connected accounts in the cloud to provide
isolation, compartmentalization, authentication, authorization, auditing, and
reporting.

Each cloud recommends different
account structures

account baseline noun
/əˈkaʊnt ˈbeɪslaɪn/
The basic set of controls installed in every account to enforce a common set of
best practices (e.g., authentication, authorization, logging, monitoring, tagging,
guard rails, etc.).

Description Examples
Authentication User identity, login, MFA IAM users & roles, SSO, IdPs
Authorization User permissions and access IAM policies & groups, ACLs, RBAC
Monitoring Audit logging, app logging, metrics CloudTrail, Elastic stack, Grafana
Networking IPs, routing, DNS, connectivity VPCs, NAT, Route 53, VPN, SSH, RDP
Hardening Network hardening, intrusion detection WAF, IPS, Squid Proxy, GuardDuty
Guard rails Limit what actions can be taken IAM policies, SCPs, OPA, AWS Config
Compliance Enforce compliance requirements SOC2, ISO 27001, CIS, PCI, HIPAA
Ownership Associate accounts & resources with teams Tagging, billing
Account baselines should handle:

module "account_baseline" {
source = "github.com/gruntwork-io/account-baseline"
enable_cloudtrail = true
enable_aws_config = true
enable_guard_duty = true
child_accounts = {
dev = "accounts+dev@company.com"
stage = "accounts+stage@company.com"
prod = "accounts+prod@company.com"
}
}
Define your account baselines as code

account vending machine noun
/əˈkaʊnt ˈvendɪŋ məˈʃiːn/
An official tool or process for spinning up new accounts which enforces each of
those accounts is configured with the appropriate account baseline.

Key ingredients for an account vending machine:
1. Self-service
Teams should be able to spin up new accounts for themselves on-demand.
2. GitOps-driven
Under the hood, manage accounts as code checked into version control.
3. Apply baselines
The vending machine ensures the proper baseline is applied to every new account.
4. Provision access
The vending machine not only creates accounts, but also grants teams access to them
(e.g., via SSO).

child_accounts = {
# Add new account
example = "accounts+example@company.com"
}
}
Example vending machine: update a
file, commit, CI / CD system deploys it

Key idea #2: Set up your Landing Zone
as early as you can.

Deployments are done by humans
from their own computers

Even with IaC, relying on a person to do
deployments leads to problems

Problems with a person deploying:
1. Error prone
Manual process = human error. E.g., fat-fingering a command, forgetting some step.
2. Not reproducible
E.g., Wrong version installed locally, accidentally deploying uncommitted changes.
3. Low bus factor
Often only 1 or 2 devs can deploy. What if they go on vacation or leave the company?
4. Race conditions
Different devs accidentally deploy different code (e.g., different branches) = conflicts.
5. Not secure
Deploying arbitrary changes requires arbitrary—admin—permissions. We already know
what happens when you give too many people admin permissions.

“Realizing you
just ran terraform
destroy in prod.”
Gustav Courbet
Oil on canvas, 1845

Do all deploys through a
CI / CD pipeline

Description
GitOps-driven The pipeline is triggered by commits to version control
Defined as code The full workflow should be defined as code
Automated tests The pipeline should run pre-, post-, and during- deploy checks.
Preview environments Deploy the changes in each PR into an ephemeral environment
Promotion workflows Promote immutable artifacts across environments: e.g., dev à stage à prod
Approval workflows For some types of changes, require human approval for deployment to prod
Deployment workflows Blue/green deploys, rolling deploys, canary deploys, feature toggles
App and infra code Your need a workflows for both application and infrastructure code
Key CI / CD pipeline features:

The workflows for app & infra code are
similar, but with key differences.

Application code Infrastructure code
Run locally
• Run the code on localhost
• Make a change, refresh
• Run the code in the cloud (sandboxes)
• Make a change, redeploy (use stages!)
Code review • Submit pull request with code changes • Submit pull request with code changes
Test
• Static analysis: linter
• Functional tests: unit, integration, e2e
• Static analysis: linter, policy enforcement
• Functional tests: plan, integration
Release
• Merge pull request
• Build immutable, versioned artifact
• Merge pull request
• Create git tag
CI config
• CI server has limited permissions
• CI server triggers K8S, ECS, EC2, etc.
• Isolated worker has admin permissions
• CI server triggers isolated worker
Deploy
• Promote artifacts: e.g., dev à stage à prod
• Rolling, blue/green, canary, feature flags
• Promote tags: e.g., dev à stage à prod
• Plan, approve, deploy, hope
Workflows for app & infra code:

Key idea #3: The CI / CD pipeline is the
only thing that can deploy to prod.

No one has write access to prod (let
alone admin access) except the pipeline.

Key idea #4: The CI / CD pipeline will
only deploy vetted services from the
Service Catalog to prod.

The Catalog + Pipeline are the only path
to prod; the API between Devs and Ops.

Key idea #5: The CI / CD pipeline
protects its permissions for prod.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "*",
"Resource": "*"
}
]
}
To deploy arbitrary infra changes, you
need arbitrary (admin) permissions!

Giving your CI server direct access to
admin permissions considered harmful.

This is a BAD combination:
1. Everyone in your company can access your CI server
2. You use the CI server to execute arbitrary code
3. The CI server has admin permissions

Congratulations, everyone in your
company has admin permissions again!

And so do
hackers
outside your
company!
https://research.nccgroup.com/2022/01/13/10-real-world-stories-
of-how-weve-compromised-ci-cd-pipelines/

The solution: only give admin
permissions to an isolated worker

The isolated worker:
1. Is highly locked down
Unlike the CI server, no one at the company has direct access to the worker.
2. Can only be triggered by the CI server
The CI server only has permissions to trigger the worker via an API & stream logs from it.
3. Exposes a limited, locked-down API
The worker only allows you to run certain commands (e.g., terraform apply), in certain
repos, in certain branches, in certain folders, etc.
4. Minimizes the potential damage
If an attacker gets access to your CI server, the worst they can do is trigger a deploy on
your own code. They do NOT get admin permissions directly.

The Ops team, trying to protect the
company, acts as a gatekeeper.

Inevitably, the Ops team is overwhelmed
and becomes a bottleneck

So the Dev team finds a workaround…

So Ops adds more process… but that just
makes things even more backed up.

“The Ops team
explains the new
95-step change
request process
to the Dev team.”
Ferdinand Pauwels
Oil on canvas, 1872

Provide
developers with
self-service

Key idea #6: Any team can deploy their
own infra + apps from the Service Catalog

The cloud is primarily a tool for Devs,
not Ops.

One of the biggest benefits of the cloud:
Devs can be more self-sufficient.

Ops team as a gatekeeper: Devs
aren’t self sufficient, go slow.

Ops team as enabler: Devs are self-
sufficient, go fast.

Enable self-service safely via the Catalog
+ Pipeline: your API on top of the cloud.

Devs should have sandbox accounts
for easy testing, learning, etc.

Tool Clouds Features
cloud-nuke AWS
Delete all resources older than a certain
date; in a certain region; of a certain type.
safe-scrub Google Cloud
Safely delete unwanted resources in a
GCP project
Azure Powershell Azure
Includes native commands to delete
Resource Groups
Run cleanup tools in cron jobs to remove
old resources in sandbox accounts

In prod, Devs deploy via self-service with
the Service Catalog + CI / CD Pipeline.

Key self-service features:
1. GitOps-driven
Everything is managed as code and driven by commits to version control. Allows code
review, testing, audit log, versioning, etc.
2. UI-driven (optional)
Web UI as a layer on top of GitOps layer to make it more accessible.
3. Focus on common use cases
E.g., Account vending machine, data store deployment, app deployment. Don’t have to
solve everything right away.
4. Access controls
Different teams can access/deploy different things. E.g., NetOps team might be able to
deploy networking, whereas app teams can deploy orchestration tools and data stores.

child_accounts = {
# Add new account
example = "accounts+example@company.com"
}
}
Example of self-service: update a file,
commit, CI / CD system deploys it

Key idea #7: Any team can contribute
to the Service Catalog.

stage prod
Modern software involves many
moving pieces

If only Ops can add those pieces to the
Service Catalog, that’ll be a bottleneck

Automated tests:
✓ tflint
✓ tfsec
✓ OPA
✓ steampipe
✓ checkhov
✓ Terratest
Passed: 6. Failed: 0. Skipped: 0.
Test run successful.
Instead, allow
everyone to contribute
and enforce company
requirements through
code reviews and
automated tests

Not taking into account ongoing
maintenance work

stage prod
Not only are there many moving pieces,
but they’re all also constantly changing.

AWS is
constantly
changing
The last S3 security document that we’ll ever need, and how to use it
How To Keep Up With AWS Announcements

Docker is
constantly
changing
Docker Releases

Kubernetes is
constantly
changing
Kubernetes Wikipedia page

Terraform is
constantly
changing
Terraform Upgrade Guides

Many companies assume that the initial
cloud deployment is the hard part.

“Software maintenance
cost is increasingly
growing and estimates
showed that about 90%
of software life cost is
related to its
maintenance phase.”
Which Factors Affect Software Projects
Maintenance Cost More?
Sayed Mehdi Hejazi Dehaghani and Nafiseh Hajrahimi

If you don’t have a plan for maintenance,
all that code you wrote will rot.

“Coming back to that
Terraform codebase
after 6 months.”
Eero Järnefelt
Oil on canvas, 1893

Key auto-update features:
1. Automation-driven
Updates are discovered and the code is updated automatically. No relying on a human
to remember it. Update cadence should be configurable.
2. GitOps-driven
The code is updated via automated pull requests.
3. Automated testing
You must have automated tests in place and running against each pull request to let
you know if the updated code still works.
4. Automated deployment
Once a pull request is merged, it must deploy automatically via the CI / CD pipeline,
promoting the update across environments: e.g., dev à stage à prod.

Key idea #8: Updates are pushed to the
code via PRs, automatically.

Key idea #9: Code without automated
tests will rot.

How to do automated testing for infrastructure code
https://terratest.gruntwork.io/docs/getting-started/introduction/#watch-how-to-test-infrastructure-code

Key ideas:
1. Manage everything as code in a Service Catalog.
2. Set up your Landing Zone as early as you can.
3. Only the CI / CD Pipeline can deploy to prod.
4. The CI / CD Pipeline only deploys from the Service Catalog.
5. The CI / CD Pipeline protects its admin permissions.
6. Any team can deploy infra + apps from the Service Catalog.
7. Any team can contribute to the Service Catalog.
8. Updates are pushed to the code via PRs, automatically.
9. Code without automated tests will rot.

Fail Description Solution
Do it by hand ClickOps Service Catalog
Do it live Everyone is an admin Landing Zone
Do it on my machine People deploying from their computers CI / CD Pipeline
Do it only on my machine Only Ops can deploy Self-Service
Do it once Not taking maintenance into account Automatic Updates
5 cloud adoption fails and solutions:

If you use this framework, here’s the
experience for your Ops team:

Step 1: Create a Service Catalog
Everything defined as code. Works for app + infra. You could build from
scratch or on top of an existing one (e.g., Gruntwork Service Catalog).

Step 2: Set up your Landing Zone
Set up your basic account structure, define account baselines, etc.

Step 3: Set up a CI / CD pipeline
Ensure it’s the only way to deploy to prod. Make it work for apps + infra.

Step 4: Provide self-service
Enable all teams to deploy. Start with a GitOps solution. Add UI later.

Step 5: Set up automatic updates
PRs opened automatically. Automated tests in place for app + infra code.

And here’s the experience for your
Dev team:

Step 1: Scaffold a new app
Leverage vetted application templates from the Service Catalog and the
logic built in: e.g., service discovery, packaging, monitoring, testing, etc.

Step 2: Deploy infrastructure
Leverage Self-Service + Service Catalog + CI / CD Pipeline.

Step 3: Iterate on the app
Leverage CI / CD built into the templates to deploy subsequent changes.

Step 4: Debug issues
Leverage monitoring, logging, alerting, etc. built into the templates.

Step 5: Stay up to date
Leverage auto update built into the templates. Automated PRs + tests.

“The Cloud
you always
wanted.”
Thomas Cole
Oil on canvas, 1836

Cloud adoption fails - 5 ways deployments go wrong and 5 solutions

More Related Content

Similar to Cloud adoption fails - 5 ways deployments go wrong and 5 solutions

More from Yevgeniy Brikman

Recently uploaded

In this document

Cloud adoption fails - 5 ways deployments go wrong and 5 solutions