SlideShare a Scribd company logo
1 of 150
Download to read offline
5 ways deployments go
wrong and 5 solutions
Cloud
adoption
fails
FAIL
“All happy cloud
deployments are alike;
each unhappy cloud
deployment is unhappy
in its own way.”
Leo Tolstoy
Site Reliability Engineer
I’m
Yevgeniy
Brikman
ybrikman.com
Author
Co-founder of
Gruntwork
gruntwork.io
At Gruntwork,
I’ve seen the
cloud adoption
journeys of
hundreds of
companies
I’ve seen some go well.
I’ve seen some go poorly.
I've seen things you people
wouldn’t believe. DDos attacks
starting fires off the shoulder
of Ohio (us-east-2). I watched
C-suite foreheads glitter in the
dark near their Fargate bills.
All those moments will be lost
in time, like tears in rain...
Image credit: Blade Runner, Warner Bros, 1982
Why is it so hard?
Because everything has changed
about how we build software.
Before After
Dev team Write code, “toss it over the wall” Write code, deploy
Ops team Rack servers, deploy code Write code, deploy
Servers Dedicated physical servers Elastic virtual servers
Connectivity Static IPs Dynamic IPs, service discovery
Security Physical, strong perimeter, high trust Virtual, end-to-end, zero trust
Infra provisioning Manual Infrastructure as Code (IaC) tools
Server configuration Manual Configuration management tools
Testing Manual Automated testing
Deployments Manual Automated
Deployment cadence Weeks or months Many times per day
Change process Change request tickets Self-service
Change cadence Weeks or months Minutes
The shift to DevOps and the cloud
Adopting the cloud without acknowledging
these changes leads to problems
This talk is about 5 common causes of
cloud adoption failure…
Plus 5 solutions
based on the
patterns that
worked across
hundreds of
companies
The 5 solutions
are part of the
Gruntwork
Production
Framework
https://docs.gruntwork.io/guides/production-framework/
1. Do it by hand
2. Do it live
3. Do it on my machine
4. Do it only on my machine
5. Do it once
Outline
1. Do it by hand
2. Do it live
3. Do it on my machine
4. Do it only on my machine
5. Do it once
Outline
NUMBER 1:
FAIL
Deploying by using the web console
for your cloud provider: “ClickOps”
Almost everyone starts this way.
Almost everyone regrets it.
Problems with ClickOps:
1. Slow
Hours of clicking to spin up a new environment.
2. No reuse
Every deploy must be done from scratch. No leverage from previous work.
3. No audit trail
All info trapped in one person’s head. No versioning.
4. Error-prone
Manual task = human error. Deployment problems. Snowflake servers. Can’t use tests.
5. Tedious
No one likes doing slow, repetitive, error-prone, risky work over and over again.
“Realizing your
DevOps Engineer left...
After deploying
everything via
ClickOps.”
Vasily Vereshchagin
Oil on canvas, 1887
Side note:
credit to Classic
Programmer
Paintings for the
comic inspiration!
https://classicprogrammerpaintings.com/
NUMBER 1:
SOLUTION
Create a Service Catalog
A modern Service Catalog.
The modern Service Catalog:
1. Defined as code
Using tools such as Terraform, CloudFormation, Docker, Kubernetes, etc.
2. Designed for production use
Not a “5 minute demo,” but production-grade code.
3. Meet company requirements out-of-the-box
Scalability, HA, security, compliance (e.g., SOC 2, ISO 27001, PCI, HIPAA), etc.
4. Tested to meet company requirements
Code reviews, static analysis, functional testing, policy enforcement, etc.
5. Infrastructure and app code
Defines templates and patterns for both infrastructure and applications.
Infrastructure
templates
This is your Cloud API
https://docs.gruntwork.io/guides/production-
framework/ingredients/service-catalog/infrastructure-templates
Application
templates
This is your API between the
cloud and your apps
https://docs.gruntwork.io/guides/production-
framework/ingredients/service-catalog/application-templates
Real-world example: Gruntwork Service Catalog
Example infrastructure template for EKS
Example application template for Node.js
Key idea #1: Manage everything as
code in a Service Catalog.
Manual provisioning à Infrastructure as code
Manual server config à Configuration management
Manual app config à Configuration files
Manual builds à Continuous integration
Manual deployment à Continuous delivery
Manual testing à Automated testing
Manual policies à Automated policies (OPA)
Manual DBA work à Schema migrations
Manual specs à Automated specs (BDD)
Recall the problems with ClickOps:
1. Slow
Hours of clicking to spin up a new environment.
2. No reuse
Every deploy must be done from scratch. No leverage from previous work.
3. No audit trail
All info trapped in one person’s head. No reproducibility. No versioning.
4. Error-prone
Manual task = human error. Every environment a little bit different. No testing.
5. Tedious
No one likes doing slow, repetitive, error-prone, risky work over and over again.
Advantages of code:
1. Slow Fast
Computers can do in seconds what it takes a human hours to do.
2. No reuse Reusable
Leverage your previous work and the work of others. Evolve your code over time.
3. No audit trail Logged & versioned
Everything is in your version control system, including the full history of changes.
4. Error-prone Reliable
Code + automated tests + code reviews dramatically reduce errors.
5. Tedious Enjoyable
Writing code and being creative is more fun than repetitive, stressful, manual work.
1. Do it by hand
2. Do it live
3. Do it on my machine
4. Do it only on my machine
5. Do it once
Outline
NUMBER 2:
FAIL
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "*",
"Resource": "*"
}
]
}
Making everyone an admin
Initially, most companies try to limit
permissions…
But IAM is hard
Image from Why is AWS IAM So Hard? by Stephen Kuenzli
An error occurred (AccessDenied) when calling the
ListBuckets operation: Access Denied
(tweak the IAM policy)
An error occurred (AccessDenied) when calling the
ListBuckets operation: Access Denied
(tweak the IAM policy)
An error occurred (AccessDenied) when calling the
ListBuckets operation: Access Denied
And frustrating. It’s just “Access Denied”
over and over and over again.
The inevitable result: “F*ck it, we’ll do it
live!” and you make everyone an admin.
Problems with everyone is an admin:
1. Weak security
Huge blast radius from any mistake. Any compromised credentials may result in a
severe security incident. Any guard rails you put in place are ineffective.
2. Sprawl
Tons of new accounts and resources spun up and no one knows what they are for.
3. No consistency
Everything is configured differently: logging, networking, security controls, etc.
4. Difficult to fix it
If everyone is an admin, very hard to “undo” the damage: you don’t know what they’ve
done and you’re never 100% confident you’ve reined things in.
“Attempting to
get all the AWS
accounts under
control”
Jacques-Louis David
Oil on canvas, 1799
NUMBER 2:
SOLUTION
Set up your Landing Zone as
early as possible
landing zone noun
/ˈlændɪŋ zəʊn/
A streamlined way to create new accounts in your cloud provider that are
configured out-of-the-box with best practices (e.g., authentication, authorization,
logging, monitoring, tagging, guard rails, etc.).
Key ingredients of a Landing Zone:
1. Account structure
2. Account baselines
3. Account vending machine
Key ingredients of a Landing Zone:
1. Account structure
2. Account baselines
3. Account vending machine
account structure noun
/əˈkaʊnt ˈstrʌktʃə(r) /
How to configure multiple inter-connected accounts in the cloud to provide
isolation, compartmentalization, authentication, authorization, auditing, and
reporting.
Each cloud recommends different
account structures
Key ingredients of a Landing Zone:
1. Account structure
2. Account baselines
3. Account vending machine
account baseline noun
/əˈkaʊnt ˈbeɪslaɪn/
The basic set of controls installed in every account to enforce a common set of
best practices (e.g., authentication, authorization, logging, monitoring, tagging,
guard rails, etc.).
Description Examples
Authentication User identity, login, MFA IAM users & roles, SSO, IdPs
Authorization User permissions and access IAM policies & groups, ACLs, RBAC
Monitoring Audit logging, app logging, metrics CloudTrail, Elastic stack, Grafana
Networking IPs, routing, DNS, connectivity VPCs, NAT, Route 53, VPN, SSH, RDP
Hardening Network hardening, intrusion detection WAF, IPS, Squid Proxy, GuardDuty
Guard rails Limit what actions can be taken IAM policies, SCPs, OPA, AWS Config
Compliance Enforce compliance requirements SOC2, ISO 27001, CIS, PCI, HIPAA
Ownership Associate accounts & resources with teams Tagging, billing
Account baselines should handle:
module "account_baseline" {
source = "github.com/gruntwork-io/account-baseline"
enable_cloudtrail = true
enable_aws_config = true
enable_guard_duty = true
child_accounts = {
dev = "accounts+dev@company.com"
stage = "accounts+stage@company.com"
prod = "accounts+prod@company.com"
}
}
Define your account baselines as code
Key ingredients of a Landing Zone:
1. Account structure
2. Account baselines
3. Account vending machine
account vending machine noun
/əˈkaʊnt ˈvendɪŋ məˈʃiːn/
An official tool or process for spinning up new accounts which enforces each of
those accounts is configured with the appropriate account baseline.
Key ingredients for an account vending machine:
1. Self-service
Teams should be able to spin up new accounts for themselves on-demand.
2. GitOps-driven
Under the hood, manage accounts as code checked into version control.
3. Apply baselines
The vending machine ensures the proper baseline is applied to every new account.
4. Provision access
The vending machine not only creates accounts, but also grants teams access to them
(e.g., via SSO).
module "account_baseline" {
source = "github.com/gruntwork-io/account-baseline"
child_accounts = {
dev = "accounts+dev@company.com"
stage = "accounts+stage@company.com"
prod = "accounts+prod@company.com"
# Add new account
example = "accounts+example@company.com"
}
}
Example vending machine: update a
file, commit, CI / CD system deploys it
Key idea #2: Set up your Landing Zone
as early as you can.
1. Do it by hand
2. Do it live
3. Do it on my machine
4. Do it only on my machine
5. Do it once
Outline
NUMBER 3:
FAIL
Deployments are done by humans
from their own computers
Even with IaC, relying on a person to do
deployments leads to problems
Problems with a person deploying:
1. Error prone
Manual process = human error. E.g., fat-fingering a command, forgetting some step.
2. Not reproducible
E.g., Wrong version installed locally, accidentally deploying uncommitted changes.
3. Low bus factor
Often only 1 or 2 devs can deploy. What if they go on vacation or leave the company?
4. Race conditions
Different devs accidentally deploy different code (e.g., different branches) = conflicts.
5. Not secure
Deploying arbitrary changes requires arbitrary—admin—permissions. We already know
what happens when you give too many people admin permissions.
“Realizing you
just ran terraform
destroy in prod.”
Gustav Courbet
Oil on canvas, 1845
NUMBER 3:
SOLUTION
Do all deploys through a
CI / CD pipeline
Description
GitOps-driven The pipeline is triggered by commits to version control
Defined as code The full workflow should be defined as code
Automated tests The pipeline should run pre-, post-, and during- deploy checks.
Preview environments Deploy the changes in each PR into an ephemeral environment
Promotion workflows Promote immutable artifacts across environments: e.g., dev à stage à prod
Approval workflows For some types of changes, require human approval for deployment to prod
Deployment workflows Blue/green deploys, rolling deploys, canary deploys, feature toggles
App and infra code Your need a workflows for both application and infrastructure code
Key CI / CD pipeline features:
The workflows for app & infra code are
similar, but with key differences.
Application code Infrastructure code
Run locally
• Run the code on localhost
• Make a change, refresh
• Run the code in the cloud (sandboxes)
• Make a change, redeploy (use stages!)
Code review • Submit pull request with code changes • Submit pull request with code changes
Test
• Static analysis: linter
• Functional tests: unit, integration, e2e
• Static analysis: linter, policy enforcement
• Functional tests: plan, integration
Release
• Merge pull request
• Build immutable, versioned artifact
• Merge pull request
• Create git tag
CI config
• CI server has limited permissions
• CI server triggers K8S, ECS, EC2, etc.
• Isolated worker has admin permissions
• CI server triggers isolated worker
Deploy
• Promote artifacts: e.g., dev à stage à prod
• Rolling, blue/green, canary, feature flags
• Promote tags: e.g., dev à stage à prod
• Plan, approve, deploy, hope
Workflows for app & infra code:
Key idea #3: The CI / CD pipeline is the
only thing that can deploy to prod.
No one has write access to prod (let
alone admin access) except the pipeline.
Key idea #4: The CI / CD pipeline will
only deploy vetted services from the
Service Catalog to prod.
The Catalog + Pipeline are the only path
to prod; the API between Devs and Ops.
Key idea #5: The CI / CD pipeline
protects its permissions for prod.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "*",
"Resource": "*"
}
]
}
To deploy arbitrary infra changes, you
need arbitrary (admin) permissions!
Giving your CI server direct access to
admin permissions considered harmful.
This is a BAD combination:
1. Everyone in your company can access your CI server
2. You use the CI server to execute arbitrary code
3. The CI server has admin permissions
Congratulations, everyone in your
company has admin permissions again!
And so do
hackers
outside your
company!
https://research.nccgroup.com/2022/01/13/10-real-world-stories-
of-how-weve-compromised-ci-cd-pipelines/
The solution: only give admin
permissions to an isolated worker
The isolated worker:
1. Is highly locked down
Unlike the CI server, no one at the company has direct access to the worker.
2. Can only be triggered by the CI server
The CI server only has permissions to trigger the worker via an API & stream logs from it.
3. Exposes a limited, locked-down API
The worker only allows you to run certain commands (e.g., terraform apply), in certain
repos, in certain branches, in certain folders, etc.
4. Minimizes the potential damage
If an attacker gets access to your CI server, the worst they can do is trigger a deploy on
your own code. They do NOT get admin permissions directly.
1. Do it by hand
2. Do it live
3. Do it on my machine
4. Do it only on my machine
5. Do it once
Outline
NUMBER 4:
FAIL
Only Ops is allowed to deploy
The Ops team, trying to protect the
company, acts as a gatekeeper.
But that usually backfires:
Inevitably, the Ops team is overwhelmed
and becomes a bottleneck
So the Dev team finds a workaround…
So Ops adds more process… but that just
makes things even more backed up.
“The Ops team
explains the new
95-step change
request process
to the Dev team.”
Ferdinand Pauwels
Oil on canvas, 1872
NUMBER 4:
SOLUTION
Provide
developers with
self-service
Key idea #6: Any team can deploy their
own infra + apps from the Service Catalog
The cloud is primarily a tool for Devs,
not Ops.
One of the biggest benefits of the cloud:
Devs can be more self-sufficient.
Ops team as a gatekeeper: Devs
aren’t self sufficient, go slow.
Ops team as enabler: Devs are self-
sufficient, go fast.
Enable self-service safely via the Catalog
+ Pipeline: your API on top of the cloud.
Devs should have sandbox accounts
for easy testing, learning, etc.
Tool Clouds Features
cloud-nuke AWS
Delete all resources older than a certain
date; in a certain region; of a certain type.
safe-scrub Google Cloud
Safely delete unwanted resources in a
GCP project
Azure Powershell Azure
Includes native commands to delete
Resource Groups
Run cleanup tools in cron jobs to remove
old resources in sandbox accounts
In prod, Devs deploy via self-service with
the Service Catalog + CI / CD Pipeline.
Key self-service features:
1. GitOps-driven
Everything is managed as code and driven by commits to version control. Allows code
review, testing, audit log, versioning, etc.
2. UI-driven (optional)
Web UI as a layer on top of GitOps layer to make it more accessible.
3. Focus on common use cases
E.g., Account vending machine, data store deployment, app deployment. Don’t have to
solve everything right away.
4. Access controls
Different teams can access/deploy different things. E.g., NetOps team might be able to
deploy networking, whereas app teams can deploy orchestration tools and data stores.
module "account_baseline" {
source = "github.com/gruntwork-io/account-baseline"
child_accounts = {
dev = "accounts+dev@company.com"
stage = "accounts+stage@company.com"
prod = "accounts+prod@company.com"
# Add new account
example = "accounts+example@company.com"
}
}
Example of self-service: update a file,
commit, CI / CD system deploys it
Key idea #7: Any team can contribute
to the Service Catalog.
stage prod
Modern software involves many
moving pieces
If only Ops can add those pieces to the
Service Catalog, that’ll be a bottleneck
Automated tests:
✓ tflint
✓ tfsec
✓ OPA
✓ steampipe
✓ checkhov
✓ Terratest
Passed: 6. Failed: 0. Skipped: 0.
Test run successful.
Instead, allow
everyone to contribute
and enforce company
requirements through
code reviews and
automated tests
1. Do it by hand
2. Do it live
3. Do it on my machine
4. Do it only on my machine
5. Do it once
Outline
NUMBER 5:
FAIL
Not taking into account ongoing
maintenance work
stage prod
Not only are there many moving pieces,
but they’re all also constantly changing.
AWS is
constantly
changing
The last S3 security document that we’ll ever need, and how to use it
How To Keep Up With AWS Announcements
Docker is
constantly
changing
Docker Releases
Kubernetes is
constantly
changing
Kubernetes Wikipedia page
Terraform is
constantly
changing
Terraform Upgrade Guides
Many companies assume that the initial
cloud deployment is the hard part.
It isn’t.
“Software maintenance
cost is increasingly
growing and estimates
showed that about 90%
of software life cost is
related to its
maintenance phase.”
Which Factors Affect Software Projects
Maintenance Cost More?
Sayed Mehdi Hejazi Dehaghani and Nafiseh Hajrahimi
If you don’t have a plan for maintenance,
all that code you wrote will rot.
“Coming back to that
Terraform codebase
after 6 months.”
Eero Järnefelt
Oil on canvas, 1893
NUMBER 5:
SOLUTION
Set up
automatic
updates
Key auto-update features:
1. Automation-driven
Updates are discovered and the code is updated automatically. No relying on a human
to remember it. Update cadence should be configurable.
2. GitOps-driven
The code is updated via automated pull requests.
3. Automated testing
You must have automated tests in place and running against each pull request to let
you know if the updated code still works.
4. Automated deployment
Once a pull request is merged, it must deploy automatically via the CI / CD pipeline,
promoting the update across environments: e.g., dev à stage à prod.
Key idea #8: Updates are pushed to the
code via PRs, automatically.
Key idea #9: Code without automated
tests will rot.
How to do automated testing for infrastructure code
https://terratest.gruntwork.io/docs/getting-started/introduction/#watch-how-to-test-infrastructure-code
1. Do it by hand
2. Do it live
3. Do it on my machine
4. Do it only on my machine
5. Do it once
Outline
Let’s recap:
Key ideas:
1. Manage everything as code in a Service Catalog.
2. Set up your Landing Zone as early as you can.
3. Only the CI / CD Pipeline can deploy to prod.
4. The CI / CD Pipeline only deploys from the Service Catalog.
5. The CI / CD Pipeline protects its admin permissions.
6. Any team can deploy infra + apps from the Service Catalog.
7. Any team can contribute to the Service Catalog.
8. Updates are pushed to the code via PRs, automatically.
9. Code without automated tests will rot.
Fail Description Solution
Do it by hand ClickOps Service Catalog
Do it live Everyone is an admin Landing Zone
Do it on my machine People deploying from their computers CI / CD Pipeline
Do it only on my machine Only Ops can deploy Self-Service
Do it once Not taking maintenance into account Automatic Updates
5 cloud adoption fails and solutions:
The 5 solutions
are part of the
Gruntwork
Production
Framework
https://docs.gruntwork.io/guides/production-framework/
If you use this framework, here’s the
experience for your Ops team:
Step 1: Create a Service Catalog
Everything defined as code. Works for app + infra. You could build from
scratch or on top of an existing one (e.g., Gruntwork Service Catalog).
Step 2: Set up your Landing Zone
Set up your basic account structure, define account baselines, etc.
Step 3: Set up a CI / CD pipeline
Ensure it’s the only way to deploy to prod. Make it work for apps + infra.
Step 4: Provide self-service
Enable all teams to deploy. Start with a GitOps solution. Add UI later.
Step 5: Set up automatic updates
PRs opened automatically. Automated tests in place for app + infra code.
And here’s the experience for your
Dev team:
Step 1: Scaffold a new app
Leverage vetted application templates from the Service Catalog and the
logic built in: e.g., service discovery, packaging, monitoring, testing, etc.
Step 2: Deploy infrastructure
Leverage Self-Service + Service Catalog + CI / CD Pipeline.
Step 3: Iterate on the app
Leverage CI / CD built into the templates to deploy subsequent changes.
Step 4: Debug issues
Leverage monitoring, logging, alerting, etc. built into the templates.
Step 5: Stay up to date
Leverage auto update built into the templates. Automated PRs + tests.
“The Cloud
you always
wanted.”
Thomas Cole
Oil on canvas, 1836
Questions?
info@gruntwork.io

More Related Content

What's hot

What's hot (20)

Kubernetes extensibility
Kubernetes extensibilityKubernetes extensibility
Kubernetes extensibility
 
Introduction to Vault
Introduction to VaultIntroduction to Vault
Introduction to Vault
 
ArgoCD and Tekton: Match made in Kubernetes heaven | DevNation Tech Talk
ArgoCD and Tekton: Match made in Kubernetes heaven | DevNation Tech TalkArgoCD and Tekton: Match made in Kubernetes heaven | DevNation Tech Talk
ArgoCD and Tekton: Match made in Kubernetes heaven | DevNation Tech Talk
 
CKA Certified Kubernetes Administrator Notes
CKA Certified Kubernetes Administrator Notes CKA Certified Kubernetes Administrator Notes
CKA Certified Kubernetes Administrator Notes
 
IT Infrastructure Automation with Ansible
IT Infrastructure Automation with AnsibleIT Infrastructure Automation with Ansible
IT Infrastructure Automation with Ansible
 
DevSecOps Jenkins Pipeline -Security
DevSecOps Jenkins Pipeline -SecurityDevSecOps Jenkins Pipeline -Security
DevSecOps Jenkins Pipeline -Security
 
Kubernetes Security
Kubernetes SecurityKubernetes Security
Kubernetes Security
 
Kafka Tutorial: Kafka Security
Kafka Tutorial: Kafka SecurityKafka Tutorial: Kafka Security
Kafka Tutorial: Kafka Security
 
Code Security with GitHub Advanced Security
Code Security with GitHub Advanced SecurityCode Security with GitHub Advanced Security
Code Security with GitHub Advanced Security
 
CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton
 
Managing secrets at scale
Managing secrets at scaleManaging secrets at scale
Managing secrets at scale
 
Repository Management with JFrog Artifactory
Repository Management with JFrog ArtifactoryRepository Management with JFrog Artifactory
Repository Management with JFrog Artifactory
 
Container Runtime Security with Falco
Container Runtime Security with FalcoContainer Runtime Security with Falco
Container Runtime Security with Falco
 
Introduction to kubernetes
Introduction to kubernetesIntroduction to kubernetes
Introduction to kubernetes
 
Docker Container Security
Docker Container SecurityDocker Container Security
Docker Container Security
 
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
 
Introduction to DevSecOps
Introduction to DevSecOpsIntroduction to DevSecOps
Introduction to DevSecOps
 
2019 DevSecOps Reference Architectures
2019 DevSecOps Reference Architectures2019 DevSecOps Reference Architectures
2019 DevSecOps Reference Architectures
 
Kubernetes Architecture - beyond a black box - Part 2
Kubernetes Architecture - beyond a black box - Part 2Kubernetes Architecture - beyond a black box - Part 2
Kubernetes Architecture - beyond a black box - Part 2
 
Backstage at CNCF Madison.pptx
Backstage at CNCF Madison.pptxBackstage at CNCF Madison.pptx
Backstage at CNCF Madison.pptx
 

Similar to Cloud adoption fails - 5 ways deployments go wrong and 5 solutions

Abusing bleeding edge web standards for appsec glory
Abusing bleeding edge web standards for appsec gloryAbusing bleeding edge web standards for appsec glory
Abusing bleeding edge web standards for appsec glory
Priyanka Aash
 
Serverless 101 in Montreal
Serverless 101 in MontrealServerless 101 in Montreal
Serverless 101 in Montreal
Aaron Williams
 

Similar to Cloud adoption fails - 5 ways deployments go wrong and 5 solutions (20)

Abusing bleeding edge web standards for appsec glory
Abusing bleeding edge web standards for appsec gloryAbusing bleeding edge web standards for appsec glory
Abusing bleeding edge web standards for appsec glory
 
Making Security Agile - Oleg Gryb
Making Security Agile - Oleg GrybMaking Security Agile - Oleg Gryb
Making Security Agile - Oleg Gryb
 
Why the cloud is more secure than your existing systems
Why the cloud is more secure than your existing systemsWhy the cloud is more secure than your existing systems
Why the cloud is more secure than your existing systems
 
Agility Requires Safety
Agility Requires SafetyAgility Requires Safety
Agility Requires Safety
 
Practical Cloud & Workflow Orchestration
Practical Cloud & Workflow OrchestrationPractical Cloud & Workflow Orchestration
Practical Cloud & Workflow Orchestration
 
DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012
DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012
DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012
 
Us 17-krug-hacking-severless-runtimes
Us 17-krug-hacking-severless-runtimesUs 17-krug-hacking-severless-runtimes
Us 17-krug-hacking-severless-runtimes
 
Continuous Delivery, Continuous Integration
Continuous Delivery, Continuous Integration Continuous Delivery, Continuous Integration
Continuous Delivery, Continuous Integration
 
DevOps Tooling - Pop-up Loft TLV 2017
DevOps Tooling - Pop-up Loft TLV 2017DevOps Tooling - Pop-up Loft TLV 2017
DevOps Tooling - Pop-up Loft TLV 2017
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Serverless in production, an experience report (microservices london)
Serverless in production, an experience report (microservices london)Serverless in production, an experience report (microservices london)
Serverless in production, an experience report (microservices london)
 
Serverless 101 in Montreal
Serverless 101 in MontrealServerless 101 in Montreal
Serverless 101 in Montreal
 
Serverless in production, an experience report (FullStack 2018)
Serverless in production, an experience report (FullStack 2018)Serverless in production, an experience report (FullStack 2018)
Serverless in production, an experience report (FullStack 2018)
 
Serverless in production, an experience report (London js community)
Serverless in production, an experience report (London js community)Serverless in production, an experience report (London js community)
Serverless in production, an experience report (London js community)
 
Securing your Cloud Environment v2
Securing your Cloud Environment v2Securing your Cloud Environment v2
Securing your Cloud Environment v2
 
Debunking serverless myths
Debunking serverless mythsDebunking serverless myths
Debunking serverless myths
 
Serverless in production, an experience report (Going Serverless)
Serverless in production, an experience report (Going Serverless)Serverless in production, an experience report (Going Serverless)
Serverless in production, an experience report (Going Serverless)
 
Managing WorkSpaces at Scale | AWS Public Sector Summit 2016
Managing WorkSpaces at Scale | AWS Public Sector Summit 2016Managing WorkSpaces at Scale | AWS Public Sector Summit 2016
Managing WorkSpaces at Scale | AWS Public Sector Summit 2016
 
AppSec California 2016 - Making Security Agile
AppSec California 2016 - Making Security AgileAppSec California 2016 - Making Security Agile
AppSec California 2016 - Making Security Agile
 
Dev Ops without the Ops
Dev Ops without the OpsDev Ops without the Ops
Dev Ops without the Ops
 

More from Yevgeniy Brikman

More from Yevgeniy Brikman (20)

How to test infrastructure code: automated testing for Terraform, Kubernetes,...
How to test infrastructure code: automated testing for Terraform, Kubernetes,...How to test infrastructure code: automated testing for Terraform, Kubernetes,...
How to test infrastructure code: automated testing for Terraform, Kubernetes,...
 
Gruntwork Executive Summary
Gruntwork Executive SummaryGruntwork Executive Summary
Gruntwork Executive Summary
 
Reusable, composable, battle-tested Terraform modules
Reusable, composable, battle-tested Terraform modulesReusable, composable, battle-tested Terraform modules
Reusable, composable, battle-tested Terraform modules
 
The Truth About Startups: What I wish someone had told me about entrepreneurs...
The Truth About Startups: What I wish someone had told me about entrepreneurs...The Truth About Startups: What I wish someone had told me about entrepreneurs...
The Truth About Startups: What I wish someone had told me about entrepreneurs...
 
An intro to Docker, Terraform, and Amazon ECS
An intro to Docker, Terraform, and Amazon ECSAn intro to Docker, Terraform, and Amazon ECS
An intro to Docker, Terraform, and Amazon ECS
 
Comprehensive Terraform Training
Comprehensive Terraform TrainingComprehensive Terraform Training
Comprehensive Terraform Training
 
Infrastructure as code: running microservices on AWS using Docker, Terraform,...
Infrastructure as code: running microservices on AWS using Docker, Terraform,...Infrastructure as code: running microservices on AWS using Docker, Terraform,...
Infrastructure as code: running microservices on AWS using Docker, Terraform,...
 
Startup Ideas and Validation
Startup Ideas and ValidationStartup Ideas and Validation
Startup Ideas and Validation
 
A Guide to Hiring for your Startup
A Guide to Hiring for your StartupA Guide to Hiring for your Startup
A Guide to Hiring for your Startup
 
Startup DNA: Speed Wins
Startup DNA: Speed WinsStartup DNA: Speed Wins
Startup DNA: Speed Wins
 
Node.js vs Play Framework (with Japanese subtitles)
Node.js vs Play Framework (with Japanese subtitles)Node.js vs Play Framework (with Japanese subtitles)
Node.js vs Play Framework (with Japanese subtitles)
 
Node.js vs Play Framework
Node.js vs Play FrameworkNode.js vs Play Framework
Node.js vs Play Framework
 
Rapid prototyping
Rapid prototypingRapid prototyping
Rapid prototyping
 
Composable and streamable Play apps
Composable and streamable Play appsComposable and streamable Play apps
Composable and streamable Play apps
 
Play Framework: async I/O with Java and Scala
Play Framework: async I/O with Java and ScalaPlay Framework: async I/O with Java and Scala
Play Framework: async I/O with Java and Scala
 
The Play Framework at LinkedIn
The Play Framework at LinkedInThe Play Framework at LinkedIn
The Play Framework at LinkedIn
 
Kings of Code Hack Battle
Kings of Code Hack BattleKings of Code Hack Battle
Kings of Code Hack Battle
 
Hackdays and [in]cubator
Hackdays and [in]cubatorHackdays and [in]cubator
Hackdays and [in]cubator
 
Startup DNA: the formula behind successful startups in Silicon Valley (update...
Startup DNA: the formula behind successful startups in Silicon Valley (update...Startup DNA: the formula behind successful startups in Silicon Valley (update...
Startup DNA: the formula behind successful startups in Silicon Valley (update...
 
Dust.js
Dust.jsDust.js
Dust.js
 

Recently uploaded

AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
Alluxio, Inc.
 
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
mbmh111980
 

Recently uploaded (20)

how-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdfhow-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdf
 
IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Kraków
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
A Guideline to Gorgias to to Re:amaze Data Migration
A Guideline to Gorgias to to Re:amaze Data MigrationA Guideline to Gorgias to to Re:amaze Data Migration
A Guideline to Gorgias to to Re:amaze Data Migration
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
 
What need to be mastered as AI-Powered Java Developers
What need to be mastered as AI-Powered Java DevelopersWhat need to be mastered as AI-Powered Java Developers
What need to be mastered as AI-Powered Java Developers
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
 
OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024
 
10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
 
The Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion ProductionThe Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion Production
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
 
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
 
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfImplementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM Integration
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
 

Cloud adoption fails - 5 ways deployments go wrong and 5 solutions

  • 1. 5 ways deployments go wrong and 5 solutions Cloud adoption fails FAIL
  • 2. “All happy cloud deployments are alike; each unhappy cloud deployment is unhappy in its own way.” Leo Tolstoy Site Reliability Engineer
  • 6. At Gruntwork, I’ve seen the cloud adoption journeys of hundreds of companies
  • 7. I’ve seen some go well. I’ve seen some go poorly.
  • 8. I've seen things you people wouldn’t believe. DDos attacks starting fires off the shoulder of Ohio (us-east-2). I watched C-suite foreheads glitter in the dark near their Fargate bills. All those moments will be lost in time, like tears in rain... Image credit: Blade Runner, Warner Bros, 1982
  • 9. Why is it so hard?
  • 10. Because everything has changed about how we build software.
  • 11. Before After Dev team Write code, “toss it over the wall” Write code, deploy Ops team Rack servers, deploy code Write code, deploy Servers Dedicated physical servers Elastic virtual servers Connectivity Static IPs Dynamic IPs, service discovery Security Physical, strong perimeter, high trust Virtual, end-to-end, zero trust Infra provisioning Manual Infrastructure as Code (IaC) tools Server configuration Manual Configuration management tools Testing Manual Automated testing Deployments Manual Automated Deployment cadence Weeks or months Many times per day Change process Change request tickets Self-service Change cadence Weeks or months Minutes The shift to DevOps and the cloud
  • 12. Adopting the cloud without acknowledging these changes leads to problems
  • 13. This talk is about 5 common causes of cloud adoption failure…
  • 14. Plus 5 solutions based on the patterns that worked across hundreds of companies
  • 15. The 5 solutions are part of the Gruntwork Production Framework https://docs.gruntwork.io/guides/production-framework/
  • 16. 1. Do it by hand 2. Do it live 3. Do it on my machine 4. Do it only on my machine 5. Do it once Outline
  • 17. 1. Do it by hand 2. Do it live 3. Do it on my machine 4. Do it only on my machine 5. Do it once Outline
  • 19. Deploying by using the web console for your cloud provider: “ClickOps”
  • 20. Almost everyone starts this way. Almost everyone regrets it.
  • 21. Problems with ClickOps: 1. Slow Hours of clicking to spin up a new environment. 2. No reuse Every deploy must be done from scratch. No leverage from previous work. 3. No audit trail All info trapped in one person’s head. No versioning. 4. Error-prone Manual task = human error. Deployment problems. Snowflake servers. Can’t use tests. 5. Tedious No one likes doing slow, repetitive, error-prone, risky work over and over again.
  • 22. “Realizing your DevOps Engineer left... After deploying everything via ClickOps.” Vasily Vereshchagin Oil on canvas, 1887
  • 23. Side note: credit to Classic Programmer Paintings for the comic inspiration! https://classicprogrammerpaintings.com/
  • 25. Create a Service Catalog
  • 26. A modern Service Catalog.
  • 27. The modern Service Catalog: 1. Defined as code Using tools such as Terraform, CloudFormation, Docker, Kubernetes, etc. 2. Designed for production use Not a “5 minute demo,” but production-grade code. 3. Meet company requirements out-of-the-box Scalability, HA, security, compliance (e.g., SOC 2, ISO 27001, PCI, HIPAA), etc. 4. Tested to meet company requirements Code reviews, static analysis, functional testing, policy enforcement, etc. 5. Infrastructure and app code Defines templates and patterns for both infrastructure and applications.
  • 28. Infrastructure templates This is your Cloud API https://docs.gruntwork.io/guides/production- framework/ingredients/service-catalog/infrastructure-templates
  • 29. Application templates This is your API between the cloud and your apps https://docs.gruntwork.io/guides/production- framework/ingredients/service-catalog/application-templates
  • 30. Real-world example: Gruntwork Service Catalog
  • 33. Key idea #1: Manage everything as code in a Service Catalog.
  • 34. Manual provisioning à Infrastructure as code Manual server config à Configuration management Manual app config à Configuration files Manual builds à Continuous integration Manual deployment à Continuous delivery Manual testing à Automated testing Manual policies à Automated policies (OPA) Manual DBA work à Schema migrations Manual specs à Automated specs (BDD)
  • 35. Recall the problems with ClickOps: 1. Slow Hours of clicking to spin up a new environment. 2. No reuse Every deploy must be done from scratch. No leverage from previous work. 3. No audit trail All info trapped in one person’s head. No reproducibility. No versioning. 4. Error-prone Manual task = human error. Every environment a little bit different. No testing. 5. Tedious No one likes doing slow, repetitive, error-prone, risky work over and over again.
  • 36. Advantages of code: 1. Slow Fast Computers can do in seconds what it takes a human hours to do. 2. No reuse Reusable Leverage your previous work and the work of others. Evolve your code over time. 3. No audit trail Logged & versioned Everything is in your version control system, including the full history of changes. 4. Error-prone Reliable Code + automated tests + code reviews dramatically reduce errors. 5. Tedious Enjoyable Writing code and being creative is more fun than repetitive, stressful, manual work.
  • 37. 1. Do it by hand 2. Do it live 3. Do it on my machine 4. Do it only on my machine 5. Do it once Outline
  • 39. { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "*", "Resource": "*" } ] } Making everyone an admin
  • 40. Initially, most companies try to limit permissions…
  • 41. But IAM is hard Image from Why is AWS IAM So Hard? by Stephen Kuenzli
  • 42. An error occurred (AccessDenied) when calling the ListBuckets operation: Access Denied (tweak the IAM policy) An error occurred (AccessDenied) when calling the ListBuckets operation: Access Denied (tweak the IAM policy) An error occurred (AccessDenied) when calling the ListBuckets operation: Access Denied And frustrating. It’s just “Access Denied” over and over and over again.
  • 43. The inevitable result: “F*ck it, we’ll do it live!” and you make everyone an admin.
  • 44.
  • 45. Problems with everyone is an admin: 1. Weak security Huge blast radius from any mistake. Any compromised credentials may result in a severe security incident. Any guard rails you put in place are ineffective. 2. Sprawl Tons of new accounts and resources spun up and no one knows what they are for. 3. No consistency Everything is configured differently: logging, networking, security controls, etc. 4. Difficult to fix it If everyone is an admin, very hard to “undo” the damage: you don’t know what they’ve done and you’re never 100% confident you’ve reined things in.
  • 46. “Attempting to get all the AWS accounts under control” Jacques-Louis David Oil on canvas, 1799
  • 48. Set up your Landing Zone as early as possible
  • 49. landing zone noun /ˈlændɪŋ zəʊn/ A streamlined way to create new accounts in your cloud provider that are configured out-of-the-box with best practices (e.g., authentication, authorization, logging, monitoring, tagging, guard rails, etc.).
  • 50. Key ingredients of a Landing Zone: 1. Account structure 2. Account baselines 3. Account vending machine
  • 51. Key ingredients of a Landing Zone: 1. Account structure 2. Account baselines 3. Account vending machine
  • 52. account structure noun /əˈkaʊnt ˈstrʌktʃə(r) / How to configure multiple inter-connected accounts in the cloud to provide isolation, compartmentalization, authentication, authorization, auditing, and reporting.
  • 53. Each cloud recommends different account structures
  • 54. Key ingredients of a Landing Zone: 1. Account structure 2. Account baselines 3. Account vending machine
  • 55. account baseline noun /əˈkaʊnt ˈbeɪslaɪn/ The basic set of controls installed in every account to enforce a common set of best practices (e.g., authentication, authorization, logging, monitoring, tagging, guard rails, etc.).
  • 56. Description Examples Authentication User identity, login, MFA IAM users & roles, SSO, IdPs Authorization User permissions and access IAM policies & groups, ACLs, RBAC Monitoring Audit logging, app logging, metrics CloudTrail, Elastic stack, Grafana Networking IPs, routing, DNS, connectivity VPCs, NAT, Route 53, VPN, SSH, RDP Hardening Network hardening, intrusion detection WAF, IPS, Squid Proxy, GuardDuty Guard rails Limit what actions can be taken IAM policies, SCPs, OPA, AWS Config Compliance Enforce compliance requirements SOC2, ISO 27001, CIS, PCI, HIPAA Ownership Associate accounts & resources with teams Tagging, billing Account baselines should handle:
  • 57. module "account_baseline" { source = "github.com/gruntwork-io/account-baseline" enable_cloudtrail = true enable_aws_config = true enable_guard_duty = true child_accounts = { dev = "accounts+dev@company.com" stage = "accounts+stage@company.com" prod = "accounts+prod@company.com" } } Define your account baselines as code
  • 58. Key ingredients of a Landing Zone: 1. Account structure 2. Account baselines 3. Account vending machine
  • 59. account vending machine noun /əˈkaʊnt ˈvendɪŋ məˈʃiːn/ An official tool or process for spinning up new accounts which enforces each of those accounts is configured with the appropriate account baseline.
  • 60. Key ingredients for an account vending machine: 1. Self-service Teams should be able to spin up new accounts for themselves on-demand. 2. GitOps-driven Under the hood, manage accounts as code checked into version control. 3. Apply baselines The vending machine ensures the proper baseline is applied to every new account. 4. Provision access The vending machine not only creates accounts, but also grants teams access to them (e.g., via SSO).
  • 61. module "account_baseline" { source = "github.com/gruntwork-io/account-baseline" child_accounts = { dev = "accounts+dev@company.com" stage = "accounts+stage@company.com" prod = "accounts+prod@company.com" # Add new account example = "accounts+example@company.com" } } Example vending machine: update a file, commit, CI / CD system deploys it
  • 62. Key idea #2: Set up your Landing Zone as early as you can.
  • 63. 1. Do it by hand 2. Do it live 3. Do it on my machine 4. Do it only on my machine 5. Do it once Outline
  • 65. Deployments are done by humans from their own computers
  • 66. Even with IaC, relying on a person to do deployments leads to problems
  • 67. Problems with a person deploying: 1. Error prone Manual process = human error. E.g., fat-fingering a command, forgetting some step. 2. Not reproducible E.g., Wrong version installed locally, accidentally deploying uncommitted changes. 3. Low bus factor Often only 1 or 2 devs can deploy. What if they go on vacation or leave the company? 4. Race conditions Different devs accidentally deploy different code (e.g., different branches) = conflicts. 5. Not secure Deploying arbitrary changes requires arbitrary—admin—permissions. We already know what happens when you give too many people admin permissions.
  • 68. “Realizing you just ran terraform destroy in prod.” Gustav Courbet Oil on canvas, 1845
  • 70. Do all deploys through a CI / CD pipeline
  • 71. Description GitOps-driven The pipeline is triggered by commits to version control Defined as code The full workflow should be defined as code Automated tests The pipeline should run pre-, post-, and during- deploy checks. Preview environments Deploy the changes in each PR into an ephemeral environment Promotion workflows Promote immutable artifacts across environments: e.g., dev à stage à prod Approval workflows For some types of changes, require human approval for deployment to prod Deployment workflows Blue/green deploys, rolling deploys, canary deploys, feature toggles App and infra code Your need a workflows for both application and infrastructure code Key CI / CD pipeline features:
  • 72. The workflows for app & infra code are similar, but with key differences.
  • 73. Application code Infrastructure code Run locally • Run the code on localhost • Make a change, refresh • Run the code in the cloud (sandboxes) • Make a change, redeploy (use stages!) Code review • Submit pull request with code changes • Submit pull request with code changes Test • Static analysis: linter • Functional tests: unit, integration, e2e • Static analysis: linter, policy enforcement • Functional tests: plan, integration Release • Merge pull request • Build immutable, versioned artifact • Merge pull request • Create git tag CI config • CI server has limited permissions • CI server triggers K8S, ECS, EC2, etc. • Isolated worker has admin permissions • CI server triggers isolated worker Deploy • Promote artifacts: e.g., dev à stage à prod • Rolling, blue/green, canary, feature flags • Promote tags: e.g., dev à stage à prod • Plan, approve, deploy, hope Workflows for app & infra code:
  • 74. Key idea #3: The CI / CD pipeline is the only thing that can deploy to prod.
  • 75. No one has write access to prod (let alone admin access) except the pipeline.
  • 76. Key idea #4: The CI / CD pipeline will only deploy vetted services from the Service Catalog to prod.
  • 77. The Catalog + Pipeline are the only path to prod; the API between Devs and Ops.
  • 78. Key idea #5: The CI / CD pipeline protects its permissions for prod.
  • 79. { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "*", "Resource": "*" } ] } To deploy arbitrary infra changes, you need arbitrary (admin) permissions!
  • 80. Giving your CI server direct access to admin permissions considered harmful.
  • 81. This is a BAD combination: 1. Everyone in your company can access your CI server 2. You use the CI server to execute arbitrary code 3. The CI server has admin permissions
  • 82. Congratulations, everyone in your company has admin permissions again!
  • 83. And so do hackers outside your company! https://research.nccgroup.com/2022/01/13/10-real-world-stories- of-how-weve-compromised-ci-cd-pipelines/
  • 84. The solution: only give admin permissions to an isolated worker
  • 85. The isolated worker: 1. Is highly locked down Unlike the CI server, no one at the company has direct access to the worker. 2. Can only be triggered by the CI server The CI server only has permissions to trigger the worker via an API & stream logs from it. 3. Exposes a limited, locked-down API The worker only allows you to run certain commands (e.g., terraform apply), in certain repos, in certain branches, in certain folders, etc. 4. Minimizes the potential damage If an attacker gets access to your CI server, the worst they can do is trigger a deploy on your own code. They do NOT get admin permissions directly.
  • 86. 1. Do it by hand 2. Do it live 3. Do it on my machine 4. Do it only on my machine 5. Do it once Outline
  • 88. Only Ops is allowed to deploy
  • 89. The Ops team, trying to protect the company, acts as a gatekeeper.
  • 90. But that usually backfires:
  • 91. Inevitably, the Ops team is overwhelmed and becomes a bottleneck
  • 92. So the Dev team finds a workaround…
  • 93.
  • 94. So Ops adds more process… but that just makes things even more backed up.
  • 95. “The Ops team explains the new 95-step change request process to the Dev team.” Ferdinand Pauwels Oil on canvas, 1872
  • 98. Key idea #6: Any team can deploy their own infra + apps from the Service Catalog
  • 99. The cloud is primarily a tool for Devs, not Ops.
  • 100. One of the biggest benefits of the cloud: Devs can be more self-sufficient.
  • 101. Ops team as a gatekeeper: Devs aren’t self sufficient, go slow.
  • 102. Ops team as enabler: Devs are self- sufficient, go fast.
  • 103. Enable self-service safely via the Catalog + Pipeline: your API on top of the cloud.
  • 104. Devs should have sandbox accounts for easy testing, learning, etc.
  • 105. Tool Clouds Features cloud-nuke AWS Delete all resources older than a certain date; in a certain region; of a certain type. safe-scrub Google Cloud Safely delete unwanted resources in a GCP project Azure Powershell Azure Includes native commands to delete Resource Groups Run cleanup tools in cron jobs to remove old resources in sandbox accounts
  • 106. In prod, Devs deploy via self-service with the Service Catalog + CI / CD Pipeline.
  • 107. Key self-service features: 1. GitOps-driven Everything is managed as code and driven by commits to version control. Allows code review, testing, audit log, versioning, etc. 2. UI-driven (optional) Web UI as a layer on top of GitOps layer to make it more accessible. 3. Focus on common use cases E.g., Account vending machine, data store deployment, app deployment. Don’t have to solve everything right away. 4. Access controls Different teams can access/deploy different things. E.g., NetOps team might be able to deploy networking, whereas app teams can deploy orchestration tools and data stores.
  • 108. module "account_baseline" { source = "github.com/gruntwork-io/account-baseline" child_accounts = { dev = "accounts+dev@company.com" stage = "accounts+stage@company.com" prod = "accounts+prod@company.com" # Add new account example = "accounts+example@company.com" } } Example of self-service: update a file, commit, CI / CD system deploys it
  • 109. Key idea #7: Any team can contribute to the Service Catalog.
  • 110. stage prod Modern software involves many moving pieces
  • 111. If only Ops can add those pieces to the Service Catalog, that’ll be a bottleneck
  • 112. Automated tests: ✓ tflint ✓ tfsec ✓ OPA ✓ steampipe ✓ checkhov ✓ Terratest Passed: 6. Failed: 0. Skipped: 0. Test run successful. Instead, allow everyone to contribute and enforce company requirements through code reviews and automated tests
  • 113. 1. Do it by hand 2. Do it live 3. Do it on my machine 4. Do it only on my machine 5. Do it once Outline
  • 115. Not taking into account ongoing maintenance work
  • 116. stage prod Not only are there many moving pieces, but they’re all also constantly changing.
  • 117. AWS is constantly changing The last S3 security document that we’ll ever need, and how to use it How To Keep Up With AWS Announcements
  • 121. Many companies assume that the initial cloud deployment is the hard part.
  • 123. “Software maintenance cost is increasingly growing and estimates showed that about 90% of software life cost is related to its maintenance phase.” Which Factors Affect Software Projects Maintenance Cost More? Sayed Mehdi Hejazi Dehaghani and Nafiseh Hajrahimi
  • 124. If you don’t have a plan for maintenance, all that code you wrote will rot.
  • 125. “Coming back to that Terraform codebase after 6 months.” Eero Järnefelt Oil on canvas, 1893
  • 128. Key auto-update features: 1. Automation-driven Updates are discovered and the code is updated automatically. No relying on a human to remember it. Update cadence should be configurable. 2. GitOps-driven The code is updated via automated pull requests. 3. Automated testing You must have automated tests in place and running against each pull request to let you know if the updated code still works. 4. Automated deployment Once a pull request is merged, it must deploy automatically via the CI / CD pipeline, promoting the update across environments: e.g., dev à stage à prod.
  • 129. Key idea #8: Updates are pushed to the code via PRs, automatically.
  • 130. Key idea #9: Code without automated tests will rot.
  • 131. How to do automated testing for infrastructure code https://terratest.gruntwork.io/docs/getting-started/introduction/#watch-how-to-test-infrastructure-code
  • 132. 1. Do it by hand 2. Do it live 3. Do it on my machine 4. Do it only on my machine 5. Do it once Outline
  • 134. Key ideas: 1. Manage everything as code in a Service Catalog. 2. Set up your Landing Zone as early as you can. 3. Only the CI / CD Pipeline can deploy to prod. 4. The CI / CD Pipeline only deploys from the Service Catalog. 5. The CI / CD Pipeline protects its admin permissions. 6. Any team can deploy infra + apps from the Service Catalog. 7. Any team can contribute to the Service Catalog. 8. Updates are pushed to the code via PRs, automatically. 9. Code without automated tests will rot.
  • 135. Fail Description Solution Do it by hand ClickOps Service Catalog Do it live Everyone is an admin Landing Zone Do it on my machine People deploying from their computers CI / CD Pipeline Do it only on my machine Only Ops can deploy Self-Service Do it once Not taking maintenance into account Automatic Updates 5 cloud adoption fails and solutions:
  • 136. The 5 solutions are part of the Gruntwork Production Framework https://docs.gruntwork.io/guides/production-framework/
  • 137. If you use this framework, here’s the experience for your Ops team:
  • 138. Step 1: Create a Service Catalog Everything defined as code. Works for app + infra. You could build from scratch or on top of an existing one (e.g., Gruntwork Service Catalog).
  • 139. Step 2: Set up your Landing Zone Set up your basic account structure, define account baselines, etc.
  • 140. Step 3: Set up a CI / CD pipeline Ensure it’s the only way to deploy to prod. Make it work for apps + infra.
  • 141. Step 4: Provide self-service Enable all teams to deploy. Start with a GitOps solution. Add UI later.
  • 142. Step 5: Set up automatic updates PRs opened automatically. Automated tests in place for app + infra code.
  • 143. And here’s the experience for your Dev team:
  • 144. Step 1: Scaffold a new app Leverage vetted application templates from the Service Catalog and the logic built in: e.g., service discovery, packaging, monitoring, testing, etc.
  • 145. Step 2: Deploy infrastructure Leverage Self-Service + Service Catalog + CI / CD Pipeline.
  • 146. Step 3: Iterate on the app Leverage CI / CD built into the templates to deploy subsequent changes.
  • 147. Step 4: Debug issues Leverage monitoring, logging, alerting, etc. built into the templates.
  • 148. Step 5: Stay up to date Leverage auto update built into the templates. Automated PRs + tests.
  • 149. “The Cloud you always wanted.” Thomas Cole Oil on canvas, 1836