Slides from https://www.meetup.com/ContinuousDeliveryNYC/events/254036209/
In this talk we will show some techniques to add safeguards to a CI pipeline that manages AWS infrastructure with Terraform. By adding parameters that serve as CAPTCHAs, and doing other checks to pipeline stages you can prevent accidental modification of production environments and unauthorized creation of expensive resources. that can protect your wallet and infrastructure. We will demonstrate several techniques, and describe tradeoffs, and discuss the potential for future work in this area.
Managing expensive or destructive operations in jenkins ci
1. Managing Expensive or
Destructive Operations
in Jenkins CI
http://bit.ly/dangerous-jenkins
September 26, 2018
Richard Bullington-McGuire
Principal Architect, Modus Create
richard@moduscreate.com
@obscurerichard
4. My Continuous Integration & CD Experience
Early Experiments - CruiseControl (Java) (2004-2008)
CruiseControl.NET (C#) 2008
Hudson + Windows / Linux + Java: 2009-2011
Jenkins (Linux, Windows, Android, iOS, Java, .NET, Python,
NodeJS, Angular, React, Objective C, etc etc.) 2011-2018
Commercial clients in a wide variety of industries
5. What is Expensive?
Downtime for critical systems
Deploying Cloud back end services (at scale)
Large scale load testing
Starting large scale analysis jobs - computational
fluid dynamics, geonomics, etc.
6. What is Destructive?
● Cloud provisioning: CloudFormation or Terraform
● Very risky: large changes w/ Infrastructure as Code
● Targeting the wrong environment by mistake
“Ooops, I just deleted the prod database server!”
7. Defense Strategies
1. Access Controls
2. Separate Control Systems for Production
3. Code Review
4. Human Check on Deployments
5. Opt-In Switches and CAPTCHAs
6. Small Increments with Metrics & Monitoring
7. Tiered Environments
8. Restricted Branches
9. Case Study: Jenkins & Terraform at Work
● Education company cloud migration (4mo -> prod)
● Apps w/> 30,000 RPM at peak measured with New Relic
● Production with 80+ sizeable AMIs baseline
● Auto Scaling to 200+ AMIs under heavy load
● Multiple environments: dev, qa, staging, prod
● Terabyte-scale MySQL Aurora cluster, 50+ TB in S3
● All managed with Jenkins, Terraform, Ansible, Packer,
CodeDeploy, multiple tech stacks
10. What could go wrong?
● Some changes to resources destroy old & create new
○ Eg.. Instances not part of auto scaling groups
are easy to destroy by accident
● Changes to network environments may require deleting
and recreating whole server stacks
● Terraform has imperfect grasp of some dependencies -
some plans fail to execute
● Accidentally deploying to the wrong environment
11. That’s not exactly Continuous Deployment!
Nope.
But not every organization is ready for that,
Nor will it work to push every change to every environment
without careful review.
12. Access Controls
Use access controls baked into Jenkins & target systems
● Integrate with Enterprise Directory - AD or Google SSO
● Limit access to deploy jobs
● Use Jenkins secrets or other secret store to hide keys
Example: Microsoft Cloud SSO + Jenkins SAML provider
13. Separate Environments
Consider separating dev & prod deploy mechanisms
● Have separate CI systems for dev & prod systems
● Limit access to prod systems more strictly
● May help with Sarbanes Oxley compliance or other
enterprise security controls. Could also be just a crutch.
Eg... One Jenkins server per AWS account
14. Code Review
Bake code review into deployment pipeline
1. Use GitHub pull requests or equivalent mechanisms
2. Require sign-off from tech lead or multiple people
3. Have linting tools to automate some code review chores
15. Human Check on Deployments
Require human review of critical steps
1. Use Jenkins inputs to pause and provide an abort option
2. Set rules of engagement on how deploys are done
Example: on a mixed team of consultants and company
employees, only do a production deploy with customer staff
watching and signing off
16. Opt-In Switches and CAPTCHAs
Add speed bumps (CAPTCHAs) to deployments
1. Avoid expensive or destructive operations on every commit
2. Make deployers run a job manually and select expensive or
destructive operations specifically
3. Make deployers solve a simple math problem to slow them
down and think about what they are doing
17. Confirm?
Type CONFIRM to continue
`
What do you think happens
when people try asking for
confirmation like this?
18. `
BAD IDEA proven awful
through HARD EXPERIENCE
● People always just type CONFIRM,
● Or worse, the browser autofills it!
Confirm? CONFIRM
Type CONFIRM to continue
19. Solve it! 42
What do you get when you multiply 6 by 7?
`A simple math problem works
much better in practice
20. Small Increments with Metrics & Monitoring
Deploy small increments at a time and monitor everything:
1. Make small changes and test them independently
2. Monitor target systems with New Relic or similar systems
3. Write automated tests for the small increments
4. Write load tests that can test your system at scale
21. Restricted Branches & Deploy Constraints
Require some changes to go through a certain branch:
1. Restrict prod deployments to the master branch
2. Ensure changes have gone through code review & testing
3. Make deploy jobs fail fast if they violate constraints
4. Build switches into code that have to be backed out in
commits in order to do some types of dangerous changes
5. Restrict deploys to certain times or by metrics
26. Step By Step
● Builds on CIS Baseline demo from NYC DevOps meetup
○ Ansible, Packer, OpenSCAP scan
● Adds building & destroying:
○ AWS VPC
○ Auto Scaling Group with ELB
○ Simple web application (from DevOps Wall St demo)
● Has Auto Scaling Group rotation
27. Defenses implemented
This demo has 4 of 7 of the defenses implemented:
1. Access Controls
2. Code Review
3. Human Check on Deployments
4. Opt-In Switches and CAPTCHAs
28. Lessons Learned
● Keeping people in the deploy loop is good sometimes
● Mashing up Terraform, Ansible, Packer & Jenkins works
● Simple math problem CAPTCHAs are way better than
simple confirmation prompts
● Using Docker for both Packer and Terraform works well
● Working around Docker’s tendency to create root owned
files can be done with Docker busybox and a chown -R
29. Extending Defenses Further
● Find a way to integrate a better (real) CAPTCHA
● Use TOTP passwords (Google Authenticator style)
during job approval process
● Hook into metrics to avoid changes at most volatile
times - add extra danger indicators
● Adapt amount of approvals needed per environment
● Chef Inspect / compliance tests / Test Kitchen