This presentation looks at scaling patterns for Terraform, an infrastructure provisioning tool/language/framework.
I will also demonstrate a code generator that I have written that will ensure that teams can adopt the Terraservices pattern as easily as possible.
https://github.com/williamtsoi1/generator-terraform-environments
A working example of the terraservices pattern is here: https://github.com/williamtsoi1/terraservices-example
2. PATTERNS FOR SCALING TERRAFORM
AGENDA
‣ Introduction
‣ Love/Hate relationship with Terraform
‣ Case study – can you relate to the pain?
‣ Something to make your life easier
‣ Things to think about when you get home
3. PATTERNS FOR SCALING TERRAFORM
INTRODUCTION
‣ Consultant, Coach and Engineer in
Continuous Integration, Infrastructure
Automation, Agile and DevOps
‣ Current role: Senior Automation Engineer
@ Vibrato
‣ Vibrato is a professional service IT
consultancy that specialises in
Automation, DevOps, Cloud Migration and
Data Engineering
4. PATTERNS FOR SCALING TERRAFORM
I LOVE
TERRAFORM!
‣ Started in mid-late 2015 (0.6.x)
‣ Lots to love
‣ terraform plan
‣ Dependency management
‣ Domain-specific language
‣ One language, many
providers/clouds
5. PATTERNS FOR SCALING TERRAFORM
I HATE
TERRAFORM!
‣ Live and die by your
terraform.tfstate
‣ “Randomly” destroys stuff
‣ Corrupted state files
‣ Sharing state files between your team
‣ No established usage patterns apart from
“Buy Terraform Enterprise!”
6. PATTERNS FOR SCALING TERRAFORM
LET’S TRY SOME ROLE PLAYING
‣ (scenario largely borrowed from Nicki Watt’s talk from HashiDays London 2017)
7. PATTERNS FOR SCALING TERRAFORM
DAY 1
‣ New green field project
‣ Bastion host
‣ Compute Cluster
‣ Database
‣ Automate all the things!
‣ WIN!
9. PATTERNS FOR SCALING TERRAFORM
DAY 2 –
PRODUCTION!
‣ Deploy to production… now!
‣ terraform-prod.tf
‣ terraform-test.tf
‣ terraform.tfstate
10.
11. PATTERNS FOR SCALING TERRAFORM
DAY 3 – CHANGE
TO TEST
‣ terraform-prod.tfbkp
‣ terraform-test.tf
‣ terraform.tfvars
‣ terraform.tfstate
12.
13. PATTERNS FOR SCALING TERRAFORM
TERRALITH:
CHARACTERISTICS
‣ Single state file
‣ Single definition file
‣ Hard coded config
‣ Local state
‣ Can’t manage environments separately
‣ Config not intuitive
‣ Maintenance nightmare: Duplicate code
15. PATTERNS FOR SCALING TERRAFORM
MULTI-TERRALITH
‣ Separated state file between staging and
production
‣ SLIGHTLY more intuitive (network and
VM split into separate files)
‣ Still lots of duplication (networks.tf and
vms.tf still duplicated)
16. PATTERNS FOR SCALING TERRAFORM
MODULAR!
‣ Database
‣ Amazon RDS
‣ DB Subnet groups
‣ Compute
‣ Instances
‣ Security Groups
‣ Core
‣ VPC
‣ Subnets
‣ Core Routing and Gateways
‣ Bastion Host
17.
18. PATTERNS FOR SCALING TERRAFORM
TERRAMOD
‣ Separate out environment management
(config) and module definitions (code)
‣ Logical components as reusable modules
‣ No config or hard-coding allowed in
modules
‣ Input.tf and output.tf essentially acts as
“contracts” of the module
20. PATTERNS FOR SCALING TERRAFORM
LIFE IS PRETTY
GOOD, UNTIL…
‣ You get asked to reduce the size of
the bastion box in production
‣ Piece of cake!
‣ Just change the bastion_flav value
in the production terraform.tfvars!
21.
22. WHERE’D MY
CLUSTER GO?
‣ Someone got lazy and reused the
var.bastion_flav variable!
‣ AWS will destroy the instances and
reprovision them since the sizes
have changed…
‣ Managing environments separately,
but not the logical components!
PATTERNS FOR SCALING TERRAFORM
23. PATTERNS FOR SCALING TERRAFORM
SOLUTION:
TERRASERVICES
‣ Expand out environments folder to
separate out logical components
‣ Logical components are separated out!
Changing config for the bastion will no
longer accidentally break things in other
modules, even in the same environment
25. PATTERNS FOR SCALING TERRAFORM
HOW TO DO THE SAME THING USING THE
TERRASERVICES PATTERN?
environments/production/core/compute.tf
environments/production/core/output.tf
26. PATTERNS FOR SCALING TERRAFORM
HOW TO DO THE SAME THING USING THE
TERRASERVICES PATTERN?
environments/production/compute/terraform.tf
27. PATTERNS FOR SCALING TERRAFORM
TERRASERVICES - IMPLICATIONS
‣ Require additional orchestration effort
‣ Deploy the core (VPC + subnets) before deploying compute (EC2 instance)
‣ tfstate explosion = (number of environments) * (number of logical components)
‣ Need a standard practice on laying out the tfstate files
‣ This setup is for larger teams and enterprises. Smaller teams can just use Terramod
‣ Remote state & distributed locking becomes really important
‣ Requires so much more code to write and maintain just to reference all the remote state files!
28. PATTERNS FOR SCALING TERRAFORM
INTRODUCING THE
TERRAFORM-
ENVIRONMENTS
CODE GENERATOR!‣ Yeoman
‣ Supports TerraServices and Terramod
patterns
‣ Currently only supports s3 remote state
‣ Future: support anything with distributed
state locking (azurerm, gcs, consul)
30. PATTERNS FOR SCALING TERRAFORM
INSTRUCTIONS
‣ Install yeoman
‣ npm install –g yo
‣ Install the generator
‣ npm install –g generator-terraform-environments
‣ Create your project folder, cd to it, then run
‣ yo terraform-environments
31. PATTERNS FOR SCALING TERRAFORM
BACKUP FOR IF THE DEMO DOESN’T WORK
‣ https://asciinema.org/a/tUwkFEpuWmR4lJVwYvrKRRkit
32. PATTERNS FOR SCALING TERRAFORM
THINGS TO THINK ABOUT WHEN YOU GET
HOME
‣ Use either Terramod or TerraServices. Do not Terralith!
‣ How to split modules? Think about:
‣ Team/responsibility structures
‣ Release cadence of various components
‣ Overall architecture of the system
‣ What remote state to use? Distributed locking is important!
‣ Branching model for this repo? Github-flow should be fine
‣ Security for remote state – prevent tampering & accessing secrets
33. PATTERNS FOR SCALING TERRAFORM
APPENDIX
‣ Project homepage: https://github.com/williamtsoi1/generator-terraform-environments
‣ Example using TerraServices: https://github.com/williamtsoi1/terraservices-example
‣ Contact details
‣ william.tsoi@vibrato.com.au
‣ @williamtsoi on Twitter
‣ https://about.me/williamtsoi
Editor's Notes
Introduction to myself
Talk about how I got into Terraform, and my experience at the time
Case study, which mimics real life. From the case study we can see an evolution of what your Terraform code layout/pattern should look like as your infrastructure grows
After the case study we should have a fair idea of what the end-state should look like. I’ll present a little open source project that I’ve created to make your life easier so you can get there so much quicker without the pain
Things to think about in your organisation or circumstances so you can make the smart decisions early
Recovering agile coach, now truly recovered.
We’re hiring. Talk to Molly (point her out in the room)
Started with Terraform around 2 years ago.
Customer was doing a digital transformation:
Delivery Process (waterfall to agile)
Architecture (moving from monolith to microservices)
Infrastructure (on-prem to cloud)
Perfect opportunity to introduce Terraform
Customer didn’t go with Terraform in the end, and at the time there were some of these criticisms.
Hardest thing to manage in Terraform is the state file
“Randomly” is quoted because it’s not the tool that’s the problem, it’s how you’re using it
Sharing state files
Let’s look at some of the perceived
It’s your first day at a brand new job, and surprisingly you get to work on a green field project.
Something simple. Bastion Host and base networking
A compute cluster, could be Kubernetes
Database, let’s say RDS
Code away. Quite a simple setup – so read the documentation on each of the resource types, and off you go.
Terraform apply, and you have a single state file created
Your boss is so impressed, that by the second day you’ve been asked to deploy to production
Create some separation between the environments, so you make a copy of the file
Make the changes for IP addresses etc in the prod file
Run terraform apply again. Great!
Change CIDR range for test environment. As a good engineer, you make a backup to the prod file so it doesn’t get touched
Ask audience: what happens when I terraform apply?
You can terraform plan – but let’s say it was so trivial that he was over-confident
Rookie mistake, but just shows how dangerous Terraform can be
Duplicated code = between each environment
Separate out production and staging into separate folders with separate tfstate
Broke out network and VM into separate files
Better use of variables (point out the variables)
So if I make a change to staging I won’t blow up production
How do we facilitate code reuse?
You read about modules in the Terraform, and start thinking about how to break the code out
This is the first “sensible solution”
environment-specific config such as instance sizes, cidr ranges etc belong in the environments sub-folder
logical components eg. compute, core, data separated out into a modules sub-folder
The modules sub-folder could even be extracted out into separate repos
What does an environment file look like?
Has 3 main functions:
Consumes the environment-specific terraform.tfvars file and injects the variables into the module
Pass dependencies between the modules (eg. Module.core.priv_subnet_id which is only known at runtime)
Manage separated tfstate files (same as multi-terralith)
Ask the audience how many are using a pattern like this?
Modules can also consume other modules too
End of the talk? Not quite!
It's a config, so go to the environments folder
then go into production
then just change the value of bastion_flav in terraform.tfvars
Then run terraform apply
Ask audience: What will happen?
By changing the value of var.bastion_flav, this has caused a change in the compute module call as well!
Recap that dependencies are looked up via module output variables (module.core.priv_subnet_id)
On the production/core module I now have a remote state defined in s3, in a given bucket, and a given path
I’m also including an output parameter in the production/core module. When I terraform apply this, the value of the private subnet will be stored within the tfstate file, stored in s3
Now that I've run terraform apply and deployed the VPC and the subnets, over in the production/compute module, I need to make a reference to the remote tfstate file that was generated when I terraform applied the core module, using the exact same parameters for s3 bucket and path
Then I can access the value in the tfstate file using the data.terraform_remote_state.core.priv_subnet_id syntax
terraform will then go into the state file in s3, retrieve the value of the subnet and then use it for this run
orchestration - ordering problem. Need to run one after the other in a specific order
tfstate explosion, in terralith there was only 1 file to manage, with terramod with 3 modules, 3 files. Now with this pattern if I have 2 environments now I'm dealing with 6 tfstate files! How does one manage where each of the files are?
How to solve?
Re: problem 1 – use a CI system to orchestrate this, or go and buy Terraform Enterprise
Re: problem 2-4, I have a solution!
mkdir my-terraform-app
cd my-terraform-app
yo terraform-environments
williamtsoi-terraform-remote-state
cd environments/staging/core
terraform init
terraform apply
cd ../../staging/compute
terraform init
terraform apply
With the examples I’ve shown it should be plainly obvious why Terralith is a bad idea
TerraServices might be overkill based on the size of your team and the complexity of your system, but you need to think about this before you begin.
It will be hard to refactor from TerraMod to TerraServices (as you’ll need to manually play around with tfstate files)
How to split?
Team/responsibility structures
Split by logical components if teams have end-to-end responsibility
Often this is not the case – eg. Central CCOE team hands out AWS accounts with pre-cooked VPCs and custom routing tables and VPC peering rules to app development teams. VPCs should be one module, the app infrastructure should be another module (referencing tfstate files created by CCOE team)
Release cadence (eg. App infra could be reprovisioned in a blue/green deployment, but DB hardly ever changes). So better to separate
Overall architecture – how much sharing is there between different logical components (eg. Sharing same VPCs/subnets, R53 zones etc)
Remote state
Terragrunt, but probably no longer needed
Security for remote state:
Love it or hate it, people will put stuff they’re not supposed to in output variables
Security for S3
RBAC for Consul