Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Lessons learned from writing over 300,000 lines of infrastructure code

8,805 views

Published on

This talk is a concise masterclass on how to write infrastructure code. I share key lessons from the “Infrastructure Cookbook” we developed at Gruntwork while creating and maintaining a library of over 300,000 lines of infrastructure code that’s used in production by hundreds of companies. Come and hear our war stories, laugh about all the mistakes we’ve made along the way, and learn what Terraform, Packer, Docker, and Go look like in the wild.

Published in: Software

Lessons learned from writing over 300,000 lines of infrastructure code

  1. 1. Lessons learned from writing 300,000 LINES OF INFRASTRUCTURE CODE
  2. 2. It’s time for a confession:
  3. 3. DevOps is still in the stone ages
  4. 4. We are trying to build this…
  5. 5. Using this.
  6. 6. If you just read the headlines, it all sounds so cutting edge…
  7. 7. Kubernetes, Docker, serverless, microservices, infrastructure as code, distributed tracing, big data systems, data warehouses, data lakes, chaos engineering, zero-trust architecture, streaming architecture, immutable infrastructure, service discovery, service meshes, NoSQL, NewSQL, ChatOps, HugOps, NoOps, DevSecOpsLeanSREAgileWTFBBQ, …
  8. 8. But to me, it doesn’t feel cutting edge. It feels more like…
  9. 9. #thisisdevops
  10. 10. #thisisdevops
  11. 11. #thisisdevops
  12. 12. #thisisdevops
  13. 13. Here’s something we don’t admit often enough:
  14. 14. Building production-grade infrastructure is hard.
  15. 15. And stressful.
  16. 16. And time consuming.
  17. 17. Some rough numbers:
  18. 18. Production-grade infrastructure Project Examples Time estimate Managed service ECS, ELB, RDS, ElastiCache 1 – 2 weeks Distributed system (stateless) nginx, Node.js app, Rails app 2 – 4 weeks Distributed system (stateful) Elasticsearch, Kafka, MongoDB 2 – 4 months Entire cloud architecture Apps, DBs, CI/CD, monitoring, etc. 6 – 24 months
  19. 19. Fortunately, it’s getting a little bit better
  20. 20. One trend I love: manage (almost) everything as code
  21. 21. Manual provisioning à Infrastructure as code Manual server config à Configuration management Manual app config à Configuration files Manual builds à Continuous integration Manual deployment à Continuous delivery Manual testing à Automated testing Manual DBA work à Schema migrations Manual specs à Automated specs (BDD)
  22. 22. The benefits of code: 1.  Automation 2.  Version control 3.  Code review 4.  Testing 5.  Documentation 6.  Reuse
  23. 23. At Gruntwork, we’ve created a reusable library of infrastructure code
  24. 24. Primarily written in Terraform, Go, Python, and Bash
  25. 25. Off-the-shelf, battle-tested solutions for AWS, Docker, VPCs, VPN, MySQL, Postgres, Couchbase, ElasticSearch, Kafka, ZooKeeper, Monitoring, Alerting, secrets management, CI, CD, DNS, …
  26. 26. The library is used in production by hundreds of customers
  27. 27. Which allows you to go from this… Project Examples Time estimate Managed service ECS, ELB, RDS, ElastiCache 1 – 2 weeks Distributed system (stateless) nginx, Node.js app, Rails app 2 – 4 weeks Distributed system (stateful) Elasticsearch, Kafka, MongoDB 2 – 4 months Entire cloud architecture Apps, DBs, CI/CD, monitoring, etc. 6 – 24 months
  28. 28. … To this Project Examples Time estimate Managed service ECS, ELB, RDS, ElastiCache 1 – 2 weeks 1 day Distributed system (stateless) nginx, Node.js app, Rails app 2 – 4 weeks 1 day Distributed system (stateful) Elasticsearch, Kafka, MongoDB 2 – 4 months 1 day Entire cloud architecture Apps, DBs, CI/CD, monitoring, etc. 6 – 24 months 1 day
  29. 29. 3+ years of development. 300,000+ lines of code.
  30. 30. In this talk, I’ll share what we learned along the way!
  31. 31. I’m Yevgeniy Brikman ybrikman.com
  32. 32. Co-founder of Gruntwork gruntwork.io
  33. 33. Author
  34. 34. 1.  Checklist 2.  Tools 3.  Modules 4.  Tests 5.  Releases Outline
  35. 35. 1.  Checklist 2.  Tools 3.  Modules 4.  Tests 5.  Releases Outline
  36. 36. Production-grade infrastructure Project Examples Time estimate Managed service ECS, ELB, RDS, ElastiCache 1 – 2 weeks Distributed system (stateless) nginx, Node.js app, Rails app 2 – 4 weeks Distributed system (stateful) Elasticsearch, Kafka, MongoDB 2 – 4 months Entire cloud architecture Apps, DBs, CI/CD, monitoring, etc. 6 – 24 months
  37. 37. How can it possibly take that long??
  38. 38. Two main reasons:
  39. 39. Reason it takes so long #1: Yak shaving
  40. 40. Yak shaving: a seemingly endless series of small tasks you have to do before you can do what you actually want.
  41. 41. Reason it takes so long #2: It’s a long checklist!
  42. 42. Introducing: The production-grade infrastructure checklist
  43. 43. Task Description Example tools Install Install the software binaries and all dependencies. Bash, Chef, Ansible, Puppet Configure Configure the software at runtime: e.g., configure port settings, file paths, users, leaders, followers, replication, etc. Bash, Chef, Ansible, Puppet Provision Provision the infrastructure: e.g., EC2 Instances, load balancers, network topology, security groups, IAM permissions, etc. Terraform, CloudFormation Deploy Deploy the service on top of the infrastructure. Roll out updates with no downtime: e.g., blue-green, rolling, canary deployments. Scripts, Orchestration tools (ECS, K8S, Nomad) Production-grade infrastructure checklist, part 1/4
  44. 44. Task Description Example tools Security Encryption in transit (TLS) and on disk, authentication, authorization, secrets management, server hardening. ACM, EBS Volumes, Cognito, Vault, CiS Monitoring Availability metrics, business metrics, app metrics, server, metrics, events, observability, tracing, alerting. CloudWatch, DataDog, New Relic, Honeycomb Logs Rotate logs on disk. Aggregate log data to a central location. CloudWatch Logs, ELK, Sumo Logic, Papertrail Backup and restore Make backups of DBs, caches, and other data on a scheduled basis. Replicate to separate region/account. RDS, ElastiCache, ec2- snapper, Lambda Production-grade infrastructure checklist, part 2/4
  45. 45. Task Description Example tools Networking VPCs, subnets, static and dynamic IPs, service discovery, service mesh, firewalls, DNS, SSH access, VPN access. EIPs, ENIs, VPCs, NACLs, SGs, Route 53, OpenVPN High availability Withstand outages of individual processes, EC2 Instances, services, Availability Zones, and regions. Multi AZ, multi-region, replication, ASGs, ELBs Scalability Scale up and down in response to load. Scale horizontally (more servers) and/or vertically (bigger servers). ASGs, replication, sharding, caching, divide and conquer Performance Optimize CPU, memory, disk, network, GPU and usage. Query tuning. Benchmarking, load testing, profiling. Dynatrace, valgrind, VisualVM, ab, Jmeter Production-grade infrastructure checklist, part 3/4
  46. 46. Task Description Example tools Cost optimization Pick proper instance types, use spot and reserved instances, use auto scaling, nuke unused resources ASGs, spot instances, reserved instances Documentation Document your code, architecture, and practices. Create playbooks to respond to incidents. READMEs, wikis, Slack Tests Write automated tests for your infrastructure code. Run tests after every commit and nightly. Terratest Production-grade infrastructure checklist, part 4/4
  47. 47. Key takeaway: use a checklist to build production-grade infrastructure.
  48. 48. Full checklist: gruntwork.io/devops-checklist/
  49. 49. 1.  Checklist 2.  Tools 3.  Modules 4.  Tests 5.  Releases Outline
  50. 50. What tools do you use to implement that checklist?
  51. 51. We prefer tools that: 1.  Define infrastructure as code 2.  Are open source & popular 3.  Support multiple providers 4.  Support reuse & composition 5.  Require no extra infrastructure 6.  Support immutable infrastructure
  52. 52. Here’s the toolset we’ve found most effective as of 2018:
  53. 53. Server Server Server Server Server Networking, Load Balancers, Databases, Users, Permissions, etc 1. Deploy all the basic infrastructure using Terraform
  54. 54. Server Server Server Server Server Networking, Load Balancers, Databases, Users, Permissions, etc VM VM VM VM VM 2. Configure the VMs using Packer
  55. 55. Server Server Server Server Server Networking, Load Balancers, Databases, Users, Permissions, etc VM VM VM VM VM Docker Cluster 3. Some of the VMs form a cluster (e.g., ECS or Kubernetes cluster)
  56. 56. Server Server Server Server Server Networking, Load Balancers, Databases, Users, Permissions, etc VM VM VM VM VM Docker Cluster Container Container Container Container Container Container 4. We use that Docker cluster to run Docker containers
  57. 57. Server Server Server Server Server Networking, Load Balancers, Databases, Users, Permissions, etc VM VM VM VM VM Docker Cluster Container Container Container Container Container Container 5. Under the hood: glue everything together with Bash, Go, and Python
  58. 58. Important note:
  59. 59. We find these tools useful…
  60. 60. But tools are not enough. You must change behavior too.
  61. 61. Old way: make changes directly and manually
  62. 62. New way: make changes indirectly and automatically
  63. 63. Learning these takes time
  64. 64. More time than making a change directly…
  65. 65. If you make changes manually, the code will not reflect reality.
  66. 66. And the next person to try to use it will get errors
  67. 67. So then they’ll fall back and make manual changes
  68. 68. But making manual changes does not scale
  69. 69. This does
  70. 70. Key takeaway: tools are not enough. You also need to change behavior.
  71. 71. 1.  Checklist 2.  Tools 3.  Modules 4.  Tests 5.  Releases Outline
  72. 72. It’s tempting to define all of your infrastructure code in 1 file / folder… dev qa test stage prod
  73. 73. Downsides: runs slower; harder to understand; harder to review (plan output unreadable); harder to test; harder to reuse code; need admin permissions; team concurrency limited to 1… dev qa test stage prod
  74. 74. Also, a mistake anywhere could break everything! dev qa test stage prod
  75. 75. Large modules considered harmful. dev qa test stage prod
  76. 76. What you really want is isolation for each environment dev stage prod
  77. 77. And for each “component” MySQL VPC Frontend
  78. 78. Take your architecture…
  79. 79. And break it up into small, reusable, standalone, tested modules module module module module module module module module module module module module module module module
  80. 80. live └ dev └ stage └ prod Break architecture down by environment
  81. 81. live └ dev └ vpc └ mysql └ frontend └ stage └ vpc └ mysql └ frontend └ prod └ vpc └ mysql └ frontend Break environments down by infrastructure type
  82. 82. live └ dev └ vpc └ mysql └ frontend └ stage └ vpc └ mysql └ frontend └ prod └ vpc └ mysql └ frontend modules └ vpc └ mysql └ microservice Implement infrastructure in modules
  83. 83. gruntwork-io └ asg └ alb └ ssh live └ dev └ vpc └ mysql └ frontend └ stage └ vpc └ mysql └ frontend └ prod └ vpc └ mysql └ frontend modules └ vpc └ mysql └ microservice Build complex modules from simpler modules
  84. 84. Example: Vault Modules
  85. 85. terraform-aws-vault └ modules └ examples └ test └ README.md Typical repo has three key folders: /modules, /examples, /test
  86. 86. terraform-aws-vault └ modules └ install-vault └ run-vault └ vault-cluster └ vault-elb └ vault-security-group-rules └ examples └ test └ README.md /modules: implementation code, broken down into standalone sub-modules
  87. 87. terraform-aws-vault └ modules └ install-vault └ install-vault.sh └ run-vault └ vault-cluster └ vault-security-group-rules └ vault-elb └ examples └ test └ README.md install-xxx: sub-module to install the software (e.g., in Packer or Docker)
  88. 88. terraform-aws-vault └ modules └ install-vault └ run-vault └ run-vault.sh └ vault-cluster └ vault-security-group-rules └ vault-elb └ examples └ test └ README.md run-xxx: sub-module to launch the software during boot (e.g., in User Data)
  89. 89. terraform-aws-vault └ modules └ install-vault └ run-vault └ vault-cluster └ main.tf └ vault-security-group-rules └ vault-elb └ examples └ test └ README.md xxx-cluster: sub-module to deploy infrastructure (e.g., into an ASG)
  90. 90. terraform-aws-vault └ modules └ install-vault └ run-vault └ vault-cluster └ vault-security-group-rules └ main.tf └ vault-elb └ main.tf └ examples └ test └ README.md xxx-yyy: sub-modules with shareable components (e.g., Security Group rules)
  91. 91. variable "cluster_name" { description = "Name to use for the Vault cluster" } variable "vpc_id" { description = "ID of the VPC to use" } variable "allowed_inbound_cidr_blocks" { description = "IPs allowed to connect to Vault" type = "list" } Each sub-module exposes variables for configuration and dependencies
  92. 92. Small, configurable sub-modules make code reuse possible
  93. 93. As you can combine and compose them any way you want!
  94. 94. terraform-aws-vault └ modules └ examples └ vault-with-elb └ vault-s3-backend └ vault-ami └ test └ README.md /examples: Runnable example code for how to use the sub-modules
  95. 95. terraform-aws-vault └ modules └ examples └ vault-with-elb └ main.tf └ vault-s3-backend └ main.tf └ vault-ami └ test └ README.md Some examples are Terraform code
  96. 96. terraform-aws-vault └ modules └ examples └ vault-with-elb └ vault-s3-backend └ vault-ami └ vault.json └ test └ README.md Some examples are Packer templates or Dockerfiles
  97. 97. terraform-aws-vault └ modules └ examples └ test └ vault_with_elb_test.go └ vault_s3_backend_test.go └ README.md /tests: Automated tests for the sub- modules.
  98. 98. terraform-aws-vault └ modules └ examples └ vault-with-elb └ vault-s3-backend └ test └ vault_with_elb_test.go └ vault_s3_backend_test.go └ README.md Typically, our tests deploy & validate each example! More on this later.
  99. 99. Key takeaway: build infrastructure from small, composable modules.
  100. 100. 1.  Checklist 2.  Tools 3.  Modules 4.  Tests 5.  Releases Outline
  101. 101. Infrastructure code rots very quickly.
  102. 102. Infrastructure code rots very quickly.
  103. 103. Infrastructure code without automated tests is broken.
  104. 104. For general-purpose languages, we can run unit tests on localhost
  105. 105. For infrastructure as code tools, there is no “localhost” or “unit”
  106. 106. Therefore, the test strategy is: 1.  Deploy real infrastructure 2.  Validate it works 3.  Undeploy the infrastructure
  107. 107. We write these integration tests in Go using Terratest (open source!)
  108. 108. Terratest philosophy: how would you test it manually?
  109. 109. terraformOptions := &terraform.Options { TerraformDir: "../examples/vault-with-elb", } defer terraform.Destroy(t, terraformOptions) terraform.InitAndApply(t, terraformOptions) validateServerIsWorking(t, terraformOptions) Typical test structure
  110. 110. terraformOptions := &terraform.Options { TerraformDir: "../examples/vault-with-elb", } defer terraform.Destroy(t, terraformOptions) terraform.InitAndApply(t, terraformOptions) validateServerIsWorking(t, terraformOptions) Specify where the code lives
  111. 111. terraformOptions := &terraform.Options { TerraformDir: "../examples/vault-with-elb", } defer terraform.Destroy(t, terraformOptions) terraform.InitAndApply(t, terraformOptions) validateServerIsWorking(t, terraformOptions) Run terraform init and terraform apply to deploy
  112. 112. terraformOptions := &terraform.Options { TerraformDir: "../examples/vault-with-elb", } defer terraform.Destroy(t, terraformOptions) terraform.InitAndApply(t, terraformOptions) validateServerIsWorking(t, terraformOptions) Validate the infrastructure works as expected
  113. 113. // Get IPs of servers aws.GetPublicIpsOfEc2Instances(t, ids, region) // Make HTTP requests in a retry loop http.GetWithRetry(t, url, 200, expected, retries, sleep) // Run command over SSH ssh.CheckSshCommand(t, host, "vault operator init") Terratest has many tools built-in for validation
  114. 114. terraformOptions := &terraform.Options { TerraformDir: "../examples/vault-with-elb", } defer terraform.Destroy(t, terraformOptions) terraform.InitAndApply(t, terraformOptions) validateServerIsWorking(t, terraformOptions) At the end of the test, run terraform destroy to clean up
  115. 115. Note: tests create and destroy lots of resources!
  116. 116. Pro tip #1: run tests in completely separate “sandbox” accounts
  117. 117. Pro tip #2: clean up left-over resources with cloud-nuke.
  118. 118. e2e Tests Test pyramid Integration Tests Unit Tests
  119. 119. As you go up the pyramid, tests get more expensive, brittle, and slower Cost, brittleness, run time e2e Tests Integration Tests Unit Tests
  120. 120. How the test pyramid works with infrastructure code:
  121. 121. Individual modules Unit tests for infrastructure code: test individual sub-modules (keep them small!). e2e Tests Integration Tests Unit Tests
  122. 122. Multiple modules Integration tests for infrastructure code: test multiple sub-modules together. Individual modules e2e Tests Integration Tests Unit Tests
  123. 123. Entire stack E2E tests for infrastructure code: test entire environments (stage, prod). Individual modules e2e Tests Integration Tests Unit Tests Multiple modules
  124. 124. 30 – 120+ minutes 5 – 60 minutes Note the test times! This is another reason to use small modules. 1 – 20 minutes e2e Tests Integration Tests Unit Tests
  125. 125. Make sure to check out Terratest best practices for how to speed things up!
  126. 126. Key takeaway: infrastructure code without automated tests is broken.
  127. 127. 1.  Checklist 2.  Tools 3.  Modules 4.  Tests 5.  Releases Outline
  128. 128. Let’s put it all together: checklist, tools, modules, tests
  129. 129. 1. Go through the checklist Task Description Example tools Security Encryption in transit (TLS) and on disk, authentication, authorization, secrets management, server hardening. ACM, EBS Volumes, Cognito, Vault, CiS Monitoring Availability metrics, business metrics, app metrics, server, metrics, events, observability, tracing, alerting. CloudWatch, DataDog, New Relic, Honeycomb Logs Rotate logs on disk. Aggregate log data to a central location. CloudWatch Logs, ELK, Sumo Logic, Papertrail Backup and restore Make backups of DBs, caches, and other data on a scheduled basis. Replicate to separate region/account. RDS, ElastiCache, ec2- snapper, Lambda
  130. 130. provider "aws" { region = "us-east-1" } resource "aws_instance" "example" { ami = "ami-408c7f28" instance_type = "t2.micro" } 2. Write some code
  131. 131. terraformOptions := &terraform.Options { TerraformDir: "../examples/vault-with-elb", } defer terraform.Destroy(t, terraformOptions) terraform.InitAndApply(t, terraformOptions) validateServerIsWorking(t, terraformOptions) 3. Write automated tests
  132. 132. 4. Do a code review
  133. 133. 5. Release a new version of your code
  134. 134. qa stage prod 6. Promote that versioned code from environment to environment
  135. 135. Key takeaway:
  136. 136. Before… …After
  137. 137. Questions? info@gruntwork.io

×