How LogicMonitor manages resources in AWS using Terraform to provide a reliable, repeatable way to both naturally grow our infrastructure and provide disaster recovery solutions.
5. What is a Pod?
Automating Disaster Recovery
All of the components required
to provide LogicMonitor for customers
Tomcat
Kafka
TSDB
MySQL
Relay
Global Resources:
APIs
HAProxy
Redis
S3
SQS
ELBs
Sitemonitor
Proxy
SMTP
Render
ECSSG
DNS
… what’s next?
ElasticSearch
Rserve
IAM
Horizontally scalable Cell Architecture
7. • Runbook (Cookbook)
• CLI or web interface
• Co-workers .bash_history
• Crossing your fingers?
The Old Way
Automating Disaster Recovery
8. • Infrastructure as code (self documenting, repeatable)
• Provision and de-provision (important!)
• Scalable (change two parameters to create a new Pod)
Terraform
Automating Disaster Recovery
Hello. I’m Randall Thomson, Sr. TechOps Engineer at LogicMonitor. Our TechOps team manages the infrastructure that provides LogicMonitor service for our customers. We straddle the line of SRE or DevOps, whatever you want to call it nowadays. We are always juggling our time between re-active and pro-active tasks. This talk is about what our team has done to provide automation in disaster recovery situations using Terraform and AWS.
I tend to jump right into the nitty gritty so I want to spend a brief couple slides going over the two main subjects to talk: Terraform & Pods (I will keep referring to these two things)
Ask audience:
Who has heard of or has experience with Terraform?
Who has, or still does, provision AWS resources via the Web Portal? CLI? Other orchestration tools?
This is Elon Musk’s Disaster Recovery plan for Earth. Not what I will be talking about today but definitely something fun to Google afterwards.
Terraform - open source tool by Hashicorp (vagrant, packer, consul, vault) - will quote from website
“Terraform enables you to safely and predictably create, change, and improve production infrastructure. It is an open source tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.”
Context part #2 - A Pod. LogicMonitor uses a Cell Architecture design we internally refer to as “Pods”. These, in addition to a handful of global resources are the infrastructure that powers the LogicMonitor service to our customers. Most of our Pods are a hybrid cloud model, where some of the resources are in our own datacenters with the rest being in AWS. The list on screen is only a subset (always changing), but as you can see there are a lot of resources that go into building a Pod. Lots of nuts & bolts.
So this leads us to one of our challenges.
How do you provide a reliable way to scale and keep your disaster recovery plan up-to-date?
15m (Resolution) Open with the Old Way of creating infrastructure (cli, web interface)
In the past if we wanted to replicate how an existing server was built we would have to lookup the documentation (if any) and then assess if any manual changes were made (cross your fingers, or read through co-worker’s .bash_history). Black magic.
This led to inconsistencies for environments that should ideally be exactly the same.
Cue Terraform.
The terraform code serves both as documentation of how infrastructure is built and a description of existing infrastructure. With Terraform you can both provision new infrastructure to be the same as old, as well as keep your older infrastructure up-to-date as you make changes along the way.
Terraform is able to provision all of the resources which make up our pods except our bare-metal servers. Our DR plan utilizes a 100% AWS Cloud pod design with no data center dependencies.
Scalable
Worthwhile to maintain as it serves as the single source of truth.
Documentation is always up-to-date.
Turned processes we used to fear into near thoughtless tasks.
- HCL, Modules, Projects, and Directory Organization. Private vs public facing resources. Data Providers.
Terraform projects and modules can or rather should be stored in a code repository (but not your state files) even in a single person shop. This enables you to have all the normal benefits of a software project but for your infrastructure. Revision history, proper change control.
We use modules (reusable resource provisioners) as templates for our various application servers. We define projects to represent our various pods (and global resources). Each AWS environment has a distinct terraform code repository. Terraform can operation across multiple AWS environments but this gets complicated quick. Suggest: Make use of data providers so that you are not defining variables in your code. For example, looking up network ranges or AMI numbers.
5m - The ability to preview changes is useful both when creating new resources and especially important when modifying old resources. It’s like a diff output showing additions, subtractions and changes. Somewhat colorized. You can (and will) configure various resources to ignore certain types of changes over time for cases when you don’t need your older resources modified. For example, AMI numbers. You may change the AMI over time but you don’t need to re-provision older servers as Puppet keeps them up-to-date.
The Complication. I want to make a brief sidenote on AMI and the spectrum of Generic vs ready-to-run. We have about a dozen different types of application servers. For us it made sense to build a AMI that gets us about 95% of what we need and let Puppet do the final tweaks. For some it may be best to have a dozen different AMIs ready-to-run. The time savings can be dramatic when your instances don’t need a lot of post-configuration. It's another example of where you have to put a lot more work up-front to save time later. There are a variety of tools for building AMIs. We happen to use Packer, not because of any Terraform integration but simply that it does it’s one job very well. Also, make sure you copy your AMIs to any region where you may need to perform DR tasks.
At this point you may be wondering what all this has to do with Disaster Recovery. So here’s where we are today. We agreed as a group that any resource we provision in AWS must be done via terraform. All of our pods are described in terraform projects. As it so happened, in a serendipitous way, our Disaster Recovery plan was born. We no longer needed one way to provision our production infrastructure and a different method for our DR plan. With Terraform it’s basically the same in either case.
10m - The day has come. Your datacenter lost power. It’s 5am and you’ve been up half the night with your toddler. How much thinking do you want to have to do? How much thinking will you even be capable of? Likely very little.
terraform plan; terraform apply. copy the project file and repeat. (hope your VPN works, and that you have AMIs in the target regions)
10m - We’re currently making use of terraform to manage our QA environments as well. There is always room for improvement. We are looking at ways to automate application deployment in DR situations. One example we’re testing is using IAM roles combined with EC2 user-data scripts to fetch our WAR files directly from S3. Another example would be having CI/CD tool (such as Bamboo) run the terraform commands. Then even your boss manager could do it.