How we accelerated our vault adoption with terraform

Lucy Davinhart – Sky Betting & Gaming
How we accelerated our Vault
adoption, with Terraform

👋 Who am I?
• Senior Automation Engineer
• @ Sky Betting & Gaming
• Part of The Stars Group
• Delivery Engineering Squad
• Part of the Infrastructure & Platforms Tribe
• Among other things we…
• Look after our Vault clusters
• Maintain Vault integrations & tooling
• Control access to AWS (via Vault!)
• Support internal customers
@LUCYDAVINHART @SBGTECHTEAM

🔐 What do we use Vault for?
• Across the company, our Vault users are:
• > 4000 Virtual Machines
• > 500 humans
• > 250 various AppRoles
• And a few more for Kubernetes Auth and AWS Auth
• Main features we use:
• K/V Secrets
• PKI
• AWS Credentials

💬 This Talk
• Our problems managing Vault and onboarding people
• How we went about solving them
• Our initial Terraform solution
• How we have improved it over time
• The Future

✍️ Everything Manual
• Time consuming for us to make changes
• Making the changes
• Comparing policies, AppRoles, LDAP groups, etc.
• Time consuming to see what was in Vault already
• We were regularly asked to troubleshoot why User A doesn’t have access to Secret B
• Lack of standards / best practices
• (and we didn’t really know what we were doing initially)
• Automating Stuff is Cool 😎

🧞♀️ We were too powerful
• We started out with full admin rights and access to everything
• Configure all the auth and secret mounts
• Read and write to all the secrets
• Give ourselves any policies we needed
• But at least none of us had root tokens, right? 😱

🙈 Lack of Audit Trail
• What was changed?
• When was it changed?
• Who changed it?
• How did it change?
• Why did they change it?
🧞
🧞
☹️

Vault Config Ruby Gem
• Downloads Vault config (policies, AppRoles, LDAP groups, etc) and saves in git repo
• Jenkins job to run this on a schedule
• We now have configuration backups, so we can see what has changed and when
• But not necessarily who or why
• Written very quickly:
• Was useful very quickly
• Was not particularly maintainable

Goldfish Vault UI
• A Vault UI, before one was available in Open Source Vault
• Policy Request feature
• Users edited policies in the UI, and submitted for approval
• Vault admins review changes and apply

Terraform
• Codifies APIs into declarative configuration files
• Reproducible Infrastructure as Code
Terraform Code
resource "vault_policy" "ravenclaw" { … }
resource "vault_policy" "hufflepuff" { … }
Terraform State
vault_policy.ravenclaw
vault_policy.slytherin
Terraform Plan
+ vault_policy.hufflepuff
- vault_policy.slytherin

🧞 Terraform Pipeline Design Decisions
• Look like the Vault API as much as possible
• Files which match the Vault API, e.g. sys/policy/foo.json

policies.tf
resource "vault_policy" "example"
{
name = "dev-team"
policy = <<EOT
path "secret/my_app" {
capabilities = [”read”]
}
EOT
}
sys/policy/example.hcl
}

• Look like the Vault API as much as possible
• Files which match the Vault API, e.g. sys/policy/foo.json
• (Initially) Take output from Ruby Gem as input
• Pull Requests to make changes
• Start with Policies, our most common request
• Everything in the repo in Vault
Nothing in Vault that was not in the repo
Config
in
Vault
Config
in
Repo
Config
in Vault
+ Repo
Delete This
Create This

👩💻 What a User Sees

Jenkins Job

Makefile

Init
• Ensures we have valid AWS credentials
• We store Terraform State in S3
• Dynamic AWS credentials from Vault
• terraform init
• Accesses remote Terraform State
• Downloads dependencies
• terraform workspace select test/prod
• Allows us to maintain separate Terraform State for different Vault clusters

Import
• Lists resources in Vault
• Lists resources in Terraform State
• Imports resources not in Terraform State
Config
in
Vault
Config
in
Repo

Generate
• Converts from files representing the Vault API into Terraform code
resource "vault_policy" "example"
{
name = "dev-team"
policy = <<EOT
}
EOT
}
}

Validate
• terraform validate
• Ensures all generated Terraform code is syntactically correct
• Resource-specific checks
• Check for common human errors e.g.
• Types of certain resources (e.g. LDAP groups, AD users)
• Some case sensitivity issues
• Most of these are actually done in the Generate phase

Plan
terraform plan -out=prod-vault.plan
Terraform will perform the following actions:
+ vault_policy.hufflepuff
- vault_policy.slytherin
Plan: 1 to add, 0 to change, 1 to destroy.

AppRole
• So far: Read only access to Vault
• Prompt for a short-lived secret-id to gain write access to Vault

Apply
terraform apply prod-vault.plan
vault_policy.hufflepuff: Creating...
name: "" => "hufflepuff"
policy: "" => "..."
vault_policy.slytherin: Destroying... (ID: slytherin)
vault_policy.slytherin: Destruction complete after 0s
vault_policy.hufflepuff: Creation complete after 0s (ID: hufflepuff)
Apply complete! Resources: 1 added, 0 changed, 1 destroyed.

Commit + Merge
• Commit any generated Terraform code
• Merge release branch to master

LDAP Groups
• One of the most common requests, after policies
• Initially: vault_generic_secret
• Resource to manage arbitrary Vault paths
• Later: vault_ldap_auth_backend_group
• Dedicated LDAP group resource
• LDAP Restructure: Only allow certain LDAP groups to be mapped to policies
• ✅ PG-Vault-Foo
• 🚫 SG-MyTeam

AppRoles
• Another of the most common requests, after policies
• We introduced Terraform variables for CIDR ranges:
variable "cidr_range_prod_jenkins_agents" {
type = "list"
default = [
”1.2.3.4/30", # Production Site A Jenkins Agents
”2.3.4.5/30", # Production Site B Jenkins Agents
...
]
}

AppRoles
• Another of the most common requests, after policies
• We introduced Terraform variables for CIDR ranges:
{
"token_bound_cidrs": "${var.cidr_range_prod_jenkins_agents}",
"policies": [
"default",
"terraform_vault-readonly”
],
"token_max_ttl": 120
}

Kubernetes Auth Roles
• The team managing the k8s clusters wrote this one for us!
• Effort needed by them:
• Write Import Script, based on existing scripts
• Write Generate Script, based on existing scripts
• Effort needed by us:
• Review their scripts

AWS Auth Roles
• Some auto-generation of resources
• Get all AWS Account IDs with:
aws organizations list-accounts
• Generate resources:
resource "vault_aws_auth_backend_sts_role" "role" {
backend = ”aws"
account_id = "1234567890"
sts_role = "arn:aws:iam::1234567890:role/my-role"
}

Active Directory Users
• ad/roles/:role_name
• has a few fields you can’t write to
{
"last_vault_rotation": "2018-05-24T17:14:38.677370855Z",
"password_last_set": "2018-05-24T17:14:38.677370855Z",
"service_account_name": "my-application@example.com",
"ttl": 100
}

Active Directory Users
• vault_generic_endpoint resource
resource "vault_generic_endpoint" "ad_role-vaulttest" {
path = "ad/roles/vaulttest”
data_json = ‘{"service_account_name": ”VaultTest@fancycorp.net"}’
# When reading, the secret contains keys that cannot be written:
# password_last_set (when did the password last get updated)
# last_vault_rotation (when did Vault last update the password)
ignore_absent_fields = true
}

🎉 What Did All This Give Us?
• Time
• Individual changes take less of our time  We can handle more requests
• Visibility
• Easier to see what’s in Vault
• Easier to debug
• Auditability
• Who, What, When, How, Why
• grep-ability / Searchability
• Find common patterns
• Identify issues before they become problems
• Reducing our own permissions
• Lots of configuration can no longer be done by humans

🆕 New Resources
• PKI
• dynamic X.509 certificates
• Sentinel Policies
• Richer access control functionality than ACL policies
• Namespaces
• Self-managed sub-Vaults

🧞 Auto Generation
• AWS Accounts, all have standard permissions, which correspond to at least…
• 2x Vault Policies per account
• 2x LDAP Groups per account
• Auto-Generated PRs for common functionality
• Service Discovery for AppRole CIDR ranges

🧞🧞 Review Security Trade-Offs
• 2FA to apply changes
• e.g. require 2 Factor Auth before a human can grant Jenkins read/write access
Jenkins
requests
read/write
Human runs
command
2FA prompt
Human
pastes token
into Jenkins
Jenkins
requests
read/write
First human
runs
command
Second
human runs
command
First human
pastes token
into Jenkins
• Enterprise Control Groups
• e.g. require multiple humans to grant Jenkins read/write access

🧞🧞♀️ More validation before a PR can be merged
• Check resources for sensible parameters
• e.g. TTLs, num_uses, etc.
• Check if Vault has required permissions before approving PRs
• e.g. check if AWS account is in Organization
• e.g. check if AD user is in correct Organizational Unit
• Case sensitivity check on LDAP groups
• We have a script to manually check this
• Deploy to a local dev Vault
• For testing new features in the pipeline

• End-to-end, ignoring time waiting for humans, it currently takes 3.5m
• But it could be faster!
Gotta Go Fast!

👤 Make it Generic
• Allow pipeline to be run against child namespaces
• Config for each namespace stored in different repos
• Delegate permissions to other teams

Thank You!
🧞
@LucyDavinhart - @SBGTechTeam
Slides: goto.lmhd.me/hc2019slides

How we accelerated our vault adoption with terraform

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How we accelerated our vault adoption with terraform

Similar to How we accelerated our vault adoption with terraform (20)

More from Mitchell Pronschinske

More from Mitchell Pronschinske (20)

Recently uploaded

Recently uploaded (20)

How we accelerated our vault adoption with terraform

Editor's Notes