Designing High Availability for HashiCorp Vault in AWS

Designing High Availability
for HashiCorp Vault in AWS
Bryan Krausen

Who Is This Guy Talking?
 Bryan Krausen
 Sr. Solutions Architect @ AHEAD [awesome company]
 Blog: itdiversified.com [voted worst blog ever]
 Twitter: @btkrausen
 Holds [ all ] AWS Certifications
 HashiCorp Vault Intermediate Certified [partner cert]
 Working towards Advanced

A Story [the problem]
• You [finally] implemented a secrets solution
• You told everyone it was a PoC
• First onboarded application “test” was successful, and
immediately went into production - so other app owners
wanted in….
• The Ops team starting saving static secrets in the KV store,
like a good Ops team does….
• Word got out that Vault was a thing and more requests were
submitted to use and, well, you obliged
• Vault goes down and it breaks all the things….
• You realize Vault is a critical piece of your infrastructure….
¯_(ツ)_/¯

High Availability Built Directly Into Vault
High Availablity in AWS
Vault Disastery Recovery
Vault Enterprise Features for High Availability
How Can I Build Vault As A Highly Available Solution?

HashiCorp
Vault
High Availability Built Directly
Into Vault

Storage Backends
• Configures the location for the storage of Vault data
• Storage is defined in the main Vault configuration file along with
desired parameters
• Not all storage backends are created equal
• Some support high availability
• Others have better tools for management & data protection

Storage Backends
• Storage Backends that support High Availability
• Consul
• DynamoDB
• Etcd
• FoundationDB
• Google Cloud Spanner
• Googe Cloud Storage
• MySQL
• Zookeeper
Your storage backend must be
configured for high availability!

Storage Backends – Config File
storage "dynamodb" {
ha_enabled = "true"
max_parallel = 128
region = "us-east-1"
table = "Vault-Storage-Backend"
read_capacity = 10
write_capacity = 15
}
listener "tcp" {
address = "0.0.0.0:8200"
tls_disable = "true"
}
api_addr = "https://IPADDRESS:8200"
ui = true
DynamoDB Consul Client
storage “consul" {
address = “127.0.0.1:8500”
path = “vault/”
}
listener "tcp" {
address = "0.0.0.0:8200“
cluster_address = “0.0.0.0:8201”
}
api_addr = "https://IPADDRESS:8200"
ui = true
don’t use this in prod

Local Redundancy
Single NodePrimary Node Standby Node Standby Node

Accessing Vault - Multiple Nodes
Primary Node Standby Node Standby Node

Unsealing Vault - Shamir
Master Key Encryption Key
Vault Data
a98w79w
5d6yjum6
m/5664n3
Shamir’s Secret Sharing Algorithm
Key Share
Key Share
Key Share
Key Share
Key Share

Unsealing Vault – Key Shares
Key Share Key Share Key Share Key Share Key Share

HashiCorp Vault - Unseal
Key Value
--- -----
Seal Type shamir
Sealed true
Total Shares 5
Threshold 3
Unseal Progress 0/3
Key Value
--- -----
Seal Type shamir
Sealed true
Total Shares 5
Threshold 3
Unseal Progress 1/3
Key Value
--- -----
Seal Type shamir
Sealed true
Total Shares 5
Threshold 3
Unseal Progress 2/3
Key Value
--- -----
Seal Type shamir
Sealed false
Total Shares 5
Threshold 3

Cloud Unseal
• Rather than use Shamir, use AWS KMS for seal wrapping
• Uses an identified KMS key
• Automatically unseals when nodes come online or are restarted
• Can use an EC2 Service Role to permit access to Encrypt, Decrypt,
and Describe Keys
• Supports KMS key rotation

Audit Devices
 Keep detailed log of all requests and responses to Vault
 Audit log is formatted using JSON
 Sensitive information is hashed before logging
 Can [and should] have more than one audit device enabled
!
Vault requires at least one audit device to write the log before
completing the Vault request – if enabled
[see below]

HashiCorp
Vault
High Availability in AWS

Regions and Availability Zones
 AZs are the simplest way to provide fault zones in AWS
 Regions are harder
Virtual Private Cloud (VPC)
Availablity Zone A Availablity Zone B Availablity Zone C
Primary Node
Standby Node Standby Node
Standby NodeStandby Node
Standby Node

Placement Groups
 For nodes within the same Availability Zone, use a Spread
Placement Group
 Removes possibility of single point of failure relating to underlying
hardware
Availability Zone (AZ)
Primary Node Standby Node
Physical Server Physical Server
hypervisor hypervisor
Spread Placement
Group

Security Groups
 Use self-referencing security groups to enable communication among
nodes within a cluster
 [Vault]: 8200, 8201
 Don’t hardcode node IP addresses in the security group (/32)
Primary Node Standby Nodetcp/8200
tcp/8201
sg_prod_vault sg_prod_vault

Load Balancing
 Front-end Vault with an Application Load Balancer (ALB)
 ALB used for high availability in this case, NOT load balancing
 Use ALB health checks to determine the Active Node by way of the
Vault endpoint /v1/sys/health
Standby Node
200 – initialized, unsealed, and active
429 – unsealed and standby
501 – if not initialized
203 – sealed
HTTP Status Codes:
Standby Node
Active Node
ALB

Automation
 Automate the provisioning and configuration of Vault
 Can use CloudFormation/Terraform to provision network, security
groups, roles, load balancers, and Vault nodes
 Use Auto Scaling Groups for Availability (not scalability)
Network
IAM Roles
Security Groups
Storage Backend
Vault Nodes

HashiCorp
Vault
Vault Disastery Recovery

Storage Backend - Backups
 The most critical task to protecting Vault is backing up the storage
backend
 Use the storage backend’s built-in features to help manage
backups/snapshots and store them in multiple places

Snapshot 3
Snapshot 2
Consul Snapshots
 Consul snapshots save the state of Consul servers for disaster
recovery
 Saves key/value, service catalog, prepared queries, sessions, and
ALCs
 Run a one-time snapshot [consul snapshot save] or use the
Consul Agent for automatic backups (*enterprise feature)
Snapshot 1
S3 S3
Region 2

Consul Autopilot
 Built-in solution to assist with managing Consul nodes
 Dead Server Cleanup
 Server Stabilization
 Redundancy Zone Tags
 Upgrade Migration
 Autopilot is on by default – disable features you don’t want

Consul Autopilot - Dead Server Cleanup
Why did Consul have to clean up the failed node?
…..because the infrastructure was a big mesh!
 Dead server cleanup will remove failed servers from the cluster once
the replacement comes online based on configurable threshold
 Cleanup will also be initialized anytime a new server joins the cluster
 Previously, it would take 72 hours to reap a failed server or it had to
be done manually.

Consul Autopilot – Server Stabilization
 New Consul server nodes must be healthy for x amount of time
before being promoted to a full, voting member.
 Configurable time – default is 5 seconds

Consul Autopilot – Redundancy Zones
 Ensure that Consul voting members will be spread across fault zones
to ensure high availability at all times.
 In AWS, you can create fault zones based upon Availability Zones
Availability Zone 1 Availability Zone 2
VOTEVOTE

Consul Autopilot – Upgrade Migrations
 New Consul Server version > current Consul Server version
 Consul won’t immediately promote newer servers as voting members
 Number of ‘new’ nodes must match the number of ‘old’ nodes
Availability Zone 1 Availability Zone 2
1.3.01.3.0
1.4.0 1.4.0

Consul Autopilot
New Node
Old Node
Old Node

Consul Autopilot
new node
promoted
new node
promoted

HashiCorp
Vault
Vault Enterprise Features for
High Availability

Disaster Recovery Replication
 Warm-standby if primary cluster fails
 Mirrors all secrets, policies, and authentication tokens and leases
 Does NOT service client requests unless promoted
Primary Cluster
Region 1 Region 2
Secondary Cluster
replication

Disaster Recovery Replication
 Requires connectivity between regions for cluster replication
 Likely accomplished with VPC Peering, Transit Network, Transit Gateway [NEW]
 DR Cluster nodes should be architected in a similar fashion as production
 Multiple Availability Zones, Spread Placement Groups, etc
 Security Groups and NACLs should permit communication between primary and
secondary cluster
 Don’t forget to permit clients
 Use a Route53 Failover Routing Policy, along with health checks, to fail the
primary Vault DNS record the DR cluster – or use Consul

Performance Replication
 THE way to extend Vault to other regions, public clouds, data centers
 Mirrors all secrets and policies but NOT local tokens and leases
 Will service local client requests for static secrets. Will create
dynamic secrets and leases separately from primary cluster
Primary Cluster
Region 1 Region 2
Secondary Cluster
replicationRequest

Performance Replication
 Place performance replicated clusters near applications that it will service
 i.e., same region, same VPC, same Availability Zones, etc.
 Use a separate Route53 record for communication with the local cluster
 Should be used heavily for applications needing read-only access to Vault
 Use Mount Filters to limit what secrets that are replicated
 Mount filters can be used to satisfy GDPR requirements

• You implemented HashiCorp Vault
• Because you’re ridiculously smart, you knew everybody
would want in on this…
• You used all the points in this presentation to deploy Vault in
a highly available architectecture.
• All the app teams were happy and migrated to Vault,
• Vault didn’t go down…
• You are an IT hero!
A Story [a happier one]

-Bryan KrausenTHE END
[ THIS IS THE LAST SLIDE, I SWEAR ]

Designing High Availability for HashiCorp Vault in AWS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Designing High Availability for HashiCorp Vault in AWS

Similar to Designing High Availability for HashiCorp Vault in AWS (20)

Recently uploaded

Recently uploaded (20)

Designing High Availability for HashiCorp Vault in AWS