2. Who Is This Guy Talking?
Bryan Krausen
Sr. Solutions Architect @ AHEAD [awesome company]
Blog: itdiversified.com [voted worst blog ever]
Twitter: @btkrausen
Holds [ all ] AWS Certifications
HashiCorp Vault Intermediate Certified [partner cert]
Working towards Advanced
3.
4. A Story [the problem]
• You [finally] implemented a secrets solution
• You told everyone it was a PoC
• First onboarded application “test” was successful, and
immediately went into production - so other app owners
wanted in….
• The Ops team starting saving static secrets in the KV store,
like a good Ops team does….
• Word got out that Vault was a thing and more requests were
submitted to use and, well, you obliged
• Vault goes down and it breaks all the things….
• You realize Vault is a critical piece of your infrastructure….
¯_(ツ)_/¯
5. High Availability Built Directly Into Vault
High Availablity in AWS
Vault Disastery Recovery
Vault Enterprise Features for High Availability
How Can I Build Vault As A Highly Available Solution?
7. Storage Backends
• Configures the location for the storage of Vault data
• Storage is defined in the main Vault configuration file along with
desired parameters
• Not all storage backends are created equal
• Some support high availability
• Others have better tools for management & data protection
8. Storage Backends
• Storage Backends that support High Availability
• Consul
• DynamoDB
• Etcd
• FoundationDB
• Google Cloud Spanner
• Googe Cloud Storage
• MySQL
• Zookeeper
Your storage backend must be
configured for high availability!
14. HashiCorp Vault - Unseal
Key Value
--- -----
Seal Type shamir
Sealed true
Total Shares 5
Threshold 3
Unseal Progress 0/3
Key Value
--- -----
Seal Type shamir
Sealed true
Total Shares 5
Threshold 3
Unseal Progress 1/3
Key Value
--- -----
Seal Type shamir
Sealed true
Total Shares 5
Threshold 3
Unseal Progress 2/3
Key Value
--- -----
Seal Type shamir
Sealed false
Total Shares 5
Threshold 3
15. Cloud Unseal
• Rather than use Shamir, use AWS KMS for seal wrapping
• Uses an identified KMS key
• Automatically unseals when nodes come online or are restarted
• Can use an EC2 Service Role to permit access to Encrypt, Decrypt,
and Describe Keys
• Supports KMS key rotation
16. Audit Devices
Keep detailed log of all requests and responses to Vault
Audit log is formatted using JSON
Sensitive information is hashed before logging
Can [and should] have more than one audit device enabled
!
Vault requires at least one audit device to write the log before
completing the Vault request – if enabled
[see below]
18. Regions and Availability Zones
AZs are the simplest way to provide fault zones in AWS
Regions are harder
Virtual Private Cloud (VPC)
Availablity Zone A Availablity Zone B Availablity Zone C
Primary Node
Standby Node Standby Node
Standby NodeStandby Node
Standby Node
19. Placement Groups
For nodes within the same Availability Zone, use a Spread
Placement Group
Removes possibility of single point of failure relating to underlying
hardware
Availability Zone (AZ)
Primary Node Standby Node
Physical Server Physical Server
hypervisor hypervisor
Spread Placement
Group
20. Security Groups
Use self-referencing security groups to enable communication among
nodes within a cluster
[Vault]: 8200, 8201
Don’t hardcode node IP addresses in the security group (/32)
Primary Node Standby Nodetcp/8200
tcp/8201
sg_prod_vault sg_prod_vault
21. Load Balancing
Front-end Vault with an Application Load Balancer (ALB)
ALB used for high availability in this case, NOT load balancing
Use ALB health checks to determine the Active Node by way of the
Vault endpoint /v1/sys/health
Standby Node
200 – initialized, unsealed, and active
429 – unsealed and standby
501 – if not initialized
203 – sealed
HTTP Status Codes:
Standby Node
Active Node
ALB
22. Automation
Automate the provisioning and configuration of Vault
Can use CloudFormation/Terraform to provision network, security
groups, roles, load balancers, and Vault nodes
Use Auto Scaling Groups for Availability (not scalability)
Network
IAM Roles
Security Groups
Storage Backend
Vault Nodes
24. Storage Backend - Backups
The most critical task to protecting Vault is backing up the storage
backend
Use the storage backend’s built-in features to help manage
backups/snapshots and store them in multiple places
25. Snapshot 3
Snapshot 2
Consul Snapshots
Consul snapshots save the state of Consul servers for disaster
recovery
Saves key/value, service catalog, prepared queries, sessions, and
ALCs
Run a one-time snapshot [consul snapshot save] or use the
Consul Agent for automatic backups (*enterprise feature)
Snapshot 1
S3 S3
Region 2
26. Consul Autopilot
Built-in solution to assist with managing Consul nodes
Dead Server Cleanup
Server Stabilization
Redundancy Zone Tags
Upgrade Migration
Autopilot is on by default – disable features you don’t want
27. Consul Autopilot - Dead Server Cleanup
Why did Consul have to clean up the failed node?
…..because the infrastructure was a big mesh!
Dead server cleanup will remove failed servers from the cluster once
the replacement comes online based on configurable threshold
Cleanup will also be initialized anytime a new server joins the cluster
Previously, it would take 72 hours to reap a failed server or it had to
be done manually.
28. Consul Autopilot – Server Stabilization
New Consul server nodes must be healthy for x amount of time
before being promoted to a full, voting member.
Configurable time – default is 5 seconds
29. Consul Autopilot – Redundancy Zones
Ensure that Consul voting members will be spread across fault zones
to ensure high availability at all times.
In AWS, you can create fault zones based upon Availability Zones
Availability Zone 1 Availability Zone 2
VOTEVOTE
30. Consul Autopilot – Upgrade Migrations
New Consul Server version > current Consul Server version
Consul won’t immediately promote newer servers as voting members
Number of ‘new’ nodes must match the number of ‘old’ nodes
Availability Zone 1 Availability Zone 2
1.3.01.3.0
1.4.0 1.4.0
34. Disaster Recovery Replication
Warm-standby if primary cluster fails
Mirrors all secrets, policies, and authentication tokens and leases
Does NOT service client requests unless promoted
Primary Cluster
Region 1 Region 2
Secondary Cluster
replication
35. Disaster Recovery Replication
Requires connectivity between regions for cluster replication
Likely accomplished with VPC Peering, Transit Network, Transit Gateway [NEW]
DR Cluster nodes should be architected in a similar fashion as production
Multiple Availability Zones, Spread Placement Groups, etc
Security Groups and NACLs should permit communication between primary and
secondary cluster
Don’t forget to permit clients
Use a Route53 Failover Routing Policy, along with health checks, to fail the
primary Vault DNS record the DR cluster – or use Consul
36. Performance Replication
THE way to extend Vault to other regions, public clouds, data centers
Mirrors all secrets and policies but NOT local tokens and leases
Will service local client requests for static secrets. Will create
dynamic secrets and leases separately from primary cluster
Primary Cluster
Region 1 Region 2
Secondary Cluster
replicationRequest
37. Performance Replication
Place performance replicated clusters near applications that it will service
i.e., same region, same VPC, same Availability Zones, etc.
Use a separate Route53 record for communication with the local cluster
Should be used heavily for applications needing read-only access to Vault
Use Mount Filters to limit what secrets that are replicated
Mount filters can be used to satisfy GDPR requirements
38. • You implemented HashiCorp Vault
• Because you’re ridiculously smart, you knew everybody
would want in on this…
• You used all the points in this presentation to deploy Vault in
a highly available architectecture.
• All the app teams were happy and migrated to Vault,
• Vault didn’t go down…
• You are an IT hero!
A Story [a happier one]