SlideShare a Scribd company logo
Designing High Availability
for HashiCorp Vault in AWS
Bryan Krausen
Who Is This Guy Talking?
 Bryan Krausen
 Sr. Solutions Architect @ AHEAD [awesome company]
 Blog: itdiversified.com [voted worst blog ever]
 Twitter: @btkrausen
 Holds [ all ] AWS Certifications
 HashiCorp Vault Intermediate Certified [partner cert]
 Working towards Advanced
A Story [the problem]
• You [finally] implemented a secrets solution
• You told everyone it was a PoC
• First onboarded application “test” was successful, and
immediately went into production - so other app owners
wanted in….
• The Ops team starting saving static secrets in the KV store,
like a good Ops team does….
• Word got out that Vault was a thing and more requests were
submitted to use and, well, you obliged
• Vault goes down and it breaks all the things….
• You realize Vault is a critical piece of your infrastructure….
¯_(ツ)_/¯
High Availability Built Directly Into Vault
High Availablity in AWS
Vault Disastery Recovery
Vault Enterprise Features for High Availability
How Can I Build Vault As A Highly Available Solution?
HashiCorp
Vault
High Availability Built Directly
Into Vault
Storage Backends
• Configures the location for the storage of Vault data
• Storage is defined in the main Vault configuration file along with
desired parameters
• Not all storage backends are created equal
• Some support high availability
• Others have better tools for management & data protection
Storage Backends
• Storage Backends that support High Availability
• Consul
• DynamoDB
• Etcd
• FoundationDB
• Google Cloud Spanner
• Googe Cloud Storage
• MySQL
• Zookeeper
Your storage backend must be
configured for high availability!
Storage Backends – Config File
storage "dynamodb" {
ha_enabled = "true"
max_parallel = 128
region = "us-east-1"
table = "Vault-Storage-Backend"
read_capacity = 10
write_capacity = 15
}
listener "tcp" {
address = "0.0.0.0:8200"
tls_disable = "true"
}
api_addr = "https://IPADDRESS:8200"
ui = true
DynamoDB Consul Client
storage “consul" {
address = “127.0.0.1:8500”
path = “vault/”
}
listener "tcp" {
address = "0.0.0.0:8200“
cluster_address = “0.0.0.0:8201”
}
api_addr = "https://IPADDRESS:8200"
ui = true
don’t use this in prod
Local Redundancy
Single NodePrimary Node Standby Node Standby Node
Accessing Vault - Multiple Nodes
Primary Node Standby Node Standby Node
Unsealing Vault - Shamir
Master Key Encryption Key
Vault Data
a98w79w
5d6yjum6
m/5664n3
Shamir’s Secret Sharing Algorithm
Key Share
Key Share
Key Share
Key Share
Key Share
Unsealing Vault – Key Shares
Key Share Key Share Key Share Key Share Key Share
HashiCorp Vault - Unseal
Key Value
--- -----
Seal Type shamir
Sealed true
Total Shares 5
Threshold 3
Unseal Progress 0/3
Key Value
--- -----
Seal Type shamir
Sealed true
Total Shares 5
Threshold 3
Unseal Progress 1/3
Key Value
--- -----
Seal Type shamir
Sealed true
Total Shares 5
Threshold 3
Unseal Progress 2/3
Key Value
--- -----
Seal Type shamir
Sealed false
Total Shares 5
Threshold 3
Cloud Unseal
• Rather than use Shamir, use AWS KMS for seal wrapping
• Uses an identified KMS key
• Automatically unseals when nodes come online or are restarted
• Can use an EC2 Service Role to permit access to Encrypt, Decrypt,
and Describe Keys
• Supports KMS key rotation
Audit Devices
 Keep detailed log of all requests and responses to Vault
 Audit log is formatted using JSON
 Sensitive information is hashed before logging
 Can [and should] have more than one audit device enabled
!
Vault requires at least one audit device to write the log before
completing the Vault request – if enabled
[see below]
HashiCorp
Vault
High Availability in AWS
Regions and Availability Zones
 AZs are the simplest way to provide fault zones in AWS
 Regions are harder
Virtual Private Cloud (VPC)
Availablity Zone A Availablity Zone B Availablity Zone C
Primary Node
Standby Node Standby Node
Standby NodeStandby Node
Standby Node
Placement Groups
 For nodes within the same Availability Zone, use a Spread
Placement Group
 Removes possibility of single point of failure relating to underlying
hardware
Availability Zone (AZ)
Primary Node Standby Node
Physical Server Physical Server
hypervisor hypervisor
Spread Placement
Group
Security Groups
 Use self-referencing security groups to enable communication among
nodes within a cluster
 [Vault]: 8200, 8201
 Don’t hardcode node IP addresses in the security group (/32)
Primary Node Standby Nodetcp/8200
tcp/8201
sg_prod_vault sg_prod_vault
Load Balancing
 Front-end Vault with an Application Load Balancer (ALB)
 ALB used for high availability in this case, NOT load balancing
 Use ALB health checks to determine the Active Node by way of the
Vault endpoint /v1/sys/health
Standby Node
200 – initialized, unsealed, and active
429 – unsealed and standby
501 – if not initialized
203 – sealed
HTTP Status Codes:
Standby Node
Active Node
ALB
Automation
 Automate the provisioning and configuration of Vault
 Can use CloudFormation/Terraform to provision network, security
groups, roles, load balancers, and Vault nodes
 Use Auto Scaling Groups for Availability (not scalability)
Network
IAM Roles
Security Groups
Storage Backend
Vault Nodes
HashiCorp
Vault
Vault Disastery Recovery
Storage Backend - Backups
 The most critical task to protecting Vault is backing up the storage
backend
 Use the storage backend’s built-in features to help manage
backups/snapshots and store them in multiple places
Snapshot 3
Snapshot 2
Consul Snapshots
 Consul snapshots save the state of Consul servers for disaster
recovery
 Saves key/value, service catalog, prepared queries, sessions, and
ALCs
 Run a one-time snapshot [consul snapshot save] or use the
Consul Agent for automatic backups (*enterprise feature)
Snapshot 1
S3 S3
Region 2
Consul Autopilot
 Built-in solution to assist with managing Consul nodes
 Dead Server Cleanup
 Server Stabilization
 Redundancy Zone Tags
 Upgrade Migration
 Autopilot is on by default – disable features you don’t want
Consul Autopilot - Dead Server Cleanup
Why did Consul have to clean up the failed node?
…..because the infrastructure was a big mesh!
 Dead server cleanup will remove failed servers from the cluster once
the replacement comes online based on configurable threshold
 Cleanup will also be initialized anytime a new server joins the cluster
 Previously, it would take 72 hours to reap a failed server or it had to
be done manually.
Consul Autopilot – Server Stabilization
 New Consul server nodes must be healthy for x amount of time
before being promoted to a full, voting member.
 Configurable time – default is 5 seconds
Consul Autopilot – Redundancy Zones
 Ensure that Consul voting members will be spread across fault zones
to ensure high availability at all times.
 In AWS, you can create fault zones based upon Availability Zones
Availability Zone 1 Availability Zone 2
VOTEVOTE
Consul Autopilot – Upgrade Migrations
 New Consul Server version > current Consul Server version
 Consul won’t immediately promote newer servers as voting members
 Number of ‘new’ nodes must match the number of ‘old’ nodes
Availability Zone 1 Availability Zone 2
1.3.01.3.0
1.4.0 1.4.0
Consul Autopilot
New Node
Old Node
Old Node
Consul Autopilot
new node
promoted
new node
promoted
HashiCorp
Vault
Vault Enterprise Features for
High Availability
Disaster Recovery Replication
 Warm-standby if primary cluster fails
 Mirrors all secrets, policies, and authentication tokens and leases
 Does NOT service client requests unless promoted
Primary Cluster
Region 1 Region 2
Secondary Cluster
replication
Disaster Recovery Replication
 Requires connectivity between regions for cluster replication
 Likely accomplished with VPC Peering, Transit Network, Transit Gateway [NEW]
 DR Cluster nodes should be architected in a similar fashion as production
 Multiple Availability Zones, Spread Placement Groups, etc
 Security Groups and NACLs should permit communication between primary and
secondary cluster
 Don’t forget to permit clients
 Use a Route53 Failover Routing Policy, along with health checks, to fail the
primary Vault DNS record the DR cluster – or use Consul
Performance Replication
 THE way to extend Vault to other regions, public clouds, data centers
 Mirrors all secrets and policies but NOT local tokens and leases
 Will service local client requests for static secrets. Will create
dynamic secrets and leases separately from primary cluster
Primary Cluster
Region 1 Region 2
Secondary Cluster
replicationRequest
Performance Replication
 Place performance replicated clusters near applications that it will service
 i.e., same region, same VPC, same Availability Zones, etc.
 Use a separate Route53 record for communication with the local cluster
 Should be used heavily for applications needing read-only access to Vault
 Use Mount Filters to limit what secrets that are replicated
 Mount filters can be used to satisfy GDPR requirements
• You implemented HashiCorp Vault
• Because you’re ridiculously smart, you knew everybody
would want in on this…
• You used all the points in this presentation to deploy Vault in
a highly available architectecture.
• All the app teams were happy and migrated to Vault,
• Vault didn’t go down…
• You are an IT hero!
A Story [a happier one]
-Bryan KrausenTHE END
[ THIS IS THE LAST SLIDE, I SWEAR ]

More Related Content

What's hot

What's hot (20)

Keeping a Secret with HashiCorp Vault
Keeping a Secret with HashiCorp VaultKeeping a Secret with HashiCorp Vault
Keeping a Secret with HashiCorp Vault
 
Introduction to Ansible
Introduction to AnsibleIntroduction to Ansible
Introduction to Ansible
 
Eliminating Secret Sprawl in the Cloud with HashiCorp Vault - 07.11.2018
Eliminating Secret Sprawl in the Cloud with HashiCorp Vault - 07.11.2018Eliminating Secret Sprawl in the Cloud with HashiCorp Vault - 07.11.2018
Eliminating Secret Sprawl in the Cloud with HashiCorp Vault - 07.11.2018
 
Hashicorp Vault: Open Source Secrets Management at #OPEN18
Hashicorp Vault: Open Source Secrets Management at #OPEN18Hashicorp Vault: Open Source Secrets Management at #OPEN18
Hashicorp Vault: Open Source Secrets Management at #OPEN18
 
Adopting HashiCorp Vault
Adopting HashiCorp VaultAdopting HashiCorp Vault
Adopting HashiCorp Vault
 
Kubernetes Architecture and Introduction
Kubernetes Architecture and IntroductionKubernetes Architecture and Introduction
Kubernetes Architecture and Introduction
 
Understanding container security
Understanding container securityUnderstanding container security
Understanding container security
 
02 terraform core concepts
02 terraform core concepts02 terraform core concepts
02 terraform core concepts
 
Evolution of containers to kubernetes
Evolution of containers to kubernetesEvolution of containers to kubernetes
Evolution of containers to kubernetes
 
Hashicorp Vault Open Source vs Enterprise
Hashicorp Vault Open Source vs EnterpriseHashicorp Vault Open Source vs Enterprise
Hashicorp Vault Open Source vs Enterprise
 
Vault
VaultVault
Vault
 
Helm - Application deployment management for Kubernetes
Helm - Application deployment management for KubernetesHelm - Application deployment management for Kubernetes
Helm - Application deployment management for Kubernetes
 
Introducing Vault
Introducing VaultIntroducing Vault
Introducing Vault
 
Vault Open Source vs Enterprise v2
Vault Open Source vs Enterprise v2Vault Open Source vs Enterprise v2
Vault Open Source vs Enterprise v2
 
Kubernetes CI/CD with Helm
Kubernetes CI/CD with HelmKubernetes CI/CD with Helm
Kubernetes CI/CD with Helm
 
Kubernetes #6 advanced scheduling
Kubernetes #6   advanced schedulingKubernetes #6   advanced scheduling
Kubernetes #6 advanced scheduling
 
Introduction to helm
Introduction to helmIntroduction to helm
Introduction to helm
 
Introduction to Kubernetes
Introduction to KubernetesIntroduction to Kubernetes
Introduction to Kubernetes
 
Kubernetes security
Kubernetes securityKubernetes security
Kubernetes security
 
Introduction to Docker - 2017
Introduction to Docker - 2017Introduction to Docker - 2017
Introduction to Docker - 2017
 

Similar to Designing High Availability for HashiCorp Vault in AWS

Similar to Designing High Availability for HashiCorp Vault in AWS (20)

Powering Remote Developers with Amazon Workspaces
Powering Remote Developers with Amazon WorkspacesPowering Remote Developers with Amazon Workspaces
Powering Remote Developers with Amazon Workspaces
 
From One to Many: Evolving VPC Design (ARC401) | AWS re:Invent 2013
From One to Many:  Evolving VPC Design (ARC401) | AWS re:Invent 2013From One to Many:  Evolving VPC Design (ARC401) | AWS re:Invent 2013
From One to Many: Evolving VPC Design (ARC401) | AWS re:Invent 2013
 
Netflix presents at MassTLC Cloud Summit 2013
Netflix presents at MassTLC Cloud Summit 2013Netflix presents at MassTLC Cloud Summit 2013
Netflix presents at MassTLC Cloud Summit 2013
 
LASCON 2013 - AWS CLoud HSM
LASCON 2013 - AWS CLoud HSM LASCON 2013 - AWS CLoud HSM
LASCON 2013 - AWS CLoud HSM
 
How Easy to Automate Application Deployment on AWS
How Easy to Automate Application Deployment on AWSHow Easy to Automate Application Deployment on AWS
How Easy to Automate Application Deployment on AWS
 
(ARC403) From One to Many: Evolving VPC Design | AWS re:Invent 2014
(ARC403) From One to Many: Evolving VPC Design | AWS re:Invent 2014(ARC403) From One to Many: Evolving VPC Design | AWS re:Invent 2014
(ARC403) From One to Many: Evolving VPC Design | AWS re:Invent 2014
 
Weaveworks at AWS re:Invent 2016: Operations Management with Amazon ECS
Weaveworks at AWS re:Invent 2016: Operations Management with Amazon ECSWeaveworks at AWS re:Invent 2016: Operations Management with Amazon ECS
Weaveworks at AWS re:Invent 2016: Operations Management with Amazon ECS
 
From One to Many: Evolving VPC Design
From One to Many: Evolving VPC DesignFrom One to Many: Evolving VPC Design
From One to Many: Evolving VPC Design
 
Cloud Foundry Summit 2015: Building a Robust Cloud Foundry (HA, Security and DR)
Cloud Foundry Summit 2015: Building a Robust Cloud Foundry (HA, Security and DR)Cloud Foundry Summit 2015: Building a Robust Cloud Foundry (HA, Security and DR)
Cloud Foundry Summit 2015: Building a Robust Cloud Foundry (HA, Security and DR)
 
How Netflix’s Tools Can Help Accelerate Your Start-up (SVC202) | AWS re:Inven...
How Netflix’s Tools Can Help Accelerate Your Start-up (SVC202) | AWS re:Inven...How Netflix’s Tools Can Help Accelerate Your Start-up (SVC202) | AWS re:Inven...
How Netflix’s Tools Can Help Accelerate Your Start-up (SVC202) | AWS re:Inven...
 
Cloudformation & VPC, EC2, RDS
Cloudformation & VPC, EC2, RDSCloudformation & VPC, EC2, RDS
Cloudformation & VPC, EC2, RDS
 
Habitat talk at CodeMonsters Sofia, Bulgaria Nov 27 2018
Habitat talk at CodeMonsters Sofia, Bulgaria Nov 27 2018Habitat talk at CodeMonsters Sofia, Bulgaria Nov 27 2018
Habitat talk at CodeMonsters Sofia, Bulgaria Nov 27 2018
 
Introducing Gridiron Security and Compliance Management Platform and Enclave ...
Introducing Gridiron Security and Compliance Management Platform and Enclave ...Introducing Gridiron Security and Compliance Management Platform and Enclave ...
Introducing Gridiron Security and Compliance Management Platform and Enclave ...
 
AWS Best Practices
AWS Best PracticesAWS Best Practices
AWS Best Practices
 
AWS Best Practices Version 2
AWS Best Practices Version 2AWS Best Practices Version 2
AWS Best Practices Version 2
 
Getting Started with MariaDB with Docker
Getting Started with MariaDB with DockerGetting Started with MariaDB with Docker
Getting Started with MariaDB with Docker
 
Mastering Vault Operations: A Comprehensive Guide to HashiCorp Certifications...
Mastering Vault Operations: A Comprehensive Guide to HashiCorp Certifications...Mastering Vault Operations: A Comprehensive Guide to HashiCorp Certifications...
Mastering Vault Operations: A Comprehensive Guide to HashiCorp Certifications...
 
Serverless: A love hate relationship
Serverless: A love hate relationshipServerless: A love hate relationship
Serverless: A love hate relationship
 
Svc 202-netflix-open-source
Svc 202-netflix-open-sourceSvc 202-netflix-open-source
Svc 202-netflix-open-source
 
Docker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platformsDocker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platforms
 

Recently uploaded

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 

Recently uploaded (20)

Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 

Designing High Availability for HashiCorp Vault in AWS

  • 1. Designing High Availability for HashiCorp Vault in AWS Bryan Krausen
  • 2. Who Is This Guy Talking?  Bryan Krausen  Sr. Solutions Architect @ AHEAD [awesome company]  Blog: itdiversified.com [voted worst blog ever]  Twitter: @btkrausen  Holds [ all ] AWS Certifications  HashiCorp Vault Intermediate Certified [partner cert]  Working towards Advanced
  • 3.
  • 4. A Story [the problem] • You [finally] implemented a secrets solution • You told everyone it was a PoC • First onboarded application “test” was successful, and immediately went into production - so other app owners wanted in…. • The Ops team starting saving static secrets in the KV store, like a good Ops team does…. • Word got out that Vault was a thing and more requests were submitted to use and, well, you obliged • Vault goes down and it breaks all the things…. • You realize Vault is a critical piece of your infrastructure…. ¯_(ツ)_/¯
  • 5. High Availability Built Directly Into Vault High Availablity in AWS Vault Disastery Recovery Vault Enterprise Features for High Availability How Can I Build Vault As A Highly Available Solution?
  • 7. Storage Backends • Configures the location for the storage of Vault data • Storage is defined in the main Vault configuration file along with desired parameters • Not all storage backends are created equal • Some support high availability • Others have better tools for management & data protection
  • 8. Storage Backends • Storage Backends that support High Availability • Consul • DynamoDB • Etcd • FoundationDB • Google Cloud Spanner • Googe Cloud Storage • MySQL • Zookeeper Your storage backend must be configured for high availability!
  • 9. Storage Backends – Config File storage "dynamodb" { ha_enabled = "true" max_parallel = 128 region = "us-east-1" table = "Vault-Storage-Backend" read_capacity = 10 write_capacity = 15 } listener "tcp" { address = "0.0.0.0:8200" tls_disable = "true" } api_addr = "https://IPADDRESS:8200" ui = true DynamoDB Consul Client storage “consul" { address = “127.0.0.1:8500” path = “vault/” } listener "tcp" { address = "0.0.0.0:8200“ cluster_address = “0.0.0.0:8201” } api_addr = "https://IPADDRESS:8200" ui = true don’t use this in prod
  • 10. Local Redundancy Single NodePrimary Node Standby Node Standby Node
  • 11. Accessing Vault - Multiple Nodes Primary Node Standby Node Standby Node
  • 12. Unsealing Vault - Shamir Master Key Encryption Key Vault Data a98w79w 5d6yjum6 m/5664n3 Shamir’s Secret Sharing Algorithm Key Share Key Share Key Share Key Share Key Share
  • 13. Unsealing Vault – Key Shares Key Share Key Share Key Share Key Share Key Share
  • 14. HashiCorp Vault - Unseal Key Value --- ----- Seal Type shamir Sealed true Total Shares 5 Threshold 3 Unseal Progress 0/3 Key Value --- ----- Seal Type shamir Sealed true Total Shares 5 Threshold 3 Unseal Progress 1/3 Key Value --- ----- Seal Type shamir Sealed true Total Shares 5 Threshold 3 Unseal Progress 2/3 Key Value --- ----- Seal Type shamir Sealed false Total Shares 5 Threshold 3
  • 15. Cloud Unseal • Rather than use Shamir, use AWS KMS for seal wrapping • Uses an identified KMS key • Automatically unseals when nodes come online or are restarted • Can use an EC2 Service Role to permit access to Encrypt, Decrypt, and Describe Keys • Supports KMS key rotation
  • 16. Audit Devices  Keep detailed log of all requests and responses to Vault  Audit log is formatted using JSON  Sensitive information is hashed before logging  Can [and should] have more than one audit device enabled ! Vault requires at least one audit device to write the log before completing the Vault request – if enabled [see below]
  • 18. Regions and Availability Zones  AZs are the simplest way to provide fault zones in AWS  Regions are harder Virtual Private Cloud (VPC) Availablity Zone A Availablity Zone B Availablity Zone C Primary Node Standby Node Standby Node Standby NodeStandby Node Standby Node
  • 19. Placement Groups  For nodes within the same Availability Zone, use a Spread Placement Group  Removes possibility of single point of failure relating to underlying hardware Availability Zone (AZ) Primary Node Standby Node Physical Server Physical Server hypervisor hypervisor Spread Placement Group
  • 20. Security Groups  Use self-referencing security groups to enable communication among nodes within a cluster  [Vault]: 8200, 8201  Don’t hardcode node IP addresses in the security group (/32) Primary Node Standby Nodetcp/8200 tcp/8201 sg_prod_vault sg_prod_vault
  • 21. Load Balancing  Front-end Vault with an Application Load Balancer (ALB)  ALB used for high availability in this case, NOT load balancing  Use ALB health checks to determine the Active Node by way of the Vault endpoint /v1/sys/health Standby Node 200 – initialized, unsealed, and active 429 – unsealed and standby 501 – if not initialized 203 – sealed HTTP Status Codes: Standby Node Active Node ALB
  • 22. Automation  Automate the provisioning and configuration of Vault  Can use CloudFormation/Terraform to provision network, security groups, roles, load balancers, and Vault nodes  Use Auto Scaling Groups for Availability (not scalability) Network IAM Roles Security Groups Storage Backend Vault Nodes
  • 24. Storage Backend - Backups  The most critical task to protecting Vault is backing up the storage backend  Use the storage backend’s built-in features to help manage backups/snapshots and store them in multiple places
  • 25. Snapshot 3 Snapshot 2 Consul Snapshots  Consul snapshots save the state of Consul servers for disaster recovery  Saves key/value, service catalog, prepared queries, sessions, and ALCs  Run a one-time snapshot [consul snapshot save] or use the Consul Agent for automatic backups (*enterprise feature) Snapshot 1 S3 S3 Region 2
  • 26. Consul Autopilot  Built-in solution to assist with managing Consul nodes  Dead Server Cleanup  Server Stabilization  Redundancy Zone Tags  Upgrade Migration  Autopilot is on by default – disable features you don’t want
  • 27. Consul Autopilot - Dead Server Cleanup Why did Consul have to clean up the failed node? …..because the infrastructure was a big mesh!  Dead server cleanup will remove failed servers from the cluster once the replacement comes online based on configurable threshold  Cleanup will also be initialized anytime a new server joins the cluster  Previously, it would take 72 hours to reap a failed server or it had to be done manually.
  • 28. Consul Autopilot – Server Stabilization  New Consul server nodes must be healthy for x amount of time before being promoted to a full, voting member.  Configurable time – default is 5 seconds
  • 29. Consul Autopilot – Redundancy Zones  Ensure that Consul voting members will be spread across fault zones to ensure high availability at all times.  In AWS, you can create fault zones based upon Availability Zones Availability Zone 1 Availability Zone 2 VOTEVOTE
  • 30. Consul Autopilot – Upgrade Migrations  New Consul Server version > current Consul Server version  Consul won’t immediately promote newer servers as voting members  Number of ‘new’ nodes must match the number of ‘old’ nodes Availability Zone 1 Availability Zone 2 1.3.01.3.0 1.4.0 1.4.0
  • 34. Disaster Recovery Replication  Warm-standby if primary cluster fails  Mirrors all secrets, policies, and authentication tokens and leases  Does NOT service client requests unless promoted Primary Cluster Region 1 Region 2 Secondary Cluster replication
  • 35. Disaster Recovery Replication  Requires connectivity between regions for cluster replication  Likely accomplished with VPC Peering, Transit Network, Transit Gateway [NEW]  DR Cluster nodes should be architected in a similar fashion as production  Multiple Availability Zones, Spread Placement Groups, etc  Security Groups and NACLs should permit communication between primary and secondary cluster  Don’t forget to permit clients  Use a Route53 Failover Routing Policy, along with health checks, to fail the primary Vault DNS record the DR cluster – or use Consul
  • 36. Performance Replication  THE way to extend Vault to other regions, public clouds, data centers  Mirrors all secrets and policies but NOT local tokens and leases  Will service local client requests for static secrets. Will create dynamic secrets and leases separately from primary cluster Primary Cluster Region 1 Region 2 Secondary Cluster replicationRequest
  • 37. Performance Replication  Place performance replicated clusters near applications that it will service  i.e., same region, same VPC, same Availability Zones, etc.  Use a separate Route53 record for communication with the local cluster  Should be used heavily for applications needing read-only access to Vault  Use Mount Filters to limit what secrets that are replicated  Mount filters can be used to satisfy GDPR requirements
  • 38. • You implemented HashiCorp Vault • Because you’re ridiculously smart, you knew everybody would want in on this… • You used all the points in this presentation to deploy Vault in a highly available architectecture. • All the app teams were happy and migrated to Vault, • Vault didn’t go down… • You are an IT hero! A Story [a happier one]
  • 39. -Bryan KrausenTHE END [ THIS IS THE LAST SLIDE, I SWEAR ]