Vault is an open source solution for identity and secrets management. Vault is well suited for both public cloud and private datacenter usage, but a common challenge is securely running Vault and accessing secrets in public cloud. This talk will show how to securely run Vault in the cloud, and be able to access those secrets securely from multiple differing cloud platforms. Additionally, the Vault 0.10 release is right around the corner and includes some major changes to improve the lives of both beginners and advanced users of Vault. We’ll spend some time looking at the latest features in Vault, and use these throughout the talk.
“Technical Account Manager at HashiCorp
The HashiCorp stack about 7 years (Vagrant
Worn a lot of hats in my time...
Developer, Consultant, Pre-Sales, TAM
Making people’s operational life easier and
DEVOPS ALL THE THINGS
Introductions - Who is this person?
“▪ Consul is the main
recommended backend for Vault
▪ It allows Vault to have a proper
HA and DR story
▪ More info:
Vault and Consul - What a team!
“▪ Consul hit 1.0 last year!
▪ Vault is at 0.10… 1.0 is
coming “Sooner rather than
later” - Mitchell
▪ Other products “Soon”™
▪ Also, cool stuff is coming,
come to HashiDays
Amsterdam and HashiConf!
Maturing of Products
With maturing comes operationalisation
Our research team is right now working on
Consul soaking and measuring at massive scale,
so if you’re hitting edge cases or have
information for us, we’d like to hear from you!
Come help us with Consul scaling research!
▪ Time-series telemetry data: This involves capturing metrics
from the application, storing them in a special database
designed for that purpose, and analyzing trends in the data
▪ Examples: Grafana, CloudWatch, DataDog, Circonus.
Time-series Telemetry Data
▪ Log analytics. This means capturing log files from the
system and the application, extracting useful signals
from the text, and then analyzing that data.
▪ Examples: Splunk, ELK, SumoLogic.
▪ This involves active methods of connecting to the
application and interacting with it to ensure it is
▪ Examples: Nagios, Sensu, Keynote.
Active health checks
“▪ Vault and Consul use the go-metrics library to export telemetry.
▪ Currently they support the following options:
• DataDog's DogStatsd
▪ Note that DataDog's agent and Statsite are implementations of statsd, so the
last 3 options are nearly the same thing.
How do we get those metrics?
Where do they go?
▪ Once the metrics reach your statsd-compatible agent, they
need to be forwarded somewhere so they can be stored
and displayed. There are many options...
▪ For this demo we’re sticking to a TIGK Stack:
• Telegraf, InfluxDB, Grafana, Kapacitor
• (Normally that would be TICK, but Cronograf’s
dashboards are not as good as Grafana IMO)
Consul Telemetry - How?
➔ Two Entries:
◆ dogstatsd_addr: hostname and port of
the statsd daemon.
○ DogStatsd format instead of - tells
Consul to send tagswith each metric.
Tags can be used by Grafana to filter
data on your dashboards
◆ disable_hostname: true
◆ Tells Consul not to insert the hostname in
the names of the metrics it sends to
statsd, since the hostnames will be sent
○ Without this option, the single metric
consul.raft.apply would become
Vault Telemetry - How?
Pretty much the same!
dogstatsd_addr = "localhost:8125"
disable_hostname = true
Consul Telemetry - What?
▪ Consul has 86 different
▪ That’s good but… which
do I need to look at?
▪ And what’s the threshold
before I should get
Consul Telemetry - Transaction Timing
Metric Name Description
consul.kvs.apply This measures the time it takes to complete an
update to the KV store.
consul.txn.apply This measures the time spent applying a
consul.raft.apply This counts the number of Raft transactions
occurring over the interval.
consul.raft.commitTime This measures the time it takes to commit a new
entry to the Raft log on the leader.
Why they're important: Taken together, these metrics indicate how long it takes to complete write operations in
various parts of the Consul cluster. Generally these should all be fairly consistent and no more than a few
milliseconds. Sudden changes in any of the timing values could be due to unexpected load on the Consul servers, or
due to problems on the servers themselves.
What to look for: Deviations (in any of these metrics) of more than 50% from baseline over the previous hour.
Vault Telemetry - Seal Status
Metric Name Description
consul_health_checks[check_name="Vault Sealed Status"].passing Value of 1 indicates Vault is unsealed;
0 means sealed.
Why they're important: By default, Vault is sealed on startup, so if this value
changes to 0 during the day, Vault has restarted for some reason. And until it's
unsealed, it won't answer requests from clients.
What to look for: A value of 0 being reported by any host.
NOTE: This metric is actually reported by the Consul plugin to Telegraf.