Monitoring as Code: Getting to Monitoring-Driven Development - DEV314 - re:Invent 2017

AWS re:Invent
Monitoring as Code
G e t t i n g t o M o n i t o r i n g - D r i v e n D e v e l o p m e n t
D E V 3 1 4

Background
• Adam Kane – Director of Engineering @
• Operating multiple businesses and varied tech stacks
• Hybrid cloud environments

Background
• The ultimate digital network for all things movies
• Our portfolio reaches more than 60 million unique visitors per month

Agenda
• History of infrastructure & monitoring at Fandango
• Problem Space
• Datadog
• Deploying
• Monitoring & Alerting
• Next steps

History
• Primarily datacenter centric services
• Manual monitoring and alert configurations
• Traditional tools (nagios, CloudWatch, etc.)
• Eventual migration to Sensu
• Move to hybrid cloud

Problem Space
• Commonality in monitoring and alerting platforms
• Hybrid cloud challenges
• Adapting to architecture changes in automated ways
• Handling growth of micro services and infrastructure

Finding a new solution
• We wanted more automation
• Evaluated Datadog and a few other SaaS solutions
• Flexible APIs and lots of pre defined integrations
• Fit into our previous model

A bit about Datadog
• Nearly 300 out-of-the-box integrations
• Open source agent and libraries
• Well-documented API
• Trillions of data points per day

Monitoring fundamentals:
4 qualities of good metrics

1 second
46.67% at 14:06:16
1 minute
36% at 14:06

1 second
46.67% at 14:06:16
1 minute
36% at 14:06
5 minutes
12% at 14:05

How many requests per second…

... On my nodes running
application:postgresql?

... On my nodes running application:postgresql
... That are part of role:accounting-app?

... That are part of role:accounting-app
... In region:us-east-1?

... In region:us-east-1
... By availability-zone?

... In region:us-east-1
... By availability-zone
And show me any that aren’t acting like the
others

How long?
• AWS Cloudwatch
• 3 hours at 1 second granularity
• 15 days at 1 minute granularity
• 63 days at 5 minute granularity
• 15 months at 1 hour granularity
• Datadog
• 15 months at 1 second granularity

M T W Th F M T W Th F
What happened on Tuesday?
Outage or a holiday?

A good fit for Fandango
• Integrations with all the technologies at Fandango
• Easy to send metrics from on-prem and AWS
• Single pane of glass for business and system monitoring
• Easily automated

Deploying
• Datadog agents are deployed via Puppet
• Monitors, Alerts, and Timeboards are deployed via Terraform
• Not all hosts run agents – Cloudwatch metrics

class fandango_datadog {
if $::operatingsystem == 'windows’ {
include fandango_datadog::windows::package
include fandango_datadog::windows::service
}
else {
include fandango_datadog::linux::package
include fandango_datadog::linux::config
include fandango_datadog::linux::service
}
}
Agent Deployment (puppet code snippet)

node /^prd-cass[0-9]{3}.fandango.aws$/ inherits default-prd {
fandango_datadog::integration { 'cassandra':
erb_template => ’fandango_datadog/cassandra.yaml.erb',
cluster_name => ’fandango_cassandra',
port => '19096',
}
}

instances:
- host: localhost
port: <%= @port %>
user: <%= @username %>
password: <%= @password %>
name: <%= @cluster_name %>
init_config:
conf:
- include:
domain: org.apache.cassandra.metrics
...

Agent Deployment (recap)
• Next time Puppet runs…

• …the datadog agent will install

• …cassandra.yaml will be placed into the proper config directory for
the datadog agent

• …cassandra.yaml will be placed into the proper config directory for
the datadog agent
• dd-agent will receive a HUP from Puppet to start reading the new
cassandra.yaml file

module "datadog_integration" {
source = "./modules/datadog"
datadog_api_key = "${module.secrets.datadog_api_key}"
datadog_app_key = "${module.secrets.datadog_app_key}"
}
Monitor Deployment (tform code snippet)

resource "datadog_monitor" "cpu_check" {
name = "Global - CPU Usage"
type = "metric alert"
message = "CPU is high on {host}! @slack-fd-alerts"
escalation_message = "CPU is STILL high on {host}! @slack-fd-alerts"
query = "avg(last_1h):avg:system.cpu.user{environment:prd} by {host} > 85"
thresholds {
ok = 0
warning = 85
critical = 95
}...
Monitor Deployment (tform code snippet)

module.base.datadog_monitor.base_services: Creating...
name: "" => "Global - CPU Usage"
message: "" => "CPU is high on {host}! @slack-fd-alerts"
escalation_message: "" => "CPU is STILL high on {host}! @slack-fd-alerts"
query: "" =>
"avg(last_1h):avg:system.cpu.user{environment:prd} by {host} > 85"
thresholds.warning: "" => "85"
thresholds.critical: "" => "95"
module.base.datadog_monitor.base_services: Creation complete (ID: 3054683)
Monitor Deployment (tform apply)

Monitor Deployment (it’s live!)

Monitor Deployment (recap)
• We wrote some terraform code…

• …the code first set up the API and APP key access to Datadog

• …then we wrote code to check for CPU usage above 85%

• …then we wrote code to check for CPU usage above 85%
• …executed terraform apply and our monitor is now live!

Alerting
• Alerts are part of the Terraform code
• Separation of “alerts” vs. “notifications”
• Anomaly detection
• Slack integration
• PagerDuty integration
• Email & Slack distributions for notifications

Alerting
...
escalation_message = "CPU is STILL high on {host}! @slack-fd-alerts”
...

Alerting
...
escalation_message = "CPU is STILL high on {host}! @slack-fd-alerts”
thresholds {
ok = 0
warning = 85
critical = 95
}
notify_no_data = true
renotify_interval = 60

Next steps
• Service inheritance dashboards
• ChatOps
• Increase global dashboards
• Additional business KPI metrics

Monitoring as Code: Getting to Monitoring-Driven Development - DEV314 - re:Invent 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Monitoring as Code: Getting to Monitoring-Driven Development - DEV314 - re:Invent 2017

Similar to Monitoring as Code: Getting to Monitoring-Driven Development - DEV314 - re:Invent 2017 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Monitoring as Code: Getting to Monitoring-Driven Development - DEV314 - re:Invent 2017