“Infrastructure as Code” has changed not only how we think about configuring infrastructure, but about the infrastructure itself. AWS has been at the core of this movement, enabling your infrastructure teams to benefit from software engineering best practices such as CI/CD, automated testing, and repeatable deployments. Now that you have mastered the art of managing your infrastructure as code, it’s time to leverage these same lessons for monitoring and metrics. In this session, we dive into how you can leverage tooling such as AWS, Terraform, and Datadog to programmatically define your monitoring so that you that you can scale your organizational observability along with your infrastructure, and attain consistency from local development all the way through production.
Session sponsored by Datadog, Inc.
Monitoring as Code: Getting to Monitoring-Driven Development - DEV314 - re:Invent 2017
1. AWS re:Invent
Monitoring as Code
G e t t i n g t o M o n i t o r i n g - D r i v e n D e v e l o p m e n t
D E V 3 1 4
2. Background
• Adam Kane – Director of Engineering @
• Operating multiple businesses and varied tech stacks
• Hybrid cloud environments
3. Background
• The ultimate digital network for all things movies
• Our portfolio reaches more than 60 million unique visitors per month
4. Agenda
• History of infrastructure & monitoring at Fandango
• Problem Space
• Datadog
• Deploying
• Monitoring & Alerting
• Next steps
5. History
• Primarily datacenter centric services
• Manual monitoring and alert configurations
• Traditional tools (nagios, CloudWatch, etc.)
• Eventual migration to Sensu
• Move to hybrid cloud
6. Problem Space
• Commonality in monitoring and alerting platforms
• Hybrid cloud challenges
• Adapting to architecture changes in automated ways
• Handling growth of micro services and infrastructure
7. Finding a new solution
• We wanted more automation
• Evaluated Datadog and a few other SaaS solutions
• Flexible APIs and lots of pre defined integrations
• Fit into our previous model
10. A bit about Datadog
• Nearly 300 out-of-the-box integrations
• Open source agent and libraries
• Well-documented API
• Trillions of data points per day
24. ... On my nodes running
application:postgresql?
How many requests per second…
25. ... On my nodes running application:postgresql
How many requests per second…
... That are part of role:accounting-app?
26. ... On my nodes running application:postgresql
How many requests per second…
... That are part of role:accounting-app
... In region:us-east-1?
27. ... On my nodes running application:postgresql
How many requests per second…
... That are part of role:accounting-app
... In region:us-east-1
... By availability-zone?
28. ... On my nodes running application:postgresql
How many requests per second…
... That are part of role:accounting-app
... In region:us-east-1
... By availability-zone
And show me any that aren’t acting like the
others
30. How long?
• AWS Cloudwatch
• 3 hours at 1 second granularity
• 15 days at 1 minute granularity
• 63 days at 5 minute granularity
• 15 months at 1 hour granularity
• Datadog
• 15 months at 1 second granularity
33. M T W Th F M T W Th F
What happened on Tuesday?
Outage or a holiday?
34. A good fit for Fandango
• Integrations with all the technologies at Fandango
• Easy to send metrics from on-prem and AWS
• Single pane of glass for business and system monitoring
• Easily automated
36. Deploying
• Datadog agents are deployed via Puppet
• Monitors, Alerts, and Timeboards are deployed via Terraform
• Not all hosts run agents – Cloudwatch metrics
37. class fandango_datadog {
if $::operatingsystem == 'windows’ {
include fandango_datadog::windows::package
include fandango_datadog::windows::service
}
else {
include fandango_datadog::linux::package
include fandango_datadog::linux::config
include fandango_datadog::linux::service
}
}
Agent Deployment (puppet code snippet)
42. Agent Deployment (recap)
• Next time Puppet runs…
• …the datadog agent will install
• …cassandra.yaml will be placed into the proper config directory for
the datadog agent
43. Agent Deployment (recap)
• Next time Puppet runs…
• …the datadog agent will install
• …cassandra.yaml will be placed into the proper config directory for
the datadog agent
• dd-agent will receive a HUP from Puppet to start reading the new
cassandra.yaml file
45. resource "datadog_monitor" "cpu_check" {
name = "Global - CPU Usage"
type = "metric alert"
message = "CPU is high on {host}! @slack-fd-alerts"
escalation_message = "CPU is STILL high on {host}! @slack-fd-alerts"
query = "avg(last_1h):avg:system.cpu.user{environment:prd} by {host} > 85"
thresholds {
ok = 0
warning = 85
critical = 95
}...
Monitor Deployment (tform code snippet)
46. module.base.datadog_monitor.base_services: Creating...
name: "" => "Global - CPU Usage"
message: "" => "CPU is high on {host}! @slack-fd-alerts"
escalation_message: "" => "CPU is STILL high on {host}! @slack-fd-alerts"
query: "" =>
"avg(last_1h):avg:system.cpu.user{environment:prd} by {host} > 85"
thresholds.warning: "" => "85"
thresholds.critical: "" => "95"
module.base.datadog_monitor.base_services: Creation complete (ID: 3054683)
Monitor Deployment (tform apply)
49. Monitor Deployment (recap)
• We wrote some terraform code…
• …the code first set up the API and APP key access to Datadog
50. Monitor Deployment (recap)
• We wrote some terraform code…
• …the code first set up the API and APP key access to Datadog
• …then we wrote code to check for CPU usage above 85%
51. Monitor Deployment (recap)
• We wrote some terraform code…
• …the code first set up the API and APP key access to Datadog
• …then we wrote code to check for CPU usage above 85%
• …executed terraform apply and our monitor is now live!
53. Alerting
• Alerts are part of the Terraform code
• Separation of “alerts” vs. “notifications”
• Anomaly detection
• Slack integration
• PagerDuty integration
• Email & Slack distributions for notifications