SlideShare a Scribd company logo
1 of 57
AWS re:Invent
Monitoring as Code
G e t t i n g t o M o n i t o r i n g - D r i v e n D e v e l o p m e n t
D E V 3 1 4
Background
• Adam Kane – Director of Engineering @
• Operating multiple businesses and varied tech stacks
• Hybrid cloud environments
Background
• The ultimate digital network for all things movies
• Our portfolio reaches more than 60 million unique visitors per month
Agenda
• History of infrastructure & monitoring at Fandango
• Problem Space
• Datadog
• Deploying
• Monitoring & Alerting
• Next steps
History
• Primarily datacenter centric services
• Manual monitoring and alert configurations
• Traditional tools (nagios, CloudWatch, etc.)
• Eventual migration to Sensu
• Move to hybrid cloud
Problem Space
• Commonality in monitoring and alerting platforms
• Hybrid cloud challenges
• Adapting to architecture changes in automated ways
• Handling growth of micro services and infrastructure
Finding a new solution
• We wanted more automation
• Evaluated Datadog and a few other SaaS solutions
• Flexible APIs and lots of pre defined integrations
• Fit into our previous model
Enter Datadog
A bit about Datadog
• Nearly 300 out-of-the-box integrations
• Open source agent and libraries
• Well-documented API
• Trillions of data points per day
Monitoring fundamentals:
4 qualities of good metrics
1. Well
Understood
1. Well
Understood
1. Well
Understood
2. Sufficiently Granular
1 second
46.67% at 14:06:16
1 second
46.67% at 14:06:16
1 minute
36% at 14:06
1 second
46.67% at 14:06:16
1 minute
36% at 14:06
5 minutes
12% at 14:05
3. Tagged & Filterable
How many requests per second…
... On my nodes running
application:postgresql?
How many requests per second…
... On my nodes running application:postgresql
How many requests per second…
... That are part of role:accounting-app?
... On my nodes running application:postgresql
How many requests per second…
... That are part of role:accounting-app
... In region:us-east-1?
... On my nodes running application:postgresql
How many requests per second…
... That are part of role:accounting-app
... In region:us-east-1
... By availability-zone?
... On my nodes running application:postgresql
How many requests per second…
... That are part of role:accounting-app
... In region:us-east-1
... By availability-zone
And show me any that aren’t acting like the
others
4. Long-lived
How long?
• AWS Cloudwatch
• 3 hours at 1 second granularity
• 15 days at 1 minute granularity
• 63 days at 5 minute granularity
• 15 months at 1 hour granularity
• Datadog
• 15 months at 1 second granularity
M T W Th F M T W Th F
M T W Th F M T W Th F
What happened on Tuesday?
Outage or a holiday?
A good fit for Fandango
• Integrations with all the technologies at Fandango
• Easy to send metrics from on-prem and AWS
• Single pane of glass for business and system monitoring
• Easily automated
DEPLOYING
Deploying
• Datadog agents are deployed via Puppet
• Monitors, Alerts, and Timeboards are deployed via Terraform
• Not all hosts run agents – Cloudwatch metrics
class fandango_datadog {
if $::operatingsystem == 'windows’ {
include fandango_datadog::windows::package
include fandango_datadog::windows::service
}
else {
include fandango_datadog::linux::package
include fandango_datadog::linux::config
include fandango_datadog::linux::service
}
}
Agent Deployment (puppet code snippet)
node /^prd-cass[0-9]{3}.fandango.aws$/ inherits default-prd {
fandango_datadog::integration { 'cassandra':
erb_template => ’fandango_datadog/cassandra.yaml.erb',
cluster_name => ’fandango_cassandra',
port => '19096',
}
}
Agent Deployment (puppet code snippet)
instances:
- host: localhost
port: <%= @port %>
user: <%= @username %>
password: <%= @password %>
name: <%= @cluster_name %>
init_config:
conf:
- include:
domain: org.apache.cassandra.metrics
...
Agent Deployment (puppet code snippet)
Agent Deployment (recap)
• Next time Puppet runs…
Agent Deployment (recap)
• Next time Puppet runs…
• …the datadog agent will install
Agent Deployment (recap)
• Next time Puppet runs…
• …the datadog agent will install
• …cassandra.yaml will be placed into the proper config directory for
the datadog agent
Agent Deployment (recap)
• Next time Puppet runs…
• …the datadog agent will install
• …cassandra.yaml will be placed into the proper config directory for
the datadog agent
• dd-agent will receive a HUP from Puppet to start reading the new
cassandra.yaml file
module "datadog_integration" {
source = "./modules/datadog"
datadog_api_key = "${module.secrets.datadog_api_key}"
datadog_app_key = "${module.secrets.datadog_app_key}"
}
Monitor Deployment (tform code snippet)
resource "datadog_monitor" "cpu_check" {
name = "Global - CPU Usage"
type = "metric alert"
message = "CPU is high on {host}! @slack-fd-alerts"
escalation_message = "CPU is STILL high on {host}! @slack-fd-alerts"
query = "avg(last_1h):avg:system.cpu.user{environment:prd} by {host} > 85"
thresholds {
ok = 0
warning = 85
critical = 95
}...
Monitor Deployment (tform code snippet)
module.base.datadog_monitor.base_services: Creating...
name: "" => "Global - CPU Usage"
message: "" => "CPU is high on {host}! @slack-fd-alerts"
escalation_message: "" => "CPU is STILL high on {host}! @slack-fd-alerts"
query: "" =>
"avg(last_1h):avg:system.cpu.user{environment:prd} by {host} > 85"
thresholds.warning: "" => "85"
thresholds.critical: "" => "95"
module.base.datadog_monitor.base_services: Creation complete (ID: 3054683)
Monitor Deployment (tform apply)
Monitor Deployment (it’s live!)
Monitor Deployment (recap)
• We wrote some terraform code…
Monitor Deployment (recap)
• We wrote some terraform code…
• …the code first set up the API and APP key access to Datadog
Monitor Deployment (recap)
• We wrote some terraform code…
• …the code first set up the API and APP key access to Datadog
• …then we wrote code to check for CPU usage above 85%
Monitor Deployment (recap)
• We wrote some terraform code…
• …the code first set up the API and APP key access to Datadog
• …then we wrote code to check for CPU usage above 85%
• …executed terraform apply and our monitor is now live!
ALERTING
Alerting
• Alerts are part of the Terraform code
• Separation of “alerts” vs. “notifications”
• Anomaly detection
• Slack integration
• PagerDuty integration
• Email & Slack distributions for notifications
Alerting
resource "datadog_monitor" "cpu_check" {
...
message = "CPU is high on {host}! @slack-fd-alerts"
escalation_message = "CPU is STILL high on {host}! @slack-fd-alerts”
...
Alerting
resource "datadog_monitor" "cpu_check" {
...
message = "CPU is high on {host}! @slack-fd-alerts"
escalation_message = "CPU is STILL high on {host}! @slack-fd-alerts”
thresholds {
ok = 0
warning = 85
critical = 95
}
notify_no_data = true
renotify_interval = 60
Next steps
• Service inheritance dashboards
• ChatOps
• Increase global dashboards
• Additional business KPI metrics
QUESTIONS?

More Related Content

What's hot

Stabilising the jenga tower
Stabilising the jenga towerStabilising the jenga tower
Stabilising the jenga towerGordon Chung
 
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...Reactivesummit
 
Apache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedInApache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedInChris Riccomini
 
Spark Your Legacy (Spark Summit 2016)
Spark Your Legacy (Spark Summit 2016)Spark Your Legacy (Spark Summit 2016)
Spark Your Legacy (Spark Summit 2016)Tzach Zohar
 
Prezo at-mesos con2015-final
Prezo at-mesos con2015-finalPrezo at-mesos con2015-final
Prezo at-mesos con2015-finalSharma Podila
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaArvind Kumar G.S
 
Altitude NY 2018: Programming the edge workshop
Altitude NY 2018: Programming the edge workshopAltitude NY 2018: Programming the edge workshop
Altitude NY 2018: Programming the edge workshopFastly
 
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARNApache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARNblueboxtraveler
 
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, EverAltitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, EverFastly
 
Cron in der Cloud - Die Top 10 Hitparade
Cron in der Cloud - Die Top 10 HitparadeCron in der Cloud - Die Top 10 Hitparade
Cron in der Cloud - Die Top 10 HitparadeQAware GmbH
 
Reactive Fault Tolerant Programming with Hystrix and RxJava
Reactive Fault Tolerant Programming with Hystrix and RxJavaReactive Fault Tolerant Programming with Hystrix and RxJava
Reactive Fault Tolerant Programming with Hystrix and RxJavaMatt Stine
 
Akka-demy (a.k.a. How to build stateful distributed systems) I/II
 Akka-demy (a.k.a. How to build stateful distributed systems) I/II Akka-demy (a.k.a. How to build stateful distributed systems) I/II
Akka-demy (a.k.a. How to build stateful distributed systems) I/IIPeter Csala
 
Cloud Solution Day 2016: Microservices on Mesos & Netflix OSS
Cloud Solution Day 2016: Microservices on Mesos & Netflix OSSCloud Solution Day 2016: Microservices on Mesos & Netflix OSS
Cloud Solution Day 2016: Microservices on Mesos & Netflix OSSAWS Vietnam Community
 
Kafka timestamp offset_final
Kafka timestamp offset_finalKafka timestamp offset_final
Kafka timestamp offset_finalDaeMyung Kang
 
Kafka timestamp offset
Kafka timestamp offsetKafka timestamp offset
Kafka timestamp offsetDaeMyung Kang
 
Using Libvirt with Cluster API to manage baremetal Kubernetes
Using Libvirt with Cluster API to manage baremetal KubernetesUsing Libvirt with Cluster API to manage baremetal Kubernetes
Using Libvirt with Cluster API to manage baremetal KubernetesHimani Agrawal
 
Recon with Nmap
Recon with Nmap Recon with Nmap
Recon with Nmap OWASP Delhi
 

What's hot (20)

Nmap for Scriptors
Nmap for ScriptorsNmap for Scriptors
Nmap for Scriptors
 
HTTP/2 Server Push
HTTP/2 Server PushHTTP/2 Server Push
HTTP/2 Server Push
 
Stabilising the jenga tower
Stabilising the jenga towerStabilising the jenga tower
Stabilising the jenga tower
 
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...
 
Apache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedInApache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedIn
 
Spark Your Legacy (Spark Summit 2016)
Spark Your Legacy (Spark Summit 2016)Spark Your Legacy (Spark Summit 2016)
Spark Your Legacy (Spark Summit 2016)
 
Prezo at-mesos con2015-final
Prezo at-mesos con2015-finalPrezo at-mesos con2015-final
Prezo at-mesos con2015-final
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
 
Altitude NY 2018: Programming the edge workshop
Altitude NY 2018: Programming the edge workshopAltitude NY 2018: Programming the edge workshop
Altitude NY 2018: Programming the edge workshop
 
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARNApache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
 
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, EverAltitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
 
Cron in der Cloud - Die Top 10 Hitparade
Cron in der Cloud - Die Top 10 HitparadeCron in der Cloud - Die Top 10 Hitparade
Cron in der Cloud - Die Top 10 Hitparade
 
Reactive Fault Tolerant Programming with Hystrix and RxJava
Reactive Fault Tolerant Programming with Hystrix and RxJavaReactive Fault Tolerant Programming with Hystrix and RxJava
Reactive Fault Tolerant Programming with Hystrix and RxJava
 
Akka-demy (a.k.a. How to build stateful distributed systems) I/II
 Akka-demy (a.k.a. How to build stateful distributed systems) I/II Akka-demy (a.k.a. How to build stateful distributed systems) I/II
Akka-demy (a.k.a. How to build stateful distributed systems) I/II
 
Scapy talk
Scapy talkScapy talk
Scapy talk
 
Cloud Solution Day 2016: Microservices on Mesos & Netflix OSS
Cloud Solution Day 2016: Microservices on Mesos & Netflix OSSCloud Solution Day 2016: Microservices on Mesos & Netflix OSS
Cloud Solution Day 2016: Microservices on Mesos & Netflix OSS
 
Kafka timestamp offset_final
Kafka timestamp offset_finalKafka timestamp offset_final
Kafka timestamp offset_final
 
Kafka timestamp offset
Kafka timestamp offsetKafka timestamp offset
Kafka timestamp offset
 
Using Libvirt with Cluster API to manage baremetal Kubernetes
Using Libvirt with Cluster API to manage baremetal KubernetesUsing Libvirt with Cluster API to manage baremetal Kubernetes
Using Libvirt with Cluster API to manage baremetal Kubernetes
 
Recon with Nmap
Recon with Nmap Recon with Nmap
Recon with Nmap
 

Similar to Monitoring as Code: Getting to Monitoring-Driven Development - DEV314 - re:Invent 2017

Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)Yan Cui
 
Serverless in production, an experience report (London DevOps)
Serverless in production, an experience report (London DevOps)Serverless in production, an experience report (London DevOps)
Serverless in production, an experience report (London DevOps)Yan Cui
 
How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Te...
How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Te...How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Te...
How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Te...InfluxData
 
Hunting for APT in network logs workshop presentation
Hunting for APT in network logs workshop presentationHunting for APT in network logs workshop presentation
Hunting for APT in network logs workshop presentationOlehLevytskyi1
 
Serverless in production, an experience report (Going Serverless)
Serverless in production, an experience report (Going Serverless)Serverless in production, an experience report (Going Serverless)
Serverless in production, an experience report (Going Serverless)Yan Cui
 
DevOps, Microservices and Serverless Architecture
DevOps, Microservices and Serverless ArchitectureDevOps, Microservices and Serverless Architecture
DevOps, Microservices and Serverless ArchitectureMikhail Prudnikov
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Brian Brazil
 
Cloudify workshop at CCCEU 2014
Cloudify workshop at CCCEU 2014 Cloudify workshop at CCCEU 2014
Cloudify workshop at CCCEU 2014 Uri Cohen
 
Serverless in production, an experience report (JeffConf)
Serverless in production, an experience report (JeffConf)Serverless in production, an experience report (JeffConf)
Serverless in production, an experience report (JeffConf)Yan Cui
 
Modern Scheduling for Modern Applications with Nomad
Modern Scheduling for Modern Applications with NomadModern Scheduling for Modern Applications with Nomad
Modern Scheduling for Modern Applications with NomadMitchell Pronschinske
 
Managing Large-scale Networks with Trigger
Managing Large-scale Networks with TriggerManaging Large-scale Networks with Trigger
Managing Large-scale Networks with Triggerjathanism
 
Lessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsLessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsItai Yaffe
 
DevOps Fest 2019. Gianluca Arbezzano. DevOps never sleeps. What we learned fr...
DevOps Fest 2019. Gianluca Arbezzano. DevOps never sleeps. What we learned fr...DevOps Fest 2019. Gianluca Arbezzano. DevOps never sleeps. What we learned fr...
DevOps Fest 2019. Gianluca Arbezzano. DevOps never sleeps. What we learned fr...DevOps_Fest
 
OSDC 2018 - Distributed monitoring
OSDC 2018 - Distributed monitoringOSDC 2018 - Distributed monitoring
OSDC 2018 - Distributed monitoringGianluca Arbezzano
 
OSDC 2018 | Distributed Monitoring by Gianluca Arbezzano
OSDC 2018 | Distributed Monitoring by Gianluca ArbezzanoOSDC 2018 | Distributed Monitoring by Gianluca Arbezzano
OSDC 2018 | Distributed Monitoring by Gianluca ArbezzanoNETWAYS
 
AWS Lambda from the trenches
AWS Lambda from the trenchesAWS Lambda from the trenches
AWS Lambda from the trenchesYan Cui
 
Kubernetes Navigation Stories – DevOpsStage 2019, Kyiv
Kubernetes Navigation Stories – DevOpsStage 2019, KyivKubernetes Navigation Stories – DevOpsStage 2019, Kyiv
Kubernetes Navigation Stories – DevOpsStage 2019, KyivAleksey Asiutin
 

Similar to Monitoring as Code: Getting to Monitoring-Driven Development - DEV314 - re:Invent 2017 (20)

Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)
 
Serverless in production, an experience report (London DevOps)
Serverless in production, an experience report (London DevOps)Serverless in production, an experience report (London DevOps)
Serverless in production, an experience report (London DevOps)
 
How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Te...
How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Te...How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Te...
How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Te...
 
Hunting for APT in network logs workshop presentation
Hunting for APT in network logs workshop presentationHunting for APT in network logs workshop presentation
Hunting for APT in network logs workshop presentation
 
Serverless in production, an experience report (Going Serverless)
Serverless in production, an experience report (Going Serverless)Serverless in production, an experience report (Going Serverless)
Serverless in production, an experience report (Going Serverless)
 
DevOps, Microservices and Serverless Architecture
DevOps, Microservices and Serverless ArchitectureDevOps, Microservices and Serverless Architecture
DevOps, Microservices and Serverless Architecture
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)
 
Cloudify workshop at CCCEU 2014
Cloudify workshop at CCCEU 2014 Cloudify workshop at CCCEU 2014
Cloudify workshop at CCCEU 2014
 
Was faqs
Was faqsWas faqs
Was faqs
 
Serverless in production, an experience report (JeffConf)
Serverless in production, an experience report (JeffConf)Serverless in production, an experience report (JeffConf)
Serverless in production, an experience report (JeffConf)
 
Angular2 inter3
Angular2 inter3Angular2 inter3
Angular2 inter3
 
Modern Scheduling for Modern Applications with Nomad
Modern Scheduling for Modern Applications with NomadModern Scheduling for Modern Applications with Nomad
Modern Scheduling for Modern Applications with Nomad
 
Managing Large-scale Networks with Trigger
Managing Large-scale Networks with TriggerManaging Large-scale Networks with Trigger
Managing Large-scale Networks with Trigger
 
Lessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsLessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark Applications
 
DevOps as a Contract
DevOps as a ContractDevOps as a Contract
DevOps as a Contract
 
DevOps Fest 2019. Gianluca Arbezzano. DevOps never sleeps. What we learned fr...
DevOps Fest 2019. Gianluca Arbezzano. DevOps never sleeps. What we learned fr...DevOps Fest 2019. Gianluca Arbezzano. DevOps never sleeps. What we learned fr...
DevOps Fest 2019. Gianluca Arbezzano. DevOps never sleeps. What we learned fr...
 
OSDC 2018 - Distributed monitoring
OSDC 2018 - Distributed monitoringOSDC 2018 - Distributed monitoring
OSDC 2018 - Distributed monitoring
 
OSDC 2018 | Distributed Monitoring by Gianluca Arbezzano
OSDC 2018 | Distributed Monitoring by Gianluca ArbezzanoOSDC 2018 | Distributed Monitoring by Gianluca Arbezzano
OSDC 2018 | Distributed Monitoring by Gianluca Arbezzano
 
AWS Lambda from the trenches
AWS Lambda from the trenchesAWS Lambda from the trenches
AWS Lambda from the trenches
 
Kubernetes Navigation Stories – DevOpsStage 2019, Kyiv
Kubernetes Navigation Stories – DevOpsStage 2019, KyivKubernetes Navigation Stories – DevOpsStage 2019, Kyiv
Kubernetes Navigation Stories – DevOpsStage 2019, Kyiv
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Monitoring as Code: Getting to Monitoring-Driven Development - DEV314 - re:Invent 2017

  • 1. AWS re:Invent Monitoring as Code G e t t i n g t o M o n i t o r i n g - D r i v e n D e v e l o p m e n t D E V 3 1 4
  • 2. Background • Adam Kane – Director of Engineering @ • Operating multiple businesses and varied tech stacks • Hybrid cloud environments
  • 3. Background • The ultimate digital network for all things movies • Our portfolio reaches more than 60 million unique visitors per month
  • 4. Agenda • History of infrastructure & monitoring at Fandango • Problem Space • Datadog • Deploying • Monitoring & Alerting • Next steps
  • 5. History • Primarily datacenter centric services • Manual monitoring and alert configurations • Traditional tools (nagios, CloudWatch, etc.) • Eventual migration to Sensu • Move to hybrid cloud
  • 6. Problem Space • Commonality in monitoring and alerting platforms • Hybrid cloud challenges • Adapting to architecture changes in automated ways • Handling growth of micro services and infrastructure
  • 7. Finding a new solution • We wanted more automation • Evaluated Datadog and a few other SaaS solutions • Flexible APIs and lots of pre defined integrations • Fit into our previous model
  • 9.
  • 10. A bit about Datadog • Nearly 300 out-of-the-box integrations • Open source agent and libraries • Well-documented API • Trillions of data points per day
  • 16. 1 second 46.67% at 14:06:16
  • 17. 1 second 46.67% at 14:06:16 1 minute 36% at 14:06
  • 18. 1 second 46.67% at 14:06:16 1 minute 36% at 14:06 5 minutes 12% at 14:05
  • 19. 3. Tagged & Filterable
  • 20.
  • 21.
  • 22.
  • 23. How many requests per second…
  • 24. ... On my nodes running application:postgresql? How many requests per second…
  • 25. ... On my nodes running application:postgresql How many requests per second… ... That are part of role:accounting-app?
  • 26. ... On my nodes running application:postgresql How many requests per second… ... That are part of role:accounting-app ... In region:us-east-1?
  • 27. ... On my nodes running application:postgresql How many requests per second… ... That are part of role:accounting-app ... In region:us-east-1 ... By availability-zone?
  • 28. ... On my nodes running application:postgresql How many requests per second… ... That are part of role:accounting-app ... In region:us-east-1 ... By availability-zone And show me any that aren’t acting like the others
  • 30. How long? • AWS Cloudwatch • 3 hours at 1 second granularity • 15 days at 1 minute granularity • 63 days at 5 minute granularity • 15 months at 1 hour granularity • Datadog • 15 months at 1 second granularity
  • 31.
  • 32. M T W Th F M T W Th F
  • 33. M T W Th F M T W Th F What happened on Tuesday? Outage or a holiday?
  • 34. A good fit for Fandango • Integrations with all the technologies at Fandango • Easy to send metrics from on-prem and AWS • Single pane of glass for business and system monitoring • Easily automated
  • 36. Deploying • Datadog agents are deployed via Puppet • Monitors, Alerts, and Timeboards are deployed via Terraform • Not all hosts run agents – Cloudwatch metrics
  • 37. class fandango_datadog { if $::operatingsystem == 'windows’ { include fandango_datadog::windows::package include fandango_datadog::windows::service } else { include fandango_datadog::linux::package include fandango_datadog::linux::config include fandango_datadog::linux::service } } Agent Deployment (puppet code snippet)
  • 38. node /^prd-cass[0-9]{3}.fandango.aws$/ inherits default-prd { fandango_datadog::integration { 'cassandra': erb_template => ’fandango_datadog/cassandra.yaml.erb', cluster_name => ’fandango_cassandra', port => '19096', } } Agent Deployment (puppet code snippet)
  • 39. instances: - host: localhost port: <%= @port %> user: <%= @username %> password: <%= @password %> name: <%= @cluster_name %> init_config: conf: - include: domain: org.apache.cassandra.metrics ... Agent Deployment (puppet code snippet)
  • 40. Agent Deployment (recap) • Next time Puppet runs…
  • 41. Agent Deployment (recap) • Next time Puppet runs… • …the datadog agent will install
  • 42. Agent Deployment (recap) • Next time Puppet runs… • …the datadog agent will install • …cassandra.yaml will be placed into the proper config directory for the datadog agent
  • 43. Agent Deployment (recap) • Next time Puppet runs… • …the datadog agent will install • …cassandra.yaml will be placed into the proper config directory for the datadog agent • dd-agent will receive a HUP from Puppet to start reading the new cassandra.yaml file
  • 44. module "datadog_integration" { source = "./modules/datadog" datadog_api_key = "${module.secrets.datadog_api_key}" datadog_app_key = "${module.secrets.datadog_app_key}" } Monitor Deployment (tform code snippet)
  • 45. resource "datadog_monitor" "cpu_check" { name = "Global - CPU Usage" type = "metric alert" message = "CPU is high on {host}! @slack-fd-alerts" escalation_message = "CPU is STILL high on {host}! @slack-fd-alerts" query = "avg(last_1h):avg:system.cpu.user{environment:prd} by {host} > 85" thresholds { ok = 0 warning = 85 critical = 95 }... Monitor Deployment (tform code snippet)
  • 46. module.base.datadog_monitor.base_services: Creating... name: "" => "Global - CPU Usage" message: "" => "CPU is high on {host}! @slack-fd-alerts" escalation_message: "" => "CPU is STILL high on {host}! @slack-fd-alerts" query: "" => "avg(last_1h):avg:system.cpu.user{environment:prd} by {host} > 85" thresholds.warning: "" => "85" thresholds.critical: "" => "95" module.base.datadog_monitor.base_services: Creation complete (ID: 3054683) Monitor Deployment (tform apply)
  • 48. Monitor Deployment (recap) • We wrote some terraform code…
  • 49. Monitor Deployment (recap) • We wrote some terraform code… • …the code first set up the API and APP key access to Datadog
  • 50. Monitor Deployment (recap) • We wrote some terraform code… • …the code first set up the API and APP key access to Datadog • …then we wrote code to check for CPU usage above 85%
  • 51. Monitor Deployment (recap) • We wrote some terraform code… • …the code first set up the API and APP key access to Datadog • …then we wrote code to check for CPU usage above 85% • …executed terraform apply and our monitor is now live!
  • 53. Alerting • Alerts are part of the Terraform code • Separation of “alerts” vs. “notifications” • Anomaly detection • Slack integration • PagerDuty integration • Email & Slack distributions for notifications
  • 54. Alerting resource "datadog_monitor" "cpu_check" { ... message = "CPU is high on {host}! @slack-fd-alerts" escalation_message = "CPU is STILL high on {host}! @slack-fd-alerts” ...
  • 55. Alerting resource "datadog_monitor" "cpu_check" { ... message = "CPU is high on {host}! @slack-fd-alerts" escalation_message = "CPU is STILL high on {host}! @slack-fd-alerts” thresholds { ok = 0 warning = 85 critical = 95 } notify_no_data = true renotify_interval = 60
  • 56. Next steps • Service inheritance dashboards • ChatOps • Increase global dashboards • Additional business KPI metrics