Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
PagerDuty+AWS
Thomas Robinson, Solutions Architect, AWS
Eric Sigler, Head of DevOps, PagerDuty
Chris Hoey, Lead Site Relia...
DevOps on the AWS Cloud
Thomas Robinson, Solutions Architect, AWS
Traditional Development Models are Obsolete
 Business is increasingly software-driven
 End-users expect both continuous ...
DevOps Can Help
Increase Decrease
 Length of development cycles
 Time to market
 Deployment failures and rollbacks
 Ti...
Infrastructure
as Code
Microservices Logging and
Monitoring
Continuous Integration/
Continuous Delivery
DevOps on AWS
AWS ...
 Provision the server, storage, and networking capacity
you need on demand
 Deploy independently, as a single service, o...
 Build services around the business capabilities you require
 Scale up and down as required with virtually no notice
 M...
 Maintain visibility and auditability of activity in your
application infrastructure
 Assess how application and infrast...
 Model and visualize your own custom release workflow
 Automate deployments of new code
 Improve developer productivity...
Benefits of DevOps on AWS
Get started quickly
and pay as you go
Automate systems
operations
Scale without
infrastructure c...
PagerDuty
Eric Sigler, Head of DevOps, PagerDuty
PagerDuty At-a-Glance
Trusted by over 8500
Organizations
50 of the Fortune 100 Global Community
80 Countries
200,000+ User...
9000+ Customers
PagerDuty Manages the Complexity
Tools Process People
Collaboration/Resolution
Deployment Tools
Monitoring Tools
 App
 S...
Triage Notify Mobilize Collaborate Resolve Learn
Identify What’s
Wrong
Commercial response that
engages the business
Visib...
Lower Costs
Leverage your development and
operations resources more efficiently
Increase Revenue Growth
Deliver customer e...
With PagerDuty, we
spend less time
worrying about on-call,
and more time creating
product to impact lives.
- Panasonic
How Can PagerDuty Help?
Improvement in MTTR- PICNIC
Achieved 99.9% uptime
- Pantheon
500%
Improvement
For every product
- ...
Use Case 2:
Operationalize and Monitor
AWS Environments
Use Case 3:
Accelerate Migration to AWS
Use Case 1:
Response Workf...
Use Case 1: Response Workflows and Orchestration
 Leverage ChatOps Tools
 Integrate with Ticketing Systems
 Configure W...
 Identify patterns, trends,
and anomalies
 View data
 Monitor infrastructure health
Use Case 2: Operationalize and Moni...
Use Case 3: Accelerate Migration to AWS
 Create alarms to monitor any
Amazon CloudWatch
 Initiate event and
suppression ...
Datadog
Chris Hoey, Lead Site Reliability Engineer
• SaaS based infrastructure and app monitoring
• Open Source Agent with 200+ integrations
• Time series data (metrics and ...
Challenges
 Building a product while ramping up
number of people involved running it
 Increase in the number of services...
How Datadog uses AWS
 API focused to allow
custom tooling
 Scale up and down as
need for capacity
 Integration Dogfoodi...
Why Datadog choose AWS
ScalabilityBorn in the cloud Leverage breadth and
flexibility of AWS
Reliability
SRE Team Growth
Dedicated SRE Team
runs stable services
in production
Early days of
Datadog (2010) -
everyone is on-call
A...
Spreading Team Coverage
Taking On-Call Global
Addressing Alert Fatigue
Why Datadog Chose PagerDuty
Strong API’s
 Easy to get started, automation friendly
 Scales with growth of teams, company...
 Able to efficiently scale operations from tens of engineers to hundreds
 Improved productivity through custom alerts an...
Upcoming SlideShare
Loading in …5
×

DevOps at Scale: How Datadog is using AWS and PagerDuty to Keep Pace with Growth and Improve Incident Resolution

940 views

Published on

Meeting the demands of everchanging IT management and security requirements means evolving both how you respond to and resolve incidents. It’s critical for organizations to adopt a scalable DevOps solution that integrates with their current monitoring systems to enable collaboration across development and operations teams, reducing the mean time to resolution. PagerDuty works with AWS services like Amazon CloudWatch, to provide rapid incident response with rich, contextual details that allow you to analyze trends and monitor the performance of your applications and AWS environment.

Published in: Technology
  • Be the first to comment

DevOps at Scale: How Datadog is using AWS and PagerDuty to Keep Pace with Growth and Improve Incident Resolution

  1. 1. PagerDuty+AWS Thomas Robinson, Solutions Architect, AWS Eric Sigler, Head of DevOps, PagerDuty Chris Hoey, Lead Site Reliability Engineer, Datadog
  2. 2. DevOps on the AWS Cloud Thomas Robinson, Solutions Architect, AWS
  3. 3. Traditional Development Models are Obsolete  Business is increasingly software-driven  End-users expect both continuous improvement and stability from applications  IT needs to be able to provision infrastructure as rapidly as developers demand it  An organization’s pace of innovation is largely constrained by their ability to develop applications
  4. 4. DevOps Can Help Increase Decrease  Length of development cycles  Time to market  Deployment failures and rollbacks  Time to recover upon failure  Operational overhead DevOps practices enable companies to innovate at a higher velocity for customers  Business agility  Application stability  Ability to meet customer demand  Time spent on innovation  Security
  5. 5. Infrastructure as Code Microservices Logging and Monitoring Continuous Integration/ Continuous Delivery DevOps on AWS AWS provides on-demand infrastructure resources and tooling built to enable common DevOps practices
  6. 6.  Provision the server, storage, and networking capacity you need on demand  Deploy independently, as a single service, or a group of services  Make configuration changes repeatable and standardized  Build custom templates to provision resources in a controlled and predictable way  Use version control to keep track of all changes made to your infrastructure and application stack Infrastructure as Code Replace traditional infrastructure provisioning and management with code-based techniques
  7. 7.  Build services around the business capabilities you require  Scale up and down as required with virtually no notice  Make configuration code changes repeatable and standardized  API-driven model enables management of infrastructure with language typically used in application code  Free developers from manually configuring operating systems, system applications, and server software Microservices Build applications as a set of small services that communicates with other services through APIs
  8. 8.  Maintain visibility and auditability of activity in your application infrastructure  Assess how application and infrastructure performance impact end-user experience  Gain insight into the root causes of problems or unexpected changes  Support services that must be available 24/7 as a result of continuous integration/ continuous delivery  Create alerts based on thresholds you define Logging and Monitoring Capture, categorize, and analyze data and logs generated by applications and infrastructure
  9. 9.  Model and visualize your own custom release workflow  Automate deployments of new code  Improve developer productivity and deliver updates faster  Find and address bugs quicker with more frequent and comprehensive testing  Store anything from source code to binaries using existing Git tools Continuous Integration and Continuous Delivery Rapidly and reliably build, test, and deploy your applications, while improving quality and reducing time to market.
  10. 10. Benefits of DevOps on AWS Get started quickly and pay as you go Automate systems operations Scale without infrastructure constraints Improve visibility and security Leverage fully managed services
  11. 11. PagerDuty Eric Sigler, Head of DevOps, PagerDuty
  12. 12. PagerDuty At-a-Glance Trusted by over 8500 Organizations 50 of the Fortune 100 Global Community 80 Countries 200,000+ UsersFounded in 2009 Based in San Francisco Cloud-based incident resolution 190+ Native Integrations
  13. 13. 9000+ Customers
  14. 14. PagerDuty Manages the Complexity Tools Process People Collaboration/Resolution Deployment Tools Monitoring Tools  App  System  Log  Web  Mobile App Ticketing Tools Public Cloud Services On Call Scheduling Automatic Escalations System and User Efficiency Developer NOC Help Desk IT OPS
  15. 15. Triage Notify Mobilize Collaborate Resolve Learn Identify What’s Wrong Commercial response that engages the business Visibility that leads to operations command PagerDuty is Built on Best Practice Workflows Get on it Mobilize the Experts Diagnose the Problem Quick Problem Resolution Optimize and Prevent
  16. 16. Lower Costs Leverage your development and operations resources more efficiently Increase Revenue Growth Deliver customer experiences more readily and reliably Manage Your IT Transition PagerDuty can help you move to a more agile full-service ownership practice to deliver better results Unleash Your Developers Our platform helps developers deliver value more quickly and ensures maximium reliability Get More From Your Existing Platforms PagerDuty provides full stack visibility to help you optimize your toolchain How Can PagerDuty Help?
  17. 17. With PagerDuty, we spend less time worrying about on-call, and more time creating product to impact lives. - Panasonic
  18. 18. How Can PagerDuty Help? Improvement in MTTR- PICNIC Achieved 99.9% uptime - Pantheon 500% Improvement For every product - Jepperson 100% On-time delivery
  19. 19. Use Case 2: Operationalize and Monitor AWS Environments Use Case 3: Accelerate Migration to AWS Use Case 1: Response Workflows and Orchestration Common AWS-PagerDuty Use Cases
  20. 20. Use Case 1: Response Workflows and Orchestration  Leverage ChatOps Tools  Integrate with Ticketing Systems  Configure Workflows
  21. 21.  Identify patterns, trends, and anomalies  View data  Monitor infrastructure health Use Case 2: Operationalize and Monitor AWS Environments
  22. 22. Use Case 3: Accelerate Migration to AWS  Create alarms to monitor any Amazon CloudWatch  Initiate event and suppression rules  Automate IT incident response workflows
  23. 23. Datadog Chris Hoey, Lead Site Reliability Engineer
  24. 24. • SaaS based infrastructure and app monitoring • Open Source Agent with 200+ integrations • Time series data (metrics and events) and Tracing (APM) • Processing trillions of data points per day • Intelligent and Actionable Alerting • Insightful Dashboards • We’re hiring! (www.datadoghq.com/careers/) Datadog Overview
  25. 25. Challenges  Building a product while ramping up number of people involved running it  Increase in the number of services while shifting incident manage responsibility out to teams  Alert fatigue  Global growth while maintaining high reliability expectations
  26. 26. How Datadog uses AWS  API focused to allow custom tooling  Scale up and down as need for capacity  Integration Dogfooding
  27. 27. Why Datadog choose AWS ScalabilityBorn in the cloud Leverage breadth and flexibility of AWS Reliability
  28. 28. SRE Team Growth Dedicated SRE Team runs stable services in production Early days of Datadog (2010) - everyone is on-call As company grows team leads + directors + senior engineers on-call Broader Engineering gets involved, team based on-call
  29. 29. Spreading Team Coverage
  30. 30. Taking On-Call Global
  31. 31. Addressing Alert Fatigue
  32. 32. Why Datadog Chose PagerDuty Strong API’s  Easy to get started, automation friendly  Scales with growth of teams, company, customers  Makes custom tooling and analytics trivial  Great integration partnership Extensive alerting (incident resolution lifecycle) capabilities  Layering teams  No worries about country specific telco knowledge Robust monitoring analytics  Allows us to look into patterns and deal with alert fatigue
  33. 33.  Able to efficiently scale operations from tens of engineers to hundreds  Improved productivity through custom alerts and escalation policies  Reduced alert fatigue  Continuously improving customer experience via sanely managed global on-call coverage Benefits

×