More Related Content Similar to Building a Monitoring Plan.pdf (20) More from Amazon Web Services (20) Building a Monitoring Plan.pdf1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Paul Ferguson
Senior Consultant, Professional Services, Amazon Web Services
Chris Kozlowski
Senior Technical Account Manager, Amazon Web Services
Building a Monitoring Plan
2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Who we are
Paul Ferguson – Senior Consultant, London
Chris Kozlowski – Senior Technical Account
Manager, US East Coast
Who we are
3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Buzzword Bingo
Observability
Operational Intelligence
‘No Ops’
Composable monitoring
Event correlation
Signal to noise ratio
Alarm fatigue
Single pane of glass
E-bonding
4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What we’ll discuss…
• Who needs to be involved, and why
• What to Monitor
• What makes for an effective monitoring rule
• What tools to use and when
• Metrics, business outcomes, improvements
5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Operational Challenges will always exist…
But with proper planning and design, you will be ready for them.
6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
So Why Monitor In the First Place?
To Gain Insights!
• Customer Experience
• Performance & Cost
• Trends
• Troubleshooting & Remediation
• Learning & Improvement
7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What Goes Into a Monitoring Plan?
Alerts
System
Knowledge
People
Actions
Tools
8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
People
9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Roles and Responsibilities
• Operations – First Responders, Triage
• Developers/Engineers – Define normal operation
• Management – Tasked with making business decisions in
response to events
10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Monitoring Plan
11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
System Knowledge
12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Faults
Configuration
Accounting
Performance
Security
Categories of Insight
13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Things to Monitor
AWS Foundation Services
Compute Storage Database Networking
AWS Global
Infrastructure
Regions
Availability Zones Edge Locations
Operating Systems
Applications
Databases
Networking
Internet
Gateway
Elastic
Load
Balancer
Web Servers
(EC2 w/ Auto
Scaling)
RDS
our
14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Monitoring Plan
15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Crafting Alerts
16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Anatomy of an Effective Alert
FCAPS Category: Performance
Amazon
Cloudwatch
Element: Web Server
Custom Alert:
ALARM Site latency
>=2s for 1 minute
Elastic Load
Balancing
EC2 InstancesAuto Scaling
EC2 Instance
Runbook
Owner
Test Action
18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The Drive Towards Achieving Business Insight
Metrics
Operational
Outcomes
Webpage Response
Time, Job Run Length
CPU Wait %, Disk Queue
Depth
Business Insight! Customer Sentiment,
SLAs
19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Alerting Best Practices
• Break alert crafting into batches. Highest Priority
First
• Refine quickly.
• Alert to prompt an action
• Descriptive alerts to aid in prompt resolution
• Don’t only use email
20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Monitoring Plan
System Knowledge
Component Area
IGW Faults
ELB Faults
ELB Performance
Web Servers Faults
Web Servers Performance
21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Tools to use
22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How to select a good tool
• Let your requirements dictate your tools
• Start with the tools you have
• Consider using native tools on the platform
• Integrate tools - ergonomics matter
24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example Workload
CloudTrail – logging of API calls
AWS Config Rules - config
CloudWatch – Resources
APM for customer experience/
synthetic monitoring
25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Dashboards
26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
System Knowledge
Component Area
IGW Faults
ELB Faults
ELB Performance
Web Servers Faults
Web Servers Performance
27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Actions and Improvements
28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Actions
• Every alert and event should end in an action to be taken
• Escalations to another person should end with an action to
be taken by them
• Actions are not only technical. Plan for what
business decisions might need to be made
• Runbooks and Playbooks
29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Whenever alarms and alerts fire, identify if they can be improved.
• Alert made more accurate, descriptive, timely
• Remediation improved
• Establish processes with people first, then automate
• Identify routine or standard changes as early candidates for automation
Improvement and Automation
30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Monitoring Plan
System Knowledge
Component Area
IGW Faults
ELB Faults
ELB Performance
Web Servers Faults
Web Servers Performance
31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Summary and Next Steps
- Check your monitoring approach
- Is it user-centric?
- Are you measuring the right things?
- Write a monitoring plan
- Start monitoring, test and iterate
The reason operations exists is to support the needs of the
business.
32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank you!