2. Two major reasons for monitoring
● Reliability
○ Preventing, detecting and resolving incidents
● Continuous Delivery
○ Building the right thing
3. Monitoring as part of development
● Refinement
○ Who do you expect will use
the feature?
○ How do you expect the
feature will be used?
○ Performance
requirements/expectations?
○ Technical dependencies?
○ What monitoring do we
need?
Monitoring Monitoring Monitoring
4. Monitoring as part of development
● Implementation
● Monitoring
○ 1st/2nd test env
■ Functional testing
● Errors?
■ Performance testing
● As expected?
○ Production
■ Validate expectations
■ Learn
Monitoring Monitoring Monitoring
7. What should we monitor (and alert on)?
1. General availability/health
2. Performance and errors
3. Analytics
1. Can we process requests?
2. Quickly and successfully?
3. Are we achieving our goals? Are
the customers achieving theirs?
9. Availability
● “99.8%”
● Traditional definition
○ Server/OS availability
○ Network availability: Users can reach the cloud service
● Customer definition
○ Functional availability: The cloud service “works”
● Our definition
○ Users can reach the cloud service, and critical components and dependencies are
healthy
● How can we monitor this?
21. 1) Synthetic monitoring - Summary
● Quick and easy to set up and use
● 5 lines of Python will be required if you need to authenticate
● Only checks one webpage; doesn’t reflect health of the whole system
● Fragile; just looks for HTTP 200 (unless you use more scripting)
● Can only run every 5 minutes
22. 2) Smarter heartbeat monitoring
● “Users can reach the cloud service, and critical components and
dependencies are healthy”
● What are critical components and dependencies?
○ Database? → Critical!
○ Authorization service? → Critical!
○ Background processing job → ?
○ Zip code lookup service → Not critical
23. 2) Smarter heartbeat monitoring
● How do we know if they are healthy?
○ Database
■ Connect
■ SELECT 1
■ SELECT id FROM table LIMIT 1
Suitable for heartbeat
25. 2) Smarter heartbeat monitoring
● How do we know if they are healthy?
○ Authorization service
■ Make synchronous request?
■ Log and check last successful call, ping only if necessary
○ Background processing job (e.g. calculating wagerun, generating report)
■ Log and check last successful run, trigger test payload if necessary
■ If we expect test payload to process fast, wait for it before returning OK
■ If not, return OK optimistically, then NOT OK on later calls if test payload has
timed out
38. Synthetic monitoring with AppDynamics Heartbeat monitoring with AWS + AppDynamics
Maximum frequency is every 5 minutes Maximum frequency is every 10 seconds
From 3 locations From 8 locations
$123 per year $81 per year
Quick and easy to get started Some design and implementation effort
Superficial health assessment (network
avail.)
Holistic health assessment (functional avail.)
Heartbeat monitoring - Summary
39. Takeaways
1. Define availability for your service (may change over time!)
2. Implement holistic heartbeat monitoring (starting simple is OK)
3. Configure alerts (incident detection)
4. Configure dashboards (for reporting/analysis/improvement)
43. Business Transactions
● Examples for Visma.net HRM Employee Management
○ Registering a new employee
○ Saving changes to an employee
○ Getting data for an employee
● Examples for Visma.net HRM Payroll
○ Calculating a wagerun (for an organization)
○ Generating a bank payment file for a wagerun
● Defined by URL pattern, or method in application
code (doesn’t have to be web-based or user-facing)
●
● POST /api/employees
● PUT /api/employees/<id>
● GET /api/employees/<id>
●
● WageRunManager.RunForOrg
● GenerateWageRunPayslips.HandleEvent
46. Refinement v1
● Today, claims can only be deleted by managers
● Managers and Payroll Administrators should be able to reject a claim, which sends it back to
the employee
● Monitoring
○ No changes to availability monitoring
○ Add monitoring for performance and errors
65. Takeaways
1. Identify critical business transactions
1.1. Start small, but then continuously!
2. Configure alerts (for anomaly detection)
2.1. Consider response time and error rate
2.2. Don’t send a critical alert unless human action is required
2.3. Discussing alerts in Slack 💕
3. Configure dashboards (for reporting/analysis/improvement)
3.1. Look at “top 10 lists” to identify possible quick wins
67. Identify goals and relevant metrics
● Visma-oriented
○ Goal: Become the leader in the Danish market
■ Metric: Number of payslips generated per month (for DK customers)
○ Goal: Increase cross-sales
■ Metric: Number of customers who activate the invoicing module
● Customer-oriented
○ Goal: Schools want to enable efficient communication with parents
■ Metric: Messages sent, by user role
■ Metric: Inbox size, by user role
○ Goal: Enterprises want an efficient expense management process
■ Metric: Rejected expenses, by reason
■ Metric: Rejected expenses, by industry
69. Refinement v2
● Today, claims can only be deleted by managers
● Managers and Payroll Administrators should be able to reject a claim, which sends it back to
the employee
● Assumptions
○ ~60% of rejections will be by managers, ~40% will be by Payroll Administrators
○ Rejections by managers will often be done on a mobile device, while PAs use PCs
○ Most common reason for rejection will be incorrect or insufficient documentation
● Monitoring
○ No changes to availability monitoring
○ Add monitoring for performance and errors
○ Add analytics: Rejections by role, rejections by device, rejections by reason
73. Refinement
● Claims must have a new mandatory currency field
● Assumptions
○ 95% of claims will use NOK, SEK, DKK, EUR, USD
○ Currency support will not affect how many claims are created/approved/rejected/paid
● Monitoring
○ Changes to availability monitoring?
■ Yes, we depend on 3rd party for exchange rates (but maybe not main heartbeat?)
○ Add performance and/or error monitoring?
■ Payment errors by currency could be interesting
○ Add analytics
■ New claims by currency
■ Approved claims by currency
■ Rejected claims by currency
■ Paid claims by currency
77. Takeaways
1. Identify Visma and customer-oriented goals as part of development
2. Monitor those goals to achieve them
3. Identify relevant assumptions as part of refinement
4. Monitor those assumptions, and use data to decide what to do next
79. Focus on capabilities
● SaaS Compliance Requirements + ArchTech Maturity Index
● Ability to monitor, and alert on
○ general availability and service health over time
○ performance of backend transactions
○ backend transaction errors
○ end-to-end performance of loading web pages and route changes
○ frontend errors
○ business metrics (number of users, number of certain actions, use
of functionality, etc.)