3 types of monitoring for 2020

3 types of
monitoring
for 2020
T. Alexander Lystad
Chief Cloud Architect
Visma Enterprise PD

Two major reasons for monitoring
● Reliability
○ Preventing, detecting and resolving incidents
● Continuous Delivery
○ Building the right thing

Monitoring as part of development
● Reﬁnement
○ Who do you expect will use
the feature?
○ How do you expect the
feature will be used?
○ Performance
requirements/expectations?
○ Technical dependencies?
○ What monitoring do we
need?
Monitoring Monitoring Monitoring

Monitoring as part of development
● Implementation
● Monitoring
○ 1st/2nd test env
■ Functional testing
● Errors?
■ Performance testing
● As expected?
○ Production
■ Validate expectations
■ Learn
Monitoring Monitoring Monitoring

What should we monitor (and alert on)?
1. General availability/health
2. Performance and errors
3. Analytics
1. Can we process requests?
2. Quickly and successfully?
3. Are we achieving our goals? Are
the customers achieving theirs?

General availability/health
Can we process requests?

Availability
● “99.8%”
● Traditional definition
○ Server/OS availability
○ Network availability: Users can reach the cloud service
● Customer definition
○ Functional availability: The cloud service “works”
● Our definition
○ Users can reach the cloud service, and critical components and dependencies are
healthy
● How can we monitor this?

Heartbeat monitoring
● Checking availability/health frequently

1) Synthetic monitoring
Network
availability
conﬁrmed

1) Synthetic monitoring - Configuration

1) Synthetic monitoring - Conﬁguration

1) Synthetic monitoring - Results

1) Synthetic monitoring - Summary
● Quick and easy to set up and use
● 5 lines of Python will be required if you need to authenticate
● Only checks one webpage; doesn’t reﬂect health of the whole system
● Fragile; just looks for HTTP 200 (unless you use more scripting)
● Can only run every 5 minutes

2) Smarter heartbeat monitoring
● “Users can reach the cloud service, and critical components and
dependencies are healthy”
● What are critical components and dependencies?
○ Database? → Critical!
○ Authorization service? → Critical!
○ Background processing job → ?
○ Zip code lookup service → Not critical

● How do we know if they are healthy?
○ Database
■ Connect
■ SELECT 1
■ SELECT id FROM table LIMIT 1
Suitable for heartbeat

2) Smarter heartbeat monitoring - Endpoint
http://myservice.com/heartbeats/availability

● How do we know if they are healthy?
○ Authorization service
■ Make synchronous request?
■ Log and check last successful call, ping only if necessary
○ Background processing job (e.g. calculating wagerun, generating report)
■ Log and check last successful run, trigger test payload if necessary
■ If we expect test payload to process fast, wait for it before returning OK
■ If not, return OK optimistically, then NOT OK on later calls if test payload has
timed out

2) Heartbeat monitoring - Architecture

2) Heartbeat monitoring - AWS Health Check

Now we can use this metric in dashboards and alerts!

Availability alerting!
But we will look at alerting later

Synthetic monitoring with AppDynamics Heartbeat monitoring with AWS + AppDynamics
Maximum frequency is every 5 minutes Maximum frequency is every 10 seconds
From 3 locations From 8 locations
$123 per year $81 per year
Quick and easy to get started Some design and implementation effort
Superficial health assessment (network
avail.)
Holistic health assessment (functional avail.)
Heartbeat monitoring - Summary

Takeaways
1. Define availability for your service (may change over time!)
2. Implement holistic heartbeat monitoring (starting simple is OK)
3. Configure alerts (incident detection)
4. Configure dashboards (for reporting/analysis/improvement)

2) Performance and errors
Quickly and successfully?

What’s (most) important?
Business Transactions

Business Transactions
● Examples for Visma.net HRM Employee Management
○ Registering a new employee
○ Saving changes to an employee
○ Getting data for an employee
● Examples for Visma.net HRM Payroll
○ Calculating a wagerun (for an organization)
○ Generating a bank payment ﬁle for a wagerun
● Deﬁned by URL pattern, or method in application
code (doesn’t have to be web-based or user-facing)
●
● POST /api/employees
● PUT /api/employees/<id>
● GET /api/employees/<id>
●
● WageRunManager.RunForOrg
● GenerateWageRunPayslips.HandleEvent

New feature: Rejecting claim
Performance and errors

Reﬁnement v1
● Today, claims can only be deleted by managers
● Managers and Payroll Administrators should be able to reject a claim, which sends it back to
the employee
● Monitoring
○ No changes to availability monitoring
○ Add monitoring for performance and errors

Alerting - Notification options
1. Email
2. HTTP (e.g. Slack, OpsGenie, …)

No alerts!
Can we still ﬁnd something to improve?

Example App - Performance Problem

Performance problem fixed!
But… remember this?

Error fixed!
Quickly and successfully

Takeaways
1. Identify critical business transactions
1.1. Start small, but then continuously!
2. Conﬁgure alerts (for anomaly detection)
2.1. Consider response time and error rate
2.2. Don’t send a critical alert unless human action is required
2.3. Discussing alerts in Slack 💕
3. Conﬁgure dashboards (for reporting/analysis/improvement)
3.1. Look at “top 10 lists” to identify possible quick wins

3) Analytics
1. Are we achieving our goals?
Are customers achieving theirs?

Identify goals and relevant metrics
● Visma-oriented
○ Goal: Become the leader in the Danish market
■ Metric: Number of payslips generated per month (for DK customers)
○ Goal: Increase cross-sales
■ Metric: Number of customers who activate the invoicing module
● Customer-oriented
○ Goal: Schools want to enable eﬃcient communication with parents
■ Metric: Messages sent, by user role
■ Metric: Inbox size, by user role
○ Goal: Enterprises want an eﬃcient expense management process
■ Metric: Rejected expenses, by reason
■ Metric: Rejected expenses, by industry

New feature: Rejecting claim
Analytics

Reﬁnement v2
● Today, claims can only be deleted by managers
● Managers and Payroll Administrators should be able to reject a claim, which sends it back to
the employee
● Assumptions
○ ~60% of rejections will be by managers, ~40% will be by Payroll Administrators
○ Rejections by managers will often be done on a mobile device, while PAs use PCs
○ Most common reason for rejection will be incorrect or insuﬃcient documentation
● Monitoring
○ No changes to availability monitoring
○ Add monitoring for performance and errors
○ Add analytics: Rejections by role, rejections by device, rejections by reason

Feature dashboard v2
INSUFFICIENT_DOCS
INCORRECT_ACCOUNT
UNAUTHORIZED_SPEND
450
331
123
Administrator
Manager

New requirement: Currency!
Any changes in monitoring?

Refinement
● Claims must have a new mandatory currency ﬁeld
● Assumptions
○ 95% of claims will use NOK, SEK, DKK, EUR, USD
○ Currency support will not aﬀect how many claims are created/approved/rejected/paid
● Monitoring
○ Changes to availability monitoring?
■ Yes, we depend on 3rd party for exchange rates (but maybe not main heartbeat?)
○ Add performance and/or error monitoring?
■ Payment errors by currency could be interesting
○ Add analytics
■ New claims by currency
■ Approved claims by currency
■ Rejected claims by currency
■ Paid claims by currency

Feature dashboard v3
Manager
Administrator
INSUFFICIENT_DOCS
OTHER_REASON
450
331
INCORRECT_CURRENCY 3975

Takeaways
1. Identify Visma and customer-oriented goals as part of development
2. Monitor those goals to achieve them
3. Identify relevant assumptions as part of reﬁnement
4. Monitor those assumptions, and use data to decide what to do next

Focus on capabilities
● SaaS Compliance Requirements + ArchTech Maturity Index
● Ability to monitor, and alert on
○ general availability and service health over time
○ performance of backend transactions
○ backend transaction errors
○ end-to-end performance of loading web pages and route changes
○ frontend errors
○ business metrics (number of users, number of certain actions, use
of functionality, etc.)

Respect
Reliability
Innovation
Competence
Team spirit

3 types of monitoring for 2020

Recommended

Recommended

More Related Content

Similar to 3 types of monitoring for 2020

Similar to 3 types of monitoring for 2020 (20)

More from T. Alexander Lystad

More from T. Alexander Lystad (7)

Recently uploaded

Recently uploaded (20)

3 types of monitoring for 2020