2. I’m Karl Norling
2
• Swedish, moved to U.S ~13 years ago
• Spends most of my time in Brooklyn with my family
• Lead software engineer at Quartet
3. Quartet
3
• Healthcare technology company focused on improving
Behavioral Healthcare
• 1 in 4 Americans experienced a mental disorder in the last
year; most were moderate to severe
• Behavioral Health side is often ignored, leading to poor
outcomes (medication non-adherence & ER visits)
4. Quartet
4
• Quartet delivers scalable behavioral health integration for
our partners, leading to better patient care and cost
savings
• Quartet’s product is a marriage of our three pillars, Data &
Analytics, Collaborative Platform, Engagement and
Support.
5. Engineering at Quartet
5
• Work with highly sensitive and regulated information that
demands high reliability (PHI/HIPAA)
• Develop for very distinct users with very different
challenges (BHP, PCP, Patients, Quartet Users)
• Need to deliver a robust solution
8. How do I know that’s something is
wrong?
Identify critical ‘this should never
happen, but if’ — and log them
if (user.isAuthenticated) {
…
} else {
this.logger.log(‘warn’, ‘Not
authorized user trying to
access ..’, {
user,
});
}
9. How do I know that’s something is
wrong?
Wrap code in try catch statements try {
const patient = new
Patient(metrics, logger);
patient.hydrate(rawPatient);
} catch (error) {
this.logger.log(‘error’,’Failed
to hydrated patient’, {
patientId: rawPatient.id
});
}
10. How do I know that’s something is
wrong?
Measure events with metrics const res =
request.authenticate(email, pw);
if (res.status === 200) {
this.metrics(‘user_login_success’
);
} else if (res.status === 403) {
this.metrics(‘user_login_unathori
zed’);
} else if (res.status >= 500) {
this.metrics(‘user_login_error’);
}
11. How do I know that’s something is
wrong?
11
Measure everything
• Environment alerts: CPU usage, disk space etc.
• External reporting: Customer, employee reports issue.
13. Create alerts
13
Create search queries in logging software
(e.g. kibana, sumologic, splunk)
that will alert on specific log message, level, or threshold.
At Quartet we’re using elastalert.
14. Create alerts
14
Metrics are good to use to detect trends.
Example:
If we haven’t had any logins in the last 24 hours,
it’s time to investigate.
15. Create alerts
15
There should be a way for employees and customers to report
issues — either from the website or via email address.
Example:
Employee using internal tool cannot change shipping address for
an order.
17. Organize alerts
17
Add tags to log messages.
Then, search queries are easier to group, delegate, and report
upon.
18. Organize alerts
18
Define a naming convention system for your tags.
Either prefix them with functional areas or team names.
19. Organize alerts
19
Alerts should create tickets.
When an alert gets triggered, a ticket should be generated in
whichever tool being used to track work in (i.e., JIRA).
Tickets should be created within the project associated with the
team that owns the service.
21. Communicate
21
Choose the right tool for communicating the alert to the person
on call (e.g. Slack, Hipchat, email, JIRA).
At Quartet we’re using Slack.
22. Communicate
22
Make sure the tool can be configured to send alerts via different
channels depending on the alert, so the correct team, on-call
person sees it.
At Quartet we’re using PagerDuty.
25. Who is on-call
25
On-call is the employee that’s responding to alerts. Other terms
might be red-hot, on-duty, etc.
26. On-call acknowledges the issue
26
On-call schedule should be created, rotating weekly (depending
on # of employees).
You may also have a secondary on-call, in case primary is
unavailable (i.e., on subway ride home).
At Quartet we have one for app devs, core, and infra.
28. On-call acknowledges the issue
28
Primary on-call receives alert, acknowledges through same
channel within defined range of time.
If time expires, issue is bumped to alert secondary.
Make sure to set time range that makes sense for your
organization.
At Quartet, we use 15 minutes.
30. How to respond to the issue
30
Alerts need to be investigated to determine how urgently they
need to be addressed.
For critical issues, on-call should be empowered to reach out and
involve owner of code causing issue even if it’s after-hours.
At Quartet, alerts create tickets automatically. For all issues, on-
call will make sure tickets are assigned to the right team.
31. How to respond to the issue
31
You need to define a process for marking issues resolved that
makes sense for your organizational model.
It’s helpful if there’s a link to a handbook in the reported issue.
The handbook should contain steps for how to investigate and
possibly resolve the issue.
32. How to respond to the issue
32
If it’s an employee or customer filing the issue, there needs to be
an established process for communicating externally, i.e., internal
email or involving customer service.
Depending on SLA, it needs to happen within a timeframe.
34. Establish process and guidelines
34
At Quartet, we have a doc that details our process and best
practices.
New employees shadow the primary on-calls for a month before
they get added to the rotation.
35. Example guidelines
35
• Each engineering team member will have the PagerDuty app
installed on their phone.
• If your PagerDuty schedule overlaps with planned vacation,
arrange a schedule override in PagerDuty.
• Each team is responsible for creating PagerDuty alerts for the
services that they are responsible for.
36. Ensure information is maintained
36
To facilitate a good continuum of the on-call schedule to the next
person, there should be a hand-off meeting.
The on-call is responsible for walking the next on-call through the
weekly report for the previous week.
37. Evolve
37
How did we get here?
Stop the noise - if an error happens over and over again, dedupe
it. Investigate why.
Downgrade - is the error actually an error or should we measure
via a metric.
38. On-call
38
On-call is dedicated 100% of their time to investigate bugs, this
makes sense where we’re at, shipping a lot of code. More code
generates more bugs naturally.
39. Tools
39
Tools ❤ Process
The tooling will not solve your issues, you have to have a process
how to use the tools.
“If you only have a hammer, everything looks like a nail”