Incident Management Framework

Preparing for System
Failure
Our Approach at Rentman

About Me
- Software Architect at Gapstars / Rentman
- ~15 years of experience, mistakes and learning
- Primarily APIs and Web tech
- I sporadically blog on randomcoding.com
- I tweet as @jomanlk

About Rentman
- Provides resource management and
planning for the AV & Event industry
- Industry leader in rentals management for
the events industry
- 10+ years in the events space
- Customers across 75 countries
- 70+ employees spread across NA, Europe
and Sri Lanka
- Tech stack primarily on AWS
- Most services multi region / multi AZ
- Primarily running on top of AWS ECS
- Heavily use Atlassian products

Agenda
- Introduction ←
- Approach
- Learnings

Why Now, for Rentman?
- Increase our ‘bus factor’
- Reduce loss of institutional knowledge
- Increase active monitoring coverage
- Growing pains. Reduce stress & panic

What Is An Incident Response Plan?
- Well defined framework to deal with incidents
- No ambiguity
- Clear command structure
- Refer the Incident Command System (ICS)
“An incident response plan is a document that outlines an organization's procedures, steps,
and responsibilities of its incident response program.”

Goals: The 3 Cs
- Coordinate response effort.
- Communicate between incident responders, within the organization, and to
the outside world.
- Maintain control over the incident response.

Step 0
- Documentation
- Documentation
- Create a Playbook
- Setup Teams & Organizational support
- Tiered teams. T1, T2

Incident Response Phases
Triage Coordinate
Mitigate
Resolve Learnings

Common Terms
- IC: Incident Commander
- CL: Comms lead
- LI: Lead investigator
- DS: Domain specialist

Triage
- What’s going on?
- How bad is it?
- Depends on
- Monitoring
- User reports
- P3, P2?
- Not great, but it can wait
- P1
- BIG problem

Coordinate
- Use tooling
- Scheduling
- Alerting
- Who needs to be involved?
- Small incident?
- Big incident?
- Who’s available?

Mitigate
- STOP THE BLEED!
- Goal ≠ Finding and fixing issue
- Goal = Get things working
- Collaborate
- Keep it DRY
- Keep it documented
Reviewing
recent
releases
Disabling
demo creation
Support is asking me
for an update, do we
have anything?
Joining the
incident response!
Where are we at?

Mitigate
Reviewing
recent
releases
Disabling
demo creation
Support is asking me
for an update, do we
have anything?
Joining the
incident response!
Where are we at?

Resolve
- Make sure the root cause is
addressed
- This could be days or sometimes
weeks after incident
Creating hotfix
branch
Added extra
logs for this
specific issue

Resolve
Creating hotfix
branch
Added extra
logs for this
specific issue

Follow Up
- Document the JIRA issue
timeline
- Psychological Safety
- Learn from the experience
- Failure is in process not individual
- Blame free / Owned by team
- Review the process
- What went well / not well?
- What was missing?
Improvements
to process
Additional
logging added
Learnings
Create RCA

Learnings
- Leverage existing workflows / tools
- Practice. Practice. Practice.
- Breakathons
- Simulations

Learnings Continued
- Plan. Do. Review. Improve.
- Incorporate Organizational Requirements Early
- Compensation for on-call
- Uptime guarantees
- SLA with customers

Fin.
- Questions: Stay tuned for the panel
discussion
- Want to reach out?
- @jomanlk on Twitter
- linkedin.com/in/jnxpereira on LinkedIn
- john@jnx.me on Email

Incident Management Framework

More Related Content

What's hot

Similar to Incident Management Framework

Recently uploaded

Incident Management Framework