Preparing for System
Failure
Our Approach at Rentman
About Me
- Software Architect at Gapstars / Rentman
- ~15 years of experience, mistakes and learning
- Primarily APIs and Web tech
- I sporadically blog on randomcoding.com
- I tweet as @jomanlk
About Rentman
- Provides resource management and
planning for the AV & Event industry
- Industry leader in rentals management for
the events industry
- 10+ years in the events space
- Customers across 75 countries
- 70+ employees spread across NA, Europe
and Sri Lanka
- Tech stack primarily on AWS
- Most services multi region / multi AZ
- Primarily running on top of AWS ECS
- Heavily use Atlassian products
Agenda
- Introduction ←
- Approach
- Learnings
Why Now, for Rentman?
- Increase our ‘bus factor’
- Reduce loss of institutional knowledge
- Increase active monitoring coverage
- Growing pains. Reduce stress & panic
What Is An Incident Response Plan?
- Well defined framework to deal with incidents
- No ambiguity
- Clear command structure
- Refer the Incident Command System (ICS)
“An incident response plan is a document that outlines an organization's procedures, steps,
and responsibilities of its incident response program.”
Goals: The 3 Cs
- Coordinate response effort.
- Communicate between incident responders, within the organization, and to
the outside world.
- Maintain control over the incident response.
The Approach
Step 0
- Documentation
- Documentation
- Create a Playbook
- Setup Teams & Organizational support
- Tiered teams. T1, T2
Incident Response Phases
Triage Coordinate
Mitigate
Resolve Learnings
Common Terms
- IC: Incident Commander
- CL: Comms lead
- LI: Lead investigator
- DS: Domain specialist
Triage
- What’s going on?
- How bad is it?
- Depends on
- Monitoring
- User reports
- P3, P2?
- Not great, but it can wait
- P1
- BIG problem
Triage
Coordinate
- Use tooling
- Scheduling
- Alerting
- Who needs to be involved?
- Small incident?
- Big incident?
- Who’s available?
Coordinate
Mitigate
- STOP THE BLEED!
- Goal ≠ Finding and fixing issue
- Goal = Get things working
- Collaborate
- Keep it DRY
- Keep it documented
Reviewing
recent
releases
Disabling
demo creation
Support is asking me
for an update, do we
have anything?
Joining the
incident response!
Where are we at?
Mitigate
Reviewing
recent
releases
Disabling
demo creation
Support is asking me
for an update, do we
have anything?
Joining the
incident response!
Where are we at?
Resolve
- Make sure the root cause is
addressed
- This could be days or sometimes
weeks after incident
Creating hotfix
branch
Added extra
logs for this
specific issue
Resolve
Creating hotfix
branch
Added extra
logs for this
specific issue
Follow Up
- Document the JIRA issue
timeline
- Psychological Safety
- Learn from the experience
- Failure is in process not individual
- Blame free / Owned by team
- Review the process
- What went well / not well?
- What was missing?
Improvements
to process
Additional
logging added
Learnings
Create RCA
The Learnings
Learnings
- Leverage existing workflows / tools
- Practice. Practice. Practice.
- Breakathons
- Simulations
Learnings Continued
- Plan. Do. Review. Improve.
- Incorporate Organizational Requirements Early
- Compensation for on-call
- Uptime guarantees
- SLA with customers
Fin.
- Questions: Stay tuned for the panel
discussion
- Want to reach out?
- @jomanlk on Twitter
- linkedin.com/in/jnxpereira on LinkedIn
- john@jnx.me on Email

Incident Management Framework

  • 2.
  • 3.
    About Me - SoftwareArchitect at Gapstars / Rentman - ~15 years of experience, mistakes and learning - Primarily APIs and Web tech - I sporadically blog on randomcoding.com - I tweet as @jomanlk
  • 4.
    About Rentman - Providesresource management and planning for the AV & Event industry - Industry leader in rentals management for the events industry - 10+ years in the events space - Customers across 75 countries - 70+ employees spread across NA, Europe and Sri Lanka - Tech stack primarily on AWS - Most services multi region / multi AZ - Primarily running on top of AWS ECS - Heavily use Atlassian products
  • 5.
    Agenda - Introduction ← -Approach - Learnings
  • 6.
    Why Now, forRentman? - Increase our ‘bus factor’ - Reduce loss of institutional knowledge - Increase active monitoring coverage - Growing pains. Reduce stress & panic
  • 7.
    What Is AnIncident Response Plan? - Well defined framework to deal with incidents - No ambiguity - Clear command structure - Refer the Incident Command System (ICS) “An incident response plan is a document that outlines an organization's procedures, steps, and responsibilities of its incident response program.”
  • 8.
    Goals: The 3Cs - Coordinate response effort. - Communicate between incident responders, within the organization, and to the outside world. - Maintain control over the incident response.
  • 9.
  • 10.
    Step 0 - Documentation -Documentation - Create a Playbook - Setup Teams & Organizational support - Tiered teams. T1, T2
  • 11.
    Incident Response Phases TriageCoordinate Mitigate Resolve Learnings
  • 13.
    Common Terms - IC:Incident Commander - CL: Comms lead - LI: Lead investigator - DS: Domain specialist
  • 14.
    Triage - What’s goingon? - How bad is it? - Depends on - Monitoring - User reports - P3, P2? - Not great, but it can wait - P1 - BIG problem
  • 15.
  • 16.
    Coordinate - Use tooling -Scheduling - Alerting - Who needs to be involved? - Small incident? - Big incident? - Who’s available?
  • 17.
  • 18.
    Mitigate - STOP THEBLEED! - Goal ≠ Finding and fixing issue - Goal = Get things working - Collaborate - Keep it DRY - Keep it documented Reviewing recent releases Disabling demo creation Support is asking me for an update, do we have anything? Joining the incident response! Where are we at?
  • 19.
    Mitigate Reviewing recent releases Disabling demo creation Support isasking me for an update, do we have anything? Joining the incident response! Where are we at?
  • 20.
    Resolve - Make surethe root cause is addressed - This could be days or sometimes weeks after incident Creating hotfix branch Added extra logs for this specific issue
  • 21.
  • 22.
    Follow Up - Documentthe JIRA issue timeline - Psychological Safety - Learn from the experience - Failure is in process not individual - Blame free / Owned by team - Review the process - What went well / not well? - What was missing? Improvements to process Additional logging added Learnings Create RCA
  • 23.
  • 24.
    Learnings - Leverage existingworkflows / tools - Practice. Practice. Practice. - Breakathons - Simulations
  • 25.
    Learnings Continued - Plan.Do. Review. Improve. - Incorporate Organizational Requirements Early - Compensation for on-call - Uptime guarantees - SLA with customers
  • 26.
    Fin. - Questions: Staytuned for the panel discussion - Want to reach out? - @jomanlk on Twitter - linkedin.com/in/jnxpereira on LinkedIn - john@jnx.me on Email