Incident Management Framework

Incident Management Framework
Preparing for System
Failure
Our Approach at Rentman
About Me
- Software Architect at Gapstars / Rentman
- ~15 years of experience, mistakes and learning
- Primarily APIs and Web tech
- I sporadically blog on randomcoding.com
- I tweet as @jomanlk
About Rentman
- Provides resource management and
planning for the AV & Event industry
- Industry leader in rentals management for
the events industry
- 10+ years in the events space
- Customers across 75 countries
- 70+ employees spread across NA, Europe
and Sri Lanka
- Tech stack primarily on AWS
- Most services multi region / multi AZ
- Primarily running on top of AWS ECS
- Heavily use Atlassian products
Agenda
- Introduction ←
- Approach
- Learnings
Why Now, for Rentman?
- Increase our ‘bus factor’
- Reduce loss of institutional knowledge
- Increase active monitoring coverage
- Growing pains. Reduce stress & panic
What Is An Incident Response Plan?
- Well defined framework to deal with incidents
- No ambiguity
- Clear command structure
- Refer the Incident Command System (ICS)
“An incident response plan is a document that outlines an organization's procedures, steps,
and responsibilities of its incident response program.”
Goals: The 3 Cs
- Coordinate response effort.
- Communicate between incident responders, within the organization, and to
the outside world.
- Maintain control over the incident response.
The Approach
Step 0
- Documentation
- Documentation
- Create a Playbook
- Setup Teams & Organizational support
- Tiered teams. T1, T2
Incident Response Phases
Triage Coordinate
Mitigate
Resolve Learnings
Incident Management Framework
Common Terms
- IC: Incident Commander
- CL: Comms lead
- LI: Lead investigator
- DS: Domain specialist
Triage
- What’s going on?
- How bad is it?
- Depends on
- Monitoring
- User reports
- P3, P2?
- Not great, but it can wait
- P1
- BIG problem
Triage
Coordinate
- Use tooling
- Scheduling
- Alerting
- Who needs to be involved?
- Small incident?
- Big incident?
- Who’s available?
Coordinate
Mitigate
- STOP THE BLEED!
- Goal ≠ Finding and fixing issue
- Goal = Get things working
- Collaborate
- Keep it DRY
- Keep it documented
Reviewing
recent
releases
Disabling
demo creation
Support is asking me
for an update, do we
have anything?
Joining the
incident response!
Where are we at?
Mitigate
Reviewing
recent
releases
Disabling
demo creation
Support is asking me
for an update, do we
have anything?
Joining the
incident response!
Where are we at?
Resolve
- Make sure the root cause is
addressed
- This could be days or sometimes
weeks after incident
Creating hotfix
branch
Added extra
logs for this
specific issue
Resolve
Creating hotfix
branch
Added extra
logs for this
specific issue
Follow Up
- Document the JIRA issue
timeline
- Psychological Safety
- Learn from the experience
- Failure is in process not individual
- Blame free / Owned by team
- Review the process
- What went well / not well?
- What was missing?
Improvements
to process
Additional
logging added
Learnings
Create RCA
The Learnings
Learnings
- Leverage existing workflows / tools
- Practice. Practice. Practice.
- Breakathons
- Simulations
Learnings Continued
- Plan. Do. Review. Improve.
- Incorporate Organizational Requirements Early
- Compensation for on-call
- Uptime guarantees
- SLA with customers
Fin.
- Questions: Stay tuned for the panel
discussion
- Want to reach out?
- @jomanlk on Twitter
- linkedin.com/in/jnxpereira on LinkedIn
- john@jnx.me on Email
1 of 26

Recommended

ITIL Incident Management Workflow PowerPoint Presentation Slides by
ITIL Incident Management Workflow PowerPoint Presentation SlidesITIL Incident Management Workflow PowerPoint Presentation Slides
ITIL Incident Management Workflow PowerPoint Presentation SlidesSlideTeam
521 views22 slides
ITIL Incident Management Workflow - Process Guide by
	 ITIL Incident Management Workflow - Process Guide	 ITIL Incident Management Workflow - Process Guide
ITIL Incident Management Workflow - Process GuideFlevy.com Best Practices
1.1K views35 slides
Incident Management Best Practices by
Incident Management Best PracticesIncident Management Best Practices
Incident Management Best PracticesTechExcel
4K views2 slides
Security champions v1.0 by
Security champions v1.0Security champions v1.0
Security champions v1.0Dinis Cruz
3.3K views18 slides
Problem Management Overview by
Problem Management OverviewProblem Management Overview
Problem Management OverviewMarval Software
6.5K views17 slides
Disaster Recovery Plan for IT by
Disaster Recovery Plan for ITDisaster Recovery Plan for IT
Disaster Recovery Plan for IThhuihhui
36.1K views11 slides

More Related Content

What's hot

Incident Response Swimlanes by
Incident Response SwimlanesIncident Response Swimlanes
Incident Response SwimlanesDaniel P Wallace
13.6K views1 slide
Incident Response by
Incident Response Incident Response
Incident Response InnoTech
6.3K views25 slides
Disaster Recovery Plan by
Disaster Recovery PlanDisaster Recovery Plan
Disaster Recovery PlanIndeevari Ramanayake
5.5K views49 slides
ITIL Practical Guide - Service Operation by
ITIL Practical Guide - Service OperationITIL Practical Guide - Service Operation
ITIL Practical Guide - Service OperationAxios Systems
16.5K views60 slides
Business continuity planning and disaster recovery by
Business continuity planning and disaster recoveryBusiness continuity planning and disaster recovery
Business continuity planning and disaster recoveryKrutiShah114
492 views13 slides
Threat Hunting - Moving from the ad hoc to the formal by
Threat Hunting - Moving from the ad hoc to the formalThreat Hunting - Moving from the ad hoc to the formal
Threat Hunting - Moving from the ad hoc to the formalPriyanka Aash
1K views27 slides

What's hot(20)

Incident Response by InnoTech
Incident Response Incident Response
Incident Response
InnoTech6.3K views
ITIL Practical Guide - Service Operation by Axios Systems
ITIL Practical Guide - Service OperationITIL Practical Guide - Service Operation
ITIL Practical Guide - Service Operation
Axios Systems16.5K views
Business continuity planning and disaster recovery by KrutiShah114
Business continuity planning and disaster recoveryBusiness continuity planning and disaster recovery
Business continuity planning and disaster recovery
KrutiShah114492 views
Threat Hunting - Moving from the ad hoc to the formal by Priyanka Aash
Threat Hunting - Moving from the ad hoc to the formalThreat Hunting - Moving from the ad hoc to the formal
Threat Hunting - Moving from the ad hoc to the formal
Priyanka Aash1K views
SOC: Use cases and are we asking the right questions? by Jonathan Sinclair
SOC: Use cases and are we asking the right questions?SOC: Use cases and are we asking the right questions?
SOC: Use cases and are we asking the right questions?
Jonathan Sinclair368 views
Cryptika cybersecurity - company profile by Safwan Talab
Cryptika cybersecurity - company profileCryptika cybersecurity - company profile
Cryptika cybersecurity - company profile
Safwan Talab878 views
Business Continuity Workshop Final by Bill Lisse
Business Continuity Workshop   FinalBusiness Continuity Workshop   Final
Business Continuity Workshop Final
Bill Lisse3.7K views
Business Impact Analysis by dlfrench
Business Impact AnalysisBusiness Impact Analysis
Business Impact Analysis
dlfrench12.2K views
Business Continuity Management PowerPoint Presentation Slides by SlideTeam
Business Continuity Management PowerPoint Presentation SlidesBusiness Continuity Management PowerPoint Presentation Slides
Business Continuity Management PowerPoint Presentation Slides
SlideTeam6.4K views
SANS Ask the Expert: An Incident Response Playbook: From Monitoring to Opera... by AlienVault
 SANS Ask the Expert: An Incident Response Playbook: From Monitoring to Opera... SANS Ask the Expert: An Incident Response Playbook: From Monitoring to Opera...
SANS Ask the Expert: An Incident Response Playbook: From Monitoring to Opera...
AlienVault5.5K views
Boardroom to War Room: Practical Application of the NIST Cybersecurity Frame... by robbiesamuel
Boardroom to War Room:  Practical Application of the NIST Cybersecurity Frame...Boardroom to War Room:  Practical Application of the NIST Cybersecurity Frame...
Boardroom to War Room: Practical Application of the NIST Cybersecurity Frame...
robbiesamuel479 views
Patch and Vulnerability Management by Marcelo Martins
Patch and Vulnerability ManagementPatch and Vulnerability Management
Patch and Vulnerability Management
Marcelo Martins3.5K views
Challenges of Vulnerability Management by Rahul Neel Mani
 Challenges of Vulnerability Management Challenges of Vulnerability Management
Challenges of Vulnerability Management
Rahul Neel Mani1.3K views
Incident Management Powerpoint Presentation Slides by SlideTeam
Incident Management Powerpoint Presentation SlidesIncident Management Powerpoint Presentation Slides
Incident Management Powerpoint Presentation Slides
SlideTeam1.5K views
NIST CyberSecurity Framework: An Overview by Tandhy Simanjuntak
NIST CyberSecurity Framework: An OverviewNIST CyberSecurity Framework: An Overview
NIST CyberSecurity Framework: An Overview
Tandhy Simanjuntak49.7K views
Disaster recovery solution by Anton An
Disaster recovery solutionDisaster recovery solution
Disaster recovery solution
Anton An776 views
business-continuity-management-awareness-presentation-for-mampu2929 by Andy Willams
business-continuity-management-awareness-presentation-for-mampu2929business-continuity-management-awareness-presentation-for-mampu2929
business-continuity-management-awareness-presentation-for-mampu2929
Andy Willams868 views

Similar to Incident Management Framework

ITSM in an Agile World - Scott Goh-Davis, Solutions Engineer APAC, Atlassian ... by
ITSM in an Agile World - Scott Goh-Davis, Solutions Engineer APAC, Atlassian ...ITSM in an Agile World - Scott Goh-Davis, Solutions Engineer APAC, Atlassian ...
ITSM in an Agile World - Scott Goh-Davis, Solutions Engineer APAC, Atlassian ...Amazon Web Services Korea
369 views55 slides
[Atlassian meets dev ops and itsm] itsm in an agile world atlassian scott by
[Atlassian meets dev ops and itsm] itsm in an agile world atlassian scott[Atlassian meets dev ops and itsm] itsm in an agile world atlassian scott
[Atlassian meets dev ops and itsm] itsm in an agile world atlassian scottOpen Source Consulting
436 views60 slides
Incident Response in the Cloud - SID319 - re:Invent 2017 by
Incident Response in the Cloud - SID319 - re:Invent 2017Incident Response in the Cloud - SID319 - re:Invent 2017
Incident Response in the Cloud - SID319 - re:Invent 2017Amazon Web Services
5K views53 slides
Paging, Alerting, Chaos Eng Overview by
Paging, Alerting, Chaos Eng OverviewPaging, Alerting, Chaos Eng Overview
Paging, Alerting, Chaos Eng Overviewmatthewbrahms
85 views36 slides
DIY guide to runbooks, incident reports, and incident response by
DIY guide to runbooks, incident reports, and incident responseDIY guide to runbooks, incident reports, and incident response
DIY guide to runbooks, incident reports, and incident responseNathan Case
2.5K views87 slides
Incident Response and SAP Systems by
Incident Response and SAP SystemsIncident Response and SAP Systems
Incident Response and SAP SystemsOnapsis Inc.
1.1K views35 slides

Similar to Incident Management Framework(20)

ITSM in an Agile World - Scott Goh-Davis, Solutions Engineer APAC, Atlassian ... by Amazon Web Services Korea
ITSM in an Agile World - Scott Goh-Davis, Solutions Engineer APAC, Atlassian ...ITSM in an Agile World - Scott Goh-Davis, Solutions Engineer APAC, Atlassian ...
ITSM in an Agile World - Scott Goh-Davis, Solutions Engineer APAC, Atlassian ...
[Atlassian meets dev ops and itsm] itsm in an agile world atlassian scott by Open Source Consulting
[Atlassian meets dev ops and itsm] itsm in an agile world atlassian scott[Atlassian meets dev ops and itsm] itsm in an agile world atlassian scott
[Atlassian meets dev ops and itsm] itsm in an agile world atlassian scott
Incident Response in the Cloud - SID319 - re:Invent 2017 by Amazon Web Services
Incident Response in the Cloud - SID319 - re:Invent 2017Incident Response in the Cloud - SID319 - re:Invent 2017
Incident Response in the Cloud - SID319 - re:Invent 2017
Paging, Alerting, Chaos Eng Overview by matthewbrahms
Paging, Alerting, Chaos Eng OverviewPaging, Alerting, Chaos Eng Overview
Paging, Alerting, Chaos Eng Overview
matthewbrahms85 views
DIY guide to runbooks, incident reports, and incident response by Nathan Case
DIY guide to runbooks, incident reports, and incident responseDIY guide to runbooks, incident reports, and incident response
DIY guide to runbooks, incident reports, and incident response
Nathan Case2.5K views
Incident Response and SAP Systems by Onapsis Inc.
Incident Response and SAP SystemsIncident Response and SAP Systems
Incident Response and SAP Systems
Onapsis Inc. 1.1K views
Inside SecOps at bet365 by Splunk
Inside SecOps at bet365 Inside SecOps at bet365
Inside SecOps at bet365
Splunk1.1K views
Business Continuity and Disaster Recovery for the Modern Office by Dawn Simpson
Business Continuity and Disaster Recovery for the Modern Office Business Continuity and Disaster Recovery for the Modern Office
Business Continuity and Disaster Recovery for the Modern Office
Dawn Simpson273 views
S.R.E - create ultra-scalable and highly reliable systems by Ricardo Amaro
S.R.E - create ultra-scalable and highly reliable systemsS.R.E - create ultra-scalable and highly reliable systems
S.R.E - create ultra-scalable and highly reliable systems
Ricardo Amaro728 views
ISACA Ireland Keynote 2015 by Shannon Lietz
ISACA Ireland Keynote 2015ISACA Ireland Keynote 2015
ISACA Ireland Keynote 2015
Shannon Lietz876 views
Corona| COVID IT Tactical Security Preparedness: Threat Management by RedZone Technologies
Corona| COVID IT Tactical Security Preparedness: Threat ManagementCorona| COVID IT Tactical Security Preparedness: Threat Management
Corona| COVID IT Tactical Security Preparedness: Threat Management
Splunk Discovery: Warsaw 2018 - Legacy SIEM to Splunk, How to Conquer Migrati... by Splunk
Splunk Discovery: Warsaw 2018 - Legacy SIEM to Splunk, How to Conquer Migrati...Splunk Discovery: Warsaw 2018 - Legacy SIEM to Splunk, How to Conquer Migrati...
Splunk Discovery: Warsaw 2018 - Legacy SIEM to Splunk, How to Conquer Migrati...
Splunk567 views
How to Build an Invincible Incident Management Plan by DevOps.com
How to Build an Invincible Incident Management PlanHow to Build an Invincible Incident Management Plan
How to Build an Invincible Incident Management Plan
DevOps.com403 views
How Dealertrack Optimizes the DevOps Toolchain, FutureStack17 by New Relic
How Dealertrack Optimizes the DevOps Toolchain, FutureStack17How Dealertrack Optimizes the DevOps Toolchain, FutureStack17
How Dealertrack Optimizes the DevOps Toolchain, FutureStack17
New Relic708 views
Webinar: Demonstrating Business Value for DevOps & Continuous Delivery by XebiaLabs
Webinar: Demonstrating Business Value for DevOps & Continuous DeliveryWebinar: Demonstrating Business Value for DevOps & Continuous Delivery
Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
XebiaLabs2K views
SplunkLive! Paris 2018: Event Management Is Dead by Splunk
SplunkLive! Paris 2018: Event Management Is DeadSplunkLive! Paris 2018: Event Management Is Dead
SplunkLive! Paris 2018: Event Management Is Dead
Splunk447 views
DevSecCon KeyNote London 2015 by Shannon Lietz
DevSecCon KeyNote London 2015DevSecCon KeyNote London 2015
DevSecCon KeyNote London 2015
Shannon Lietz3.1K views
Tenable_One_Sales_Presentation_for_Customers.pptx by alex hincapie
Tenable_One_Sales_Presentation_for_Customers.pptxTenable_One_Sales_Presentation_for_Customers.pptx
Tenable_One_Sales_Presentation_for_Customers.pptx
alex hincapie35 views

Recently uploaded

Ransomware is Knocking your Door_Final.pdf by
Ransomware is Knocking your Door_Final.pdfRansomware is Knocking your Door_Final.pdf
Ransomware is Knocking your Door_Final.pdfSecurity Bootcamp
81 views46 slides
The Role of Patterns in the Era of Large Language Models by
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language ModelsYunyao Li
74 views65 slides
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue by
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlueShapeBlue
75 views23 slides
Network Source of Truth and Infrastructure as Code revisited by
Network Source of Truth and Infrastructure as Code revisitedNetwork Source of Truth and Infrastructure as Code revisited
Network Source of Truth and Infrastructure as Code revisitedNetwork Automation Forum
49 views45 slides
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool by
Extending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPoolExtending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPool
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPoolShapeBlue
56 views10 slides
NTGapps NTG LowCode Platform by
NTGapps NTG LowCode Platform NTGapps NTG LowCode Platform
NTGapps NTG LowCode Platform Mustafa Kuğu
287 views30 slides

Recently uploaded(20)

The Role of Patterns in the Era of Large Language Models by Yunyao Li
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language Models
Yunyao Li74 views
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue by ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
ShapeBlue75 views
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool by ShapeBlue
Extending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPoolExtending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPool
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool
ShapeBlue56 views
NTGapps NTG LowCode Platform by Mustafa Kuğu
NTGapps NTG LowCode Platform NTGapps NTG LowCode Platform
NTGapps NTG LowCode Platform
Mustafa Kuğu287 views
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue by ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
ShapeBlue191 views
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ... by ShapeBlue
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
ShapeBlue114 views
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ... by ShapeBlue
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
ShapeBlue97 views
State of the Union - Rohit Yadav - Apache CloudStack by ShapeBlue
State of the Union - Rohit Yadav - Apache CloudStackState of the Union - Rohit Yadav - Apache CloudStack
State of the Union - Rohit Yadav - Apache CloudStack
ShapeBlue218 views
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ... by ShapeBlue
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
ShapeBlue48 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker50 views
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or... by ShapeBlue
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
ShapeBlue128 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software373 views
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue by ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlueMigrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
ShapeBlue147 views
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ... by ShapeBlue
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
ShapeBlue52 views
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P... by ShapeBlue
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
ShapeBlue120 views
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue by ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueElevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
ShapeBlue149 views
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T by ShapeBlue
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
ShapeBlue81 views
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT by ShapeBlue
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBITUpdates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
ShapeBlue138 views

Incident Management Framework

  • 2. Preparing for System Failure Our Approach at Rentman
  • 3. About Me - Software Architect at Gapstars / Rentman - ~15 years of experience, mistakes and learning - Primarily APIs and Web tech - I sporadically blog on randomcoding.com - I tweet as @jomanlk
  • 4. About Rentman - Provides resource management and planning for the AV & Event industry - Industry leader in rentals management for the events industry - 10+ years in the events space - Customers across 75 countries - 70+ employees spread across NA, Europe and Sri Lanka - Tech stack primarily on AWS - Most services multi region / multi AZ - Primarily running on top of AWS ECS - Heavily use Atlassian products
  • 5. Agenda - Introduction ← - Approach - Learnings
  • 6. Why Now, for Rentman? - Increase our ‘bus factor’ - Reduce loss of institutional knowledge - Increase active monitoring coverage - Growing pains. Reduce stress & panic
  • 7. What Is An Incident Response Plan? - Well defined framework to deal with incidents - No ambiguity - Clear command structure - Refer the Incident Command System (ICS) “An incident response plan is a document that outlines an organization's procedures, steps, and responsibilities of its incident response program.”
  • 8. Goals: The 3 Cs - Coordinate response effort. - Communicate between incident responders, within the organization, and to the outside world. - Maintain control over the incident response.
  • 10. Step 0 - Documentation - Documentation - Create a Playbook - Setup Teams & Organizational support - Tiered teams. T1, T2
  • 11. Incident Response Phases Triage Coordinate Mitigate Resolve Learnings
  • 13. Common Terms - IC: Incident Commander - CL: Comms lead - LI: Lead investigator - DS: Domain specialist
  • 14. Triage - What’s going on? - How bad is it? - Depends on - Monitoring - User reports - P3, P2? - Not great, but it can wait - P1 - BIG problem
  • 16. Coordinate - Use tooling - Scheduling - Alerting - Who needs to be involved? - Small incident? - Big incident? - Who’s available?
  • 18. Mitigate - STOP THE BLEED! - Goal ≠ Finding and fixing issue - Goal = Get things working - Collaborate - Keep it DRY - Keep it documented Reviewing recent releases Disabling demo creation Support is asking me for an update, do we have anything? Joining the incident response! Where are we at?
  • 19. Mitigate Reviewing recent releases Disabling demo creation Support is asking me for an update, do we have anything? Joining the incident response! Where are we at?
  • 20. Resolve - Make sure the root cause is addressed - This could be days or sometimes weeks after incident Creating hotfix branch Added extra logs for this specific issue
  • 22. Follow Up - Document the JIRA issue timeline - Psychological Safety - Learn from the experience - Failure is in process not individual - Blame free / Owned by team - Review the process - What went well / not well? - What was missing? Improvements to process Additional logging added Learnings Create RCA
  • 24. Learnings - Leverage existing workflows / tools - Practice. Practice. Practice. - Breakathons - Simulations
  • 25. Learnings Continued - Plan. Do. Review. Improve. - Incorporate Organizational Requirements Early - Compensation for on-call - Uptime guarantees - SLA with customers
  • 26. Fin. - Questions: Stay tuned for the panel discussion - Want to reach out? - @jomanlk on Twitter - linkedin.com/in/jnxpereira on LinkedIn - john@jnx.me on Email