A presentation on how we support & pro-actively resolve our cloud based application issues. We will share about the tools used & how we track them.
Speakers: Derek, Nurul Zaman
10. How do we do that?
• Supporting a growing list of products
• From 1 country to 8 countries within 12 months
Processes
Exponential Growth in Support Demands
11. Tools for Support
Zendesk
- Ticketing system for L1, L2, L3, Help Center
Bugzilla
- Defect Management Tool used by Support, Dev, UX & QA
Kibana
- Logging tools for Developers & Support staff
WordPress
- Manage Knowledgebase for sharing
- Release Notes
14. Incident Management
• Resolving issue within the shortest possible
time.
• Root cause is not the main concern. Get the
service back online as soon as possible.
• Could be permanent or temporary.
For example, providing a workaround.
15. Incident Management
• Situation (Code Blue) - When a system wide
fault occurs, it affects all or most of the
customers.
• Resolution of the code blue will become the
utmost priority for the support & technical
teams.
• Proper communication & follow-up with
multiple clients is required.
16. Incident Management
• Two key measurement metrics
• Time To Respond
• Within the same working day
• Time for Resolution
• P1 Defect– 2 Working Days
• Resolution could be a workaround or permanent solution
• Resolution could be advice or mini training
• For P2 and P3 defects with patches, customers would be notified with
proper follow-up
18. Problem Management
• Proper tracking of issues and defects are important
• The entire technical team could be involved:
• Support (Level 2 and Level 3)
• Developers
• Business Analysts
• QA Team
• Infrastructure
• Release / DevOps
19. Problem Management
• Known Error Record (KER) - While the problem is being resolved,
a workaround or temp solution would be used to circumvent the
issues in production
• There are 2 types of Problem Management scenarios:
• Re-Active Problem Management
• Pro-Active Problem Management
20. Problem Management (Re-active)
• These are issues that arise from incidents and are directly
reported by the customers.
• Re-Active problems are given a higher priority to be resolved.
21. Problem Management (Pro-active)
• Issues or problems do not always come
from the end-users or customers.
• Problems could be identified internally
• Continuous improvement initiatives
• Communication to the end users
22. Problem Management (Defect Prioritisation)
Priority Definition Resolution
P1 1. These are defects that affect multiple users.
2. A crucial feature (or the entire system) is not usable and
there is no suitable workaround.
1. Need to be analysed and
resolved as soon as possible.
2. Typically fixed & deployed to
production within 2 working
days.
P2 1. These are defects that affect only some users.
2. The affected features are not used often or there is an easy
workaround available.
1. Will be fixed as part of the monthly
release cycle.
P3 1. These are minor defects (e.g. UI, wordings). 1. There is no fixed time for
resolution for these.
2. Will normally be attended to
after P1 and P2 defects are fixed.
3. Increase/decrease priority
accordingly.
24. Problem Management (Root Cause Analysis)
• Finding the root cause and coming up with a viable solution.
• Require monitoring of the production system.
• Collecting and interpreting logs.
25. Problem Management (Root Cause Analysis)
• Successful RCA:
• The problem can be eliminated completely.
• Lowering the risks of re-occurrence.
• Sometimes, we might have to contact vendors.
For examples, Microsoft & Telerik to provide
solutions.
26. Problem Management (Root Cause Analysis)
Define
• What is the Problem
• Determine the Scope and Goal
Analyse
• Analyse the causes
• Why does it happen
Prevent
• Develop appropriate solution
• Implement solution
27. Problem Management (5 Whys)
WHY
?
WHY ?
Why ?
Why ?
Why ?
Problem
Revelation
RespectTrust
Learn From the Past
Look positively
towards the Future
Being
Defensive
Blame
Game
Disrespectful &
Pessimistic
30. Problem Manager Role
• Keeps track of the problems in the
system and facilitates the resolution.
• Organises meetings and work with the
teams
• Hosts Root Cause Analysis sessions.
• Reports to management and
stakeholders.
31. Problem Manager Role
• Aids in finding systematic issues, technical issues and process
issues in the product and its supporting structure.
• Works closely with the Incident Manager on analysing the
incident trends.
32. Software Design Strategy for Better Support
• Proper error logging mechanism.
• Send out critical alerts when errors are detected.
• Easy to understand error messages with detailed information
(e.g. Error Codes) for a speedier response and troubleshooting.
35. Out Support Structure
Development / Engineering / vendor – PM, CM
1. Final point for technical resolution 1. Perform RCA activities
Level 3 – IM, PM, CM
1. Able to conduct more in-depth technical investigation 1. Perform RCA activities
Level 2 - IM
1. Able to handle more technical tasks
1. Database investigation
2. Hardware support & remote support
Level 1 (Customer Service / Helpdesk) - IM
1. Initial point of contact
2. Provide solutions to simple and known issues
1. Perform straight forward tasks
2. Give advice and suggestions
39. Conclusion
• Keep users happy by providing great support.
• Come up with a good support process:
• Incident Management
• Problem Management
• Proper Communication
• Encourage Level 3 Support Team to develop in-house checking tools to
improve support.
Give an example of a code blue
Users unable to login to the system because of a memory issue
Support team monitored the IIS logs to see where the memory leak was taking place and able to identify exactly what was causing the issue
It was discovered the issue occurred because of dirty data in the system, which most likely occurred during data migration
This was found and data patches.
The code was also immediately fixed and hot patched (to handle this type of dirty data).
Communication was sent out during and after successful resolution of the situation.
part of the continuous improvement initiative this need to be tracked and resolved in a similar fashion to the re-active problems.
Why? - The battery is dead. (first why)
Why? - The alternator is not functioning. (second why)
Why? - The alternator belt has broken. (third why)
Why? - The alternator belt was well beyond its useful service life and not replaced. (fourth why)
Why? - The vehicle was not maintained according to the recommended service schedule. (fifth why, a root cause)