SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60% in 6 Months
2020114-SEV-Namibia Flexcube and Digital channels were unavailable IM5183170-PM2275.pdf
1. Page 1 of 6
SEV – Namibia Flexcube and Digital channels unavailable
IM5183170 / PM2275
Attendance register:
Name Department SEV
14-Nov-20
09h10
Post Mortem
17-Nov-20
13h00
Wessel Pieterse Service Recovery √
Mo Turkey Problem Management √ √
Douglas Gilliland Incident Management √ √
Boris Berovic Problem Management √ √
Fayyaz Dindar Problem Management
Kubendren Govender Group Risk
Florina Patil GT Risk
Rudi Bennett Group Risk Monitors √
Marcia Lopes GT Risk √
Freddy Ambani SQL Support √ √
Naga Bolisetty NAR Tech Business Enablement √ √
Ian Castelyn Application Support √ √
Indren Chinnappen Hybrid Cloud Operations √
Nitesh Chiba I&O Cloud Infrastructure √ √
Amos Doboro Service Delivery Manager √
John Drotsky Information & Communication Technology (ICT) √ √
Milin Jumberi OMNI Spare 4 √ √
Sam Maraya UNIX DB2 and ORACLE Support √
Tumelo Malete I&O Systems Optimization √
Thabang Mokoena NAR Tech Interphases and Infrastructure √
Sello Moloi UNIX DB2 and ORACLE Support √ √
Stephen McTigue NEDBANK NAMIBIA LTD √ √
Herman Munsamy NAR Tech Leadership √
Nicholas Namacha OMNI Spare 4 √ √
Ramana Potlolla NAR Tech Applications Support √
Stefanus Shivute Namibia Technical Supervisor – Core Banking √ √
Stoney Steenkamp Omni Channel Management √ √
Lionel Valentine NAR Tech Interphases and Infrastructure √ √
Suvesh Varghese NAR Tech Business Enablement √
2. Page 2 of 6
Casper Wolmarans UNIX DB2 and ORACLE Support √ √
Sibusiso Zama Service Delivery Management √
DISTRIBUTION
Barry van Huyssteen GT Exec
Alfie De Bruyn Application Development & Maintenance
Nico Naidoo CIO: Nedbank Africa Regions (NAR) √
1. Problem Description:
Namibia Flexcube and Digital channels were unavailable.
2. Impact:
Namibia Flexcube, Icon and Digital channels were unavailable for clients to transact during the outage. Namibia ATM and POS channels were available in
stand-in mode.
3. Duration:
Date and time of occurrence: Saturday 14th
November 2020 at 02h00
Time problem resolved: Saturday 14th
November 2020 at 12h30
Impacted downtime: 10 hours 30 minutes
SLA downtime: 10 hours 30 minutes
4. Communication:
Fred Swanepoel
Barry van Huyssteen
Nico Naidoo
GT Severity SMS Group
GT Executive Communication SMS Group
GT Business Severity SMS Group
5. Sequence of Events:
Date/ Time Event
14-Nov-20 00h00 Maintenance pages were brought up as part of the Middleware change.
14-Nov-20 00h07 The Oracle database change to rename duplicate datafiles was started – Change number CC20201111/1207.
3. Page 3 of 6
14-Nov-20 00h35 During the copying of the files there was an error pointing to insufficient space on the filesystem and at that point in time the DBA directed the
5 files to another filesystem which corrupted the header of the datafiles.
14-Nov-20 01h14 The Change was completed, and the database was opened.
14-Nov-20 03h00 The Oracle standby team was phoned to alert them of issues regarding connecting to the database.
14-Nov-20 03h24 The Oracle standby team escalated the outage to Management.
14-Nov-20 04h00 Maintenance pages for Namibia Flexcube and Digital channels were brought up.
14-Nov-20 04h42 A Focus call was started with technical teams.
14-Nov-20 05h28 The restoration of 5 corrupt files (Datafiles 40, 215, 248, 278, 289) started at 05h28.
14-Nov-20 06h54 SMS was sent out indicating that the restoration process was in progress, and the estimated on-line time was between 09h00 and 10h00.
14-Nov-20 08h05 The restoration process was completed.
14-Nov-20 08h05 The ORADBA team identified that only 1 file was fully restored and the other 4 were not running parallel during the restoration process.
14-Nov-20 08h30 The ORADBA team created separate scripts for the outstanding 4 files. This was done to restore the files and run them concurrently.
14-Nov-20 08h54 A Service Request was logged with ORACLE support as a Severity 1.
14-Nov-20 09h14 SMS update was sent indicating that the restoration process was unsuccessful, and the estimated on-line time was between 11h00 and
12h00.
14-Nov-20 09h14 The Focus call was upgraded to a Severity.
6. Actions: 14-Nov-20
Action/Comment Status Date / Time
Expected
Date / Time
Completed
Responsible
Person/ Area
Follow-up and feedback
1. An alternative solution was recommended by
Nitesh Chiba to copy the files from the
reporting database server to the production
directory and recover the files.
Done 14-Nov-20 14-Nov-20 Nitesh Chiba
The suggestion was made at 08h13 in
order to speed up the restoration of the
service.
2. Oracle was requested to join a Webex call.
Done 14-Nov-20 14-Nov-20 Sello Moloi
The workaround to copy files from the
Reporting DB worked so Oracle’s input
was not needed after this point in time.
3. The workaround worked and datafile 40 was
recovered. The process to copy and recover
the remaining 4 files started at 09h55.
Done 14-Nov-20 14-Nov-20 Ian Castelyn
The file copy was successful, and the
health checks confirmed that there was no
data corruption.
4. The restore from backup scripts that were
running for the outstanding datafiles were
stopped at 09h59.
Done 14-Nov-20 14-Nov-20 Ian Castelyn This was done in order to start the file copy
process.
5. All datafiles were reported to be on-line at
10h49.
Info 14-Nov-20 14-Nov-20 Ian Castelyn
6. The datafile dates were checked and confirmed
to be valid with the correct date being the 14th
November 2020.
Done 14-Nov-20 14-Nov-20 Suvesh Varghese
7. The scheduled reporting batches were initiated
at 10h52.
Done 14-Nov-20 14-Nov-20 Naga Bolisetty
8. The FCUBS Weblogic Application server was
restarted at 10h53.
Done 14-Nov-20 14-Nov-20 Stefanus Shivute
4. Page 4 of 6
9. The preparation and health checks of EOD
(End of Day) scheduling process was started at
11h02.
Done 14-Nov-20 14-Nov-20 Sandile Molefe
10. The scheduling jobs that were previously held
were now released.
Done 14-Nov-20 14-Nov-20 Sam Maraya
11. ATM’s and POS were taken out of stand-in at
11h30.
Done 14-Nov-20 14-Nov-20 Sandile Molefe
12. The scheduling jobs that were flagged as
aborted were forced completed at 11h27. Done 14-Nov-20 14-Nov-20 Stefanus Shivute This was done to complete the previous
days batch run.
13. The FCDB (Flexcube Data Base) and
onboarding applications were restarted by the
Middleware team at 11h35
Done 14-Nov-20 14-Nov-20 Stefanus Shivute Completed at 11h39.
14. The Gateway service was restarted at 11h44.
Done 14-Nov-20 14-Nov-20
Nicholas
Namacha
15. An error was reported on FCDB and
MoneyApp application when trying to logon.
Info 14-Nov-20 14-Nov-20 Milin Jumberi
The error message read “Request cannot
be processed please try again latter”. The
action was resolved, and the error
message didn’t reoccur.
16. The system was online at 11h45. Info 14-Nov-20 14-Nov-20 Stefanus Shivute The branches were able to start operating.
17. The EOD scheduling jobs completed at 12h07. Info 14-Nov-20 14-Nov-20 Stefanus Shivute Flexcube application was online at 12h07.
18. The post EOD jobs were running as per the
schedule.
Info 14-Nov-20 14-Nov-20 Stefanus Shivute
19. Onboarding was reported to be up and running
at 12h17.
Info 14-Nov-20 14-Nov-20
20. The Internet Application servers and FCDB
were restarted at 12h29. Done 14-Nov-20 14-Nov-20 Sandile Molefe
John Drotsky tested the login to MoneyApp
and FCDB Webpage confirming that all
was in order.
21. The maintenance pages were brought down at
12h29.
Done 14-Nov-20 14-Nov-20 Tumelo Malete
22. All services were fully restored at 12h30. Info 14-Nov-20 14-Nov-20
7. Actions: 17-Nov-20
Action/Comment Status Date / Time
Expected
Date / Time
Completed
Responsible
Person/ Area
Follow-up and feedback
23. Ensure that the Namibia Command Centre is
made aware of the changes relating to
Namibia.
Done 17-Nov-20 17-Nov-20 Stephen McTigue
24. Investigate and review the current 90% file
systems threshold alerting in order to mitigate
issues related to insufficient space availability.
Sev action
PM2275-001
15-Dec-20
Casper
Wolmarans
25. ORADBA team to engage with ORACLE
support to establish if a step can be built in as
Sev action
PM2275-002
15-Dec-20
Casper
Wolmarans
5. Page 5 of 6
part of health checks for any datafile corruption
within the database relating to relevant
database changes.
26. Investigate and review reducing the database
size by archiving data to reduce the time it
takes to restore the database.
Sev action
PM2275-003
15-Dec-20 Lionel Valentine
8. Root Cause:
Process: Datafiles were moved to a new location that did not have sufficient space due to the prechecks not being done as per the process.
Findings:
1. A change was executed to rename duplicate file names.
2. This change was initially scheduled for the 7th
November 2020 at which point the necessary actions were taken to ensure that sufficient space was available.
3. In the planning and execution of the change that was rescheduled, the process to ensure that sufficient space was available was omitted.
Could the incident have been prevented? Yes If sufficient space was allocated for the change.
• Is the root cause the result of a process and or
procedure that was not followed (compliance)?
• Could that procedure be automated?
Yes
No
The action to verify that sufficient space was
available before the change was not done.
• Is there a process and procedure deficient
which could have prevented the incident
(adequacy)?
• Could that procedure be automated?
No
No
Is the root cause due to a project? No
Is the root cause due to a change? Yes Change number 20201111/1207.
Was there Executive override for the change? N/A
Does this environment have DR? Yes
Does this environment have Resiliency? Yes
Does this environment have infrastructure,
application and availability monitoring and alerting?
Is the monitoring / alerting adequate?
Yes
Yes
9. Severity owner:
1. Department – Executive: (Hybrid Cloud Operations)
2. Individual – Indren Chinnappen
6. Page 6 of 6
10. Risk:
1. Financial Risk: Financial – Financial losses were reported to be N$21 635.70 (R21 635.70) based on the average NIR that would had been generated on the
Saturday for user driven, in-branch transactions (Herman Munsamy).
2. Reputational Risk: High – All core banking, Internet banking and NAR Money App services were unavailable to Nedbank Namibia clients during the outage
period. Automated Teller Machines and Point of Sale devices were in stand-in during the outage period. Branches remained open primarily to take-in deposits.
No media coverage of the event has been noted (Herman Munsamy).
11. Actions to restore service:
1. Datafile 40 was restored from the backup directory.
2. The 4 remaining datafiles were copied from the reporting database server to the production database server.
3. The datafiles were then recovered on the production database server using the archived files.
4. The recovered datafiles were put in an online status.
5. The database was opened and confirmed to be online.
6. The EOD scheduling jobs from the previous day were completed.
7. Application Internet server was restarted.
8. The maintenance pages were brought down.
12. Lessons Learnt:
1. Application and functionality testing must be done immediately after the change to minimise the risks of prolonged downtime.
2. Pre tasks for all changes must be reviewed and strictly followed.
3. Consider reducing the database size by archiving data to reduce the time it takes to restore the database.
4. Review the backup and restore strategy.
13. Actions to prevent a reoccurrence:
1. Ensure that if the change is postponed, all pre change checks have to be redone according to the new date.