Emergency handling for recovery of sap system landscapes


Published on

  • Be the first to comment

Emergency handling for recovery of sap system landscapes

  1. 1. Emergency Handling for Recovery of SAP System Landscapes Best Practice for Solution Management Version Date: May 2008 The newest version of this Best Practice can always be obtained through the SAP Solution ManagerTable of contents1 Introduction ....................................................................................................................................... 3 1.1 Goal of Document ...................................................................................................................... 3 1.2 What Is a Disaster? .................................................................................................................... 4 1.3 Course of a Recovery ................................................................................................................ 4 1.4 Organization ............................................................................................................................... 5 1.5 Dos and Don‟ts in Case of a Disaster ........................................................................................ 52 Flowchart for Emergency Handling .................................................................................................. 63 From Incident to Disaster (Steps 1 to 5) .......................................................................................... 74 Error Categorization (Step 6) ........................................................................................................... 8 4.1 Technical Failure ........................................................................................................................ 8 4.2 Logical Error ............................................................................................................................... 9 4.3 Cross-System Inconsistency ...................................................................................................... 95 Activating Alternate Procedures (Steps 7 to 8) .............................................................................. 10 5.1 Switchover after Technical Failures ......................................................................................... 10 5.2 Workarounds after Logical Errors and Data Inconsistencies ................................................... 116 Preparations for Recovery (Step 9) ................................................................................................ 117 Executing Recovery (Step 10) ....................................................................................................... 13 7.1 Overview of Recovery Phases ................................................................................................. 13 7.2 Technical Recovery (Recovery-Phase 1) ................................................................................ 14 7.3 Data Repair (Recovery-Phase 2) ............................................................................................. 17 7.4 Business Recovery (Recovery-Phase 3) ................................................................................. 21 7.5 Data Re-entry (Recovery-Phase 4) .......................................................................................... 248 Returning to Normal Operation (Steps 11 to 16) ........................................................................... 259 Examples ........................................................................................................................................ 27 9.1 Example 1: Media Failure ........................................................................................................ 27 9.2 Example 2: Media Failure and Database Recovery Failure .................................................... 28
  2. 2. 9.3 Example 3: Lost Data ............................................................................................................... 30 9.3.1 Example 3a: All Data Can be Recovered ........................................................................ 30 9.3.2 Example 3b: Remaining Data Loss is Only Locally Relevant .......................................... 31 9.3.3 Example 3c: Remaining Data Loss Causes Cross-system Inconsistencies ................... 32 9.4 Example 4: Database Block Corruptions ................................................................................. 33Appendix ................................................................................................................................................ 36 A - Flowcharts for Printout ................................................................................................................. 36© 2008 SAP AG -2-
  3. 3. 1 Introduction1.1 Goal of DocumentDisruptions of core business functions are critical to the success of a company. When businessoperations are disrupted, a standardized procedure can help to return to regular operations in a timelymanner. Meanwhile, activating technical switchover solutions or business-level workarounds canprovide an interim solution to keep up operations (at least at some minimum level) while regularfunctionality is being restored.Using flow-charts, this document outlines a general procedure to be followed in case of seriousbusiness disruptions. Starting with the escalation of an incident to a disaster, the main phases andsteps that are part of the recovery procedure are depicted - allowing the classification of incidents andproviding details on the recovery options available in the different phases.The purpose of this document is twofold: 1. Support the handling of an acute emergency 2. Provide input for Business Continuity PlanningEmergency SituationFollowing this document, SAP support employees and customer support organizations will be able tofollow a structured approach to support the recovery a customer‟s system environment within areasonable timeframe. This document provides information for each phase of a recovery procedure.By giving examples for some typical error situations and recovery approaches, the course of therecovery process becomes clear.The procedure outlined here can also serve as the basis for an action plan to be set up forcoordinating an emergency situation.For a customer, this document will be helpful if there is no disaster recovery plan available or, if thereis, it can support the emergency handling process by supplementing a customer-specific plan.As such, this document is intended for:  A disaster recovery team executing a recovery  Support employees assisting customers in a business-down situation  Duty managers / escalation managers accompanying a recoveryBusiness Continuity PlanningBusiness continuity planning has the task of preparing a company for a disaster situation by creatingdetailed recovery plans for different contingencies. The general flow of an emergency proceduredescribed here can be adopted in a customer-specific recovery plan. The error categorization andrecovery options listed in this document can provide input for the detailed recovery instructions to bedefined.As such, this document is intended for:  The business continuity project manager  Members of a business continuity projectBusiness Continuity Planning is also part of another Best Practice provided by SAP: “BusinessContinuity Management for SAP System Landscapes”, which covers the project steps ofestablishing a business continuity concept; see http://service.sap.com/solutionmanagerbp.The recovery procedures created by a business continuity project are intended to guide emergencyhandling in a business-down situation.© 2008 SAP AG -3-
  4. 4. 1.2 What Is a Disaster?Within the scope of this document, a disaster is any event that seriously disrupts business operationbeyond the acceptable outage time. An incident that cannot be resolved within a predefined time-limitneeds to be escalated to a disaster that requires further recovery procedures. This perception of theterm „disaster‟ is often used in the context of Business Continuity Management and goes beyond themore restrictive notion of a „physical disaster‟ like a fire, flooding or explosion.Business disruptions or „disasters‟ can be caused by technical failure or logical failure.  Technical failures of a system component usually affect all business processes that are using the affected component(s). This can range from crashes of individual hardware components over database block corruptions, to building fires or flooding of an entire computer center.  Logical failures, on the other hand, often only affect single or few business processes while the systems are still up. Logical failures range from partial data loss or data corruptions inside a single system, to data inconsistencies of data being exchanged between multiple systems of an environment.1.3 Course of a RecoveryDisaster recovery handling, as described in this document, starts with the escalation of an incident,which is seriously disrupting operations, to a disaster.It is important to identify the type of error causing the disruption as early as possible, since therequired recovery phases and applicable activities mainly depend on the error type.When a disaster is declared, the following main phases of recovery may be applied in this order: A. Activate possible alternatives to stay in business. This can be a technical switchover to a standby system, or the activation of alternate business processing using workarounds or emergency plans. Which options are possible or applicable depends on the solutions in place and the actual type of error. Activating a workaround will be easier and faster, if the workaround is already documented in a recovery plan. B. Prepare systems for the recovery C. If a system or component is down, system recovery or technical recovery, as a first step, has to reestablish technical availability of the system by fixing any technical error causing the disruption. This can be done, for example, by exchanging some defect hardware component, by activating a standby system or by restoring a database from a backup. D. If all components are up (or were recovered in the previous step), logical errors inside each system have to be removed to restore integrity of each system in itself. This requires in-depth application knowledge and is a prerequisite for the next step. E. If data consistency between systems of the environment was affected, this again requires in- depth application knowledge and time to fix it. F. If data was lost and could not be recovered so far, the next effort should aim at reentering such data into the systems. G. Having finished all recovery phases, the systems and business functionality should be checked as a prerequisite for resuming regular operations.More details on the different phases can be found in section 2 and following.© 2008 SAP AG -4-
  5. 5. 1.4 OrganizationThe organization that executes the recovery plans consists of:  The support desk (incident management) staff that report a possible disaster case  The business continuity manager  The recovery team with representatives from business (key business user/business process champion) and IT (application management/SAP technical operations/business process operations)  A crisis team of senior managers from business (business process champion) and IT (application management) that need to be consulted for critical decisions like the activation of a disaster recovery plan  Key users familiar with emergency workarounds for critical business functionalityA standardized method is recommended to support complex business continuity approach. Thisapproach involves all parties that can contribute to the resolution of a disaster.The focus of the approach should be the analysis and resolution of top issues, which have the highestimpact on the operations of productive solutions in case of a disaster.The benefits of a consolidated approach are:  Standardized approach that has proven to be most effective and efficient  Fast access to all experts needed  Close collaboration and communication with all involved parties  High transparency on current issue status  Continuous reporting on the progress, up to management level1.5 Dos and Don’ts in Case of a DisasterThis section lists some general pitfalls during a recovery. Don‟ts Instead do Do not apply point in time recovery of single Do repair inconsistencies / logical errors of single system system Do not apply point in time recovery for the Do identify and correct inconsistencies / missing system landscape data Do not chase temporary differences Do make sure that it‟s a real inconsistency, not a temporary difference© 2008 SAP AG -5-
  6. 6. 2 Flowchart for Emergency HandlingThis section describes the general flow of activities for handling a situation that impacts continuity ofbusiness operations. The following figure provides a flowchart that can also serve as the basis for aspecific action plan in an emergency situation.Note: If available, a customer‟s business continuity plan and corresponding recovery plans need to befactored in when performing a recovery.Figure 1: Flowchart 1: “Emergency Handling”The different steps of the emergency handling process will be discussed in more detail in the followingsections: Step Section Title 1-5 3 From Incident to Disaster 6 4 Error Categorization 7-8 5 Activating Alternate Procedures 9 6 Preparations for Recovery 10 7 Executing Recovery 11-16 8 Returning to Normal Operation© 2008 SAP AG -6-
  7. 7. 3 From Incident to Disaster (Steps 1 to 5)IncidentA business disruption is usually detected by end users who trigger an incident at the supportorganization. Incident management (support desk), as the primary addressee for any type of businessdisruption, analyzes and tries to resolve the error by answering the following questions:  What has happened?  Which processes are affected?  How many users are affected?  Is the error reproducible?  Is the error business critical? Involved Organization: Support deskEscalationThe situation has to be escalated to business continuity management:  If error resolution is not successful within a given time frame defined in the SLAs or  As soon as it becomes clear that the error is of high criticality and complexity and has a serious impact on business operationsBusiness continuity management is responsible for further handling such major incidents. Beforeescalating to a disaster, further analysis has to answer the following questions:  What is the impact on business operations?  Is the incident endangering the business of the customer (production down)?  What is the root cause of the problem?  What is the estimated time required for recovery?  Should I invoke disaster recovery plan (escalate)?If a serious business disruption (that prevents critical core business functions from operating) isdetermined, a disaster situation will be declared and the business continuity plan will be invoked. Involved Organization: Support desk, BC Manager, Senior ManagementDisasterIn a disaster situation, the following questions need to be addressed by the disaster recovery team:  Who do I have to call first?  Is the incident of an isolated nature or will the consequences deteriorate over time (for example data inconsistencies that may spread if work continues in the system)?  Is it possible to maintain partial functionality in the system or must the system be taken out of operation completely?  Which workarounds are available; which are possible? See section 5.  What needs to be done before starting recovery actions? See section 6.  Which recovery options are available; which are possible? See section 7.  Who else will be required in the recovery team (BC team)? Involved Organization: BC Team© 2008 SAP AG -7-
  8. 8. 4 Error Categorization (Step 6)The type of error (error category) determines not only the entry point for recovery execution (see step10), but also the options for activating alternate procedures (business continuity solutions, see steps7-8). Therefore, the first requirement now will be to determine the type of error leading to thedisruption.We can distinguish  Technical failures  Logical errors inside a system  Cross-system inconsistencies affecting data exchanged between systems of a landscape Involved Organization: BC Team4.1 Technical FailureA technical failure is usually caused by a hardware or system software fault. The system is usuallyunavailable and thus, all users and business processes relying on that system are affected.Technical errors can be of the following nature:  System / subsystem crash  Infrastructure failure (network, power, telephony, …)  Failure of service provider  Physical disaster (fire, flooding, …)Error causesTechnical failures can, for example, result from:  Hardware failure (memory, CPU, controller, …)  Storage media or storage system failure  Software bug (firmware, operating system, filesystem, database, …)  Database block corruptionsBlock corruptionsDatabase block corruptions are a special kind of technical error. The content of storage blocks usedby the database is corrupted thus the data stored in these blocks can no longer be used. The impactof block corruptions can range from SQL statements or transactions failing when accessing the corruptblock, up to crashes of the complete database instance. If system data of the database managementsystem was corrupted, it is possible that the database instance may no longer be restarted.The reason for block corruptions can be multiple – hardware failures like a defective disk controller,memory errors or low-level software bugs messing up the data.Although block corruptions are sometimes regarded as logical errors (since the data stored in theblocks is logically corrupt), we follow the categorization as a technical failure because the phases thatare applicable to recover from block corruptions start at the hardware / technology level (see section7.2). Fixing a corruption on the technical layer may sometimes result in missing or incorrect data (alogical error).© 2008 SAP AG -8-
  9. 9. 4.2 Logical ErrorA logical error usually affects only parts of a system or its data and thus only a few businessprocesses and a limited number of users. Since all data is consistent from a database and “SAPBasis” viewpoint, the systems are up and „only‟ some business functionality is disrupted or faulty.Logical errors can be of the following nature:  Some business data is lost, ranging from complete tables to single table rows or single fields of table rows. If data is lost, the application context will be corrupted since related data still exists in other tables  Business data is falsified, ranging from single table rows to the contents of specific table fields  Reports or other software processes are inoperableError causesLogical errors can result from software error or human error (business user or administrator error) like:  Data deletion or table drops on SQL, database administration or SAP level  Transport induced error (wrong destination, wrong transport buffer, …)  Faulty customizing  Introduction and execution of bad code  Incorrect usage of application component, incorrect data entry  Incorrect data transfer / incorrect data processing through interfaces4.3 Cross-System InconsistencyIn a system landscape where business processes use and modify data in various systems, dataconsistency is vital for correct business operation. A business object that is exchanged between twosystems and should thus be available in both systems is inconsistent (between the two systems), if:  The object does not exist in any one of the systems  The two instances of the same object have different values in both systemsA special type of inconsistency in this respect is an inconsistency between an IT system and the realworld.Difference or inconsistencyWhen talking about inconsistencies, it is important to differentiate between Differences andInconsistencies. While Difference relates to mismatches between data that will always occur inconnected running systems (due to the processing times of asynchronous update tasks, IDocs, BDocsand other interfaces, different scheduling frequencies between systems), Inconsistency means amismatch that does not disappear when all system activities are processed successfully. Beforeattempting a correction, it is therefore necessary to investigate whether an Inconsistency or atemporary Difference is observed.Error causesInconsistencies between two (or more) systems can be caused by:  Software errors o No clear leading system o Program bugs o Non-transactional interfaces, for example, synchronous communication used for data manipulations o Incorrect error handling© 2008 SAP AG -9-
  10. 10.  User errors / Manual intervention o Incorrect data entry o Incorrect error handling o Deletion of Queues o Direct access to data  Messages in error states  Simplified Commit-Protocol (as used between APO and liveCache)  Data loss in one of the systems o Incomplete Recovery of a system o Technical disaster recovery (data replication) method that does not adhere to data consistency  Tolerated data loss, for example, with asynchronous replication  Missing consistency technology  System failure or failover o Non-transactional interfaces with non-SAP components may be affected by data loss5 Activating Alternate Procedures (Steps 7 to 8)Depending on the type of error, different possibilities may be available to continue operations duringrecovery:  Technical failure A technical continuity solution may allow you to switch over operations to an alternate hardware.  Logical errors or inconsistencies Workaround processes may be available to continue the most critical business functions.5.1 Switchover after Technical Failures Involved Organization: BC Team, ITServer-side failureFailover on the server side using a cluster solution for database or SAP central services can be donewithout limitations.Storage-side failureIf data is replicated to a second facility, a switchover to the alternate storage system must always beconsidered carefully. The decision must be made by the disaster recovery team after weighting thebenefits versus the impact. Since switchover may come along with some data loss, the demand forbusiness recovery to remove cross-system data inconsistencies will come up.The amount of data loss during switchover depends on the implemented replication technology.  A standby database may incur a relatively high data loss if the most recent logfiles from production cannot be applied.  Asynchronous replication generally incurs data loss according to the allowed replication lag.© 2008 SAP AG - 10 -
  11. 11.  Even with synchronous replication, some data loss may occur if the primary location continued to operate while the replication was already interrupted (“rolling disaster”).Block CorruptionsIf a standby database is available and complete recovery of this standby database can be done usingthe most recent logs from the production system, switchover to the standby database can be a veryquick solution to enable continuity of business operation, because block corruptions caused by atechnical failure usually do not transfer into a standby database.If complete recovery is not possible on the standby database, more detailed analysis should beconducted into other possibilities of resolving the corrupted blocks (see 7.2), because a switchoverwould result in cross-system inconsistencies whose resolution might be somewhat more complex.Note: Switching to a standby database will not solve the problem if the corruption was also transferred to the standby database, for example, if the block corruption was caused by bugs in the database software.5.2 Workarounds after Logical Errors and Data Inconsistencies Involved Organization: BC Team, Key usersAlternate ProcessingIf a business process becomes unusable due to logical errors or inconsistent data, the businessprocess can only be re-activated after the error was resolved in a sufficient way. In the meantime, itmight be possible to “stay in business” using a workaround procedure.At best, such workarounds were already documented in a business continuity plan and can beactivated according to this plan. If this is not the case, it should be analyzed if any such workaroundsare possible and applicable to continue operations on a reduced scale.The following types of workarounds may be considered:  Manual, paper-based processing  Operation based on the remaining systems of a system landscape  Working with reduced functionality  A combination of the aboveSince a workaround always implies some limitations and usually requires some more or lessexpensive post-processing when normal operations are reestablished, the activation of a workaroundshould be under the control of the disaster recovery team.6 Preparations for Recovery (Step 9)Before starting the actual recovery process, some preparations may be required to avoid unintendedside-effects. Depending on the actual situation, the affected system may need to be shut down orisolated from other systems of the landscape before error resolution can continue. Isolation from othersystems may be required for example to prevent the exchange of messages before datainconsistencies have been resolved.Consider the following preparations before starting with the recovery:  Notify users  Stop user access to production system(s)  Isolate affected system  Salvage possibly helpful information© 2008 SAP AG - 11 -
  12. 12.  Ensure you are able to revert the system to the point before you start the recovery Involved Organization: BC Team, ITNotify UsersBusiness users need to know about the disruption and must be given guidelines on how to proceed. Ifa workaround will be activated, the users must be instructed to use it.Stop User AccessWhile recovery actions are performed, normal users may not be allowed to work with the affectedsystem or the affected business processes. This is mainly important during resolution of logical errorsor data inconsistencies. This can be achieved by disabling user logon to the system or by locking theaffected transactions. After a technical system recovery, user logon should be prevented until it wasverified that recovery has really completed and that no further recovery on logical level is required.Possible actions:  Lock users  Lock affected transactions  Lock systemIsolate Affected SystemAs long as the state of the affected system is not completely clear, the system should be isolated fromits environment. Any automatic actions should be disabled; message exchange with other systems ofthe environment should be avoided, especially when expecting cross-system inconsistencies afterdata loss or incomplete database recovery.Possible actions:  Disable communication from other systems o In other systems: disable connections, deregister outbound queues/destinations, disable automatic data requests o In affected system: lock RFC user for incoming messages from other systems  Disable communications to other systems o In other systems: lock RFC user for incoming messages from affected system o In affected system: disable connections, deregister outbound queues/destinations, disable automatic data requests  Disable transports  Disable print-outsSalvage Possibly Helpful InformationPrevent data that may be helpful for recovery or analysis on application level being deleted (forexample by automatic reorganization jobs). This mainly comprises information on messages that wereexchanged between the affected system and other systems of the landscape.  Do not delete the contents of message queues unless you are sure that this information is available in the target system (also see section 7.4)  Unschedule message reorganization in XI  Unschedule BDoc reorganization in CRM  Avoid deletion of ALE data© 2008 SAP AG - 12 -
  13. 13. Enable Reverting to the State Before RecoveryIf recovery would not be successful or even enlarge the damage, it should be possible to revert backto the point before recovery started. This can be ensured by taking a backup before starting recoveryor by noting the exact point in time when recovery was started so a database restore and log recoverywould allow to revert to that point. If available, other technologies like savepoints or storage-basedsnapshots can provide even better solutions for this demand.7 Executing Recovery (Step 10)At this stage, the actual recovery from the failure will be performed, using previously documentedrecovery procedures from a business continuity plan, if available. The available options depend on thetype of error and will lead to the flowchart in figure 2.7.1 Overview of Recovery PhasesThe recovery procedure can be divided into four different phases. 1. Technical Recovery  Fix technical errors to get the system up and running 2. Data repair  Fix logical errors inside affected system 3. Business recovery  Fix cross-system inconsistencies 4. Date re-entry  Re-enter data that might have been lostThe following flowchart depicts the sequence of recovery actions, coming from step 10 of flowchart 1(figure 1). The entry point into this flowchart is determined by the error type (see section 4). Whichphases actually need to be executed depends on the type of error and the resulting outcome from aprevious phase. In section 9, you can find examples showing different paths for traversing theflowchart.© 2008 SAP AG - 13 -
  14. 14. Figure 2: Flowchart 1.1: “Recovery Phases”7.2 Technical Recovery (Recovery-Phase 1) Goal: Repair defective hardware or software components to get the database and SAP system up and running Involved Organization: BC Team, ITStrategy for Technical RecoverySystems or components may not be operational due to hardware failure, corrupted filesystems, lostsystem files, lost or corrupted database files, misconfiguration, corrupted software or software bugs,where software can be any kind of low-level system software, operating system, database orapplication software. These failures can be the consequence of defect hardware but may have variousother reasons.Coming from phase 1 of flowchart 1.1 (figure 2), the following flowchart depicts the main steps to getthe systems back online.© 2008 SAP AG - 14 -
  15. 15. Figure 3: Flowchart 1.1.1: “Technical Recovery”Options for Technical RecoveryAlthough error and root cause analysis may not be easy, the measures to fix technical failures arequite straightforward and can require any of the following hardware-, system- or database-relatedactivities:  Fix hardware o Exchange defective hardware components o Switch to an alternate data center (in case of physical disasters)  Fix software and filesystems o Check and repair filesystems o Restore filesystems (see below) o Restore or reinstall affected software, which can be system software, drivers, operating system, database software, application software, and so on o Install error-free software patches o Fix configuration errors  Fix database o Restore database or database files (see below) o Resolve database block corruptions (see below)Next step: If the failure or the recovery procedure did not cause any loss or falsification of database or application data, recovery can finish at this stage and proceed with step 11 returning to normal operation. Otherwise, further applicable branches of the flowchart need to be traversed.© 2008 SAP AG - 15 -
  16. 16. Filesystem RestoreA restore of storage volumes containing non-database managed filesystems with any kind of softwarecomponents or any other kind of data is always incomplete, which means that any changes done afterthe last backup are lost. Unlike databases, filesystems do not allow to apply changes done after thebackup by applying the concept of logfiles.Since an SAP system does not store application data in filesystems (with very few exceptions forspecial kind of data like logfiles, TREX indexes or external content in CM), no impact on applicationdata consistency is to be expected after a filesystem restore. The loss of information may thus affectsoftware, configuration files, transport files or this special kind of data. Subsequently, analysis shouldbe conducted to find out what exactly was lost and then to repeat any activities that allowreconstructing the previous state (repeat installation of software patches, repeat configurationchanges, repeat export of transports). Strictly speaking, this already is an activity that should fall intosection 7.5 (Data Re-entry) but for simplicity we leave it here.Database Restore and RecoveryA database restore from a backup may be required in case of:  Media (disk) failures Since nowadays all productive installations implement some form of RAID protection; failure of a single disk no longer requires a restore. Only if more than one disk of a single RAID group fails at the same time, restore of the data residing on the RAID group will be inevitable. Due to striping that is implemented for performance reasons, a restore may affect multiple tablespaces residing on that RAID group or even the complete database.  Block corruptions (see below)  Deletion of data files or misconfiguration of raw devices, for example, due to an administrator faultA restore always consists of three phases: 1. Restore of a database backup. Depending on the backup strategy, the restore can be done from different sources (tape, virtual tape, local disk, remote disk, standby database), yielding a different restore performance. 2. Restore of database logfiles from the backup medium 3. Application of database logfiles to the restored database (log recovery)Log recovery is a very important step to roll forward the database and to apply changes that weredone after the backup. During log recovery, all archived logfiles should be applied, followed by thecurrent online logs that were being written when the failure occurred. The goal should always be toperform a complete recovery including the latest committed transaction. Only a complete recoveryavoids data loss, which is important to maintain data consistency in a system landscape. Only byapplying all available logfiles (archive and online logs), can a complete recovery be achieved.Note: Despite of the status of aborted messages being exchanged between the systems, complete recovery maintains data consistency in an SAP system landscape because of the transactional concept deployed for message exchange. Doing a complete recovery, all committed message states are restored exactly as they were at the time of the failure. The asynchronous messaging protocol used for data exchange in SAP environments (tRFC, qRFC, EO or EOIO messaging) then ensures that message exchange can be continued without losing or duplicating messages.An incomplete recovery (point-in-time recovery) of the database needs to be avoided since thisintroduces the need to remove cross-system inconsistencies that are caused by the data loss in theaffected system (see below).Example: For an example for a complete recovery after a technical failure see section 9.1.Next step: Following a complete database recovery, recovery can finish at this stage and proceed with step 11 returning to normal operation.© 2008 SAP AG - 16 -
  17. 17. Incomplete Database Recovery / Data LossIn very rare situations, an incomplete database recovery (point-in-time recovery) may be inevitablefollowing a database restore. Complete database recovery is generally not possible if:  A required logfile is corrupt and no error-free copy of the logfile is available  The tapes storing the logfiles are destroyed  The most current online logs cannot be accessed and applied to the database and there is no other replica of these online logs availableAfter an incomplete recovery, the database itself is in a consistent, error-free state and the affectedsystem can be started (unless other errors are still present). However, the data loss has an impact oncross-system data consistency in a system landscape and as a next step, the business recoveryphase needs to address these issues.Example: For an example for an incomplete recovery see section 9.2.Next step: Following an incomplete database recovery, data consistency between systems needs to be re-established and subsequently, the lost data needs to be re-entered in the system. Therefore, recovery phases 3 and 4 are required.Block corruptionsIf you recognize block corruptions on your database, always check your hardware because blockcorruptions are mostly caused by layers below the database management system. To determine thereal extent of the damage, also check your entire database.The actions that need to be taken as a consequence of block corruptions depend on the affecteddatabase areas. The analysis should thus clearly identify the objects that the corrupted blocks belongto.On Oracle, SAP note 365481 describes different options to proceed depending on the type of object.Additional options are available as follows. Please note that some options require expert knowledge!  Restore and recover single corrupt blocks from an error-free backup with Oracle RMAN. This is possible even if RMAN was not used to perform the backup. This recovery is possible online.  Restore and recover database from an error-free backup  Additional options of SAP Support o Rebuild from redundant data (from other table, from the indexes, from other system, and so on) o Workaround with partial data loss (transformation of technical error to logical error)Example: For an example for handling database block corruptions see section 9.4.Next step: Depending on the result and method used to recover from block corruptions, recovery may finish at this stage or may require traversing additional recovery phases 2, 3 or 4.7.3 Data Repair (Recovery-Phase 2) Goal: Remove logical errors or inconsistent data inside a single system Involved Organization: BC Team, Business, IT© 2008 SAP AG - 17 -
  18. 18. Do not Perform Point-in-Time RecoveryUp to now, a commonly accepted method to remove logical errors from a system was to restore andrecover the database to a point before the error occurred (database point-in-time recovery). The dataloss that came along with this proceeding was acquiesced in favor of the ease of reestablishing logicalcorrectness of the system.But nowadays, with business processes spanning federated system landscapes, this methodis no longer appropriate in most cases!While removing the logical errors through point-in-time recovery, a new problem is introduced – cross-system data inconsistencies (see error category 3 described in section 4.3). So instead of having torepair logical errors inside a single system, cross-system inconsistencies had to be dealt with, which,in most cases, is an even more challenging task (see „Business Recovery‟ in section 7.4) thanrepairing the logical error directly inside the affected system.The following table provides a comparison of „data repair‟ versus „point-in-time recovery plus businessrecovery‟: Data repair of logical errors inside Database point-in-time recovery single system followed by repair of cross-system inconsistencies Required knowledge Experts from affected application Database administrators, experts from all application areas that exchange data with affected system Outage Outage of affected business Outage of complete system and processes many cross-system processes Duration of outage Time required to fix error Time required for database restore and recovery plus time required to fix cross-system inconsistenciesIn situations where a point-in-time recovery was the easiest solution in the past, more investment intological error resolution is advisable with federated system landscapes (see „options for data repair‟).Strategy for Dealing with Logical ErrorsTo avoid point-in-time recovery of the production system, logical errors should be carefully analyzedand possible options to repair the errors should be evaluated. If the effort turns out to be very high, thisshould be compared against the effort and implications imposed by a point-in-time recovery (includingall follow-up activities).If a point-in-time recovery was nonetheless identified as the best of the evil, data consistency betweensystems needs to be re-established subsequently and the lost data needs to be re-entered in thesystem. Thus recovery phases 3 and 4 will be required.The following flowchart depicts the steps for recovering from logical errors; coming from phase 2 offlowchart 1.1 (figure 2).© 2008 SAP AG - 18 -
  19. 19. Figure 4: Flowchart 1.1.2: “Data Repair”Options for Data RepairThe following general approaches are available to fix logical errors and will be described in more detaillater:  Reverse Engineering  Recovery of lost data  Check tools  Doing nothingReverse EngineeringA typical method to resolve logical errors is reverse engineering, which means reverting the error stepby step with the help of the experts (application, development, and so on). Reverse engineering canbe supported by an analysis system that allows you to track back to the state when the error was notyet in the system.Such an analysis system could be provided by different means:  Perform a point-in-time recovery onto an alternate hardware to the state before the error occurred (not a restore onto production)  A standby database that has a sufficient delay to production can be rolled forward to the point before the error occurred  Use the state of a (recently copied) test system to compare with the corrupted production systemAs described in SAP note 434645, there are various possibilities that may be applicable to repairlogical errors, for instance if:  Data was corrupted by a malicious report o Develop a report to fix the data© 2008 SAP AG - 19 -
  20. 20. o Provide an analysis system and reconstruct original data from there  An index is corrupted o Recreate index  Wrong transports were imported into the system o Create and apply correcting transports o In case wrong table data was transported, reconstruct former table contents following the options above (see table deletion) o Reconstruct former ABAP sourcesRecovery of Lost DataIf data was accidentally deleted (by deletion of a database table, drop of a table, deletion of table rowsor attributes by a malicious report or human error), there may be several options of getting this databack without restoring the production system itself (as also listed in SAP note 434645).  Provide an analysis system (as described above) and reconstruct original data or database table from there  Reconstruct original data or database table from a standby database that is rolled forward to the point before the error occurred  Oracle: Flashback table to SCN (if Undo-information is still available)  Reconstruct table from redundant data in other tables  Reconstruct table from redundant data in other systems  Do without the data (for example, performance data of table MONI)Example: For an example for recovering a deleted table from an analysis system, see section 9.3.Check Tools for Specific ApplicationsSAP offers several tools or reports to check and repair data consistency of business data used bydifferent applications. Checks are, for example, available for:  Documents in SD and LE  Inconsistencies in MM  Inconsistencies between MM and FI  Processes involving WM  Processes involving PP  Processes involving PSFor more information see the Best Practice “Data Consistency Monitoring within SAP Logistics”that will be available at http://service.sap.com/solutionmanagerbp.Doing NothingIn some special situations, logical errors may not require any further action, for example if the affecteddata is not vital for business operations.  If non-critical data (like logfiles or monitoring data) was deleted there is no need for recovery  If non-critical data was corrupted, it could just be deletedFurther StepsWhen finishing the data repair phase, the data of the affected system should be correct from abusiness process point of view.© 2008 SAP AG - 20 -
  21. 21. However, in some situations, repairing a logical error may not have been able to recover all lost orcorrupted information completely. If data repair came along with some data loss, further analysis willhave to show:  If the lost data has an impact on data consistency between the systems – in this case, data repair has to be followed by a phase of business recovery  How important this lost data is and how the lost data can be re-entered into the system – by a subsequent phase of data re-entryNext step: Depending on the outcome and success of data repair, further recovery may require traversing the additional recovery phases 3 or 4.7.4 Business Recovery (Recovery-Phase 3) Goal: Remove inconsistencies between the systems of a system landscapeStrategy for Dealing with Cross-System InconsistenciesThe task of this phase is to deal with inconsistencies that occur between systems of the systemlandscape. Similarly, inconsistencies between systems and the real world need to be handled in thisphase as well.When dealing with cross-system inconsistencies, we need to know:  Which systems are affected  Which business processes are affected  Which data objects are affected  What is the impact on each business processThe tasks for removing such data inconsistencies include:  Identifying inconsistencies by comparing possibly affected objects between possibly affected systems  Filtering out temporary differences which do not constitute real inconsistencies  Determining a strategy to fix the identified inconsistencies Involved Organization: BC Team, BusinessThe following flowchart depicts the steps during business recovery; coming from phase 3 offlowchart 1.1 (figure 2).© 2008 SAP AG - 21 -
  22. 22. Figure 5: Flowchart 1.1.3: “Business Recovery”Options for Business RecoveryThe following general approaches are available to remove inconsistencies between systems and willbe described in more detail later:  Application- or object-level options; addressing inconsistencies by comparing and fixing business objects  Initial load; addressing inconsistencies by retransferring inconsistent data from a leading system  Message-based approaches; addressing inconsistencies by repeating the message transferThe most suitable approach must be determined for each business object and may thus be differentfor each type of inconsistency.Dealing with Pending MessagesHandling of non-processed messages (pending messages) contained in the message queues of theinvolved systems plays an important role during business recovery because:  They have an influence on the differentiation of inconsistencies from differences and  On the one hand they may contain data that can be salvaged but  On the other hand they may contain data that might lead to duplicates or logical errors if they were processed.Depending on the state of each of the systems involved in business recovery, it must be determinedhow each of these message queues has to be handled. The options are:  Delete pending messages because the related data objects will be handled completely by the compare and resynchronization process  Delete pending messages because they contain data that is already available in the other systems© 2008 SAP AG - 22 -
  23. 23. For example, after an incomplete recovery, the outbound queues of the recovered system may be deleted because that data is already available in the connected systems (unless these queues were stopped and the messages were thus not processed).  Process pending messages because they contain important information that can be rebuilt that way For example, after an incomplete recovery: o The inbound queues of the recovered system may contain valuable data that may need to be processed to recover that data. o The outbound messages in all other systems should be processed because they contain data that is not yet available in the recovered system. To preserve a correct order of these messages, however, it might be required to postpone their processing until all data objects will be compared and fixed.Application- / Object-level OptionsSAP offers a number of tools or reports to compare data objects between different systems. Many ofthese tools also allow fixing of (delete or re-transfer) inconsistent objects. The following tools arecurrently available from SAP:  CRM: Data Integrity Manager (DIMa) to check and correct business objects between CRM and ERP as well as CRM and CDB (consolidated database for mobile applications)  CRM: data exchange toolbox to check and correct one order documents (SAP note 718322)  SCM: Tools to check internal and external consistency of business objects between APO and liveCache respectively APO and ERPFor more information see the documentation for each of these tools and the overview that can befound in the Best Practice “Data Consistency Monitoring within SAP Logistics” that will beavailable at http://service.sap.com/solutionmanagerbp.If business objects are affected by inconsistencies where no SAP tools are available, the followingoptions may be evaluated:  Compare objects with customer-developed tools  Check for the availability of not officially released SAP “developer tools”  Compare and fix objects manually  Identify possibly affected objects by: o Evaluating creation or change date o Comparing mapping tables in both systems o Analyzing logfiles providing hints whether data was exchanged in the period in question (for example logfiles written by the Communication Station about data exchanged with CRM Mobile Clients) o Analyzing information about exchanged documents (for example BDoc message store in CRM)Example: For an example for executing application-level business recovery as a consequence of data loss in a system of a federated landscape see section 9.2.Initial LoadBy reloading inconsistent objects from a leading system, inconsistencies can be removed in one go.Initial load may thus be an option for specific business objects, for example, master data if this doesnot result in the loss of important additional attributes being maintained in the target system.Message-based ApproachesApart from comparing and correcting inconsistent business objects with object-specific methods, somecases may allow you to identify and correct inconsistencies by analyzing the message transfer thattook place between the systems. A prerequisite is that the messages that were transferred during the© 2008 SAP AG - 23 -
  24. 24. period in question are still available in the systems or system logs. The idea is to fix inconsistenciesthat are caused by lost data by re-sending previously processed messages. But we have to note thatthe maintenance of number ranges and newly assigned numbers may represent an issue with thisapproach.Whether information on messages that were exchanged is still available depends on the type ofcommunication being used between the systems in question:  ALE (IDOC) The “ALE Recovery Tool” (Transactions BDRL, BDRC) allows you to analyze and resend messages. For more information see: http://help.sap.com/saphelp_erp2005vp/helpdata/en/26/29d829213e11d2a5710060087832f8/f rameset.htm  RFC For RFC communication, an analysis of past message transfer is not possible since no logs are kept in the systems.  BDoc For BDocs between CRM and ERP, information on the data exchange can be retrieved from the BDoc Message Store (transactions SMW01 and SMW02). For data transfer between CRM and mobile clients, the Mobile Client Log could be evaluated.  Communication via SAP XI SAP XI keeps track of all messages from and to sending systems. In case of EO or EOIO messaging, information on the messages can also be found in the sending and receiving systems. For RFCs that were routed via XI, information will only be available in XI but not in the sending or receiving systems. Currently there is no tool available to support the analysis or re-sending of messages.  File interfaces Data that was uploaded via file interfaces could be recreated by repeating the file upload, if the file is still available. Thus, data consistency to external applications could be reestablished.Further Steps / Remaining Data LossIf business recovery became necessary due to some data loss in a system of the landscape, businessrecovery using the above options might have been able to bring back some of the lost data bytransferring it from another system that holds a second copy of the data. But usually, not all lost datacan be recovered that way, for example:  Data objects that were not exchanged with other systems  Special attributes of data objects that do not exist in the other systems where the object was replicated fromNext step: If business recovery was not able to completely recover all lost data, further recovery will proceed with phase 4, trying to re-enter lost data into the system.7.5 Data Re-entry (Recovery-Phase 4) Goal: Get back lost data in a single system that could not be recovered by previous phasesDescriptionIncomplete database recovery or resolution of database block corruptions in phase 1 might havecaused the loss of data for a complete period of time and, respectively, a very isolated, partial loss ofobjects.Data repair for logical errors in phase 2 might also have caused some loss of data objects orattributes.© 2008 SAP AG - 24 -
  25. 25. Business recovery in phase 3 might have been able to get back some lost data from other systems ofthe environment, but in most cases not to full scale.At this stage, the data available in the systems should be consistent within and between the systems.So what is left now is the task of re-entering any information into the system(s) that is still lost after (ordue to) the previous recovery phases. Involved Organization: BC Team, Business, Key usersProcedureIn general, the knowledge about which data is lost should be quite exact at this point in time – or atleast the period that is affected by the data loss can be restricted very well. So, now, any such datashould be re-entered into the system as comprehensively as possible.There are two options to approach the data re-entry phase:  Key users get access to the system and re-enter lost data before the system is returned back to regular operation  Data re-entry is postponed and done by some key users or by the normal business users after handover to production.The best time and method to re-enter lost data depends on the nature of the affected data and bothapproaches may be taken, dependant on different business objects.The following options may apply to get back lost information:  Users enter the data from written notes or from memory  Data is re-entered from an external input stream (batch input, file input, upload tools, transports, and so on)  Data is recovered from a copy of the old, corrupt production system  Data is recovered from a (recently copied) test systemNext step: Recovery has finished and checks should verify that recovery was indeed successful and that the system is ready to return back into productive operation.8 Returning to Normal Operation (Steps 11 to 16)ChecksWhen recovery and error resolution has finished, checks are needed to verify if the system has reallyreached an error-free state that allows returning back to normal operations.Checks should verify:  Functional operability of business processes  Correctness and consistency of business dataThe approval whether the system is ready for production will be taken based on the results of thesechecks. The decision should be taken by the business continuity manager, together with applicationmanagement.If recovery quality was not sufficient to return to production, recovery must be continued until asatisfactory state is reached.In specific situations, it may be possible to “partially” handover the system, which means that somebusiness process might be excluded for a limited time. This could be the case for example, if thesystem in general reached a sound state that allows continuing business operations with the exceptionof some business processes that need further fixing. A prerequisite for partial handover is a clearseparation of the released and the non-released processes and their data.© 2008 SAP AG - 25 -
  26. 26. Involved Organization: BC Team, Key users, Senior ManagementHandover to ProductionAfter handover to production, regular users are allowed to log on to the system and work with theirusual functionality. All established workarounds will be called off.Completing Data IntegrityAfter handover to production, some follow-up activities may still be required, for example to:  Re-enter lost data into the system that was purposefully not yet covered during recovery phase 4 (section 7.5).  Allow business users to identify and re-enter lost data that was not identified by the recovery team. To enable users to check on their data, they need to be informed in detail about the critical period of recovery.  Integrate data, which was created while using the workaround processes, back into the regular system. Depending on the nature of the workaround or alternate process, such data may be available on paper or in other systems. Involved Organization: BC Manager, UsersLeave Disaster StatusWhen users signal that all data has been recovered to the best of their knowledge, and when theremains of workarounds have been successfully integrated back into the productive system, thisdisaster case can be closed.Lessons LearnedHaving left the disaster status, follow-up activities should further investigate the root cause of thedisaster with the goal of avoiding similar situations in the future. To learn for future emergencies, thecomplete disaster handling process should be reflected to identify possible areas of improvement inthe business continuity plan. Involved Organization: BC Manager© 2008 SAP AG - 26 -
  27. 27. 9 Examples9.1 Example 1: Media FailureError Scenario:The SAP system runs on an Oracle database located on a RAID-protected storage system.Two disks out of the same RAID group fail. Multiple Oracle tablespaces are located on this RAIDgroup. A backup containing the lost files and the complete changelogs (Online and Offline redologs)are available for restore and recovery.Figure 6: Recovery Flow for Example 1Recovery Phase 1:Execute restore and complete DB recoveryThe Oracle database can only be mounted because the datafiles are missing. By accessing the viewv$recover_file in mount status, you can find out which datafiles need to be restored from a backup.The latest backup taken before the crash containing the missing datafiles is identified in the directory/oracle/<SID>/sapbackup. The missing datafiles are restored with SAP‟s tool brrestore.The Backup logfile contains information about the redolog in use when the backup was taken.The database view v$log contains the information what the current redolog file is.All redologfiles not available on disk anymore are restored from tape with SAP‟s tool BRrestore.After making all files available on disk, a recovery of the database is started with „recover database;‟ inthe Oracle tool sqlplus.© 2008 SAP AG - 27 -
  28. 28. Details of this procedure can be found in note 4161.As a rough estimation for the required restore time for datafiles and redologs, we can assume that it isapproximately the same as the backup runtime.During log recovery, approximately 50-500 MB of redolog volume can be recovered per minute. Thisdepends on the hardware and can be much better estimated after applying the first redologs.Recovery Phase 2:There are no logical errors in the application data of this system and recovery phase 2 is not needed.Recovery Phase 3:There is no data loss. Since the database was recovered completely (including the latest committedtransaction), all messages that were just being exchanged between the systems when the crashoccurred, reflect exactly the state at that point in time. All messages are restartable and can beprocessed as before. Recovery phase 3 is not needed.Recovery Phase 4:No data was lost; recovery phase 4 is not needed.9.2 Example 2: Media Failure and Database Recovery FailureError Scenario:This example assumes the same error scenario as example 1. However, in this case, completerecovery is not possible. During database log recovery, it is recognized that one logfile needed forrecovery is defect and cannot be applied to the database. Therefore, recovery needs to be aborted.Analysis of the timestamps of the logfiles shows that the recovered state of the database lies 2 hoursbefore the time when the database crashed. This means that 2 hours of business data is lost andcannot be recovered by technical means.© 2008 SAP AG - 28 -
  29. 29. Figure 7: Recovery Flow for Example 2Recovery Phase 1:Database recovery ended with an incomplete recovery of the database. The database and the SAPsystem can be started, but 2 hours of data is lost. Nonetheless, the system is in a consistent state (asit was 2 hours before the media failure).Recovery Phase 2:There are no logical errors in the application data of this system and recovery phase 2 is not needed.Recovery Phase 3:The data loss has an impact on data consistency between the systems of the landscape. Therefore,business recovery is required.Example business scenario: The ERP system is the leading system for material master records. Themajor part of material master is created or changed in ERP and subsequently loaded to the CRMsystem. However, inside CRM the users also create competitive materials to track similar products ofcompetitors for statistics about lost sales opportunities.The CRM system had an incomplete recovery and has now a state that is two hours older than theERP system, implying that some material updates are now missing.As part of the recovery, most material masters can be recreated by a repeated load from ERP to CRM.For instance, you could use report RSSCD100 to evaluate all change documents for ERP materials,creating a list of materials for a corrective load. First, check whether recently changed materials arestill in their messaging phase (that is, ERP outbound queue, CRM inbound queue or CRM inboundBDoc in validation error) to identify temporary differences. With the remaining list of materials, you cancreate a so-called request download definition in CRM Middleware to extract materials from ERP andload them into CRM again.© 2008 SAP AG - 29 -
  30. 30. Recovery Phase 4:Still some data is lost, which now needs to be re-entered manually.The competitive materials were not available in the ERP system and therefore could not be recreatedautomatically. You would have to manually recreate the lost competitive materials.Checks:Afterwards, it is recommended to run a full comparison of material masters between ERP and CRMwith the CRM DIMa tool (Data Integrity Manager).9.3 Example 3: Lost DataError Scenario:This example wants to present the handling of a logical error. We assume that due to a user fault, dataof a complete application table was deleted.Imagine the CRM table CRMM_TERRITORY was dropped by a user error. Now, the complete CRMTerritory Management application becomes unusable.Recovery Phase 1:This phase is not applicable. Since the error is categorized as “logical error”, recovery starts withphase 2.To demonstrate that logical errors can require very different measures during the course of recovery,the example is now separated into cases 3a, 3b and 3c.9.3.1 Example 3a: All Data Can be RecoveredFigure 8: Recovery Flow for Example 3a© 2008 SAP AG - 30 -
  31. 31. Recovery Phase 2:Recovery is done using the following procedure:Figure 9: Handling of Logical Errors Using an Analysis SystemSteps: 1. Block user access 2. Unload / export new data (if applicable) 3. Restore database to an “analysis system” 4. Recover analysis system close to the error 5. Unload / export data from analysis system 6. Insert data into production (repair) 7. Merge rescued data with repaired dataAttention: Details depend on many factors (like error type, affected objects, and so on), so carefulanalysis and application knowledge is required.Result of recovery phase 2 in example 3a: Recovery of data in table CRMM_TERRITORY from theanalysis system is completely successful, no data was lost. Thus, recovery phases 3 and 4 are notrequired.9.3.2 Example 3b: Remaining Data Loss is Only Locally RelevantRecovery Phase 2:Recovery of table data could not bring back the table completely. Further analysis of the data lossshows that this does not impact objects or attributes that are exchanged with other systems. Thus,recovery phase 3 is not required, but phase 4 needs to be applied to re-enter the lost information.As in the example above, the CRM table CRMM_TERRITORY was dropped by a user error and theTerritory Management application becomes unusable on the CRM system. However, a completerecovery failed and only a part of the table was restored to its original state.© 2008 SAP AG - 31 -
  32. 32. Figure 10: Recovery Flow for Example 3bRecovery Phase 3:Not required for example 3b.Recovery Phase 4:Now, the remaining missing table entries of CRMM_TERRITORY need to be recreated manually. Thecomplete list of key fields (territory GUIDs) is still available in related tables, for example in the territorystructure table CRMM_TERRSTRUCT or in the territory validity table CRMM_TERRITORY_V. Withgood knowledge of the data model, there is a chance to recapitulate the structure of the missingentries of the main table.9.3.3 Example 3c: Remaining Data Loss Causes Cross-system InconsistenciesRecovery Phase 2:Recovery of table data could not bring back the table completely. Further analysis of the data lossshows that the lost objects are relevant for data exchange with other systems. Thus, recovery phase 3(business recovery) is required to reestablish data consistency between the systems.We take again the example above with the loss of CRM table CRMM_TERRITORY. Again, as inexample 3b, a complete recovery failed and only a part of the table was restored to its original state.However, the situation gets more complex now, as in the new scenario we do not only have a TerritoryManagement application on the CRM server, but also on CRM Mobile Clients (laptop application). Wethus need to reestablish data consistency between the CRM Server and the CRM Mobile Clients.© 2008 SAP AG - 32 -
  33. 33. Figure 11: Recovery Flow for Example 3cRecovery Phase 3:As a first important step, we need to identify the affected systems. In our example, the connectedCRM Mobile Clients get a periodic update of the territory structure by a regular background run ofprogram CRM_TERRMAN_DOWNLOAD. This updates the CDB (consolidated database for themobile scenario) and sends delta messages to the Mobile Clients.As an emergency step, it is a wise idea to deactivate this delta update job, to prevent uncontrolleddistribution of incomplete table entries to the Mobile Clients.On the other hand, the CDB database can be a source for reconstructing missing entries of the CRMtable. It has all entries up to the last delta update in a comparable data model (table SMOTERR andothers). Therefore, the keys and attributes of the territories can be collected there and serve as asource for manual recreation of the missing territories. For a larger amount of missing records, it isalso feasible to develop a small ABAP program for this task.Recovery Phase 4:As last step, we need to identify which territories were not recreated. Newly created records after thelast delta update run have to be recreated manually without any reference system.9.4 Example 4: Database Block CorruptionsError Scenario:During normal system operation, corrupt table blocks in an Oracle database are discovered.A full consistency check on all database files is triggered (as described in SAP note 23345), returningno further corruptions.No regular consistency check was done in the past, so it cannot be guaranteed that there is a backupavailable containing an old, not corrupted version of the blocks. A consistency check now performed© 2008 SAP AG - 33 -
  34. 34. on the oldest backup available shows the same corruptions. Restore and recovery of single corruptdatablocks or datafiles is therefore not an option.The hardware partner of the customer and the hardware partner‟s SAP Competence Center isinvolved to check the hardware.Figure 12: Recovery Flow for Example 4Recovery Phase 1:Trying to access data in non-corrupt blocks fails when a corrupt block has already been read. Torestore full access, at least to the non corrupt data, a new version of the table needs to be createdcontaining all data from the non-corrupt blocks, but without any corrupt blocks. This can be achievedin general by reading ‟around‟ the corrupt blocks and copying everything from the corrupt table to a“clean” table. After renaming, the “clean” table will finally become the original table and contain thereadable data from the corrupt table minus the data from the corrupt blocks. The new table now iscorruption-free, but some data was lost.This example shows that solving a technical issue can imply the need for corrections in furtherrecovery phases.Block corruptions are different from other technical failures because the actions in the HW- and DB-related phase may depend on the possible actions in a later recovery phase. Therefore, it is necessaryto involve application experts already in the first phase, to decide how each corrupt table should bedealt with. Depending on the specific table, other options than the above procedure of copying thetable and loosing the corrupt rows may be possible: - If all columns of the table are in at least one index, retrieve the data from the indexes without reading the table blocks - If the table contains redundant data from another table, create a new, empty version of the table and refill it with application reports (for example, for tables BSIS, BSAS and so on) or on© 2008 SAP AG - 34 -
  35. 35. the fly, during normal system operation (for example, for ABAP-Load- or ABAP-Dynpro- Tables) - Recreate the table empty, if it contains data that may be less important and will not cause any harm by being deleted. Candidates for this category are tables containing log information, statistics, and so on (for example, MONI).In the above scenario, we assumed that no special handling for the table was possible. As much dataas possible needs to be copied to the clean table. As much information as possible has to be gatheredabout the lost rows (for example, the primary keys of the lost rows), so that in later recovery phases,the logical inconsistencies can be repaired efficiently by application experts.While preparing for the copy of the corrupt table, the system is still in use. Only the business processaccessing the corrupt table on exactly the corrupted data cannot be executed.Copying the data to the clean table requires system downtime, because during the copy, concurrentupdates to non-corrupt areas of the table need to be avoided. The duration of the downtime can beestimated by a test-copy (copied data is deleted afterwards), while the system is still in use.The duration of the copy largely depends on the hardware and unforeseen problems caused by thecorruption. In the worst case, the copy gets stuck and Oracle support has to be involved. Therefore,SAP insists on performing a test-copy – as always in such cases. After the test run, an appropriatedowntime is planned as soon as possible.For copying the data, creating the indexes on the clean table and gathering the information of the lostrows, the SAP internal tool Clean Copy for Oracle (note 796399) is used.Once the physical recovery of the table is done, the application experts can continue with the nextrecovery phase. They will get the information how many table rows were lost, together with a currentlist of the key values of the lost rows. They may now decide that further downtime is required duringthe next recovery phases, or that it is possible to go live with some restrictions after finishing phase 1and to execute the following recovery phases while the system is already released for productionRecovery Phases 2, 3 and 4:Depending on the achievable quality of error resolution resulting from phase 1, it may be required toproceed with further recovery phases. In comparison to example 3, similar cases to 3a), 3b) and 3c)might be considered. This possible flow of error handling is indicated as „possible‟ in figure 12. Sincethe handling of these errors is generally done in the same way as described in example 3, we do notrepeat it here.© 2008 SAP AG - 35 -
  36. 36. AppendixA - Flowcharts for PrintoutThis appendix repeats the flowcharts provided in this document for print-out.© 2008 SAP AG - 36 -
  37. 37. Flowchart 1: “Emergency Handling”Flowchart 1.1: “Recovery Phases”© 2008 SAP AG - 37 -
  38. 38. Flowchart 1.1.1: “Technical Recovery”Flowchart 1.1.2: “Data Repair”© 2008 SAP AG - 38 -
  39. 39. Flowchart 1.1.3: “Business Recovery”© 2008 SAP AG - 39 -
  40. 40. © Copyright 2008 SAP AG. All rights reserved.No part of this publication may be reproduced or transmitted in any form or for any purpose without the express permission ofSAP AG. The information contained herein may be changed without prior notice.Some software products marketed by SAP AG and its distributors contain proprietary software components of other softwarevendors. ® ® ® ® ® ® ®Microsoft , WINDOWS , NT , EXCEL , Word , PowerPoint and SQL Server are registered trademarks ofMicrosoft Corporation. ® ® ® ® ® ® ® ® ® ® ®IBM , DB2 , OS/2 , DB2/6000 , Parallel Sysplex , MVS/ESA , RS/6000 , AIX , S/390 , AS/400 , OS/390 , and ®OS/400 are registered trademarks of IBM Corporation. ®ORACLE is a registered trademark of ORACLE Corporation. TM ® ®INFORMIX -OnLine for SAP and Informix Dynamic Server are registered trademarks of Informix Software Incorporated. ® ® ® ®UNIX , X/Open , OSF/1 , and Motif are registered trademarks of the Open Group. ®HTML, DHTML, XML, XHTML are trademarks or registered trademarks of W3C , World Wide Web Consortium, MassachusettsInstitute of Technology. ® ®JAVA is a registered trademark of Sun Microsystems, Inc. JAVASCRIPT is a registered trademark of Sun Microsystems, Inc., usedunder license for technology invented and implemented by Netscape.SAP, SAP Logo, R/2, RIVA, R/3, ABAP, SAP ArchiveLink, SAP Business Workflow, WebFlow, SAP EarlyWatch, BAPI, SAPPHIRE,Management Cockpit, mySAP.com Logo and mySAP.com are trademarks or registered trademarks of SAP AG in Germany and inseveral other countries all over the world. All other products mentioned are trademarks or registered trademarks of their respectivecompanies.Disclaimer: SAP AG assumes no responsibility for errors or omissions in these materials. These materials are provided “as is”without a warranty of any kind, either express or implied, including but not limited to, the implied warranties of merchantability,fitness for a particular purpose, or non-infringement.SAP shall not be liable for damages of any kind including without limitation direct, special, indirect, or consequential damages thatmay result from the use of these materials. SAP does not warrant the accuracy or completeness of the information, text, graphics,links or other items contained within these materials. SAP has no control over the information that you may access through the useof hot links contained in these materials and does not endorse your use of third party Web pages nor provide any warrantywhatsoever relating to third party Web pages.© 2008 SAP AG - 40 -