Your SlideShare is downloading. ×
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply


Published on

1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Office of the University Auditor Old Dominion University Glenn R. Wilson, IT Audit Manager BUSINESS CONTINUITY – DISASTER RECOVERY Planning and Methodology Guide For Developing Departmental Business Continuity – Recovery Programs Oct 2005 Rev B Plans are nothing… Planning is everything… - Dwight D. Eisenhower
  • 2. TABLE OF CONTENTS Section Page I. Introduction 3 II. Overview: Business Continuity – Recovery Program 3-4 III. Business Analysis 5-6 IV. Business Analysis Activities 6 V. Plan Initiation Activities 6-7 VI. Scope of Plan 8 VII. Planning Assumptions 9 VIII. Scaling Levels of Disaster 10 IX. Collecting Data and Information 10-11 X. Planning Objectives 11 XI. Setting Realistic Goals 11 XII. Determining Critical Needs 12 XIII. Priority of Processing 13 XIV. Basic Items of Focus 13 XV. Know The Cost Of Downtime 13 XVI. How Much Risk Is Acceptable? 14 XVII. Events to Consider 14 XVIII. Information Technology Considerations, Classification of Data 15 XIX. Records Retention 16-17 XX. Insurance Considerations 17 XXI. Strategy and Plan Development 18 XXII. Response Teams 19 XXIII. List of Deliverables 20 XXIV. The Plan Document 20-22 XXV. Document Distribution 22 XXVI. Maintenance and Support 22-23 XXVII.Testing and Training 23 XXVIII.Critical Success Factors 24 XXIX. Best Practices 24 XXX. Stepping Through The Creation Of The Plan 24-28 XXXI. Summary and Conclusion 28 APPENDICES A Business Continuity Glossary 24-42 B Online Sources of Additional Information 42 I Introduction Page 2 of 42
  • 3. The goal of business continuity and disaster recovery is to mitigate financial, operational, and business impacts to a business unit and to ensure its survivability under various scenarios. It assures that core processes will either be continued or effectively and efficiently restored in accordance with the business mission, which directly assures the overall success of the organization. Owing largely to increased reliance on information technologies (IT), Contingency Management has been supplanted by the concept of Business Continuity, which focuses on the resiliency of people, processes, workspace, systems, safety, communication and new planning scenarios — loss of life, lack of decision makers, interruption of transportation, building evacuation, loss of physical assets and workspace, lack of communications, crisis command centers, terrorism, bio-terrorism etc. Traditionally, data centers or offices of computing services alone have borne the responsibility for providing contingency planning. Frequently, this has led to the development of recovery plans to restore computer resources in a manner that is not fully responsive to the needs of the University and/or its customers supported by those resources. Contingency planning is a business issue rather than strictly and IT issue. Long-term operations outages often result in impacts of catastrophic proportions. The development of a viable continuity and recovery strategy for the University must be a product of the collective planning of not only the University’s data center, communications and operations centers, but also the users and customers of those services who directly support the success of the University, and management personnel who determine acceptable levels of risk and bear responsibility for protecting the University’s assets. This guide is meant to facilitate the planning, development and maintenance of a comprehensive Business Continuity – Disaster Recovery Plan (BCRP) at the departmental level. The collection of these, anchored to OCCS’ own IT asset focused DRP program will serve to ensure the success of the University. When used herein, “Agency” refers to the individual College, Office or Department, “Unit” refers to the various areas comprising the University and “University” refers to Old Dominion University. Note that Old Dominion, as an Agency, is now responsible for reporting to the Governors office. Participation in this BCRP program is essential to the University's success. II Overview: Business Continuity – Recovery Program For most agencies, services to their customers would effectively cease if the core processes supported by key systems, resources and assets were inaccessible or unavailable for an unacceptable period of time. Each agency should establish risk management and disaster recovery planning processes for identifying, assessing, and responding to the risks associated with loss of ability to execute its core processes. The University’s requirements for continuity and recovery planning should be addressed through University wide efforts to develop and maintain a Business Continuity Recovery Program. The Business Continuity Recovery Plan is one of the main deliverables of this program. Other constituents of a comprehensive program include the Risk Assessment Analysis (RAA), Business Impact Analysis (BIA), Testing, Training and Support. Page 3 of 42
  • 4. Risk: Assessment And Analysis Risk may be defined as the potential for loss; man-made or natural. Potential is measured in probability of occurrence, which may vary with activities, geographic region, locale, world events or seasonal change. The Risk Assessment Analysis considers a range of possible disasters and applies a risk factor to each type. Each area is analyzed to determine the degree of risk associated with the various types of disasters, such as natural, technical, human and societal threats. Based on the core functions and processes necessary to continue business operations, assets are identified, categorized as being either critical, essential or administrative in nature and prioritized in order of their contributory importance to their respective functions and processes. Existing controls that serve to reduce the exposure of these are delineated. The cost for each control, the level of risk and the degree to which they effectively mitigate against known risks are analyzed, resulting in recommendations for new measures/controls or updates to process design. Planning steps for the risk assessment/analysis should include: • Evaluation of exposures and existing controls, identification of potential improvements and creation of new procedures or measures to mitigate against exposure thereby reducing residual risk to an acceptable level. Evaluation of mission critical business applications. • Systems, business applications, datasets, networks, user processes, dependencies between functions, departments, systems and applications and reliance on relationships with other agencies or external services. A proper Risk Assessment Analysis will contain a projected timeline for implementing recommended measures or list the constraints and additional resources that would need to be acquired to accomplish such. Asset: Definition For continuity and recovery planning purposes an asset is defined as anything of material value or usefulness. This definition may be refined to “anything of material value or usefulness that is required for the major functional and business activities of the University”, including staff, equipment, facilities, IT resources, office furniture etc. The Risk Assessment is an analysis that carefully considers the degree of value or usefulness of a given asset and categorizes it as Critical, Essential or Administrative. Critical assets have the greatest potential business impact, and measures against their loss are implemented commensurate with the nature and degree of their exposure to loss or damage. Critical Asset: Definition An asset is critical to the degree that a core process depends upon it for its function and output. These may include people, computers, datasets, paper documents, specialized equipment, or other resources which are not commodity items. They may or may not be readily replaceable, but it is certain that in their absence the process would not continue and the University’s [not necessarily the University’s] ability to fulfill its business mission would cease or be significantly degraded. The critical classification may be extended to include those items which do not affect day to day operations and yet are of critical importance due to the impact of their loss, which may include a damaged reputation resulting in decreased revenue, as might be the case with a very service oriented mission. Page 4 of 42
  • 5. Practically speaking, an asset is something a department cannot do without. The Business Impact Assessment The BIA delineates the business, operational and financial impacts and other consequences of potential losses and discontinuities. These most often include revenue, productivity, public image/trust, regulatory compliance, contractual obligations and represent the adverse fulfillment of the risks previously identified. It identifies the operational (qualitative) and financial (quantitative) impact of a disrupted or inaccessible core process on an agency’s ability to conduct its critical business processes. This analysis forms the basis for the formulation of viable continuity and recovery strategies that will be activated when necessary to restore operations within the required time frames. Where appropriate and warranted a business impact analysis, operational impact analysis, and financial impact analysis are developed for each core business process. The actual procedures to restore the core process in part or in whole are written into the deliverable of the Business Continuity and Recovery Plan. The procedures, when activated should reasonably assure achievement of recovery objectives. Practically speaking, a B.I.A. is the process of determining how damaged operations would be over time if the assets were not available. In summation, the Business Continuity Recovery Plan is the actual set of detailed procedures that would be followed or deferred to under sudden or impending conditions that threaten or damage the ability of the University to carry on one or more of its core functions. It includes details about core processes, inputs, outputs, information flow, personnel roles and responsibilities, contingency arrangements, emergency response, communication and coordination. The Risk Assessment and Business Impact Analyses would be attached as supporting documents to the plan. III The Business Analysis The Business Analysis identifies and describes critical, essential and administrative core processes, and the high-level resources that support these functions. This analysis enables us to confirm the managers’ description of their operations and highlight functional dependencies and single points of failure. The analysis may be in the form of a separate attachment or included within the body of the Business Continuity and Recovery Plan, but it is essential to the overall success of the plan. Most agencies are structured along functional boundaries (accounting, information technology) and the core processes within those units (payroll, accounting). In reality, however, an agency’s business is conducted through one or more business processes. A business process describes a set of recurring activities- a flow of information and/or materials-that produce an output-something of value for the end user or customer. A process usually contains multiple functions. The most straight forward approach is to analyze the University in terms of its core processes. It is critical to understand the relationships between those core processes and the end user or customer’s level of expectations in order to analyze the impact of an interruption of a given function. Page 5 of 42
  • 6. For each core process, define the Maximum Acceptable Outage (MAO)1; the point at which resource(s) and functions must be restored. Describe the qualitative and quantitative impact for an outage of each core function assigned MAO. Decide whether that level of impact is acceptable or if the MAO needs to be adjusted. The MAO is based on the University’s mission, not whether there are current resources and/ or procedures to achieve it. If there is a gap between the MAO and projected outcome, a gap analysis with a plan and timeline to close it should be presented. Core business processes should be divided into one of three categories: • Critical • Essential • Supportive or Administrative Refer to Section XIII Priority of Processing for detailed definitions of the above categories. IV Business Analysis Activities The high level approach to the Business Analysis consists of gathering information about core processes, documenting business flows, identifying customers, and gaining confirmation of the information. The first step of the analysis is to identify the core business processes performed by the University, and understand the high level flow of information, materials, and services through these core processes. The specific approach to understanding these core processes and business flows is: 1. Review relevant documentation (critical success factors, strategic plans, budgets, performance, measurements, IT plans, division goals, organizational charts) to build an understanding of organization purpose and structure. 2. Conduct interviews with agency leadership members to collect information on their “first hand” perspectives on how the University operates. It is important to note that these interviews will serve as data-gathering opportunities for all steps of the Business Impact Analysis. 3. Compile the results of your interviews in the form of business flows. These flows should describe each core process and the flow of information, services, or goods into and out of the process and include considerations for the end user or customer. 4. Develop descriptions of support functions. Some functions within the University may perform important roles, which contribute indirectly to the University’s ability to implement its assigned programs. These are classified as support or administrative functions. 5. Confirm understanding of the University, its core processes, and its business flows with appropriate management through review of the descriptions of the core processes performed. 6. Document in a step wise manner or flowchart the Business Priorities and Drivers => Business Processes => Applications and Infrastructure = >Operations and Management V Plan Initiation Activities Firstly, create a steering committee whose members represent key owners and managers of the critical and essential operations of the University. Identify the individual(s) who will be responsible for the 1 Other Risk Analysis literature may identify this as the Maximum Allowable Downtime or MAD/ Page 6 of 42
  • 7. development and implementation of the University’s Business Continuity Recovery Plan, the Project Team. The authority to make high-level decisions on behalf of the University should be defined and bestowed upon these persons. It is preferable to designate one or more of the University’s top-level managers for this responsibility. Hold a kickoff meeting to present to all staff the goal and importance of participating in the creation of the plan. Why Business Continuity Recovery Planning Is Important: • Minimizing potential economic loss • Decreasing potential exposures • Reducing the probability of occurrence • Reducing disruptions to operations • Ensuring agency stability and survivability • Providing for an orderly recovery • Minimizing insurance premiums • Reducing reliance on certain key individuals • Protecting the assets of the University • Ensuring the safety of personnel and visitors • Minimizing decision-making during a disastrous event • Minimizing both legal and regulatory liability The Steering Committee should work closely with the Project Team to: • Review operations • Identify business functions within the University • Document the level of information, goods and services that relate to the business functions • Determine the users served by each business function • Identify legal and regulatory requirements • Review special and unusual requirements • Identify the maximum acceptable outage period • Determine the consequences of not processing • Determine critical equipment requirements • Document dependencies • Analyze work flow • Evaluate the security of vital records and data • Analyze record retention policies and procedures • Determine the recovery timeframe required for each process • Identify maximum acceptable lengths of service interruption • Assess and document current file backup and recovery capabilities • Determine and document calculated assumptions Agencies must also complete the following operational activities to ensure a comprehensive Business Continuity Program exists: Page 7 of 42
  • 8. • Completion of an emergency response plan, information technology risk survey, and a security analysis with plans to close any identified gaps. • Identify new or modified operating procedures to increase continuity. • Review and modification of data backup and off-site storage procedures. • New or modified restoration procedures. • Development of alternate procedures for use during a disaster. • Negotiating and implementing contracts and other provisions as needed. • The development of alternate facilities and equipment. • Developing step by step recovery scripts, which guide an employee through the procedures necessary to recover a given service, resource or system. • Standards, forms, and guidelines for standard procurement procedures, available from the University’s procurement group or state procurement office. • Procedures to recreate or recapture information that may be lost during disaster (records, recent transactions, work in progress) • Detailed team definition and procedures including responsibilities and time line oriented task definitions. • Organizational information (Organizational charts, job descriptions etc). VI Scope of Plan Although most continuity and disaster recovery plans are weighted towards information technology, a comprehensive plan will also include areas of operation outside data processing. The plan should have a broad scope if it is to effectively address the many events and scenarios that could affect the University. A “worst case scenario” should be the bottom line basis for developing the plan. The worst-case scenario is the near or total destruction of the main or primary facility. When this is used as a baseline premise, less critical and detrimental situations can be handled by using only the needed portions of the plan, with only minor (if any) modifications required. The scope will be determined by the University’s business drivers and priorities, its size and degree of infrastructure, and the level of detail addressed by the procedures developed for business and asset continuity and recovery. These plans should be clearly and concisely written but with sufficient detail, to accurately implement them with little additional guidance to personnel in an emergency situation. Core policies and procedures found in Business Continuity and Recovery Plans include: 1. Business continuity policies 2. Emergency response procedures 3. Emergency evacuation procedures 4. Damage impact assessment procedures 5. Disaster declaration and escalation procedures 6. Command center activation procedures 7. Personnel notification procedures 8. Resumption of normal operations 9. Physical and security assessments Other more agency specific policies and procedures may include: Page 8 of 42
  • 9. 1. Administration 2. Media management 3. Employee crisis management 4. Vendor communications management 5. Client communications management 6. Salvage operations procedures 7. Travel and lodging coordination 8. Recovery expense control and reporting 9. Plan exercise project management 10. Plan maintenance management VII Planning Assumptions Every viable disaster recovery includes assumptions. The assumptions limit the circumstances that the plan addresses and provides a foundation to support its procedures. Limits may be imposed upon the magnitude of disaster. Premises may be stated declaring known dependencies and expected levels of services provided by others, such as reliance on the Office Of Computing Services to restore network availability within an acceptable time frame. Raising and exploring the following questions helps to identify assumptions. • What equipment/facilities have been destroyed? • What is the timing of the disruption? • What records, files and materials were protected from destruction? • What resources are available following the disaster: o Staffing? o Equipment? o Communications? o Transportation? o Hot site/alternate site? Typical planning assumptions included in Business Continuity Recovery Plans include: • The main facility of the organization has been destroyed. • Staff is available to perform critical functions defined within the plan. • Staff can be notified and can report to the backup site(s) to perform critical processing, recovery and restoration activities. • Off-site storage facilities and materials survive the event. • The disaster recovery plan is maintained with regard to training, testing and updating. • Subsets of the overall plan can be used to recover from minor interruptions • An alternate facility is available or can be secured as necessary. • An adequate supply of critical forms and supplies are stored off-site, either at an alternate facility or off-site storage, or are readily available from an external source. • A backup site is available for processing the organization’s work. • Cell phones, pagers, email and other auxiliary forms of communication will be available. • Surface transportation in the local area is possible. Page 9 of 42
  • 10. • Vendors will perform according to their general commitments to support the organization in a disaster This list of assumptions is not all-inclusive, but is intended as a thought provoking process in the beginning stage of planning. The assumptions themselves dictate the actual plan’s procedures and hence should be carefully reviewed by the Steering Committee for relevance and validity. VIII Scaling Levels of Disaster It is sometimes advantageous to define levels of disaster for given scenarios, for which a standard set of response procedures can be written. The sub-procedures can then be referenced or called from within other top-level procedures using the designated level of disaster. Level 0- No Interruption in operations. Level 1- Operations can be resumed within eight hours. Level 2- Operations can be resumed within 8-48 hours. Users may need to implement manual or alternate processing procedures. Level 3- Operations cannot be restored for over 48 hours. All functions and personnel to be moved to an alternate site(s). Users need to implement manual processing. Alternate disaster scale: Level 0- The disaster can be handled by the personnel of the organization alone. Level 1- The disaster will require some outside intervention for recovery such as police, fire, or other professional services. Level 2- The disaster will require assistance from multiple external organizations. IX Collecting Data and Information Collecting accurate data is vital to developing a successful plan. This point cannot be overstated. Effective data collection involves the use of questionnaires and conduction of interviews with key personnel and managers as well as review of existing policies, operating manuals and procedures. Preprinted forms and questionnaires are particularly useful and efficient. Comprehensive data collection should include the following: • Asset and Equipment Inventory • Personnel Roster • Notification Checklist • Master Communications List • Vendor and External Agency Contact List • Computer Hardware Inventory Page 10 of 42
  • 11. • Computer Software Inventory • Documentation Inventory • Forms Inventory • Office Supply Inventory • Insurance Policies Inventory • Office Equipment Inventory • Records and Data Inventory • Offsite Storage Inventory • Telecommunications Inventory • Important Telephone Numbers • Business Process Questionnaires • Service Level Agreement Questionnaires • Regulatory Compliance Questionnaires • Security Questionnaires X Planning Objectives The primary objective of Continuity Recovery planning is to enable the University to at a minimum survive a disaster but preferably continue normal business operations. In order to survive, the University must assure that critical operations can resume/continue normal processing before the Maximum Acceptable Outage is reached. [Reference Appendix A Glossary of Terms] The plan must establish clear lines of authority and prioritize work efforts. The key objectives of the Continuity and Recovery plan are to: • Ensure the safety and well being of personnel. • Continue core business processes and functions. • Minimize the duration of disruption to operations. • Minimize immediate damage and losses. • Establish chain of command and responsibility. • Facilitate effective coordination and communication. • Assure restoration of normal operations Statistically, the probability of a major disaster is low, but the consequences of an occurrence may be catastrophic, both in terms of operational impact and agency reputation Management should assign on- going responsibility for continuity recovery planning to personnel with top-level responsibility or those possessing intimate knowledge of a process or service. XI Setting Realistic Goals The goal of any Business Continuity and Recovery Plan is implicit from its title; Continuity and Recovery. For Old Dominion, a realistic goal is to maintain the "integrity of a semester", meaning that if some sort of disaster occurs the University should avoid stopping a semester, or declaring it a loss. The objective of Continuity is little or no disruption to a process. Recovery may be appropriately divided into two objectives. The Recovery Time Objective (RTO) is the period of time within which systems, applications, or functions must be recovered after disruption to ensure the viability of business operations and are often used as the basis for the development of recovery strategies, and as a Page 11 of 42
  • 12. determinant as to whether or not to implement the recovery strategies during a disaster situation. The Maximum Allowable Downtime or Maximum Acceptable Outage (MAO) is the elapsed time interval between disruption’s commencement and the achievement of the RTO. The Recovery Point Objective (RPO) is the point in time at which systems and data most be recovered after a disruption in order to ensure the viability of business operations. This includes restoration of manually processed hard copies and application datasets. RPO's are often used as the basis for the development of backup strategies, and as a determinant of the amount of data that may need to be recreated after the systems or functions have been recovered pursuant to preserving the viability of business operations. XII Determining Critical Needs To determine the critical needs of the University, each unit should document all the functions performed within their area. An analysis over a period of two weeks to one month can indicate the principle functions performed inside and outside the University, and assist in identifying the necessary data requirements for the University to conduct its daily operations. Some useful diagnostic type questions that may be raised towards this end are: • How long could the department function without the certain existing equipment and normal departmental organization? • What, if any, state reporting requirements for education must be met (medically related education may experience this issue) for integrity? • What are the high priority tasks including critical manual functions and processes in the department? • How often are these tasks performed, e.g., daily, weekly, monthly, or on a semester boundary, etc.? • What staffing, equipment, forms and supplies would be necessary to perform the high priority tasks? • How would the critical equipment, forms and supplies be replaced in a disaster situation? • Does any of the above information require long lead times for replacement? • What reference manuals and operating procedure manuals are used in the department? How would these be replaced in the event of a disaster? • Should any forms, supplies, equipment, procedure manuals or reference manuals from the department be stored in an off-site location? • Identify the storage and security of original documents. How would this information be replaced in the event of a disaster? Should any of this information be in a more protected location? • What are the current microcomputer backup procedures? Have the backups been restored? Should any critical backup copies be stored off-site? • What would the temporary operating procedures be in the event of a disaster? • How would other agencies be affected by an interruption in the department? • How might the University be affected by disruption within others during the same event? • What effect would a disaster at the data center have on the University? • What outside services/vendors are relied on for normal operation? • Would a disaster in the University jeopardize any legal requirements for reporting? • Are job descriptions available and current for the department? • Are department personnel cross-trained in accordance with the plan? • Who will be responsible for maintaining the University’s plan? • Are there other concerns related to planning for Continuity and Recovery? Page 12 of 42
  • 13. • What obstacles must be overcome to develop a viable working plan? Critical needs can be ascertained by first using questionnaires, then correlating the consistency and range of the responses. XIII Priority of Processing Once the critical needs have been documented, priorities within the University’s units should be established for the overall recovery of the University. In the University setting, the activities of each unit may be prioritized as: • Critical o A disruption of this function exceeding one day would seriously impair the operation of the University. The impact is direct and immediate upon the University and its customers. • Essential o A disruption of this function exceeding one week would seriously impair the operation of the University and would likely damage the integrity of a "semester". • Supportive o The function is related to internal administration but its extended absence would not seriously impair the operation of the University. XIV Basic Items of Focus Regardless of the University’s level of criticality to the University or the complexity of its operations, there are common basic items their plan should accommodate. • Crisis management plan — ensuring the safety of employees, continuity of decision making, and view from outside world. Includes personnel communication tree and facilities diagrams and inspection plans. • Asset list and external dependency contact information • Secure, data storage and retrieval both on and off site. • Prioritization of resources for most critical business processes based on the business impact analysis. • Detailed process flow descriptions and procedures to recover them under all scenarios considered in accordance with the plan’s scope, assumptions and premises. • Work-at-home programs or alternate plan for workspace recovery • Contingency planning — mitigating the risks of external events • Calling trees - lists of people who call other people and keep communications flowing. XV Know The Cost Of Downtime The true cost of downtime is the summary of expenditures and losses that result from a given adverse event. The total often exceeds that which can be immediately calculated. The monetary cost is a factor of lost productivity and revenue and expenses of recovery. Other costs which expose themselves during Page 13 of 42
  • 14. and after events include decreased financial performance, damaged reputation, loss of personnel, penalties for noncompliance etc. All of these must be taken into account for a valid analysis of cost effective mitigation measures. The cost of downtime is a rather complex issue that deserves a considerable portion of the planning stage allocated to it. It is the basis for establishing acceptable levels of risk. Depending on the University’s activities, it may not be sufficient to merely consider the endpoint summation of the costs of downtime for the basis of accepted risk. While an accounts receivable department will eventually reach its objective to post and deposit all outstanding payments, their ability to meet this recovery point may exceed the amount of time that the business’ cash reserves will suffice, leading to secondary crisis and potential losses. Additional costs incurred by other agencies or customers dependent upon your services should also be considered for a thorough downtime cost analysis. XVI How Much Risk Is Acceptable? The theoretical goal of the BCRP is ‘zero’ downtime and ‘instantaneous’ recovery. Due to limitations of time, resources and funding these are not practically achievable. There is always residual risk that will lead to outages of quantifiable duration. Officers and upper management generally bear the responsibility of determining acceptable levels of residual risk. The decision is rather complex and approaches a non-linear system of probabilities of occurrence and outcome. Events of interruption are often not isolated and the number of permutations is great. Under the threat of hurricane will the power also be interrupted?; Will the facility’s roof be breached?; If so will important paper documents or computer systems be damaged?; Will evacuation be necessary?; How many employees will decide to leave early to be with family? A resilient plan takes into account all possibilities and probabilities of success and failure of any preceding assumptions and preventive measures. Decisions are based on outcome probabilities and the corresponding cost of downtime. Risks may be enumerated in order of probability and their potential impact. XVII Events To Consider The potential threats to any given agency are so numerous, that any viable plan must consider only the most probable or most catastrophic events that can still be reasonably mitigated. Hence, a nuclear attack or major east coast earthquake can be eliminated. The following is a comprehensive list of both natural and man-made disasters that threaten the University and can reasonably and effectively be planned for. Natural Disasters • Flooding, Tornadoes, Sever Thunderstorm, Hurricane, Winter Storms. Man-Made or resulting from the above • Fires with and without equipment loss and/or facility damage, loss of vital documents, data software, loss of power, loss of communications, loss of environment controls (heat, A/C), internal/external vandalism or theft, loss of equipment, loss of personnel, breech of physical security, loss of facilities access, sabotage, bomb or terrorist threat or similar event, cyber event such as virus or hacker, construction including exposure such as asbestos. Page 14 of 42
  • 15. The following events have been excluded on the basis of extremely low probability of occurrence or are beyond the scope of the authority of individual agencies. • Chemical or nearby industrial accident, civil disobedience, employee strike, hostage or terrorist situation, war and military conflicts including nuclear attacks, plane or aircraft crashes, earthquakes. Refer to Appendix A, for detailed information about these events, from the perspective of Business Continuity and Recovery. XVIII Information Technology Considerations Information technology dependencies exist extensively though out the University due to its distributed computing environment. Distributed computing involves multiple computers, remotely located in relation, each with a role in computation problems or information processing. In the typical transaction using a tiered model, user interface processing is done at the desktop computer at the user's location, business processing is done in a remote computer, and database access and processing is done in another computer that provides centralized access for many business processes. The process depends upon the ability of each tier to fulfill its role. Agency process dependencies are often described in Business Continuity and Recovery literature as vertical and IT dependencies as horizontally corresponding to them. IT SERVICE DELIVERY Business Priorities and MEDIUM Drivers High Transactional Applications Numerous Business HIGH Departmental Applications Processes Critical Knowledge Bases Reporting Systems Business Activities MEDIUM Data Sets And Tasks Critical The bulk of the responsibility for managing and maintaining the campus distributed computing environment is borne by OCCS. Through extensive commitment of resources and aggressive planning, OCCS can continue or rapidly recover data and computing services due to most events including extensive damage to the data center facilities. This high degree of certainty and availability should Page 15 of 42
  • 16. definitely be one of the assumptions in the University’s plan. However, residual risk is never zero, even for OCCS and therefore some plan or procedures should be in place in the event that OCCS services are not available or partly available such as loss of email functionality or limited banner access. This action will surely augment the resiliency of the University. Information Technology statistics are helpful when setting priorities for allocating resources and developing preventive strategies. Leading Causes of Computer Downtime- Contingency Planning Research, Livingston, NJ Power Outage …………………………31% Storm Damage…………………………20% Burst Pipe-Flooding……………………16% Fire and Bombing………………………9% Earthquake…………………………….. 7% Other……………………………………4% Leading Causes of Data Loss- Ontrack Data Recovery Inc. Hardware or System Malfunction…………. 44% Human Error………………………………… 32% Software Application Malfunction…………. 4% Viruses………………………………………. 7% Natural Disaster……………………………. 3% In considering your on and off site data backup strategies it may be advantageous to classify data as a function of the Recovery Time Objective for the process or function(s) it supports. Data Classification2 Recovery Time Objective (RTO) Product or Service Related Data Required to support the core products and services of the Usually less than 24 hours. business unit e.g. Order Entry, Inventory, Shipping. Business Support Data Required to run the business Usually less than 48 hours. e.g. Financials, Payables, Data Warehouse Deferrable Data Required to support the business Usually 72 hours or longer. e.g. Fixed asset accounting, XIX Records Retention 2 Note that the University has established policy on Data Classification, number 3512. Page 16 of 42
  • 17. An organized systematic approach to records management is an important part of a comprehensive disaster recovery plan. In fact, the state legislature has regulated records retention and the commonwealth librarian has created a set of schedules for records retention. Effective records management can free up administrative resources and work to optimize processes dependent upon them before the occurrence of an event. • Reduced storage costs. • Expedited customer service. • Efficient process flow • Federal and state regulatory compliance. Records are not only retained as proof of transactions, but also to verify compliance with legal and regulatory requirements. These records are used for independent examination and verification of business practices. Federal and State requirements for records retention must be analyzed by each agency individually before a retention and management policy can be created. In addition, the University should employ specific handling and salvage techniques and procedures for its employed media types. Commonly used media types include: • Paper • Magnetic • Microfilm/Microfiche • Photographic • Compact Disc XX Insurance Considerations Adequate insurance coverage is a key consideration when developing a business recovery plan and performing a risk analysis. Having a disaster plan and testing it regularly may in itself, lower insurance rates. Good planning does reduce risks and address many concerns of the underwriter, in addition to affecting the cost or availability of the insurance. To assist in planning for emergency funding, most insurance agencies specializing in business interruption coverage can provide the organization with an estimate of anticipated business interruption costs. Many organizations that have experienced a disaster indicate that their costs were significantly higher than expected in sustaining temporary operations during recovery. To provide adequate proof of loss to an insurance company, the University should well document its assets. Asset inventories become extremely important as the adjustment process takes place. Photos of specialized or expensive equipment can prove valuable in expediting reimbursement for losses. Types of insurance coverage to be considered may include: computer hardware replacement, extra expense coverage, business interruption coverage, valuable paper and records coverage, errors and omissions coverage, fidelity coverage, media transportation coverage. Page 17 of 42
  • 18. With a good handle on the costs of downtime, costs to replace assets and increased operating costs incurred during recovery, management can make reasonable decisions as to the type and amount of insurance to carry and to what extent the University should self-insure against certain losses. [University Note: It is recommended that a copy of or pertinent portions of your plan be submitted to the appropriate Office of the Vice President for review of insurance coverage and strategies] XXI Strategy and Plan Development Methodology used to develop Business Continuity and Recovery Plans should emphasize the following: • Defining recovery requirements from the perspective of business functions. • Documenting the impact of an extended loss to operations and key business functions. • Focusing appropriately on disaster prevention and impact minimization, as well as orderly recovery. • Selecting project teams that ensure the proper balance required for plan development. • Designing a continuity plan that is understandable, easy to use and easy to maintain. • Develop methods to integrate continuity and recovery planning into ongoing business planning and system development life cycles to sustain plan viability over time. The successful and cost effective completion of such a project requires the close cooperation of management from all areas ranging from strictly IT areas to dedicated business areas supported by information systems. The kickoff meeting should include the department leadership members and those who have first hand knowledge of core processes. Describe the project’s goals and its importance to the ongoing continuity of the University. Answer any questions and clearly define the roles and responsibilities of each member. Obtain commitment from appropriate management to support and participate in the effort. Senior personnel from all units must be very much involved throughout the project for the planning process to be successful. Cooperation and awareness… Your keys to success! The objectives are to identify alternatives for specific continuity requirements, evaluate those alternatives, and recommend a business continuity and recovery strategy for management’s approval. The strategy development builds upon the MAO’s identified for each core process in the BIA by defining the specific resources necessary for the performance of that process, and setting a recommended strategy for the recovery of those resources in an outage. These strategies are documented and compiled into a comprehensive plan for the University. This is a critical decision- making step in the development of a Business Continuity Program, because the general strategy provides the specific guidelines by which the program will be implemented. The plan development builds upon the strategies selected for each agencies core business processes. The following four phases must be specifically addressed within the plan: Page 18 of 42
  • 19. • Response- the reactions to an incident or emergency in order to assess the level of containment and control required activities. • Resumption-the process of planning for and/or implementing the recovery of critical business operations immediately following an interruption or disaster. • Recovery-the process of planning for and/or implementing recovery of less time sensitive business operations and processes after critical business functions have resumed. • Restoration-the process of planning for and/or implementing full scale business operations, which allow the organization to return to a normal service level. Selecting alternative strategies is a very basic activity when developing the plan. The following should be considered: • Alternate procedures for carrying out the process either to its output stage (completion) or to an intermediate stage that may be easier to recover from or provide some customer service as opposed to shutting down the function completely. • Manual processing abilities and related costs • Suspending the function for some period of times • Mitigation of insurance (replace rather than try to salvage) • Outsourcing and vendor services: temporary personnel agencies, cellular phone rental etc. • Process redesign • Single points of failure • Ability to recreate information • Data backup versus real time replication • Business cycles, seasonal or otherwise. • Work schedule modification to maximize resource use • Internal resource capabilities • The option to do nothing XXII Response Teams The structure of the contingency organization need not be the same as the existing organization chart. The team approach is used in developing a plan as well as responding to an event. The purpose of creating teams is to assign specific responsibilities for a smooth recovery. Each team must have the authority to carry out the procedures contained in their section of the plan. Within each team, a leader and alternate should be designated. These persons provide the necessary leadership and discretion in carrying out responsibilities at the time of disaster. For smaller agencies a team may consist of one individual and furthermore that individual may head or be a member of other teams concurrently. The following is comprehensive list of potential teams. The scope and complexity of your plan will determine the type of teams that should be created. • Emergency Response Team • Management Team Page 19 of 42
  • 20. • Damage Assessment Team • Safety and Security Team • Facilities Management Team • Administrative Support Team • Logistics Support Team • User Support Team • Department Recovery Team • Computer Backup Team • Off-Site Storage Team • Software Recovery Team • Communications Team • Production Team • Computer Restoration Team • Human Relations Team • Public Relations/Customer Service Team • Business Recovery team • Departmental Recovery team Note: Various combinations of the above teams may be use, resulting in the creation of fewer teams each with broader responsibilities. It is imperative that all team members both accept their role and responsibilities and are fully qualified and trained to perform them. XXIII List of Deliverables The basic list of deliverables for a Business Continuity Recovery Program, include the following: • Detailed Project Schedule • Risk Assessment and Business Impact Analyses • Data Collection Forms and Questionnaires • General Recovery Strategies and Approach Report • Recovery Team Structure, Roles and Responsibilities • Business Continuity Plan (including all detailed recovery plans, policies and procedures) • Test Exercises Schedules and Procedures • Maintenance, Support and Training Policies and Procedures XXIV The Plan Document • Format Use a common format in preparing the actual detailed procedures and documenting other information. This will help assure consistency and conformity throughout the plan and facilitates ongoing maintenance. Standardization is especially important if several people write the procedures. • Basic Requirements Page 20 of 42
  • 21. The University name, address, and primary and secondary contact information for the Business Continuity plan must be identified. Revision status should be clearly stated on the cover page at minimum, but may also be displayed in the footer of each page. Supporting documentation that should be inserted into or attached to the master copies of the plan document include logs of Review and Update, Personnel Training, Test Exercise Procedures and Results. Provide inventory lists of all assets necessary to support the University’s operations and those items required to carry out the plan. This might not only include office equipment, computers and software but also supplies such as paper forms. • Listing of Potential Sections -- Executive Summary -- Purpose and Scope -- General Assumptions -- Statement of Compliance -- Types of Events Considered -- Business Process Analysis -- Business Continuity Agency Structure -- Recovery Strategies -- Reporting and Communications Structure -- Responsibilities of the Recovery Teams -- Emergency Response Procedures -- Damage Assessment and Restoration Procedures -- Media Notification and Public Relations Control During a Disaster -- Employee Notification, Information and Communication Systems -- Recovery Team(s) Procedures -- Team Assignments With Designated Backups -- Procedures For Establishing a Command and Control Center. -- Contingency Administrative Procedures -- Offsite Storage and Retrieval Procedures -- Emergency Funding and Accounting Procedures -- Identification of Extra Expenses During a Disaster -- Computer Network and Communications Configurations -- Critical Applications and Data Sets -- Application and Process Priorities -- Technical Recovery Procedures -- Restoration Procedures -- External and Inter-Agency Dependencies -- Inventories: personnel, storage, skills, teams, vendors, hardware, software, data-com, documents, forms, equipment, office supplies, records, critical inventories. -- Internal Communication Plan -- External Party and Customer Notification -- Employee Assistance -- Implementation Procedures Page 21 of 42
  • 22. -- Disaster Declaration Policies -- Special Security Procedures -- Procedures and Policies For Plan Review and Update -- Maintenance documents including revision logs, training logs, exercise plans and test results An organized Business Continuity Recovery Plan can be followed step by step by all internal and external personnel and result in achievement of the expected continuity and recovery objectives under a given scenario. • Writing Methods -- Procedures should be clearly written. -- Be specific. Write the plan with the assumption that it will be implemented and carried out by personnel completely unfamiliar with plan and the University’s operation details. -- Use short, direct sentences, and keep them simple. Long sentences can be overwhelming. -- Use topic sentences to start each paragraph. -- Use short paragraphs. Long paragraphs can diminish the reader’s comprehension. -- Present one idea at a time and in logical sequence. -- Try to avoid technical jargon even if explained elsewhere in the document. -- With the exception of the personnel contact list, use position titles (rather than names of individuals) to reduce maintenance and revision requirements. -- Avoid gender specific nouns and pronouns that may lead to unnecessary revision. -- Develop uniformity in procedures to simplify the training process and minimize exceptions to conditions and actions. -- Identify actions that occur in parallel and those must occur sequentially. -- Use descriptive verbs. Examples of descriptive verbs are: Acquire, Count, Log, Activate, Create, Move, Advise, Declare, Pay,Answer, Deliver, Print, Assist, Enter, Record, Back Up, Explain, Replace, Balance, File, Report, Compare, Inform, Review, Compile, List, Store, Contact, Locate, Type XXV Document Distribution A master copy of the plan should be stored under control of the recovery coordinator. At least one off- site backup copy either in print or readily available electronic format is recommended. The often sensitive and proprietary details of the University’s operations contained within a plan sometimes pose security issues, particularly where extensive controls and segregation of duties exist. At a minimum, each copy in any format should be monitored and tracked using a distribution log, which should be attached to all master copies. Personnel that do not play a top level key role in the plan may be given an abbreviated plan which includes only that which supports their assigned role and responsibilities. Abbreviated versions should be clearly designated as such and tracked along with all other copies. Mark all plan copies as Sensitive and Confidential. A strict statement of use is appropriate and recommended. XXVI Maintenance and Support Page 22 of 42
  • 23. A plan is that is not regularly reviewed and updated will not remain viable through changes to people, process and technology. A change management program with regard to review, education, dissemination and testing must be implemented. Major considerations in this process include: Frequency Change factors Procedural changes University and Agency changes Personnel changes Physical changes Technology Continuity and Recovery requirements Testing issues XVII Testing and Training As a critical factor to its success, the plan should be tested and evaluated on a regular basis, at least annually. Procedures to test the plan should be documented. Only testing will provide the University with the assurance that the policies and procedures will work to achieve necessary objectives. • Determining the feasibility and compatibility of backup facilities and procedures. • Identifying areas in the plan that need of modification. • Providing training to the team managers and team members. • Demonstrating the ability of the University to meet is continuity recovery objectives. • Providing motivation for maintaining and updating the Business Continuity Plan. [Testing is also good training] There are five (5) main types of BCRP testing. • Checklist • Structured walk-through • Simulation • Active simulation • Full interruption A Structured walk through is the best place to start testing a BCRP. A structured walk through is usually done in a conference room by people who are familiar with the plan but did not actually write the plan. Prepare written procedures for the structured walk-through test includes: • Test scenario • Description of event • Test assumptions • Test constraints Page 23 of 42
  • 24. • Time, day, month that the disaster was reported • Method of discovery of the event • Immediate damage assessment • Specific forms and reports to be used from the plan • Specific Teams involved and other participants • Have a moderator not directly participating in the test log the event • Document the results and findings in the Continuity Recovery Plan XVIII Critical Success Factors There are many, many points in producing a successful BCRP. Some examples are listed below. • Know your risks and determine acceptable residual levels • Validate all planning assumptions • Consider a spectrum of likely and catastrophic events. • Gain a detailed understanding of core processes • Create detailed procedures and sub-procedures • Effectively train and instill awareness in all personnel • Match personnel qualifications closely with their team and role • Test for compliance with Recovery Time and Point Objectives XXIX Best Practices 1. Establish a service-level classification scheme for availability and business continuity and define standard, repeatable development, infrastructure and operations architectures to meet them. 2. If comprehensive testing is not practical perform walk-through testing and ensure that external dependencies are addressed. 3. Continuity Coordinator and key personnel review the business continuity program at least annually or as required in response to changes within the University. 4. Business Continuity spans the entire agency, organization or department. 5. Business continuity planning is a continuous process within the organization. 6. Strategies are in place based upon the impact the loss of a business process would have on the University, department or organization. 7. A program to exercise the plan exists and is implemented. Results are analyzed and reviewed for compliance and to determine if plan modifications are required. Page 24 of 42
  • 25. 8. A quality assurance program is in place. Appropriate triggers to update the plan are included in the University’s change management policies and procedures. 9. Project teams and timelines are established to implement all recommended measures to mitigate risks as defined within the Risk Assessment Analysis. 10. Do establish or arrange for off-site storage of backups and copies of critical files and data. XXX Stepping Through The Creation Of The Plan 1. Obtain Top Management Commitment Top management must support and be involved in the development of the disaster recovery planning process. Management should be responsible for coordinating the plan and ensuring its continued viability. Adequate time and resources must be committed to the development of an effective plan. Resources should include both financial considerations and the effort of all personnel involved. 2. Establish a planning committee A planning or steering committee should be appointed to oversee the development and implementation of the plan. Representatives from all functional areas of the organization should be included. The committee will define the scope and objectives of the plan. 3. Perform a risk assessment The planning and steering committee should prepare a risk analysis and business impact analysis that includes a range of possible events, including natural, technical and man-made threats. Each functional area is to be analyzed to determine the consequences and impact of both likely [power failure] and catastrophic [tornado] events. Evaluate the safety of critical documents and vital records. Fire poses one of the greatest threats. Intentional human destruction or sabotage, however, should also be considered. The plan must provide for the “worst case” situation: destruction of the main facilities. Impacts and consequences resulting from loss of information and services should be addressed. Cost effective risk mitigation planning is also the committee’s responsibility. 4. Establish priorities for core processes and functions of the University’s operation The critical requirements of each area within the University should be carefully and thoroughly evaluated: • Functional operations • Key personnel • Information and data • Processing systems • Customer service • Documentation Page 25 of 42
  • 26. • Vital records • Policies and procedures Processing and operations should be analyzed to determine the maximum amount of time that the department and organization can operate without each critical system. Critical needs are defined as the necessary procedures and equipment required to continue operations should an area, main facility, or key resources or any combination of these be destroyed or become unavailable. A method of determining the critical needs of a department is to document all the functions performed by each area. Once the primary functions have been identified, the operations and processes should be ranked in order of priority: Critical, Essential, or Administrative (supportive). 5. Determine Recovery Strategies The most practical and cost effective alternatives for processing in case of a disaster need to be researched and evaluated. Alternatives, depend upon the evaluation of a given function, and may include: • Relocation To Backup Site (A “warm” site will already have suitable equipment and operating environment) • Reciprocal agreements or vendor service level agreements • Manual processing with specific follow up “return to normal” restoration procedures. • Home or remote processing (Facility is inaccessible but the computer equipment is fully operational) Written agreements with vendors or other agencies for the specific recovery alternatives selected should be prepared. Be sure to consider: • Cost of contingency arrangement • Special security procedures • Notification of systems changes • Required hours of operation • Specific hardware and other equipment required for processing • Personnel requirements-possible temp staff to accelerate recovery • Circumstances constituting an emergency • Issues of availability and terms of use 6. Perform Data Collection Recommended data gathering materials and documentation includes: • Backup position listing • Critical telephone numbers (work, cell. home, pager) • Communications Inventory including work and an alternate email address • Distribution Log • Records inventory • Equipment inventory Page 26 of 42
  • 27. • Forms inventory • Insurance policies in effect • Computer hardware /software inventory • Office equipment inventory • Master call list/communication plan • Master vendor and external agency contact list • Notification checklist • Office supply inventory • Off-site storage location inventory • Software and data files backup/retention schedules • Temporary location specifications, potential or existing backup sites It is advantageous to develop standardized forms to facilitate the data gathering process. 7. Organize and document a written plan An outline is very useful to guide the development of the detailed procedures. • Helps to organize the detailed procedures • Identifies all major steps before the writing begins • Identifies redundant procedures that only need to be written once and defines sub-procedures The planning committee should review and approve the proposed plan. The plan should be thoroughly developed, including all detailed procedures to be used before, during and after a disaster. It may not be practical to develop detailed procedures until backup alternatives have been defined. Procedures should include methods for maintaining and updating the plan to reflect any significant internal, external or systems changes and as important, allow for a regular review of the plan by key personnel within the organization. The disaster recovery plan is best structured using a team approach. Specific roles and responsibilities should be assigned to the appropriate team for each functional area of the company. General team categories include administrative functions, facilities, logistics, user support, computer backup, restoration and other important areas in the organization. The Management Team is especially important because it coordinates and accomplishes the actual continuity-recovery process. The Damage Assessment Team should first assess the disaster followed by activation the recovery plan by the team or the Continuity Coordinator, and contact other team leaders. The Management Team also documents the efforts and recovery processes during the event. Management Team members should sit on the Planning Committee to assist in final decisions, setting priorities, policies and procedures. 8. Develop testing criteria and procedures It is essential that the plan be thoroughly tested and evaluated on a regular basis (at least annually). Procedures to test the plan should be documented. The tests will provide the organization with the Page 27 of 42
  • 28. assurance that all necessary steps are included in the plan. Other reasons for testing include: • Determining the feasibility and compatibility of backup facilities and alternate processing methods • Identifying areas in the plan that require clarification or modification • Providing training to all staff and personnel • Demonstrating the ability of the University to meet the anticipated recovery objective in time and degree • Providing motivation for maintaining and updating the Business Continuity Recovery Program 9. Test the Plan After testing procedures have been completed, test the plan initially by conducting a structured walk- through test. The test will provide additional information regarding any further steps that may need to be included, changes in procedures that are not effective, and other appropriate adjustments. It is recommended that initial testing of the plan should be done in sections, and during off peak business hours to minimize disruptions to the overall operations of the University. 10. Approve the plan Once the plan has been written and tested, it must be approved by all top level management. It is top management’s ultimate responsibility that the University has a current, documented and tested plan. Additional responsibilities include: • Reviewing and approving the plan at least annually, and documenting such reviews in writing • Ensure that the plan is compatible with those of the University and other University Agencies. XXXI Summary and Conclusion Continuity and recovery planning traditionally has information technology roots, but involves more than off-site storage or backup processing. Agencies need to develop written, comprehensive continuity recovery plans that address all the critical operations and functions of its business operations. The plan should include documented and tested procedures, which, if followed, will either ensure the ongoing availability of critical resources and continuity of operations or the efficient and timely recovery of such. Since the probability of occurrence for any given event is highly uncertain, the plan is not dissimilar to liability insurance; it represents and ongoing investment in return for a certain level of protection from financial disaster. In fact, the plan is better protection, because insurance alone it may not compensate for the incalculable loss of business during the interruption or the long-term losses due to damage of reputation. Effective documentation and procedures are extremely important in a continuity recovery plan. Considerable effort and time are necessary to develop a working plan. Barring sweeping agency changes, a well-organized plan requires relatively little maintenance and with proper testing and training provides the type of core stability that cannot be matched by external arrangements or contracts alone. Page 28 of 42
  • 29. APPENDIX A- Business Continuity Glossary Provided by the Disaster Recovery Journal A ACTIVATION: The implementation of business continuity capabilities, procedures, activities, and plans in response to an emergency or disaster declaration; the execution of the recovery plan. ALERT: Notification that a potential disaster situation exists or has occurred; direction for recipient to stand by for possible activation of disaster recovery plan. ALTERNATE SITE: An alternate operating location to be used by business functions when the primary facilities are inaccessible. 1) Another location, computer center or work area designated for recovery. 2) Location, other than the main facility, that can be used to conduct business functions. 3) A location, other than the normal facility, used to process data and/or conduct critical business functions in the event of a disaster. SIMILAR TERMS: Alternate Processing Facility, Alternate Office Facility, Alternate Communication Facility, Backup Location, Recovery Site. ALTERNATE WORK AREA: Office recovery environment complete with necessary office infrastructure (desk, telephone, workstation, and associated hardware, communications, etc.); also referred to as Work Space or Alternative work site. APPLICATION RECOVERY: The component of Disaster Recovery that deals specifically with the restoration of business system software and data, after the processing platform has been restored or replaced. SIMILAR TERMS: Business System Recovery. B BACKUP GENERATOR: An independent source of power, usually fueled by diesel or natural gas. Page 29 of 42
  • 30. BUSINESS CONTINUITY PLANNING (BCP): Process of developing advance arrangements and procedures that enable an organization to respond to an event in such a manner that critical business functions continue with planned levels of interruption or essential change. SIMILAR TERMS: Contingency Planning, Disaster Recovery Planning. BUSINESS CONTINUITY PROGRAM: An ongoing program supported and funded by executive staff to ensure business continuity requirements are assessed, resources are allocated and, recovery and continuity strategies and procedures are completed and tested. BUSINESS CONTINUITY STEERING COMMITTEE: A committee of decision makers, business owners, technology experts and continuity professionals, tasked with making strategic recovery and continuity planning decisions for the organization. BUSINESS IMPACT ANALYSIS (BIA): The process of analyzing all business functions and the effect that a specific disaster may have upon them. 1) Determining the type or scope of difficulty caused to an organization should a potential event identified by the risk analysis actually occur. The BIA should quantify, where possible, the loss impact from both a business interruption (number of days) and a financial standpoint. SIMILAR TERMS: Business Exposure Assessment, Risk Analysis BUSINESS INTERRUPTION: Any event, whether anticipated (i.e., public service strike) or unanticipated (i.e., blackout) which disrupts the normal course of business operations at an organization location. BUSINESS INTERRUPTION COSTS: The costs or lost revenue associated with an interruption in normal business operations. BUSINESS INTERRUPTION INSURANCE: Insurance coverage for disaster related expenses that may be incurred until operations are fully recovered after a disaster. BUSINESS RECOVERY COORDINATOR: An individual or group designated to coordinate or control designated recovery processes or testing. SIMILAR TERMS: Disaster Recovery Coordinator BUSINESS RECOVERY TIMELINE: The chronological sequence of recovery activities, or critical path, that must be followed to resume an acceptable level of operations following a business interruption. This timeline may range from minutes to weeks, depending upon the recovery requirements and methodology. BUSINESS RESUMPTION PLANNING (BRP): The operations piece of business continuity planning. 1) A specific segment of the overall recovery process focusing on those items between the recovered environment and the actual processing of business in recovery mode. SIMILAR TERMS: Business Continuity Planning, Disaster Recovery Planning BUSINESS RESUMPTION PLANNING: An all-encompassing "umbrella" term covering both disaster recovery planning and business resumption planning. 1) Process of developing advance arrangements and procedures that enable an organization to respond to an event that lasts for an unacceptable period of time. The process typically addresses all activities from the event to performing its critical business Page 30 of 42
  • 31. functions after an interruption and may include steps indicating how to return home. 2) Frequently used to refer to a business department recovery rather than technology elements. SIMILAR TERMS: Disaster Recovery Planning, Business Resumption Planning BUSINESS RECOVERY TEAM: A group of individuals responsible for maintaining the business recovery procedures and coordinating the recovery of business functions and processes. SIMILAR TERMS: Disaster Recovery Team BUSINESS UNIT RECOVERY: The component of Disaster Recovery which deals specifically with the relocation of a key function or department in the event of a disaster, including personnel, essential records, equipment supplies, work space, communication facilities, work station computer processing capability, fax, copy machines, mail services, etc. SIMILAR TERMS: Work Group Recovery. C CALL TREE: A document that graphically depicts the calling responsibilities and the calling order used to contact management, employees, customers, vendors, and other key contacts in the event of an emergency, disaster, or severe outage situation. CERTIFIED BUSINESS CONTINUITY PROFESSIONAL (CBCP): The Disaster Recovery Institute International (DRI International), a not-for-profit corporation, certifies CBCPs and promotes credibility and professionalism in the business continuity industry. Also offers MBCP (Master Business Continuity Professional) and ABCP (Associate Business Continuity Professional). CHECKLIST EXERCISE: A method used to exercise a completed disaster recovery plan. This type of exercise is used to determine if the information such as phone numbers, manuals, equipment, etc. in the plan is accurate and current. COLD SITE: An alternate facility that already has in place the environmental infrastructure required to recover critical business functions or information systems, but does not have any pre-installed computer hardware, telecommunications equipment, communication lines, etc. These must be provisioned at time of disaster. SIMILAR TERMS: Shell Site; Backup Site; Recovery Site; Alternate Site COMMUNICATIONS RECOVERY: The component of Disaster Recovery which deals with the restoration or rerouting of an organization's telecommunication network, or its components, in the event of loss. SIMILAR TERMS: Telecommunications Recovery, Data Communications Recovery COMPUTER RECOVERY TEAM: A group of individuals responsible for assessing damage to the original system, processing data in the interim, and setting up the new system. CONSORTIUM AGREEMENT: An agreement made by a group of organizations to share processing facilities and/or office facilities, if one member of the group suffers a disaster. SIMILAR TERMS: Reciprocal Agreement. Page 31 of 42
  • 32. COMMAND CENTER: Facility separate from the main facility and equipped with adequate communications equipment from which initial recovery efforts are manned and media-business communications are maintained. The management team uses this facility temporarily to begin coordinating the recovery process and its use continues until the alternate sites are functional. CONTACT LIST: A list of team members and/or key players to be contacted including their backups. The list will include the necessary contact information (i.e. home phone, pager, cell, etc.) and in most cases be considered confidential. CONTINGENCY PLANNING: Process of developing advance arrangements and procedures that enable an organization to respond to an event that could occur by chance or unforeseen circumstances. CONTINGENCY PLAN: A plan used by an organization or business unit to respond to a specific systems failure or disruption of operations. A contingency plan may use any number of resources including workaround procedures, an alternate work area, a reciprocal agreement, or replacement resources. CONTINUITY OF OPERATIONS PLAN (COOP): A COOP provides guidance on the system restoration for emergencies, disasters, mobilization, and for maintaining a state of readiness to provide the necessary level of information processing support commensurate with the mission requirements/priorities identified by the respective functional proponent. This term traditionally is used by the Federal Government and its supporting agencies to describe activities otherwise known as Disaster Recovery, Business Continuity, Business Resumption, or Contingency Planning. CRATE & SHIP: A strategy for providing alternate processing capability in a disaster, via contractual arrangements with an equipment supplier, to ship replacement hardware within a specified time period. SIMILAR TERMS: Guaranteed Replacement, Drop Ship, Quick Ship. CRISIS: A critical event, which, if not handled in an appropriate manner, may dramatically impact an organization's profitability, reputation, or ability to operate. CRISIS MANAGEMENT: The overall coordination of an organization's response to a crisis, in an effective, timely manner, with the goal of avoiding or minimizing damage to the organization's profitability, reputation, or ability to operate. CRISIS MANAGEMENT TEAM: A crisis management team will consist of key executives as well as key role players (i.e. media representative, legal counsel, facilities manager, disaster recovery coordinator, etc.) and the appropriate business owners of critical organization functions CRISIS SIMULATION: The process of testing an organization's ability to respond to a crisis in a coordinated, timely, and effective manner, by simulating the occurrence of a specific crisis. CRITICAL FUNCTIONS: Business activities or information that could not be interrupted or unavailable for several business days without significantly jeopardizing operation of the organization. Page 32 of 42
  • 33. CRITICAL INFRASTRUCTURE: Systems whose incapacity or destruction would have a debilitating impact on the economic security of an organization, community, nation, etc CRITICAL RECORDS: Records or documents that, if damaged or destroyed, would cause considerable inconvenience and/or require replacement or recreation at considerable expense. D DAMAGE ASSESSMENT: The process of assessing damage, following a disaster, to computer hardware, vital records, office facilities, etc. and determining what can be salvaged or restored and what must be replaced. DATA BACKUPS: The back up of system, application, program and/or production files to media that can be stored both on and/or offsite. Data backups can be used to restore corrupted or lost data or to recover entire systems and databases in the event of a disaster. Data backups should be considered confidential and should be kept secure from physical damage and theft. DATA BACKUP STRATEGIES: Those actions and backup processes determined by an organization to be necessary to meet its data recovery and restoration objectives. Data backup strategies will determine the timeframes, technologies, media and offsite storage of the backups, and will ensure that recovery point and time objectives can be met. DATA CENTER RECOVERY: The component of Disaster Recovery which deals with the restoration, at an alternate location, of data centers services and computer processing capabilities. SIMILAR TERMS: Mainframe Recovery, Technology Recovery. DATA RECOVERY: The restoration of computer files from backup media to restore programs and production data to the state that existed at the time of the last safe backup. DATABASE REPLICATION: The partial or full duplication of data from a source database to one or more destination databases. Replication may use any of a number of methodologies including mirroring or shadowing, and may be performed synchronous, asynchronous, or point-in-time depending on the technologies used, recovery point requirements, distance and connectivity to the source database, etc. Replication can if performed remotely, function as a backup for disasters and other major outages. (Similar Terms: File Shadowing, Disk Mirroring) DISK MIRRORING: Disk mirroring is the duplication of data on separate disks in real time to ensure its continuous availability, currency and accuracy. Disk mirroring can function as a disaster recovery solution by performing the mirroring remotely. True mirroring will enable a zero recovery point objective. Depending on the technologies used, mirroring can be performed synchronously, asynchronously, semi-synchronously, or point-in-time. SIMILAR TERMS: File Shadowing, Data Replication, Journaling. Page 33 of 42
  • 34. DECLARATION: A formal announcement by pre-authorized personnel that a disaster or severe outage is predicted or has occurred and that triggers pre-arranged mitigating actions (e.g. a move to an alternate site.) DECLARATION FEE: A one-time fee, charged by an Alternate Facility provider, to a customer who declares a disaster. NOTE: Some recovery vendors apply the declaration fee against the first few days of recovery. 1) An initial fee or charge for implementing the terms of a recovery agreement or contract. SIMILAR TERMS: Notification Fee. DESK CHECK: One method of testing a specific component of a plan. Typically, the owner or author of the component reviews it for accuracy and completeness and signs off. DISASTER: A sudden, unplanned calamitous event causing great damage or loss. 1) Any event that creates an inability on an organizations part to provide critical business functions for some predetermined period of time. 2) In the business environment, any event that creates an inability on an organization’s part to provide the critical business functions for some predetermined period of time. 3) The period when company management decides to divert from normal production responses and exercises its disaster recovery plan. Typically signifies the beginning of a move from a primary to an alternate location. SIMILAR TERMS: Business Interruption; Outage; Catastrophe. DISASTER RECOVERY: Activities and programs designed to return the entity to an acceptable condition. 1) The ability to respond to an interruption in services by implementing a disaster recovery plan to restore an organization's critical business functions. DISASTER RECOVERY OR BUSINESS CONTINUITY COORDINATOR: The Disaster Recovery Coordinator may be responsible for overall recovery of an organization or unit(s). SIMILAR TERMS: Business Recovery Coordinator. DISASTER RECOVERY INSTITUTE INTERNATIONAL (DRI INTERNATIONAL): A not-for- profit organization that offers certification and educational offerings for business continuity professionals. DISASTER RECOVERY PLAN: The document that defines the resources, actions, tasks and data required to manage the business recovery process in the event of a business interruption. The plan is designed to assist in restoring the business process within the stated disaster recovery goals. DISASTER RECOVERY PLANNING: The technological aspect of business continuity planning. The advance planning and preparations that are necessary to minimize loss and ensure continuity of the critical business functions of an organization in the event of disaster. SIMILAR TERMS: Contingency Planning; Business Resumption Planning; Corporate Contingency Planning; Business Interruption Planning; Disaster Preparedness. DISASTER RECOVERY SOFTWARE: An application program developed to assist an organization in writing a comprehensive disaster recovery plan. Page 34 of 42
  • 35. DISASTER RECOVERY TEAMS (Business Recovery Teams): A structured group of teams ready to take control of the recovery operations if a disaster should occur. E ELECTRONIC VAULTING: Electronically forwarding backup data to an offsite server or storage facility. Vaulting eliminates the need for tape shipment and therefore significantly shortens the time required to move the data offsite. EMERGENCY: A sudden, unexpected event requiring immediate action due to potential threat to health and safety, the environment, or property. EMERGENCY PREPAREDNESS: The discipline that ensures an organization, or community's readiness to respond to an emergency in a coordinated, timely, and effective manner. EMERGENCY PROCEDURES: A plan of action to commence immediately to prevent the loss of life and minimize injury and property damage. EMERGENCY OPERATIONS CENTER (EOC): A site from which response teams/officials (municipal, county, state and federal) exercise direction and control in an emergency or disaster. ENVIRONMENT RESTORATION: Recreation of the critical business operations in an alternate location, including people, equipment and communications capability. EXECUTIVE / MANAGEMENT SUCCESSION: A predetermined plan for ensuring the continuity of authority, decision-making, and communication in the event that key members of senior management suddenly become incapacitated, or in the event that a crisis occurs while key members of senior management are unavailable. EXERCISE: An activity that is performed for the purpose of training and conditioning team members, and improving their performance . Types of exercises include: Table Top Exercise, Simulation Exercise, Operational Exercise, and Mock Disaster. F FILE SHADOWING: The asynchronous duplication of the production database on separate media to ensure data availability, currency and accuracy. File shadowing can be used as a disaster recovery solution if performed remotely, to improve both the recovery time and recovery point objectives. SIMILAR TERMS: Data Replication, Journaling, Disk Mirroring. FINANCIAL IMPACT: An operating expense that continues following an interruption or disaster, which as a result of the event cannot be offset by income and directly affects the financial position of the organization. Page 35 of 42
  • 36. FORWARD RECOVERY: The process of recovering a database to the point of failure by applying active journal or log data to the current backup files of the database. G H HAZARD OR THREAT IDENTIFICATION: The process of identifying situations or conditions that have the potential to cause injury to people, damage to property, or damage to the environment. HIGH AVAILABILITY: Systems or applications requiring a very high level of reliability and availability. High availability systems typically operate 24x7 and usually require built in redundancy built-in redundancy to minimize the risk of downtime due to hardware and/or telecommunication failures. HIGH-RISK AREAS: Heavily populated areas, particularly susceptible to high-intensity earthquakes, floods, tsunamis, or other disasters, for which emergency response may be necessary in the event of a disaster. HOTSITE: An alternate facility that already has in place the computer, telecommunications, and environmental infrastructure required to recover critical business functions or information systems. HUMAN THREATS: Possible disruptions in operations resulting from human actions. (i.e., disgruntled employee, terrorism, blackmail, job actions, riots, etc.) I INCIDENT COMMAND SYSTEM (ICS): Combination of facilities, equipment, personnel, procedures, and communications operating within a common organizational structure with responsibility for management of assigned resources to effectively direct and control the response to an incident. Intended to expand, as situation requires larger resources, without requiring new, reorganized command structure. (NEMA Term) INCIDENT MANAGER: Commands the local EOC reporting up to senior management on the recovery progress. Has the authority to invoke the local recovery plan. INCIDENT RESPONSE: The response of an organization to a disaster or other significant event that may significantly impact the organization, its people, or its ability to function productively. An incident response may include evacuation of a facility, initiating a disaster recovery plan, performing damage assessment, and any other measures necessary to bring an organization to a more stable status. Page 36 of 42
  • 37. INTEGRATED TEST: A test conducted on multiple components of a plan, in conjunction with each other, typically under simulated operating conditions INTERIM SITE: A temporary location used to continue performing business functions after vacating a recovery site and before the original or new home site can be occupied. Move to an interim site may be necessary if ongoing stay at the recovery site is not feasible for the period of time needed or if the recovery site is located far from the normal business site that was impacted by the disaster. An interim site move is planned and scheduled in advance to minimize disruption of business processes; equal care must be given to transferring critical functions from the interim site back to the normal business site. INTERNAL HOTSITE: A fully equipped alternate processing site owned and operated by the organization. J JOURNALING: The process of logging changes or updates to a database since the last full backup. Journals can be used to recover previous versions of a file before updates were made, or to facilitate disaster recovery, if performed remotely, by applying changes to the last safe backup. SIMILAR TERMS: File Shadowing, Data Replication, Disk Mirroring. K L LAN RECOVERY: The component of business continuity that deals specifically with the replacement of LAN equipment and the restoration of essential data and software in the event of a disaster. SIMILAR TERM: Client/Server Recovery. LINE REROUTING: A short-term change in the routing of telephone traffic, which can be planned and recurring, or a reaction to an outage situation. Many regional telephone companies offer service that allows a computer center to quickly reroute a network of dedicated lines to a backup site. LOSS REDUCTION: The technique of instituting mechanisms to lessen the exposure to a particular risk. Loss reduction involves planning for, and reacting to, an event to limit its impact. Examples of loss reduction include sprinkler systems, insurance policies, and evacuation procedures. LOST TRANSACTION RECOVERY: Recovery of data (paper within the work area and/or system entries) destroyed or lost at the time of the disaster or interruption. Paper documents may need to be requested or re-acquired from original sources. Data for system entries may need to be recreated or reentered. Page 37 of 42
  • 38. M MISSION-CRITICAL APPLICATION: An application that is essential to the organization’s ability to perform necessary business functions. Loss of the mission-critical application would have a negative impact on the business, as well as legal or regulatory impacts. MOBILE RECOVERY: A mobilized resource purchased or contracted for the purpose of business recovery. The mobile recovery center might include: computers, workstations, telephone, electrical power, etc. MOCK DISASTER: One method of exercising teams in which participants are challenged to determine the actions they would take in the event of a specific disaster scenario. Mock disasters usually involve all, or most, of the applicable teams. Under the guidance of exercise coordinators, the teams walk through the actions they would take per their plans, or simulate performance of these actions. Teams may be at a single exercise location, or at multiple locations, with communication between teams simulating actual ‘disaster mode’ communications. A mock disaster will typically operate on a compressed timeframe representing many hours, or even days. N NATURAL THREATS: Events caused by nature that have the potential to impact an organization. NETWORK OUTAGE: An interruption in system availability resulting from a communication failure affecting a network of computer terminals, processors, and/or workstations. O OFF-SITE STORAGE: Alternate facility, other than the primary production site, where duplicated vital records and documentation may be stored for use during disaster recovery. OPERATIONAL EXERCISE: One method of exercising teams in which participants perform some or all of the actions they would take in the event of plan activation. Operational exercises, which may involve one or more teams, are typically performed under actual operating conditions at the designated alternate location, using the specific recovery configuration that would be available in a disaster. OPERATIONAL IMPACT ANALYSIS: Determines the impact of the loss of an operational or technological resource. The loss of a system, network or other critical resource may affect a number of business processes. OPERATIONAL TEST: A test conducted on one or more components of a plan under actual operating conditions. P Page 38 of 42
  • 39. PLAN ADMINISTRATOR: The individual responsible for documenting recovery activities and tracking recovery progress. PEER REVIEW: One method of testing a specific component of a plan. Typically, the component is reviewed for accuracy and completeness by personnel (other than the owner or author) with appropriate technical or business knowledge. PLAN MAINTENANCE PROCEDURES: Maintenance procedures outline the process for the review and update of business continuity plans. R RECIPROCAL AGREEMENT: Agreement between two organizations (or two internal business groups) with basically the same equipment/same environment that allows each one to recover at each other’s site. RECOVERY: Process of planning for and/or implementing expanded operations to address less time- sensitive business operations immediately following an interruption or disaster. 1) The start of the actual process or function that uses the restored technology and location. RECOVERY PERIOD: The time period between a disaster and a return to normal functions, during which the disaster recovery plan is employed. (RECOVERY SERVICES CONTRACT): A contract with an external organization guaranteeing the provision of specified equipment, facilities, or services, usually within a specified time period, in the event of a business interruption. A typical contract will specify a monthly subscription fee, a declaration fee, usage costs, method and amount of testing, termination options, penalties and liabilities, etc. RECOVERY STRATEGY: An approach by an organization that will ensure its recovery and continuity in the face of a disaster or other major outage. Plans and methodologies are determined by the organizations strategy. There may be more than one methodology or solution for an organizations strategy. Examples of methodologies and solutions include, contracting for Hotsite or Coldsite, building an internal Hotsite or Coldsite, identifying an Alternate Work Area, a Consortium or Reciprocal Agreement, contracting for Mobile Recovery or Crate and Ship, and many others. RECOVERY POINT OBJECTIVE (RPO): The point in time to which systems and data must be recovered after an outage. (e.g. end of previous day's processing). RPOs are often used as the basis for the development of backup strategies, and as a determinant of the amount of data that may need to be recreated after the systems or functions have been recovered. RECOVERY TIME OBJECTIVE (RTO): The period of time within which systems, applications, or functions must be recovered after an outage (e.g. one business day). RTOs are often used as the basis Page 39 of 42
  • 40. for the development of recovery strategies, and as a determinant as to whether or not to implement the recovery strategies during a disaster situation. SIMILAR TERMS: Maximum Allowable Downtime. RESPONSE: The reaction to an incident or emergency to assess the damage or impact and to ascertain the level of containment and control activity required. In addition to addressing matters of life safety and evacuation, Response also addresses the policies, procedures and actions to be followed in the event of an emergency. 1) The step or stage that immediately follows a disaster event where actions begin as a result of the event having occurred. SIMILAR TERMS: Emergency Response, Disaster Response, Immediate Response, and Damage Assessment. RESTORATION: Process of planning for and/or implementing procedures for the repair or relocation of the primary site and its contents, and for the restoration of normal operations at the primary site. RESUMPTION: The process of planning for and/or implementing the restarting of defined business operations following a disaster, usually beginning with the most critical or time-sensitive functions and continuing along a planned sequence to address all identified areas required by the business. 1) The step or stage after the impacted infrastructure, data, communications and environment has been successfully re-established at an alternate location. RISK: Potential for exposure to loss. Risks, either man-made or natural, are constant. The potential is usually measured by its probability in years. RISK ASSESSMENT / ANALYSIS: Process of identifying the risks to an organization, assessing the critical functions necessary for an organization to continue business operations, defining the controls in place to reduce organization exposure and evaluating the cost for such controls. Risk analysis often involves an evaluation of the probabilities of a particular event. RISK MITIGATION: Implementation of measures to deter specific threats to the continuity of business operations, and/or respond to any occurrence of such threats in a timely and appropriate manner. S SALVAGE & RESTORATION: The process of reclaiming or refurbishing computer hardware, vital records, office facilities, etc. following a disaster. SIMULATION EXERCISE: One method of exercising teams in which participants perform some or all of the actions they would take in the event of plan activation. Simulation exercises, which may involve one or more teams, are performed under conditions that at least partially simulate ‘disaster mode’. They may or may not be performed at the designated alternate location, and typically use only a partial recovery configuration. STANDALONE TEST: A test conducted on a specific component of a plan, in isolation from other components, typically under simulated operating conditions. Page 40 of 42
  • 41. STRUCTURED WALKTHROUGH: One method of testing a specific component of a plan. Typically, a team member makes a detailed presentation of the component to other team members (and possibly non-members) for their critique and evaluation. SUBSCRIPTION: Contract commitment that provides an organization with the right to utilize a vendor recovery facility for processing capability in the event of a disaster declaration. SYSTEM DOWNTIME: A planned or unplanned interruption in system availability. T TABLE TOP EXERCISE: One method of exercising teams in which participants review and discuss the actions they would take per their plans, but do not perform any of these actions. The exercise can be conducted with a single team, or multiple teams, typically under the guidance of exercise facilitators. TEST: An activity that is performed to evaluate the effectiveness or capabilities of a plan relative to specified objectives or measurement criteria. Types of tests include: Desk Check, Peer Review, Structured Walkthrough, Standalone Test, Integrated Test, and Operational Test. TEST PLAN: A document designed to periodically exercise specific action tasks and procedures to ensure viability in a real disaster or severe outage situation. U UNINTERTUPTIBLE POWER SUPPLY (UPS): A backup supply that provides continuous power to critical equipment in the event that commercial power is lost. V VITAL RECORD: A record that must be preserved and available for retrieval if needed. W WARM SITE: An alternate processing site which is equipped with some hardware, and communications interfaces, electrical and environmental conditioning which is only capable of providing backup after additional provisioning, software or customization is performed. WORKAROUND PROCEDURES: Interim procedures that may be used by a business unit to enable it to continue to perform its critical functions during temporary unavailability of specific application systems, electronic or hard copy data, voice or data communication systems, specialized equipment, office facilities, personnel, or external services. SIMILAR TERMS: Interim Contingencies. Page 41 of 42
  • 42. APPENDIX B- Online Sources of Additional Information 1. 2. 3. Page 42 of 42