Real liferecoverypaper


Published on

oracle foreign key primary key constraints performance tuning MTS IOT 9i block size backup rman corrupted column drop rename recovery controlfile backup clone architecture database archives export dump dmp duplicate rows extents segments fragmentation hot cold blobs migration tablespace locally managed redo undo new features rollback ora-1555 shrink free space user password link TNS tnsnames.ora listener java shutdown sequence

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Real liferecoverypaper

  1. 1. Real-Life Recovery: Perspective, Preparation and Performance Daniel W. FinkIn an anonymous office building downtown, an Most Oracle documentation gives equal billing toOracle database administrator checks the log files backup and recovery. The routine task of settingto verify that the backup processes ran properly up, monitoring and maintaining backup processeslast night. Then he grabs a 2nd cup of coffee and consumes valuable resource time. Once themoves on to the day’s tasks, emergencies and process has been established, other issues, such asassorted events. Users, managers and developers security, tuning and troubleshooting move to theare concerned with data accuracy, proper security top of the DBA task list. Usually, a crisis is theand blinding performance, not in the mundane time that the recovery process is performed. Thattask of testing recovery. Little does he or the is also the worst possible time to find out that thecompany know that the past month’s backups are backup process is flawed, the SYSTEM tablespaceuseless. Since the recent operating system update, file cannot be read from the tape.the tape drive, the only tape drive on site, writes This situation is a result of the attitude that thewithout error, but cannot read a single bit that is DBA is responsible for performing databaseon the tape. backup. The actual responsibility is to restoreA few miles away, 2 DBAs and a System or recover the database to the point in timeAdministrator are completing a 12-hour system and within the downtime windowand database recovery. No one has slept in over determined by the business needs.36 hours. Mistakes were made, but they werecaught well before the right pinkie was poised Rather than the backup process dictating theover the <ENTER> key. When the final recovery options, the recovery requirementscommand was issued, the three felt suddenly should drive the backup process. A backuprefreshed…success is a great stimulant. For the 3rd strategy is determined by the answer to a singletime in as many months, they have successfully question, “What is the best method of backup andperformed a recovery with no perceivable loss of archiving that meets all of the business needs ofdata. The Operations Manager walks in with a pot data loss, downtime and cost?” As with manyof coffee and a piece of paper. “Pretty good, no decisions, the question appears to be simple, butdata loss,” she says, “but you could have been is complex when defining ‘best method’, ‘businessdone 2 hours ago. I’ll grade it a B+. Let’s go for an needs’, data loss’, ‘downtime’ and ‘cost’. In anA next time! Go home, get some rest and we’ll see attempt to define these variables, the Cost v. Lostyou tomorrow.” Revenue Model is used to assist in answering this question.These two tales illustrate several important pointsregarding Oracle Backup and Recovery. Most • Restore, Recover, Rebuildsites focus on backups and other issues before To restore is to return to a former state. Inaddressing recovery. Single points of failure are database terms, this means to return the databaseshortcuts to disaster, both business and career. file(s) to a previous point in time determined byThe only measure of a successful backup plan is copying the file from backup to online media. Thehow completely recovery can be performed. The database or particular file may be missingtime to practice a recovery is not when a transactions. Restoration is required beforeproduction system is down and the Order Entry database recovery can be performed after a mediaclerks are waiting. Personnel can be the single or other similar failures. If the database is notgreatest weakness in recovering a database, but running in archivelog mode, this is probably thethey can also be the single greatest strength. opened state of the database. Perspective
  2. 2. To recover is to return to a normal state. In perform a recovery and the amount of datadatabase terms, this means to return the database allowed to be lost. These decisions are not set into a point in time determined by the situation. It concrete; they will change over the lifetime of themay be a complete or incomplete recovery, database.depending upon the circumstances. Files are Training and Education. The responsibility forrestored, and then archived redo logs are applied performing backup and recovery operationsto bring the database to the desired state. Some should never fall on the shoulders of a persontransactions may be lost, but they are usually few. who is not well trained and educated in the waysTo rebuild is to recreate the state. In database of Oracle. Backup development and practiceterms, this usually involves reusing the data and recovery can be done in a mentor-studentprocess that populated the database in the most relationship, until the student is confident in therecent iteration. In some environments, tasks to attempt perform these unassisted. Theparticularly data warehouse or nonproduction, process is similar to learning to fly a plane, put inthe raw data exists outside of the database. There the hours before attempting to go solo. There aremay be no actual transactions since the last load several excellent books and white paperstherefore nothing is lost. available regarding this discussion. One of the best is Rama Velpur’s Oracle Backup & Recovery• Base Knowledge and Training Handbook from Oracle Press.Transaction Architecture. The reason for a • Points of Failuredatabase is to support transactions, whether theyare entering a $.99 cable order or running a 3-day ‘Never have a single point of failure’ is the OracleRegional Sales data warehouse query. The Backup & Recovery Maxim. The basic foundationunderstanding of how a transactions works, and is that a single failure should never prevent awhat happens when it does not work, are the backup or recovery process from succeeding. Ifcornerstone to database knowledge. day-old archive logs are deleted after the cold backup, the backup tape is a single point ofRecovery Structures and Processes. A DBA must failure. If the logfiles are unable to be restored, thebe able to answer the question “What is the previous backup cannot be used to recover thedifference between a rollback segment and redo database because the archived logs arelog?” in 25 words or less. Know the recovery unavailable.structures, what they do and how to recover fromthe loss of each one. It is important to know the Identify the ‘show stoppers’ and have at least 2performance implications of running a database options for bypassing. A single bad backup orin archivelog mode, both from a processing command should never be the cause of a failedstandpoint and archivelog_destination cleanup. recovery. Taking an extra 2 hours to recover a lostIntimate knowledge of each recovery type is datafile from a 2-day-old backup is much easiercritical to success. Minimum areas of than explaining why yesterday’s sales were lostunderstanding include the pros and cons of each because the archived redo log was lost.type, impact on business and downtime and Prevention of failures is a proactive approach torequired steps, i.e. which files to restore, what recovery. Recovery from a dropped table issteps to perform during and after recovery, etc. possible, but implementing and enforcingBusiness Requirements. Each database is database security, access and maintenancedifferent. Hours of operation, failure tolerance, procedures can prevent the situation. Using andata integrity rules, etc. A database is a dynamic appropriate RAID configuration can minimize thesystem, constantly changing within its impact of disk loss. There are expenses associatedenvironment, which is also changing. The with prevention, but they may be minimalrequirements in place 6 months ago may have compared with actual recovery costs. Even ifchanged drastically. Define your recovery goals proactive solutions are implemented, there willfor each system and database. Decisions are still be areas of weakness. While certain RAIDdriven by the amount of time allowable to configurations will protect against the loss of a
  3. 3. single disk, they cannot protect against the delete parameter files and a dump of v$parameter addcommand. several layers of backup.• Failure Prone Components System Tablespace datafile(s) These datafiles may be the most important in the database for it is thePersonnel The most failure prone component is blueprint. Without the data dictionary, Oracle isthe person issuing the commands. Document the blind to all of the data that is residing in thesteps required for each recovery type. When datafiles.documenting, assume little, if any practice. Therecovery manual should address the areas that Data Tablespace datafile(s) Under the OFApose the greatest chance of total loss of service. standard, data and indexes are separated intoUse signposts like “Proceed with Caution”, “Stop distinct tablespaces. This allows backup and& Review”, etc. and decision trees. Update the recovery plans to be flexible. If the system andmanual periodically, at least at every major data tablespace datafiles are recoverable, all otherchange. If you are using Oracle Support tablespaces can be built from scratch, if sufficientassistance, make certain that you are talking to documentation exists. For example, indexes cansomeone who has performed recoveries. be recreated, if the most current DDL scripts are available. For Oracle8, data tablespaces includeRedo Logs (online, offline and archived) The loss those objects defined as index-only tables.of a single redo log can result in total recoveryfailure. This is one reason that Oracle Rollback Segment Tablespace datafile(s) Althoughrecommends using mirrored redo log groups. these tablespaces are used by transactions toUsing hardware level mirroring is not sufficient, provide transaction recovery and readyou are protected from media failure, i.e. disk consistency, the loss of this tablespace is notloss, but you are not protected from the delete necessarily a ‘death sentence’. The key factor iscommand. Archived logs should exist in their the state of the database at the point of thenatural state on at least one backup tape. Once backup. If it is consistent mode, i.e. nothere, they may be compressed and kept online uncommitted transactions, there should be nofor several days, and placed on several other rollback entries. As such, the segments can bebackup tapes. This strategy offers good protection dropped and rebuilt without data loss.from loss and assists in recovery speed by having Index Tablespace datafile(s) As stated above, ifthe logs online. Although rare, the compression/ the most current DDL scripts exist, indexes can beuncompression cycle may result in corrupted files, recreated from existing data. There is a substantialso use with caution. If disk space allows, have at recovery performance penalty incurred as theleast two uncompressed archive log copies on indexes are being rebuilt, but no data loss occurs.different tapes. Temporary Tablespace datafile(s) Once theControl Files If all control files are lost, Oracle database is shutdown, the temporary segmentsoffers several methods for performing recovery. are ‘clean’. If the temporary tablespace(s) are lost,However, these tasks are risky and easily recreation is a fairly simple process. As withavoidable. Another Oracle maxim is to have 3 indexes, there is a recovery performance penalty,copies of your controlfile, because a corrupt but no data loss occurs.controlfile will always, well almost always, becopied over a good controlfile. If you have a 3rd Misc Files File maps, object recreation scripts,controlfile, the erroneous copy will still happen, backup scripts and other assorted files should bebut only you will know…the 3rd copy is used to part of the backup process. Performing a mediasave the day and your job. recovery is greatly simplified if an up-to-date tablespace-file-device map is available. WhileInitialization Parameter Files These files are often these files are not required for recovery, they canoverlooked. As with controlfiles, there loss can be shorten and simplify the recovery process.overcome, but at a substantial effort. Data is notlost, but downtime is increased. A copy of all Preparation – the Cost v. Lost Revenue(CLR) Model
  4. 4. In a perfect world, the CIO would hand the DBA tests, document and perform the occasionaland SA a blank check for backup hardware/ and provide an unlimited operations For hardware and software, acquisition andbudget. In the real world, there is a balance operation costs can be quantified or estimated.between the costs associated with backup and the The operation costs are defined within the scopebenefits of recovery. Although the business needs of time. The actual time is not important, but itdrive the Cost-Benefit Analysis, it is the must be consistent. Hardware operation costs forresponsibility of the technical staff to educate the a year are not comparable to backup operationusers and support the decision-making process. It costs for 6 not the responsibility of the technical staff tomake the decision. For personnel, the actual or opportunity cost is determined. For most production environments,The basic concept is to compare the cost of backup personnel cost is actual, i.e. the sum of salary andand recovery with lost revenue. This is balanced benefits or hourly charges. For developmentso that the frequency of backup v. the frequency environments, the cost may be classified asof recovery is appropriately weighted. In a opportunity, i.e. the hourly rate that the client isproduction environment, a downed Order Entry being charged.system can be quantified. In a developmentenvironment, the quantification is related more to Although they are not assigned at this point, thelost Developer/User time. intangible costs should be discussed. These costs include impact on project timelines, goodwill withThe model should be generated for 3 scenarios – users, management and technical personnel. If aworst, best and anticipated. The anticipated case project is approaching a critical junction, the costshould be somewhere between worst and best. of downtime may escalate dramatically. UsersBest case should not be a fantasy, assume some may tire of experiencing excessive downtime;level of recovery requirements. Managers may tire of user complaints; DevelopersThere are five basic steps to the CLR Model may tire of reloading the past month’s data;process: Administrators may tire of 36-hour days. The short-term impact may be less money spent on the1) Educate the decision makers systems, but the long-term impact may beDiscuss issues in clear, nontechnical terms that are frustration, high turnover and added costs.clear to everyone. The first step is to lay the 3) Assign costs to each backup strategyfoundation by educating the decision-makers inbasic knowledge of backup and recovery. There are three types of backup strategy: hot, coldAlthough an Accounting Manager does not need with archiving and cold without archiving. A costto understand the intimate details of archive for each scenario is calculated in two areas: fixedlogging, they do need to understand that and variable.transaction recovery is not possible without Fixed costs are the one-time costs for each stepit...and what the business implications are of during the operational period. Software needs toadopting this strategy. be purchased or written once. Disk space required2) Assign costs to each resource. for Archived Logs or backup datafiles is another one-time investment. These costs are incurredThis phase is primarily devoted to assigning regardless of the frequency of backups, assumingactual or opportunity costs to each resource. The the frequency is greater than 0.resources are hardware, software andpeopleware. The backup process will consume Variable costs are incurred each time the backupdisk space and/or tapes, CPU cycles and memory process occurs. Logfiles must be monitored, tapeson the host platform. Backup software must be must be written and stored and scripts must bepurchased or written. Administrators of various maintained.experience levels must monitor backups, perform At this point, the first business decision is made. The anticipated frequency of the database
  5. 5. backups is determined. This decision is very remember that these decisions are among thepreliminary, it is not set in stone. The less frequent most important that may be made in your career.the backup, the more costly the recovery, but afinal decision is premature. It is important to use 5) Support the business user(s) inthe preliminary decision as a baseline. Once the making an educated decision.recovery costs are determined, the type of backup Once all of the costs are determined, the processcan be revisited. of finding an acceptable balance is begun. UnlessAnother alternative is to complete this phase of the company can afford hot-standby, fault-the model for each type and frequency of backup. tolerant, fail-over systems, compromises must beWhile time consuming, it may be simpler and made. Simply put, it is time to make an educated,quicker when busy schedules and diverse documented decision that is supported by theaudiences are involved. business. The decisions that are made during this step may result in lost data, missed sales,4) Assign costs to each recovery overtime and tough explanations to management. scenario If it is determined that the decision requires additional resource, such as disk drives or a tapeThe most common types of recoveries are object, library, they can be purchased before they arefile and disk, each with a distinct strategy. Cost needed. Recovery planning is part of the systemdetermination is common among all of the design phase.strategies, and is slightly different fromdetermining backup cost. Although it may not be easy explaining to management to invest $100,000 in a system that isThe first area of recovery cost is practice cost. The rarely used, it is better than explaining whymajor fixed expense is for a practice platform. An $1,000,000 in sales were lost yesterday because theideal situation is to have a ‘sandbox’ for System Order Entry system is not archiving transactions.and Database administrators to test upgrades andrecoveries. In many situations, an older Performancedevelopment platform may be adequate. Addedto the cost is the variable personnel cost for each Oracle database recovery is one area where thepractice. line between success and failure is clear. If the database is recovered according to the definedThe second area of cost is restoration. This business requirements, the recovery wasincludes the time to detect and correct/bypass the successful, if not, the recovery was a failure.cause of failure. Then the backup media is Trying to successfully recovery is not acceptable.retrieved and objects or files are restored. If In the words of Yoda, “Do or Do Not. There is noarchiving is being used, only the affected object or Try.”file needs restoration, which can significantlyreduce the restoration time. The only valid method for testing a backup process is to perform a test recovery. This testA final, but optional, area is the cost of recovery. If should be repeated on a regularly scheduled basisarchiving is being used, Oracle recovery processes and after any major change, such as a new tapecan be used. This is usually the quickest method device or system upgrade. The time to find outof recovery. Other methods are transaction that the tape drive is not functioning properly orreentry and data reload, which are both time and that a new database file is not being backedup ispersonnel resource intensive. For a small decision during a practice run, not a live recoverysupport system, it may be more efficient to situation. Practice runs will also expose gaps inrecreate the database from tested scripts and procedures, documentation, training and, mostreload the previous period’s data. importantly, confidence.The best method for estimating recovery time is toactually practice each step. An alternative methodis to use estimations for each step. Regardless, an • Practice, Practice, Practicequalified guess is better than nothing, but
  6. 6. Practice recoveries are the only method for 2. Determine scope and cause of failure. Thegaining confidence in the process and exposing type of failure will have serious impact on theweaknesses. Being able to quote chapter and verse types of recovery to be considered. If thefrom recovery theory is not a substitute for failure will be repeated shortly after thedrawing upon actual experience, especially at 4:00 recovery is performed, the recovery is the morning after a 20-hour day. At that time, If a table has been dropped, performing ayou may be one keystroke away from success or complete recovery will return the database tofailure. It is also the time when your mental the point AFTER the table was dropped.faculties are at their most vulnerable. 3. Correct/Bypass failure. When a disk fails, theMistakes during a practice recovery cause no datafiles should be restored to another disk orharm. In fact, they are a tremendous learning the disk should be replaced. If the failure isexperience. If you do not have the time and not corrected or bypassed, a recovery may betraining to perform a recovery 100% right the first wasted.time, there may be no next time…at least not at 4. Identify Plan of Attack. Once the failure isthe current company. understood and corrected, the recovery options are defined and a plan is determined.Practice will expose the ‘bridge burning’ steps. If a datafile for an index-only tablespace hasThese points are critical decisions where returning been damaged, it may be more efficient toto a previous step is difficult or impossible, unless drop the tablespace and rebuild the affectedcertain precautions are taken. Copying a backup indexes. Depending upon the plan, the nextdatafile in place of an existing datafile is such a step may not be required.step. These steps must be documented and wellunderstood. It may be appropriate to backup a 5. Restore affected data. After failure correction,‘bad’ datafile prior to restoration. This allows for a the process of restoration can begin. A tape orretreat should it be needed. Treat these situations other backup media is retrieved and theas ‘show stoppers’. affected data and log files are restored to the system. If redo log files are not beingOracle8 introduced the Recovery Manager tool, archived, the complete database must bewhich joins a number of 3rd party products. If the uses a tool, it needs to be part of the 6. Perform recovery. The actual recoverypractice. You must also plan for the possibility strategy that was determined in the plan isthat the tool is unavailable, always practice performed. In the above example, this maymanual forms of recovery. Tool availability is not involve dropping the tablespace, recreating ita substitute for basic foundational knowledge. and then rebuilding the indexes andPracticing recovery is not a minor expense, but it recreating constraints. Depending upon thepales in comparison to losing a substantial backup plan, there may be substantialamount of data for a production environment. It resource requirements, especially if the datarequires professional dedication to the business. must be reloaded or transactions reentered. IfAs recoveries are practiced, knowledge and skill time allows, a full cold backup should beare developed and confidence is increased. In performed after recovery but before theturn, this may assist in redeploying the backup database is opened for general use.and recovery strategy. 7. Postmortem. Debrief, document and determine improvements. Determine if the• Executing the Plan failure situation could be prevented. Losing1. Stop panicking. It is imperative to get into the data and incurring downtime due to a proper frame of mind. Taking an extra 15 preventable failure is rarely acceptable in a minutes do calmly discuss the possible business critical system, unless the business alternatives may make the difference in a requirements have accounted for and failed recovery attempt and an additional 15 accepted this possibility. hours of downtime. • Conclusion
  7. 7. Among the myriad of DBA tasks andresponsibilities, the most important to thebusiness is the ability to properly recover from afailure. There are many factors that determine thetype and scope of recovery, primarily the balancebetween what the business can afford to spendand what it can afford to lose. These decisions arenot the responsibility of the DBA or SA; rather,they are to be made by key members of the usercommunity. The technical staff functions as asupport organization by educating the users andassisting in the decision-making process using aCost v. Lost Revenue Model.Once the business has determined the backup andrecovery needs, the technical staff becomesresponsible for insuring that these operations areproperly executed. As a technical administrator,you are the greatest strength and greatestweakness for the recovery process. Training andpractice are the paths to success. The recoveryprocess is too critical to depend on less than 100%effort and ability.• Biographical InformationDaniel Fink has worked with relational databasesfor 11 years, the last 6 as a production databaseadministrator for Oracle systems. He specializesin production system management, configuration,tuning, backup and, of course, recovery. He hasalso developed and taught Oracle training classesfor clients nationwide.