Your SlideShare is downloading. ×
Piotr Kołodziej - Oracle Database High Availability Essentials
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Piotr Kołodziej - Oracle Database High Availability Essentials

2,067
views

Published on

18.05.2011 r.

18.05.2011 r.

Published in: Technology

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,067
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • define service-level agreements (SLAs) in terms of high availability for critical aspects of the business. These can be categorized into high availability tiers: Tier 1 processes have maximum business impact. They have the most stringent high availability requirements, with RTO and RPO close to zero, and requiring continuously available supporting systems. Tier 2 processes that have slightly relaxed high availability and RTO and RPO requirements. Tier 3 processes may be related to internal development and quality assurance processes. Systems supporting these processes need not have the rigorous high availability requirements of the other tiers. The business impact analysis categorizes the business processes based on the severity of the impact of IT-related outages. A complete business impact analysis provides the insight needed to quantify the cost of unplanned and planned downtime . Understanding this cost is essential because it helps prioritize your high availability investment and directly influences the high availability technologies that you choose to minimize the downtime risk. RTO - the maximum amount of time that an IT-based business process can be down before the organization starts suffering unacceptable consequences (financial losses, customer dissatisfaction, reputation, and so on). RTO indicates the downtime tolerance of a business process or an organization in general. RPO - the maximum amount of data that an IT-based business process may lose without harm to the organization. RPO indicates the data-loss tolerance of a business process or an organization in general. This data loss is often measured in terms of time, for example, 5 hours or 2 days of data loss. Determine backup retention policy - Onsite, offsite, long-term. The Backup Retention Policy establishes a limit on the length of time backups are maintained and encourages distinguishment between the purposes and practices of backing-up data versus the retrieval or archiving of data. Manageability Goal is more subjective than either the RPO or the RTO. It results from an objective evaluation of the skill sets and management resources available in an organization, and the degree to which the organization can successfully manage all elements of a high availability architecture. Understanding total cost of ownership (TCO) and return on investment (ROI) is essential to selecting a high availability architecture that also achieves the business goals of your organization. TCO includes all costs (such as acquisition, implementation, systems, networks, facilities, staff, training, and support), over the useful life of the solution chosen. Likewise, the ROI calculation captures all of the financial benefits that accrue to a given high availability architecture.
  • Oracle Database High Availability is fundamentally different from traditional third party solutions implemented external to Oracle. The traditional approach is to monitor at a physical bits & bytes level. For example, storage remote-mirroring replicates all writes to every file. Cold cluster failover monitors against server failure – requiring idle servers and a cold restart at failover time. These are just two examples – the list goes on. They all share the same shortcoming – they have no internal knowledge of the Oracle database or the very transactions that they are attempting to protect. Oracle High Availability capabilities are integrated within the Oracle Database, enabling intimate knowledge of the transactions and data being protected. This enables: Superior protection against data corruptions Faster recovery from any outage Zero or minimal downtime for planned maintenance. And the ability to utilize your investment in systems and software for productive purposes at all times
  • This slide is useful in two critical aspects. Firstly – it shows you how Oracle Database’s HA solutions are available in an integrated manner, as built-in features of the Oracle Database. So – if you have an Oracle Database – you have all of these capabilities available to you. No extra integration is required. The other thing this slide shows is to provide you with some guidance regarding a step-by-step approach to HA. If you are looking at re-architecting your HA configuration, it’s probably not a good idea to configure all these solutions at the same time. Maybe minimizing planned downtime is a higher priority to your data center than safeguarding for unplanned failures. If that is the case, you start with the solutions in the planned downtime category. Once that is tested and implemented, you can start looking at other areas.
  • The result is what we call the Oracle Maximum Availability Architecture – walk through major components to address unplanned & planned downtime This slide builds from single instance to a full MAA deployment If the standby is local then the backups can be done off the standby
  • Site Failure – For Data Guard and MAA architectures, the recovery time indicated applies to database and existing connection failover. Network connection changes and other site-specific failover activities may lengthen overall recovery time. For Data Guard non-RAC and MAA it is the time it takes to failover. Computer Failure – for the Oracle DB architecture this is the time it takes to restore the system. For Data Guard non-RAC it is the time it takes to failover. Storage Failure - Storage failures are prevented by using Oracle ASM with mirroring and its automatic rebalance capability. Human Error – Recovery time for human errors depend primarily on detection time. If it takes seconds to detect a malicious DML or DLL transaction, it typically only requires seconds to flash back the appropriate transactions. Longer detection time usually leads to longer recovery time required to repair the appropriate transactions. An exception is undropping a table, which is literally instantaneous regardless of detection time. Data Corruption For DB and RAC architectures, Recovery time depends on the age of the backup used for recovery and the number of log changes scanned to make the corrupt data consistent with the database. Automatic block repair allows corrupt blocks on the primary database or physical standby database to be automatically repaired, as soon as they are detected, by transferring good blocks from the other destination. In addition, RECOVER BLOCK is enhanced to restore blocks from a physical standby database. The physical standby database must be in real-time query mode. This feature reduces time when production data cannot be accessed, due to block corruption, by automatically repairing the corruptions as soon as they are detected in real-time using good blocks from a physical standby database. This reduces block recovery time by using up-to-date good blocks from a real-time, synchronized physical standby database as opposed to disk or tape backups or flashback logs. With automatic block repair, this should be the most common block corruption repair. There are some corruptions that cannot be addressed by automatic block repair, and for those we can rely on Data Guard failover that takes seconds to minutes. Application Node Failure – Using multiple application nodes front-ended by a load balancer masks any application node failures
  • Streams is not included in this table but could be used Storage Migration - Oracle ASM automatically rebalances stored data when disks are added or removed while the database remains online. For storage migration, you are required to use both storage arrays by Oracle ASM temporarily. Database one-off patch – On RAC & MAA it is for qualified one-off patches only (RAC Rolling Upgradeable) Database patch set and version upgrade Application Changes – by use of the zero downtime upgrade process with multiple application nodes no downtime is required.
  • The preferred way to store the Oracle Clusterware files (OCR / Voting Disks) is ASM in Oracle Database 11.2 However, one can still choose to install the OCR and Voting Disks on a shared file system (a cluster file sytem other than ACFS or a NFS mount). In this case, Oracle Clusterware still provides maintenance of redundant OCRs / Voting Disks, using more checks to enforce the required redundancy on more than one file system (one file system can act as a SPOF). New in Oracle Clusterware 11.2 is the ability to have 5 OCRs (3 is the default). The management of redundant Voting Disks has not changed. Note also that the OUI does not support RAW or Block devices anymore. This is deprecated in Oracle Clusterware 11.2. For Upgrades RAW and Block device support remains unchanged, but for fresh installations this is not supported. The respective command line tools can be used to place OCRs and Voting Disks on those devices or remove them from those devices into ASM respectively.
  • In the following using Oracle Clusterware as a general purpose Clusterware is described in detail: Oracle Clusterware is the best Clusterware for Oracle RAC. However, more and more customers have asked for support of non-Oracle applications. Reason: Even in a RAC cluster, one will find some (few) applications that are meant to run close to the RAC database / to the RAC instances. For those applications a failover solution would be desirable in order to make full use of the High Availability of the Database provided by RAC. It seems an unnecessary effort, however, to use another third party software solution underneath Oracle Clusterware in a RAC environment in order to maintain those applications. Therefore, customers demanded Oracle Clusterware to support those applications. This support has been established in Oracle Clusterware 10.2 and was extended in Oracle Clusterware 11.1 to include non-RAC, and non-Oracle applications environments. The new support was fairly widely accepted as one can see given the list of applications that can now be managed by Oracle Clusterware. Thus, the functionality for those scenarios has been extended in Oracle Clusterware 11.2.
  • New is the multiple Network support. Also the ACFS as a shared location for files that need to be available cluster-wide.
  • While writing actions scripts is simple as shown, Oralce provides some pre-configured actions scripts and agents. Action scripts and agents can be found at http://otn.oracle.com/clusterware Note: the support of those actions scripts and agents is described in Metalink Note 790189.1 Oracle provides fully supported agents for the applications listed under „Pre-configured agents for Oracle Clusterware“ only.
  • Note: Original text from the licensing guide has been modified for simplification reasons. Licensing Guide text: Grid Infrastructure The Grid Infrastructure is the foundation of the Oracle Grid. It includes Oracle Clusterware, Oracle Automatic Storage Management (Oracle ASM), ASM Cluster File System (ACFS), and ASM Dynamic Volumes. Oracle Clusterware provides cluster membership and high availability services. It provides the cluster membership for features such as Oracle Real Application Clusters and Oracle ASM. Oracle ASM is a storage manager for Oracle database files and file systems. ACFS is a shared file system enabling simultaneous access to files from multiple nodes. ASM Dynamic Volumes provides volume manager services, enabling Oracle ASM to host ACFS and other file systems. The Grid Infrastructure can be installed and used on any server where any of the following conditions are met: The server OS is supported by a valid Oracle Unbreakable Linux support contract. At least one machine involved in the cluster is licensed using the appropriate metric for either Oracle Database Enterprise Edition or Oracle Database Standard Edition. Oracle Clusterware is used to protect a software product from failures. Oracle Clusterware can be used to protect a software product in cases that do not meet conditions 1 or 2 above if either of the following conditions are met: The software being protected is from Oracle The software being protected uses an Oracle database A cluster is defined to include all the machines that share the same Oracle Cluster Registry (OCR) and Voting Disk. Oracle Grid Infrastructure includes the following features: Oracle Clusterware Application monitoring, restart, and failover Cluster membership services Server monitoring and fencing Single Client Access Name (SCAN) Server Pools Grid Naming Services Oracle Automatic Storage Management Dynamic storage rebalancing Storage mirroring Online disk add/drop ASM Cluster File System Shared file system with concurrent access to files from multiple nodes Local file system ASM Dynamic Volumes Volume management services for ACFS Volume management services for 3rd-party file systems
  • Performance impact needs to be assessed on both primary and on the standby database. For Oracle Database 11g: Set DB_ULTRA_SAFE, this enables: DB_BLOCK_CHECKSUM Detects corruptions in data and redo blocks using checksum validation, prevents propagation to standby databases DB_BLOCK_CHECKING Detects data block corruptions using semantic checks DB_LOST_WRITE_PROTECT Using a standby database, detects writes lost by the I/O subsystem Can be configured for data blocks / data + index blocks For Oracle Database 10g, set DB_BLOCK_CHECKING and DB_BLOCK_CHECKSUM Primary DB or Logical DB or Streams DB: set DB_ULTRA_SAFE = DATA_AND_INDEX (sets db_block_checksum=full, db_block_checking=full, db_lost_write_protect=true) * Block checking prevents memory and data corruptions. Overhead on every block change. A lot of applications "% blocks changed per read" < 5 %; so the overall impact of enabling block checking is small. * Redo and data block checksum detect corruptions on the primary and protect the standby. Minimal CPU resource required. * Lost write protection enables a physical standby database to detect lost write corruptions on the primary or standby. Minimal redo increase. Physical Standby: set Optimum Mode: DB_BLOCK_CHECKSUM=FULL and DB_LOST_WRITE_PROTECT=TRUE * Redo and data block checksum protect the standby (and detect new corruptions that can occur on the standby). Minimal CPU resource required. * Lost write protection enables a physical standby database to detect lost write corruptions on the primary or standby. Negligible impact on standby. Note: Ideally for the most comprehensive corruption protection on the standby: set DB_ULTRA_SAFE = DATA_AND_INDEX. Setting db_block_checking=full on the standby detects corruptions from the primary and prevents new memory and data corruptions on the standby. However, Redo apply throughput can drop significantly (50% in some cases) because every redo apply or change operation will incur the additional checks.
  • RMAN & Oracle Secure Backup graphics Move Oracle Secure Backup graphic much more to the right, so to give more space between that graphic and RMAN graphic. Remove ‘Archive to Tape’ label Add ‘Tape’ label beside top tape drive icon (like you’ve done with ‘Disk’ label next to disk icon) Remove brown database icon Remove all blue arrows and their labels, then: Add one bidirectional arrow between RMAN tan background image and OSB tan background image Add bidirectional arrow between blue database icon and disk icons Flashback Technologies graphic Make background image color for Flashback Technologies light brown (contrasting tan backgrounds for RMAN and OSB) Move background image and tables much further down, to give more space between that graphic and RMAN graphic Remove all brackets Highlight ‘Flashback Database’ label as you have done with other Flashback labels (yellow highlighted background) Move ‘Flashback Technologies’ label above ‘Flashback Database’ and center above the other labels Make Flashback labels a bulleted list (still keeping grouping and yellow highlight) Overall Can you provide this entire image where all graphics are in flat layout (instead of angled layout), e.g. tan background image would be rectangle, not rhombus? I want to see if flat layout reads better.
  • Oracle Secure Backup is centralized tape backup management software protecting the Oracle database and file systems in distributed UNIX, Linux, Windows and Network Attached Storage (NAS) environments. From one central console, you can easy manage the distributed servers and tape devices within the backup domain. As Oracle is no longer just a database company, Oracle Secure Backup provides tape backup for application files as well as the database. Integrated with Recovery Manager (RMAN), Oracle Secure Backup provides the media management layer for RMAN backups to tape. While tightly integrated with the Oracle database, Oracle Secure Backup is a standalone product offering with an independent release schedule and versioning from the database. OSB 10.1 was released in April 2006 and OSB 10.2 was released in Nov 2007. 3 rd Party Media Management Software can be EXPENSIVE , and c ost thousands per database server for RMAN integration! OSB will save customer money….Oracle ROI Oracle - Single vendor solution with RMAN+OSB reduces complexity for customer…eliminates finger pointing in multi-vendor solution
  • Recovery Manager (RMAN) – ensures valid backup & restore Always verifies block checksums on backup & restore Provides optional logical block validation (e.g. missing row piece) Checks on-demand for backup / restore corruptions without creating backups / restores (BACKUP VALIDATE / RESTORE VALIDATE) Provides online recovery of individual block corruptions or all identified corruptions with Block Media Recovery (RECOVER BLOCK)
  • RMAN, which is Oracle’s Backup & Recovery solution, fully automates disk based backup and recovery in Oracle Database10g, using essentially a tiered storage configuration. You can use your mission-critical million-dollar storage array to hold database processing related files – using an area called Database Area, while using a more cost-effective, but performant storage array – maybe comprised of ATA/SATA disks, called the Recovery Area. Oracle manages this transparently for you. With an Oracle suggested backup strategy, you can start by doing a full backup of your database, and then setting up an automated policy to do nightly incremental backups. These backups could be stored in the Recovery Area. Two things to note here. Firstly – while doing incremental backups – only the changed data blocks are backed up – thereby saving considerable storage space and enabling what in the storage industry is referred to as de-duplication – i.e. eliminating the need to store multiple copies of the same block. Also, with a mechanism called Block Change Tracking – the changed blocks are tracked very efficiently. Secondly – setting up an automated RMAN policy, these nightly incremental backups could be used to roll forward recovery area backup and automatically create a full backup, thereby eliminating the need to manually do a full backup. Note that while we recommend the use of disk here as a very viable recovery mechanism, we are not discounting the value of tape. Tape continues to remain as a very effective backup media – and in this configuration as well, an automated policy could be setup to have old data age out tapes for archival purpose. Net net – this mechanism enables much faster backups – just propagate changes to recovery area, and also much faster restores – just copy backup files from Recovery Area, or simply use the copy in the Recovery Area. ================ An example of block change tracking improving incremental perf by 20x: E.g. 500 GB database, 25 GB changes since level 0, one channel - 9i: 10000s (~2.8 hrs) for inc backup - scanning full 500 GB of changed+unchanged blocks - 10g w/ BCT: 500s (~8.3 min) for inc backup - scanning only 25 GB of changed blocks
  • LOW - corresponds to LZO (11gR2) – smallest compression ratio, fastest MEDIUM - corresponds to ZLIB (11gR1) – good compression ratio, slower than LOW BASIC (free) - corresponds to BZIP2 (10g style compression) - ~compression ratio as MEDIUM, but slower HIGH - corresponds to unmodified BZIP2 (11gR2) – highest compression ratio, slowest
  • Resumable duplicate - bug 7420868 for details.
  • The Fast Recovery Area (FRA) is one location to hold recovery-related files, e.g. RMAN backups, Flashback logs, archived logs, multiplexed copy of online redo, multiplexed copy of controlfile, specified with a location (directory or ASM disk group) and space quota (upper limit). The FRA automatically deletes unneeded files when there is space pressure; ‘unneeded files’ are those that are either (1) backed up to tape via RMAN, or (2) obsolete according to RMAN retention policy. Calculating an appropriate initial FRA size depends on what the user wants to keep, e.g. e.g. one may just want Flashback logs and archived logs, and no RMAN backups. The following guidelines assist with determining the initial size. When keeping only controlfile autobackups and archived logs, one can find an approximate FRA size estimate by finding the total size of archived logs generated between successive backups on the busiest days (controlfile autobackups are generally small relative to archived logs). This is because once the archived logs are backed up, they are considered ‘reclaimable’ and will be deleted under space pressure, so you only need enough space to hold archived logs between two successive backups. This estimate is multiplied by 2 to accomodate unexpected redo spikes. When keeping both archived logs and Flashback logs, you can multiply the archived log size by 2 to get an initial estimate. This is because Flashback logs are generally created in proportion to archived redo logs generated during same retention period. So, multiply the archived log size by 4 (2x for archived logs + 2x for Flashback logs) to accommodate unexpected redo spikes, which also affect Flashback log sizes. There is no formula to estimate incremental backup sizes – they are really dependent on the amount of changes between backups. One can test run an incremental strategy to determine representative incremental sizes for a period of time, and then include those in calculation of production FRA size. Finally, if an on-disk image copy backup is to be kept in the FRA, then add in the size of the database minus size of temp files (RMAN does not backup temp files). An on-disk image copy backup allows for faster restore vs. tape, or can just be used as-is in place of the production storage data file.
  • Multiple channels per tape drive can substantially slow restore performance (I.e. media management multiplexing) RMAN channel represents a single backup file stream. A channel can read multiple datafiles or archived logs to backup into a multiplexed backup set, or can read one file at at a time for image copy backup. Increasing RMAN channels increases backup parallelism, thereby potentially reducing overall backup time. RMAN multiplexing is the number of datafiles or archived logs read by one channel at any time. The default is min (FILESPERSET, MAXOPENFILES). FILESPERSET defaults to 64 and specified at BACKUP command. MAXOPENFILES defaults to 8 and is specified at CONFIGURE CHANNEL command. So multiplexing=8 by default, which means a maximum of 8 files will be read by one channel at any time. For SAME/ASM storage, MAXOPENFILES should be set to 1, since all files are striped across available disks and reading one file/channel will optimally read from each disk. Do not utilize media management multiplexing for the Oracle database, as RMAN backup pieces will not be efficiently restored, due to interleaving of pieces on same tape volume. I.e. the tape may need to move forward/rewind, depending on which pieces are needed by restoration. Maxopenfiles = 1 for ASM…the files are already stripped appropriately across all disks so they will already be efficiently read by RMAN eliminating the need for higher multiplexing values.
  • Flashback enables easy navigation through time See all rows at a given time See all changes to a row See all changes made by a transaction Flashback enables easy correction of errors Row level Table level Database level Flashback applies to all types of users End users Developers Administrators Flashback is much faster and easier than traditional recovery ================= Flashback impact The environment for the OLTP < 2% overhead (actually < 1% in my tests) was a single instance 11.1.0.6 database running the Swingbench Order Entry workload with on Linux 32-bit, RHEL ( 2.6.9-42.0.3.0.1.ELhugemem) and the I/O subsystem was: Array Serial # Disks Memory EMC Clariion CX700 #1249 2 Trays - 15 disks per Tray - 73Gb 4Gb per SP (512Mb Read Cache / 2631Mb Write Cache) EMC Clariion CX500 #1477 2 Trays - 15 disks per Tray - 73Gb 2Gb per SP (365Mb Read Cache / 1107Mb Write Cache) where the cx700 was the data ASM disk group and the cx500 was the ASM Fast Recovery area disk group. So the FRA had 28 spindles. Other qualitative reasons for low FB logging impact: - Only one before-image is logged per 30 min interval, regardless of number of changes to the block - No extra block reads are required when writing to flashback logs (note: except for direct insert loads, which is fixed in 11.1 for single instance) - Only data file block changes are tracked, not all database files (e.g. online redo, controlfile, etc.) - In general, no process needs to wait for flashback log I/O in a well configured OLTP system.
  • Oracle9 i introduced Flashback Query to provide a simple, powerful and completely non-disruptive mechanism for recovering from human errors. It allows users to view the state of data at a point in time in the past without requiring any structural changes to the database. Oracle Database 10 g extended the Flashback Technology to provide fast and easy recovery at the database, table, row, and transaction level. Flashback Technology revolutionizes recovery by operating just on the changed data. The time it takes to recover the error is now equal to the same amount of time it took to make the mistake. Flashback Query: Ensure that the database is using an undo tablespace. The setting the UNDO_MANAGEMENT initialization parameter to AUTO specifies this. Set the UNDO_RETENTION initialization parameter (secs, default=900) to a value that causes undo to be kept for a length of time that allows success of your longest query back in time or to recovery from human errors. To guarantee that unexpired undo will not be overwritten, set the RETENTION GUARANTEE clause for the undo tablespace. Flashback Version Query Pseudocolumn Description: VERSIONS_STARTSCN, VERSIONS_STARTTIME - Starting System Change Number (SCN) or TIMESTAMP when the row version was created. This identifies the time when the data first took on the values reflected in the row version. You can use this to identify the past target time for a Flashback Table or Flashback Query operation. If this is NULL, then the row version was created before the lower time bound of the query BETWEEN clause. VERSIONS_ENDSCN, VERSIONS_ENDTIME - SCN or TIMESTAMP when the row version expired. This identifies the row expiration time. If this is NULL, then either the row version was still current at the time of the query or the row corresponds to a DELETE operation. VERSIONS_XID - Identifier of the transaction that created the row version. VERSIONS_OPERATION - Operation performed by the transaction: I for insertion, D for deletion, or U for update. The version is that of the row that was inserted, deleted, or updated; that is, the row after an INSERT operation, the row before a DELETE operation, or the row affected by an UPDATE operation. Note: For user updates of an index key, a Flashback Version Query may treat an UPDATE operation as two operations, DELETE plus INSERT, represented as two version rows with a D followed by an I. The following statement queries the FLASHBACK_TRANSACTION_QUERY view for transaction information, including the transaction ID, the operation, the operation start and end SCNs, the user responsible for the operation, and the SQL code to undo the operation: SELECT xid, operation, start_scn,commit_scn, logon_user, undo_sql FROM flashback_transaction_query WHERE xid = HEXTORAW('000200030000002D');
  • Flashback Database – restore database to point-in-time Imagine a bad batch job that has impacted your entire database, incorrectly modifying data across many tables and even certain PL/SQL procedures. The only corrective action would be to import a previous good data set, provided you have them, or to restore a good backup and roll the database forward to a point-in-time prior to the time of the batch job. Both can potentially entail long recovery times into the hours or days. Oracle Database 10g features Flashback Database, a unique database point-in-time recovery capability, which allows the database to be ‘rewound’ to a prior point-in-time within seconds or minutes. Because only the affected data blocks are restored and recovered, it’s faster than traditional recovery methods. The command is simply ‘flashback database to my point-in-time’ We accomplish this fast recovery by utilizing flashback logs which record old block versions. When writes are issued to disk and flashback database is enabled, the old block version is written to the flashback log while the new block version is written to datafiles. When a flashback database command is issued, only the changed blocks are retrieved from the flashback logs and then recovered with appropriate archived logs to the required point-in-time. You control how far back to retain flashback logging to support your recovery requirements. For example you might enable flashback database for 24 hrs, and then rely on backups for recovery past 24 hrs. Flashback Table – recover contents of tables to point-in-time (undo-based) Flashback Drop – restore accidentally dropped tables (based on free space in tablespace) Flashback Transaction – back out transaction and all subsequent conflicting transactions (redo-based)
  • Imagine a bad batch job that has impacted your entire database, incorrectly modifying data across many tables and even certain PL/SQL procedures. The only corrective action would be to import a previous good data set, provided you have them, or to restore a good backup and roll the database forward to a point-in-time prior to the time of the batch job. Both can potentially entail long recovery times into the hours or days. Oracle Database 10g features Flashback Database, a unique database point-in-time recovery capability, which allows the database to be ‘rewound’ to a prior point-in-time within seconds or minutes. Because only the affected data blocks are restored and recovered, it’s faster than traditional recovery methods. The command is simply ‘flashback database to my point-in-time’ We accomplish this fast recovery by utilizing flashback logs which record old block versions. As you can see in the diagram, when writes are issued to disk and flashback database is enabled, the old block version is written to the flashback log while the new block version is written to datafiles. When a flashback database command is issued, only the changed blocks are retrieved from the flashback logs and then recovered with appropriate archived logs to the required point-in-time. You control how far back to retain flashback logging to support your recovery requirements. For example you might enable flashback database for 24 hrs, and then rely on backups for recovery past 24 hrs. Flashback also critical for fast reinstatement of new standby after failover & snapshot standby.
  • Some DDLs that alter the structure of a table, such as drop/modify column, move table, drop partition, and truncate table/partition, invalidate any existing undo data for the table. It is not possible to retrieve data from a point earlier than the time such DDLs were executed. Trying such a query results in error ORA-1466. This restriction does not apply to DDL operations that alter the storage attributes of a table, such as PCTFREE, INITRANS, and MAXTRANS. Use DBMS_FLASHBACK package to set Flashback Query SCN at session-level for all ‘selects’ (i.e. no ‘as of’ clause needed) – this is useful for ETL/DW environments where queries that run during the day may also run during nightly batch job, so pre-batch job SCN is set for those sessions to ensure application consistent results.
  • Restore point is a user defined name that is associated with a database point in time used in conjunction with Flashback Database, Flashback Table and RMAN. They can be created in SQLPlus or EM. Guaranteed Restore Point is a special type of restore point that ensures flashback logs are kept until the restore point is used or deleted. When FB logging is turned on, flashback logs are generally created in proportion to archived redo logs generated during same retention period When FB logging is turned off and GRP is created, each changed block is only logged once to maintain GRP vs. continuous logging of blocks with a configured Flashback retention. - E.g. Create GRP prior to nightly batch job for fast recovery of batch issues, then delete GRP next day to reclaim space Flashback retention should be set >= 60 minutes (also for Data Guard Fast-Start Failover environment). This is because Oracle writes a metadata marker (used for FB DB operation) into FB logs every 30 mins and so, setting retention under 60 mins where there is space pressure could delete a needed marker and thus render FB DB unusable for some portion of time. Setting FB retention >= 60 mins guarantees that we will have at least 2 markers always available in FB logs. Maintaining flashback logs imposes comparatively limited overhead on an Oracle database instance. Changed blocks are written from memory to the flashback logs at relatively infrequent, regular intervals, to limit processing and I/O overhead. To achieve good performance for large production databases with Flashback Database enabled, Oracle recommends the following: - Use a fast file system for your Fast Recovery area, preferably without operating system file caching. Files the database creates in the Fast Recovery area, including flashback logs, are typically large. Operating system file caching is typically not effective for these files, and may actually add CPU overhead for reading from and writing to these files. Thus, it is recommended to use a file system that avoids operating system file caching, such as ASM. - Configure enough disk spindles for the file system that will hold the Fast Recovery area. For large production databases, multiple disk spindles may be needed to support the required disk throughput for the database to write the flashback logs effectively. - If the storage system used to hold the Fast Recovery area does not have non-volatile RAM, try to configure the file system on top of striped storage volumes, with a relatively small stripe size such as 128K. This will allow each write to the flashback logs to be spread across multiple spindles, improving performance - For large, production databases, set the init.ora parameter LOG_BUFFER to be at least 8MB. This makes sure the database allocates maximum memory (typically 16MB) for writing flashback database logs. The overhead of turning on logging for Flashback Database depends on the mixture of reads and writes in the database workload. The more write-intensive the workload, the higher the overhead caused by turning on logging for Flashback Database. (Queries do not change data and thus do not contribute to logging activity for Flashback Database.) Flashback related FRA sizing -> http://download-west.oracle.com/docs/cd/B19306_01/backup.102/b14192/rpfbdb003.htm#BABJJCHF
  • Data Guard – There are two options using a standby database that can be used to repair block corruption on the primary database: • Extract the rows from the block that is corrupted on the primary by using Data Pump or other means to select the data from a table. The data is then re-inserted into the primary database. • Copy the standby database datafile(s) to the primary database. Once the file is restored on the primary database, archive logs are applied to bring it consistent with the rest of the database. If the primary database corruption is widespread due to a bad controller or other hardware/software problem, then you may want to switchover to the standby database while repairs to the primary database server are made.
  • Data Recovery Advisor can diagnose failures such as the following: Components that are not accessible because they do not exist, do not have the correct access permissions, are taken offline, and so on Physical corruptions such as block checksum failures, invalid block header field values, and so on Logical corruptions caused by software bugs Incompatibility failures caused by an incorrect version of a component I/O failures such as a limit on the number of open files exceeded, channels inaccessible, network or I/O error, and so on Configuration errors such as an incorrect initialization parameter value that prevents the opening of the database A failure is a persistent data problem that has been diagnosed by the database. A failure can manifest itself as observable symptoms such as error messages and alerts, but a failure is different from a symptom because it represents a diagnosed problem. Failures are normally detected reactively. A database operation involving corrupted data results in an error, which automatically invokes a data integrity check that searches the database for failures related to the error. If failures are diagnosed, then they are recorded in the Automatic Diagnostic Repository (ADR) , which is a directory structure stored outside of the database. You can also invoke a data integrity check proactively. You can execute the check through the Health Monitor, which detects and stores failures in the same way as when the checks are invoked reactively. You can also check for block corruption with the RMAN VALIDATE and BACKUP VALIDATE commands. Intelligently determines recovery strategies Aggregates failures for efficient recovery – propose single script with appropriate commands in order, to cover all failures if possible vs. multiple scripts Presents only feasible recovery options, e.g. propose RMAN restore script only if needed backups are available. Can also check for available Flashback logs and standby database (in 11g Grid Control) for recovery options. Indicates any data loss for each option, e.g. if point-in-time recovery is only option (due to missing backups or logical error)
  • 3 simple DRA commands in RMAN: List Failure - lists the results of previously executed failure assessments. Revalidates existing failures and closes them, if possible. Advise Failure - presents manual and automatic repair options Repair Failure - automatically fix failures by running optimal repair option, suggested by ADVISE FAILURE . Revalidates existing failures when completed.
  • Let’s consider a pretty common external error 376. This error can represent multiple problems caused by different causes. This example shows that knowing a symptom or a root cause of a problem might not be enough to determine impact of a problem on applications and a repair solution for it. This example has only one symptom, however, usually there are multiple symptoms present when a DBA is trying to assess a problem and find a solution for it. Directly mapping a set of symptoms to a repair actions is a very complex problem. That’s why we need a concept of failure that unambiguously describes a database problem in user-friendly terms, and can be directly mapped to a set of repair actions. The concept of failure is very similar to the concept of disease in medicine. Each disease is associated with symptoms, diagnostic tests, causes/risk factors and treatment. Symptoms, etc. are specific objects or processes, you can touch them, you can see them. However, a disease is an abstract concept that links all this specific information together and simplifies communication between doctors and patients. A DRA failure is basically a database disease.
  • After we defined the concept of failure, let’s consider particular failures that DRA supports. There are certain physical DB components: CF, LF, DF, SP file, Wallet. Each of them can be inaccessible, missing, corrupt, old or mutually inconsistent with another component(s). In case of a component inaccessibility, a repair depends on the cause of inaccessibility and, because of this, each cause is a separate failure. A combination of component name and failure type, e.g “control file is corrupt”, or “datafile cannot be accessed because of NFS mount problem”, pretty much covers all spectrum of physical data failures. There are also failures associated with logical components. Currently, only corruptions of the data dictionary, undo segments and logical block corruptions are supported by DRA.
  • Failures are diagnosed by checks. A check is a diagnostic procedure that is executed to assess the health of database components. Checks are executed by Health Monitor (HM). A check execution is triggered by an error. Error -> Check mappings are also registered with DRA framework. A distinctive feature of DRA is that it does not try to match errors directly to failures. Instead an error triggers a comprehensive diagnostic check that examines corresponding components. A check can also be explicitly invoked either by the user via HM or RMAN validate command, or by HM as part of a scheduled database health evaluation.
  • Let’s consider some examples of repairs that DRA recommends. Repairs can be automatic or manual. For automatic repairs DRA generates a script that can be reviewed by a DBA and executed via REPAIR command. Automatic repairs include different types of media recovery, flashback database and Data Guard failover. Manual repairs are generated for failures that cannot be fixed by Oracle tools, they have to be executed by the user. For example, an IO failure might require a replacement of a faulty disk or a cable. In this case, DRA will recommend a manual repair instead of generating a repair script. For such class of failures sometimes it is impossible to diagnose a specific failure because of the lack of information returned by the OS or the disk subsystem. In this case DRA might give the user an advice what to do to obtain additional information, e.g. “check network connectivity”. Each DRA failure is associated with a set of repairs. Failures, repairs and mapping in between them are registered with the DRA framework via compile-time services.
  • Before presenting a repair to the user, DRA validates it with respect to the current environment, as well as availability of components required to complete the proposed repair. Only feasible options will be presented. The slide shows examples of feasibility checks. Feasibility checks are quick and do not examine every block of every component, so there is still a possibility that a repair proposed by DRA will fail. However, performing feasibility checking minimizes the probability of a failure.
  • Data Guard is foremost about Data Protection, but it is also about High Availability – making sure that data is available for applications regardless of events that can impact the primary database, the data center it is in, the building, campus, or geographic region where it is located. Data Guard provides an extensive set of capabilities for HA/DR that have evolved and matured over many Oracle Database releases. Oracle Database 11g changed the way we think about standby databases with the introduction of Active Data Guard.
  • Challenges Array Based Remote-mirroring Zero knowledge of the application or data Must mirror everything (all writes to all files) SYNC mode makes every DB write a synchronous write Mirrors problems just as effectively as it mirrors data The application (target Oracle DB) is off while mirroring is on Target volumes can not be used while mirroring is active Difficult to test – not sure if it will work when needed Vendor lock-in to costly storage subsystems Distance constraints Limited Protection, Expensive, Low Return on Investment
  • This slide makes it clear why all Data Guard need to replicate to the standby site is the recovery information. It is because Oracle executes the apply process on the standby database. The standby is in a continuous state of recovery, Oracle is mounted, Oracle validation prevents corruptions from being applied to the standby. We don’t need to mirror all writes because Oracle is smart enough to keep the databases synchronized using only redo data.
  • This is the traditional way of thinking about a standby database, as a failover target should the primary database fail. All processing happens at the primary. If we want the physical standby database to always be up-to-date and ready for failover, you are not able to open the standby for user access.
  • Active Data Guard changes things dramatically. Active Data Guard enables the standby database to be open read-only while it continuously applies updates received from its primary database. Reporting, ad-hoc queries, any read-only transaction, can be offloaded from the primary database to the standby. You can also offload fast incremental backups from the primary database to an active standby. And new in Oracle Database 11.2, the active standby is used to automatically repair block corruptions detected at the primary database, transparent to the user and application (and vica versa) Active Data Guard is a New Oracle Database 11g capability that builds upon Data Guard Physical standby open read-only while changes applied from primary database RMAN fast incremental backups on physical standby using Block Change Tracking – 20x faster than traditional incrementals Auto-repair corrupted data blocks (11.2) Query SLA on standby database (11.2) Benefits Improves primary database performance by offloading processing to standby Makes productive use of existing physical standby databases Licensing Packaged as a separate database Option for Oracle Enterprise Edition Requires licensing of the production database and all the physical standbys that are used for any of the above capabilities
  • Easily scale read performance Support for up to 30 active standby databases Guaranteed data consistency between primary and standby databases No restrictions on data types or indexes used Replication keeps pace with peak workload High Availability Read access 24x7x365 Reader farm transparently transitions to new primary at failover time
  • Shows the reductions in failover time as new technologies were introduced. Pre-Data Guard Clock is ticking while the problem is identified, notifications are sent, and DBAs respond Resolution involved custom scripts and complex procedures Data Guard Fast Start Failover Observer identifies the problem (primary not available), responds appropriately (failover or not depending upon conditions), and resolves the outage in under two minutes (if failover will not result in data loss) Major improvement in availability Database failover is fast, at Amazon.com . . . Database failover completes in 10 seconds Total time to redirect clients to new primary database is less than 2 minutes
  • Uses the DBMS_REDEFINITION package
  • The installation of the upgrade into the production database must not perturb live users of the pre-upgrade application - Many objects must be changed in concert. The changes must be made in privacy Data must persist across a patch or upgrade - Transactions done by the users of the pre-upgrade application must by reflected in the post-upgrade application For hot rollover, we also need the reverse of this: - Transactions done by the users of the post-upgrade application must by reflected in the pre-upgrade application
  • Transcript

    • 1.  
    • 2. Oracle Database High Availability Essentials Piotr Kołodziej Master Principal Sales Consultant Oracle Polska Piotr.Kolodziej @oracle.com
    • 3. Determine HA Requirements Analysis Framework
      • Business Impact Analysis
      • Cost of Downtime
      • Recovery Time Objective (RTO)
      • Recovery Point Objective (RPO)
      • Backup Retention
      • Manageability Goal
      • TCO and /or ROI
      • Others...
      • http://download.oracle.com/docs/cd/E11882_01/server.112/e10804/hadesign.htm#i1005920
    • 4. Oracle’s HA Design Principles
      • Complete
        • Minimize all planned and unplanned downtime
        • Offer a standard validated platform for maximum availability
      • Application oriented
        • Protect and recover application objects
        • Enable online application changes
      • Scale-out model
        • Low-cost commodity hardware
        • All components active in a grid infrastructure
      • Integrated and simple
        • Built-in HA with pluggable components
        • Automatic - eliminate manual processes
    • 5. Oracle’s Database HA Solution Set Database Integration Unique in the Industry! Server Availability Data Availability System Changes App Changes Unplanned Downtime Planned Downtime Real Application Clusters Flashback RMAN & Oracle Secure Backup ASM Data Guard GoldenGate Online Reconfiguration Rolling Upgrades Edition-based Redefinition Oracle MAA Best Practices Online Redefinition Data Changes
    • 6. Oracle Maximum Availability Architecture Integrated, Fully Active, Lots of Benefits
      • RAC
      • Scalability
      • Server HA
      • Active Data Guard
      • Data Protection, DR
      • Query Offload
      • GoldenGate
      • Active-active
      • Heterogeneous
      • Oracle Secure Backup
      • Backup to tape / cloud
      • Flashback
      • Human error correction
      Production Site Active Replica
      • Online Redefinition, Data Guard, GoldenGate
      • Minimal downtime maintenance, upgrades, migrations
      • ASM
      • Volume Management
      • RMAN & Fast Recovery Area
      • On-disk backups
    • 7. Unplanned Outage Recovery Times
      • Oracle Database High Availability Overview, “Attainable Recovery Times for Unplanned Outages” table http://download.oracle.com/docs/cd/E11882_01/server.112/e17157/architectures.htm#CJABHJAF
      Unplanned Outage Type HA Architecture 1 Oracle Database Oracle RAC Oracle Data Guard MAA RAC & Data Guard Site Failure Hours to days Hours to days Seconds to a minute Seconds to a minute Computer Failure Minutes to hours No downtime Seconds to a minute No downtime Storage Failure No downtime No downtime No downtime No downtime Human Error < 30 minutes < 30 minutes < 30 minutes < 30 minutes Data Corruption Potentially hours Potentially hours Seconds to a minute (Automatic Block Repair, 11.2) Seconds to a minute (Automatic Block Repair, 11.2) Application Node Failure No downtime No downtime No downtime No downtime
    • 8. Planned Outage Downtime
      • Oracle Database High Availability Overview, “Attainable Recovery Times for Planned Outages” table http://download.oracle.com/docs/cd/E11882_01/server.112/e17157/architectures.htm#CJABHJAF
      Planned Outage Type HA Architecture 1 Oracle Database Oracle RAC Oracle Data Guard MAA RAC & Data Guard System change - Dynamic Resource Provisioning No downtime No downtime No downtime No downtime System-level upgrade Minutes to hours No downtime Seconds to 5 minutes No downtime Clusterwide or sitewide upgrade Minutes to hours Minutes to hours Seconds to 5 minutes No downtime Storage Migration No downtime No downtime No downtime No downtime Database one-off patch Minutes to an hour No downtime Seconds to 5 minutes No downtime Database patch set and version upgrade Minutes to hours Minutes to hours Seconds to 5 minutes (11.2) Seconds to 5 minutes (11.2) Platform migration Minutes to hours Minutes to hours Minutes to hours No downtime Data Change Online Reorganization & Redefinition No downtime No downtime No downtime No downtime Application Changes No downtime No downtime No downtime No downtime
    • 9. Agenda
      • Protecting Server Availability
      • Protecting Data Availability
      • How to Live with Changing Environment
    • 10.
      • Scale workloads across multiple low cost servers
      • Consolidate into fewer servers and databases
      • Runs all Oracle database applications
      • Built-in HA to support mission critical workloads
      HR SALES ERP Real Application Clusters Virtualize Low-cost Servers
    • 11.
      • A virtualized single instance database
        • Omotion - live migration of instances across servers
          • Move services, then shutdown transactional
        • Built-in cluster failover for high availability
      • Better than OS level virtualization
        • Rolling database patches
        • Manage fewer Operating Systems
          • 10 DBs on a node does not mean 10 Operating Systems to manage
        • Rolling OS upgrades
      RAC One Node Virtualization Benefits for Oracle Databases Ref. http://www.oracle.com/technetwork/database/clustering/overview/ds-rac-one-node-11gr2-185089.pdf New in 11.2
    • 12.
      • Oracle Grid Infrastructure (OGI) is a combined home of
        • Oracle Clusterware
        • Oracle Automatic Storage Management (ASM)
      • OGI provides infrastructure software (storage management, cluster software), typically managed by System Administrators
      • There is only one active version of the OGI on a system
      • OGI comes in two versions:
        • Grid Infrastructure for a Cluster
          • Includes Oracle Clusterware, ASM
        • Grid Infrastructure for a Standalone Server
          • Includes Oracle Restart, ASM
      Oracle Grid Infrastructure
    • 13. What is Oracle Clusterware?
      • Oracle Clusterware is
        • a major part of Oracle’s Private Cloud
        • integrated with Oracle Automatic Storage Management (ASM)
        • the foundation for the Oracle ASM Cluster File System (ACFS)
        • the foundation for Oracle Real Application Clusters (RAC)
        • an infrastructure for the management of all kind of applications
      Node 1 Node 2 Node ... Node n Consolidated Pool of Storage with Automatic Storage Management (ASM) Oracle Clusterware Oracle ASM / ACFS Oracle RAC Protected App A Protected App B
    • 14. OCR / Voting Disk In Oracle ASM No RAW device support (in OUI) anymore – upgrades support RAW devices.
    • 15. 3 rd Party FS Application Automatic Storage Management (ASM) ASM Cluster & Single Node File System (ACFS) Database ACFS Snapshot ASM Disk Group DB Datafiles, OCR and Voting Files Oracle Binaries 3 rd Party File Systems Dynamic Volume Manager
      • ASM supports ALL data - database files, file systems, Clusterware files (OCR, Voting Disk)
      • Built-in mirroring protects from disk failures
      • Enables auto-repair from corrupt blocks using a valid mirror copy
      ASM Instance Managing Oracle DB Files New in 11.2 Automatic Storage Management (ASM) Stores & Manages All Data
    • 16. Oracle Clusterware – A complete Solution
      • Oracle Clusterware protects Applications A, B on nodes 3 & 4
      • It is also provides resources for the RAC database running on nodes 1 & 2
      • It provides the basis for the Oracle ACFS (optional) – all data in ASM
      • You do not need any third party cluster software anymore
      Node 1 Node 2 Node 3 Node 4 Consolidated Pool of Storage with Automatic Storage Management (ASM) Oracle Clusterware Oracle ASM / ACFS Oracle RAC Protected App A Protected App B
    • 17.
      • Most customers use Oracle Clusterware for RAC
      • More and more customers want to protect other applications (in a RAC cluster or totally different clusters)
      • Therefore, Oracle Clusterware provides HA for applications:
        • Restart on application failure
        • Relocate on node failure
      • Examples:
        • Oracle Clusterware may protect SAP, Hyperion, Oracle TimesTen, Oracle VM and other Components
      Why does the HA Framework exist?
    • 18.
      • Network Location
        • Clients need a node independent way of connecting to an Application
      • Dependencies
        • Components may need to be started in a certain order
        • Components may need to be started in relation to each other
      • Configuration files
        • Applications typically need configuration files stored on disk
      What do Applications Need?
    • 19. What does Oracle Clusterware provide?
      • VIP resource
        • Provides Application VIPs in multiple networks
      • HA-API and HA-Framework
        • Allows Oracle Clusterware to protect all kind of applications
        • Provides extended dependency settings to model real world relations between components
        • The Interface allows to change - at run time – how Oracle Clusterware manages any application
        • Agents run frequent checks, ensuring fast recovery
      • ACFS
        • Oracle ASM-based Cluster File System
    • 20. Where to Find Action Scripts and Agents http://www.oracle.com/technetwork/database/clusterware Metalink Note 790189.1 – Oracle Clusterware and Application Failover Management
    • 21.
      • The Grid Infrastructure can be installed and used on any server where any of the following conditions are met:
        • The server OS is supported by a valid Oracle Unbreakable Linux support contract.
        • At least one machine is licensed using the appropriate metric for either Oracle Database Enterprise Edition or Oracle Database Standard Edition.
        • Oracle Clusterware can be used free of charge to protect a software product in cases that do not meet conditions 1 or 2 above, if either of the following conditions are met:
          • The software being protected is from Oracle
          • The software being protected uses an Oracle database
      Oracle Clusterware licensing See: Oracle Database Licensing information ( Part Number E10594-01 )
    • 22. Agenda
      • Protecting Server Availability
      • Protecting Data Availability
      • How to Live with Changing Environment
    • 23. Maximum Data Availability Flashback RMAN Oracle Secure Backup ASM Data Guard GoldenGate Server Availability Data Availability Unplanned Downtime Protection from Data Corruptions Protection from Storage / Site Failures Enabling Active-Active Data Centers Protection from Human Errors
    • 24.
      • Any component in the systems stack can fail and cause data corruptions*
        • Software – applications, middleware, database, …
        • Hardware – disk drives, disk controllers, HBAs, memory, …
        • Network – routers, switches, cables, …
        • Operational – human errors, bad installs & upgrades, …
      • Data corruptions can be disastrous
      • Very hard to debug and diagnose
      * “Hard Disk Drives – the Good, the Bad & the Ugly”, ACM Queue, Sep/Oct 2007, http://queue.acm.org/detail.cfm?id=1317403 Data Corruptions Severe Impact on Data Availability
    • 25.
      • Oracle Database has checks to detect and repair corruptions
        • Detects corruptions in data and redo blocks using checksum validation
        • Detects data block corruptions using semantic checks
        • Detects writes acknowledged, but actually lost by the I/O subsystem
      • Various levels of checks can be configured by the administrator
        • Choose the desired protection level
        • Can be configured for data blocks / data + index blocks
      • Specific technologies provide additional validation
        • RMAN – validating while doing backup & recovery
        • ASM – validating using mirrored copies
        • Data Guard – validating while synchronizing standby database
      Built-in Protection from Data Corruptions Comprehensive Data Validation
    • 26. Oracle Backup & Recovery Technologies
      • Efficient and reliable database backup and recovery
      • Fine-grained recovery at row, transaction, table, tablespace, database level
      • Fastest database backup and restore to/from tape
      • Recovery Manager and Flashback Technologies are included with database
      • Oracle Secure Backup offers low cost, fixed pricing per tape drive
      Best Data Repair at Lowest Cost
    • 27.
      • Backup & Recovery
    • 28. Backup & Recovery Foundation Complete Oracle Solution from Disk to Tape
      • Oracle backup and recovery for your entire IT environment
      • Multiple media options available to meet the most stringent SLAs
        • Local disk, remote Cloud storage, physical and virtual tape
      File System Data UNIX Linux Windows NAS Oracle Databases Oracle Recovery Manager (RMAN) Fast Recovery Area Tape Backup Oracle Secure Backup (OSB) Oracle Secure Backup (OSB) Cloud Module Amazon S3 Cloud Storage
    • 29. Oracle Recovery Manager (RMAN) Oracle-integrated Backup & Recovery Engine Oracle Enterprise Manager RMAN Database Fast Recovery Area Tape Drive Oracle Secure Backup
      • Intrinsic knowledge of database file formats and recovery procedures
        • Block validation
        • Online block-level recovery
        • Tablespace/data file recovery
        • Online, multi-streamed backup
        • Unused block compression
        • Native encryption
      • Integrated disk, tape & cloud backup leveraging the Fast Recovery Area and Oracle Secure Backup
      Cloud
    • 30. Oracle Fast Recovery Area Automatic Disk-to-Disk (D2D) Backup & Recovery
      • Fast Recovery Area – Integrated D2D backup and recovery
        • Favorable disk economics – low-cost disks used for recovery area
        • Oracle makes it even better with ‘ ‘restore-free recovery’:
          • switch datafile 4 to copy;
          • recover datafile 4;
      • Fast incremental backups
        • Backs up only changed blocks
        • Changed blocks are tracked using a very efficient algorithm, e.g. 20X faster
      • Nightly incremental backup rolls forward recovery area backup
        • No need to do full backups
          • recover copy of database with tag ‘ORCL’;
      Fast Recovery Area Nightly Apply Validated Incremental Weekly Archive To Tape Database Area Integrated backup-storage tiering
    • 31.
      • Backup compression: popular way to save on storage costs
      • Multiple RMAN backup compression levels
        • Choose compression levels & backup throughput
          • [BASIC] | HIGH | MEDIUM | LOW
          • HIGH – reduces backup size by 40%+ depending on data type
          • LOW – least impact on backup throughput
          • MEDIUM – best balance between compression and throughput
          • HIGH | MEDIUM | LOW require Advanced Compression Option
      RMAN New Features Oracle Database 11 g Release 2
    • 32. Additional RMAN New Features Feature Benefit Backup Fast Recovery Area to disk location
      • Protect Fast Recovery Area with on-disk backup of its RMAN backups, archived logs, and controlfiles.
      Extended tablespace point-in-time recovery (TSPITR) capabilities
      • Recover a dropped tablespace.
      • Perform multiple tablespace point-in-time recoveries, without requiring recovery catalog
      Resumable DUPLICATE
      • DUPLICATE can resume processing from most points of failure, reducing overall time.
      CONVERT DATABASE can skip unneeded datafiles
      • Reduces overall conversion time by only processing the required UNDO-containing data files.
      SET NEWNAME FOR TABLESPACE | DATABASE
      • Simplifies renaming of datafiles for RESTORE , DUPLICATE , and TSPITR operations.
    • 33. RMAN Best Practices
      • Fast Recovery Area (FRA) guidelines
        • Place FRA on separate storage & store backups, in addition to copy of control file, redo logs, and archived logs, to protect all needed recovery-related files from production outages.
        • When estimating FRA size, if you want to keep:
          • Control file backups and archived logs
            • Estimate archived logs generated between successive backups on the busiest days and multiply total size by 2 to account for activity spikes.
          • Archived logs and Flashback logs
            • Multiply the archived log size between backups by 4, assuming Flashback retention = time between archived log backups.
          • Incremental backups
            • Add in their estimated sizes
          • On-disk image copy backup
            • Add in size of the database minus the size of temp files
    • 34. RMAN Performance Factors Balancing Backup and Restore Requirements
      • Incremental backup strategy improves backup performance, with trade-off in recovery performance
      • Enable block change tracking for fast incremental backups
      • Cumulative vs. differential incremental backups
      • ‘ Incremental forever’ requires an initial full then incrementals thereafter
        • Fast recovery : Current image copy of database readily available
      • Backup ‘ x’ files in parallel per channel, improving backup performance
      • RMAN multiplexing level = min(FILESPERSET , MAXOPENFILES)
      • Exception: Set MAXOPENFILES = 1 for SAME or ASM datafiles
      • Set # of RMAN channels = # of tape drives, so that media management multiplexing is not used for RMAN backups
        • Setting # of RMAN channels > # of tape drives will impact restore, due to interleaved backup pieces on single tape
      • Assess host resources, production disk I/O, HBA/network, tape drive throughput
      • Minimum performant component of these will be performance bottleneck
      Consideration Performance Effect Incremental Backup Strategy Multiplexing Hardware/Network/ Storage
    • 35. Logical Data Protection Fast ‘Rewind’ of Logical Errors Recovery Manager (RMAN) Physical Data Protection Data Recovery Advisor Logical Data Protection Recovery Analysis Flashback Technologies File System Data UNIX Linux Windows NAS Oracle Databases
    • 36. Flashback Technologies Error Detection & Correction
      • Flashback revolutionizes error recovery
        • View ‘good’ data as of a past point-in-time
        • Simply rewind data changes
        • Time to correct error equals time to make error
      • Low impact
      • Excellent tool for configuring QA, Dev and Training databases
      • Flashback is easy – simple commands, no complex procedure
      Recovery Time Traditional Recovery Flashback
    • 37. Error Investigation with Flashback
      • Flashback Query
        • Query all data at point in time
      Tx 1 Tx 2 Tx 3 select * from Salary AS OF ‘12:00 P.M.’ where … select * from Salary VERSIONS BETWEEN ‘ 12:00 PM’ and ‘2:00 PM’ where … select * from FLASHBACK_TRANSACTION_QUERY where xid = HEXTORAW(‘000200030000002D’);
      • Flashback Transaction Query
        • See all changes made by a transaction
      • Flashback Version Query
        • See all versions of a row between times
        • See transactions that changed the row
      • All above are based on available UNDO
    • 38.
      • Flashback Database – restore database to any point in time
      • Flashback Table – restore contents of tables to any point in time (undo-based)
      • Flashback Drop – restore accidentally dropped tables (based on free space in tablespace)
      • Flashback Transaction – back out transaction and all subsequent conflicting transactions (redo-based)
      Error Correction with Flashback Order Database Customer
    • 39.
      • Fast point-in-time recovery strategy
      • Eliminate the need to restore a whole database backup
      • Continuous data protection for database
        • Optimized, before-change block logging
        • Restores just changed blocks
        • Replay log to restore DB to desired time
      • It’s fast - recover in minutes, not hours
      • It’s easy - single command restore
        • Flashback Database to ‘2:05 PM’
      Flashback Database Continuous Data Protection (CDP) “ Rewind” button for the Database Data Files Flashback Log New Block Version Disk Write Old Block Version
    • 40. Flashback Technologies New Features Oracle Database 11 g Release 2
      • Increased Availability
        • Enable Flashback Database while database is open
          • Test Flashback without having to take downtime
      • Better Manageability
        • Monitor Flashback Database progress with v$session_longops
          • Progress percentage can be found with (SOFAR / TOTALWORK)
      • Minimize System Impact
        • Optimized Flashback logging for batch/insert intensive loads
          • Potentially reduce Flashback logging impact to ~2%
      • Extended Dependency Tracking
        • Flashback Transaction supports foreign key dependency tracking
    • 41. Best Practices – Undo-based Flashback Flashback Query, Flashback Table
      • Use Undo Advisor (available through Enterprise Manager) to get recommendations on available undo retention for various sizes.
      • Use fixed size undo
        • Undo retention automatically tuned for best possible retention based on tablespace size and current system load.
      • Be aware of DDL restrictions – not possible to query in the past if table structure is modified (e.g. drop/modify column, move table, etc.)
      • Further details: http://download.oracle.com/docs/cd/B19306_01/appdev.102/b14251/adfns_flashback.htm#sthref1496
    • 42.
      • Tune FRA storage
        • Use ASM, configure enough disk spindles, etc.
      • Use physical standby database to test Flashback logging
      • Use V$FLASHBACK_DATABASE_LOG to size log space, after running workload > duration of Flashback retention period.
      • Create Guaranteed Restore Point (GRP) without enabling Flashback logging
        • Saves disk space for workloads where same blocks are repeatedly updated
        • Drop GRP to immediately reclaim space
      • Further details: Metalink Note 565535.1 Flashback Database Best Practices & Performance
      Best Practices – Flashback Database
    • 43.
      • Recovery Advisor
    • 44. Data Recovery Advisor The Motivation
      • Oracle provides robust tools for data repair:
        • RMAN – physical media loss or corruptions
        • Flashback – logical errors
        • Data Guard – physical or logical problems
      • However, problem diagnosis and choosing the right solution can be error prone and time consuming
        • Errors more likely during emergencies
      Recovery Investigation & Planning
    • 45. Data Recovery Advisor
      • Oracle Database tool that automatically diagnoses data failures, presents repair options, and executes repairs at the user's request
      • Determines failures based on symptoms
        • E.g. an “open failed” because datafiles f045.dbf and f003.dbf are missing
        • Failure information recorded in Automatic Diagnostic Repository (ADR)
        • Flags problems before user discovers them, via health monitoring checks (EM, DBMS_HM package , RMAN VALIDATE )
      • Intelligently determines recovery strategies
        • Consolidates repairs for efficient recovery
        • Presents only feasible recovery options
        • Indicates any data loss, if applicable
      • Automatically performs selected recovery steps
      Reduces downtime by eliminating confusion
    • 46. DRA High-level Process Flow ORA-1110->db structure check->missing datafiles->restore/recover “ LIST FAILURE” “ ADVISE FAILURE” “ REPAIR FAILURE” Advice Generation Repair Execution Error C C C Flood Control Check Mapping D I A G N O S I S ADR DB R E P A I R
    • 47. Error – Root Cause Failure Example
      • Symptom: ORA-376 – “File cannot be read at this time”
      • Root Causes:
        • File brought offline by DBA
        • Disk Failure
        • File accidentally Deleted / Renamed / Moved
        • Software Bug 1 (bit in SGA incorrectly set)
        • OS Problem (file open limit reached)
        • Transient Network Problem
        • Software Bug 2 (code corrupted file header)
      • Possible FAILURES
        • Datafile is corrupt
        • Datafile is missing (does not exist at the OS level)
        • File system is not mounted properly
        • File resides in an inaccessible file system
        • OS limit on the number of open files is reached
        • Datafile is offline (failure and root cause are the same)
    • 48. Failure Categories
      • Physical DB Components
        • Control file
        • Log file
        • Data file
        • SP file
        • Wallet
      • Inaccessibility Failures
        • ASM or NFS mount problem
        • Not enough kernel memory
        • Limit on open files reached
        • Invalid volume partitioning
        • No file access permissions
        • File is locked
        • Device IO Error
      • Component Failure Types
        • Inaccessible
        • Missing
        • Corrupt
        • Old
        • Inconsistent
      • Logical Corruptions
        • Data Dictionary
        • Data Blocks
        • Undo Segments
    • 49. Diagnostic Checks
      • Physical Corruption Checks
        • Database Cross-Check
        • Control File Check
        • Data Block Integrity Check
        • Redo Integrity Check
        • I/O Check
      • Logical Corruption Checks
        • Data Dictionary Check
        • Undo Segment Integrity Check
        • Data Block Logical Check
    • 50. Repair Examples
      • AUTOMATIC
          • Database Media Recovery (complete / point-in-time)
          • Data File Media Recovery
          • Block Media Recovery
          • Flashback Database
          • Data Guard Failover
      • MANUAL
          • Drop / Recreate Object
          • Increase OS Limit
          • Offline Data File
          • Check Network
          • Rename File
    • 51. Feasibility Checks Examples
      • Is a data file backup available?
      • Are archived logs available?
      • Is this a backup control file?
      • Is flashback enabled?
      • Is database open?
      • Is log file inactive?
      • Many more…
      • Feasibility checks are quick and do not examine every block of every component, so there is still a possibility that a repair proposed by DRA will fail.
    • 52.
      • Data Guard
    • 53. Data Guard Essential for HA
      • Data Guard Capabilities
      • Built-in Oracle integration: ensures transactional consistency
      • Extremely high performance
      • Transparent operation, supports all Oracle features and data types
      • Application-integrated failover
      • Combined HA/DR solution
      • Loosely coupled architecture: ensures fault isolation
      • Protection from data corruptions
      • Ensures zero data loss
      • DR servers can be utilized for testing while providing DR
      • Addresses both planned and unplanned downtime
      • No vendor lock-in for storage
      • Minimal network consumption
      • No distance limitation
      LAN & MAN deployments provide Local HA and DR Extend to a Wide Area Network and add remote DR
    • 54. Data Protection: Why Not With Storage Mirroring? Significant Network Overhead Control Files fil Online Logs Archive Logs Flashback Logs Data Files SYSTEM USER TEMP UNDO Production DBMS Standby Files
      • Control
      • Files fil
      • Online Logs
      • Archive
      • Logs
      • Flashback Logs
        • Data Files
        • SYSTEM
        • USER
        • TEMP
        • UNDO
      Also: database block corruptions propagated to target Updates Network I/O
    • 55. Data Protection with Data Guard Database & Network Optimized Transport & Apply Control Files fil Online Logs Archive Logs Flashback Logs Data Files SYSTEM USER TEMP UNDO Production DBMS Standby DBMS Oracle apply Oracle validation * www.oracle.com/technology/deploy/availability/htdocs/DataGuardRemoteMirroring.html Network I/O 7X less volume* 27X fewer network I/Os* Also: standby database protected from primary database corruptions Updates
    • 56. Data Guard Standby Database: Failover Target Production Database Continuous redo shipping, validation & apply Fast Incremental Backups Physical Standby Database Read-write Workload Real-time Reporting
    • 57. Active Data Guard Standby Database: Offload Production + Failover Target Production Database Continuous redo shipping, validation & apply Fast Incremental Backups Active Standby Database (physical standby open read-only) Read-write Workload Real-time Reporting
    • 58.
      • New Oracle Database 11g capability that builds upon Data Guard
        • Physical standby open read-only while changes applied from primary database
        • RMAN fast incremental backups on physical standby using Block Change Tracking
        • Auto-repair corrupted data blocks
      • Benefits
        • Improves primary database performance by offloading processing to standby
        • Makes productive use of existing physical standby databases
      • Licensing
        • Packaged as a separate database Option for Oracle Enterprise Edition
        • Requires licensing of the production database and all the physical standbys that are used for any of the above capabilities
      Oracle Active Data Guard – What Is It?
    • 59. Apple Inc Reader Farm Scale Out using Active Data Guard App 1 App n Data Guard Standby Database ADG 1 App 2 App 3 Primary Database ADG 2 ADG 3 ADG 8 ADG 9 (Max Availability Mode) SYNC ASYNC Load Balancer Oracle Database 11g Release 1
    • 60. Amazon.com High Availability Integrated with Disaster Recovery Before Data Guard Data Guard Automatic Failover With Data Guard HA/DR Database failover: 20 secs Apps redirected: 2 mins Standby site distance: 15 miles
    • 61. Putting It All Together.. Customer Example for Data Availability Any point in time within recovery window
      • <1 hour for tablespace/datafile recovery
      • <3 hours for full database recovery
      • <30 min for row/table recovery (within last 3 hrs)
      • <1 hour for database recovery from logical errors (within last 2 hrs)
      • <15 min for any database outage
      Failover to standby database at secondary site Backups sent offsite Onsite backups - 3 day recovery window Offsite backups - 1 year tape retention Two backup copies on tape Requirement Service Level Agreement RPO
      • RTO
      • Tier 3
      • Tier 2
      • Tier 1
      Disaster Recovery Retention Policy Backup Redundancy Oracle Solution
      • Archived Log Mode
      • RMAN, OSB, DRA
      • Flashback Table
      • Flashback Database
      • Data Guard
      • Data Guard
      • OSB
      • Fast Recovery Area, OSB
      • OSB
    • 62. Agenda
      • Protecting server availability
      • Protecting data availability
      • How to live with changing environment
    • 63. Best Online Planned Maintenance At Lowest Cost Server Availability Data Availability System Changes Data Changes Unplanned Downtime Planned Downtime App Changes Online Reconfiguration Rolling Upgrades Online Redefinition Edition-based Redefinition
    • 64.
      • Servers
        • Add/Remove RAC nodes online
        • No data movement needed
      • Storage
        • Add/Remove ASM disks or arrays online
        • Automatically rebalance after storage change
      • Clusterware, ASM
        • Upgrade Oracle Clusterware and ASM (11g) in an online manner
      Database Storage Online Reconfiguration Scaling on Demand
    • 65. Online Patching and Upgrades
      • Most one-off patches can be applied to a running Oracle instance
        • Linux-x86, Solaris 10, HP-UX 11i
        • [New in 11.2] Windows 32-bit and Windows 64-bit, AIX v6.1 [TL2 SP1]
      • More complex one-off patches can be deployed online using RAC rolling patches (available 10 g onwards)
      • Database release/patchset upgrades, operating system upgrades, platform migrations can be applied in rolling fashion using Data Guard / GoldenGate
      • Data Center moves / SAN migration / Technology Refresh etc. can be done with minimal downtime using Data Guard / GoldenGate
    • 66.
      • All index changes can be done online
      • Tables can be Reorganized & Redefined online with the DBMS_REDEFINITION package
        • Allows changing location, table type, partitioning, columns, column types
        • Contents can be transformed as they are copied
      Source Table Update Tracking Transform Copy Table Transform Updates Result Table Continuous Queries & Updates Store Updates Online Index & Table Redefinition
    • 67.
      • Maintains logical versions of changed database objects, through:
        • Edition
        • Editioning View
        • Crossedition Trigger
      • Code changes installed in the privacy of a new edition
      • New data changes made to new columns/tables not seen by old edition
      • Editioning view exposes a private projection of a table into each edition
      • Crossedition trigger propagates changes made by old edition into new edition’s columns, or vice-versa
      • Capabilities primarily used by application developers
      Pre-upgrade Edition New in 11.2 Post-upgrade Edition Crossedition Triggers Edition-based Redefinition Enabling Online Application Upgrades
    • 68. Last but not Least
      • Change control is also relevant for HA
        • Use Database Vault or at least db triggers to mitigate a risk of human / appliation errors
        • Audit changes and analyze schema / configuration drift
      • Performance management is also relevant for HA
        • Identify bottlenecks
        • Identify pain points for the configuration or application
        • Practice tuning when neccesary and when possible
        • Practice resource managament
      • Security is also relevant for HA
        • Keep user/application provisioning up to date
        • Do not overprivilege the users
        • There is no availabilty when have to switch the system off to stop the cyberatack
    • 69.