Adventures in Dataguard


Published on

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Adventures in Dataguard

    1. 1. Adventures in Dataguard Dr. Jason Arneil
    2. 2. Why Dataguard Motivation
    3. 3. <ul><li>Introduction </li></ul><ul><li>The Motivation </li></ul><ul><li>Dataguard Architecture & Features </li></ul><ul><li>Creating a Physical Standby </li></ul><ul><li>Maintaining your standby </li></ul><ul><li>Using your Standby </li></ul><ul><li>Performing a Switchover </li></ul>AGENDA
    4. 4. Health Warning Introduction
    5. 5. About Me Introduction <ul><li>Jason Arneil </li></ul><ul><li>System Administrator/DBA </li></ul><ul><li>Using Oracle since 1998 </li></ul><ul><li>At Nominet since 2001 </li></ul>
    6. 6. About Nominet Introduction <ul><li>Nominet is the internet registry for .uk domain names </li></ul><ul><li>Nominet has been in existence for over 11 years </li></ul><ul><li>Nominet is run as a not-for-profit company </li></ul><ul><li>Nominet is owned by its members </li></ul><ul><li>There are over 6 Million .uk domain names </li></ul>
    7. 7. Why Dataguard Motivation <ul><li>Big push on a Nominet Business Continuity Plan </li></ul><ul><li>Dataguard is the Oracle solution for disaster recovery </li></ul><ul><li>Physical Standby was the obvious option </li></ul><ul><li>Maximum Availability Architecture (MAA) </li></ul>
    8. 8. Business Continuity Site Motivation
    9. 9. Dataguard Processes Architecture & Features Primary Database Transactions Physical/Logical Standby Database Backup / Reports Transform Redo to SQL for SQL Apply MRP/ LSP ARCH Archived Redo Logs Archived Redo Logs ARCH Oracle Net Standby Redo Logs RFS FAL Online Redo Logs LGWR LNS
    10. 10. Dataguard Features Architecture & Features <ul><li>Several Protection Modes </li></ul><ul><ul><li>Maximum Protection </li></ul></ul><ul><ul><li>Maximum Availability </li></ul></ul><ul><ul><li>Maximum Performance </li></ul></ul><ul><li>Several Transport Modes </li></ul><ul><ul><li>LGWR SYNC </li></ul></ul><ul><ul><li>LGWR ASYNC </li></ul></ul><ul><ul><li>ARCH </li></ul></ul>
    11. 11. Prepare Primary & Standby Creating a Standby <ul><li>Prepare Primary Database </li></ul><ul><ul><li>Enable Force Logging </li></ul></ul><ul><ul><li>SQL> alter database force logging; </li></ul></ul><ul><ul><li>Modify initialization parameters </li></ul></ul><ul><li>Prepare Standby Database </li></ul><ul><ul><li>Setup directory structure </li></ul></ul><ul><ul><li>Create spfile with correct parameters </li></ul></ul><ul><ul><li>Start database in nomount </li></ul></ul>
    13. 13. ssh tunnels Creating a Standby <ul><li>You may not wish your redo data being sent unencrypted across the internet to your standby. You can use ssh tunnels to avoid this </li></ul><ul><ul><li>ssh -N -L 3333:standby:1521 oracle@standby </li></ul></ul><ul><li>Now the tnsnames entry points to the localhost </li></ul><ul><li>STANDBYARC = </li></ul><ul><li> (DESCRIPTION = </li></ul><ul><li> (SDU = 32767) </li></ul><ul><li> (ADDRESS_LIST = </li></ul><ul><li> (ADDRESS = (PROTOCOL = TCP)(HOST = localhost)(PORT=3333))) </li></ul><ul><li> (CONNECT_DATA = </li></ul><ul><li>(SERVICE_NAME = STANDBY))) </li></ul>
    14. 14. Some Other Parameters Creating a Standby <ul><li>FAL_SERVER </li></ul><ul><li>FAL_CLIENT </li></ul><ul><li>ARCHIVE_LAG_TARGET </li></ul><ul><li>STANDBY_FILE_MANAGEMENT </li></ul><ul><li>DB_FILE_NAME_CONVERT </li></ul><ul><li>LOG_FILE_NAME_CONVERT </li></ul>
    15. 15. backup your primary Creating a Standby <ul><li>Backup primary - rman is good </li></ul><ul><ul><li>rman> backup format '/backup/%U' database plus archivelog; </li></ul></ul><ul><ul><li>rman> backup format '/backup/%U' current controlfile for standby; </li></ul></ul><ul><li>Recover backup on standby node </li></ul><ul><ul><li>I like using rman duplicate to create standby: </li></ul></ul><ul><li>(oracle$) rman target sys/password@PRIMARY auxiliary / </li></ul><ul><li>rman> duplicate target database for standby; </li></ul>
    16. 16. Start applying redo Creating a Standby <ul><li>Create standby redo log files on both primary and standby: </li></ul><ul><ul><li>sql> alter database add standby logfile thread 2 group 42 (’PATH_TO_DATA/standbyredo01.log') size 512M; </li></ul></ul><ul><li>Now you can start the physical standby recovering logs: </li></ul><ul><ul><li>sql>alter database recover managed standby database disconnect from session; </li></ul></ul><ul><li>Or if you prefer real time apply: </li></ul><ul><ul><li>sql>alter database recover managed standby database using current logfile disconnect from session; </li></ul></ul>
    17. 17. Monitoring the Standby Maintaining your standby <ul><li>You have to ensure your standby is keeping up with your primary </li></ul><ul><li>You can check which was the last log to have been applied to your standby is </li></ul><ul><ul><li>sql> SELECT MAX(SEQUENCE#), THREAD# </li></ul></ul><ul><ul><ul><li> FROM V$ARCHIVED_LOG </li></ul></ul></ul><ul><ul><ul><li> where APPLIED='YES' </li></ul></ul></ul><ul><ul><ul><li> GROUP BY THREAD#; </li></ul></ul></ul><ul><li>MAX(SEQUENCE#) THREAD# </li></ul><ul><li>-------------- ---------- </li></ul><ul><li> 2976 1 </li></ul><ul><li> 1888 2 </li></ul>
    18. 18. Monitoring Standby Progress Maintaining your standby <ul><li>A good way of checking what the background processes of your standby are up to is using v$managed_standby </li></ul><ul><ul><li>SQL> select process, sequence#, status </li></ul></ul><ul><li> from V$managed_standby; </li></ul><ul><li>PROCESS SEQUENCE# STATUS </li></ul><ul><li> -------- ---------- ------------ </li></ul><ul><li> ARCH 2967 CLOSING </li></ul><ul><li> ARCH 2974 CLOSING </li></ul><ul><li> RFS 2977 IDLE </li></ul><ul><li> MRP0 1889 APPLYING_LOG </li></ul><ul><li> RFS 1889 IDLE </li></ul><ul><li> RFS 2977 IDLE </li></ul>
    19. 19. Monitoring Your Standby Maintaining your standby <ul><li>You have to ensure your standby is keeping up with your primary </li></ul><ul><li>V$DATAGUARD_STATS provides useful information </li></ul><ul><ul><li>SQL> select name, value from v$dataguard_stats; </li></ul></ul><ul><li>NAME VALUE </li></ul><ul><li>-------------------------------- ------------------------------------ </li></ul><ul><li>apply finish time +00 00:00:00 </li></ul><ul><li>apply lag +00 00:00:11 </li></ul><ul><li>estimated startup time 41 </li></ul><ul><li>standby has been open N </li></ul><ul><li>transport lag +00 00:00:03 </li></ul>
    20. 20. Monitoring Your Standby Maintaining your standby <ul><li>A way of finding out what has been happening to your standby over a period time is to look at the v$dataguard_status view </li></ul><ul><ul><li>Log Apply Services 01-AUG-07 Media Recovery Waiting for thread 1 sequence 2977 (in transit) </li></ul></ul><ul><ul><li>Log Apply Services 01-AUG-07 Media Recovery Waiting for thread 1 sequence 2977 (in transit) </li></ul></ul><ul><ul><li>Log Apply Services 01-AUG-07 Media Recovery Waiting for thread 2 sequence 1889 (in transit) </li></ul></ul><ul><ul><li>Remote File Server 01-AUG-07 Primary database is in MAXIMUM PERFORMANCE mode </li></ul></ul><ul><ul><li>Remote File Server 01-AUG-07 RFS[53]: Successfully opened standby log 14: '+DATA2/standby/standbyredo02.log' </li></ul></ul>
    21. 21. Oracle can’t divide by 0 Maintaining your standby <ul><li>Standby was happily working away </li></ul><ul><ul><li>ORA-07445: exception encountered: core dump [kcrarmb()+152] [SIGFPE] [Integer divide by zero] [0x00085C300 </li></ul></ul><ul><li>MRP process crashes </li></ul><ul><ul><li>No redo gets applied from this point </li></ul></ul><ul><li>Logs after the one that caused the ORA-07445 still being shipped </li></ul><ul><li>A simple restart of the managed recovery process does a FAL and the standby is back up-to-date </li></ul>
    22. 22. kcrfr_resize2 Maintaining your standby <ul><li>Lots of problems after upgrade to </li></ul><ul><ul><li>Recovery of Online Redo Log: Thread 2 Group 23 Seq 999 Reading mem 0 </li></ul></ul><ul><ul><li>Mem# 0: +DATA3/standby/standbyredo11.log </li></ul></ul><ul><ul><li>ORA-00600: internal error code, arguments: [kcrfr_resize2], [652614828032], [268423168], [], [], [], [], [] </li></ul></ul><ul><li>Perhaps caused by the following: </li></ul><ul><ul><li>Bug 3306010 OERI[kcrfr_resize2] possible in MEDIA recovery </li></ul></ul><ul><li> Media recovery may fail with ORA-600 [kcrfr_resize2] when </li></ul><ul><li> the number of redo strands is set to a high value using </li></ul><ul><li> log_parallelism. </li></ul>
    23. 23. kcrfr_resize2 Maintaining your standby <ul><li>This issue has recently been published as Note:453259.1 </li></ul><ul><ul><li>Triggered by having a large log_buffer </li></ul></ul><ul><li>This bug affects and potentially </li></ul><ul><li>It is related to the size of the log_buffer parameter </li></ul><ul><li>Fix is included in </li></ul>
    24. 24. kcrrupirfs Maintaining your standby <ul><li>ARC processes died on primary: </li></ul><ul><ul><li>ORA-00600: [kcrrupirfs.20] [4] [368] </li></ul></ul><ul><li>Trace file showed the following: </li></ul><ul><li>Corrupt redo block 479421 detected: bad block number </li></ul><ul><li>Flag: 0x0 Format: 0x0 Block: 0x00000000 Seq: 0x00000000 Beg: 0x0 Cks:0x0 <<<<<<<-- </li></ul><ul><li>----- Dump of Corrupt Redo Buffer -----000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 </li></ul>
    25. 25. kcrrupirfs Maintaining your standby <ul><li>Oracle think initially think this ORA-600 error was hardware related </li></ul><ul><ul><li>There are NO indications of any hardware fault - the primary keeps running </li></ul></ul><ul><li>After a couple of weeks it was decided this was a “bug situation” </li></ul><ul><ul><li>This was bug 4767278 which talked about FAL not being able to read from multiple mirror sides when encountering invalid/stale redo in a file. Apparently required for ASM configurations because ASM does not guarantee all mirror sides contain same data after writing. </li></ul></ul><ul><ul><li>We were using ASM, but external redundancy </li></ul></ul><ul><ul><li>Oracle then said “The ASM group is not 100% sure if the patch 4767278 will fix the problem” </li></ul></ul>
    26. 26. log corruption Maintaining your standby <ul><li>The Managed Recovery process crashed complaining about log corruption </li></ul><ul><li>MRP0: Background Media Recovery terminated with error 355 </li></ul><ul><li>ORA-00355: change numbers out of order </li></ul><ul><li>ORA-00353: log corruption near block 2 change 1273622545 time 03/06/2007 08:32:46 </li></ul><ul><li>ORA-00312: online log 13 thread 1: '+DATA2/standby/standbyredo01.log' </li></ul><ul><li>Oracle blame the upgrade process at first. They suggest rebuilding the standby </li></ul><ul><li>Then I notice that trying managed recovery rather than real time apply seems to allow the standby to progress </li></ul>
    27. 27. log corruption Maintaining your standby <ul><li>At this point Oracle say “it looks like a bug” </li></ul><ul><li>Lots of time spent diagnosing the issue </li></ul><ul><ul><li>ALTER SYSTEM DUMP LOGFILE '+DATA2/nom/standby33.log' scn min 865465290 scn max 865465300; </li></ul></ul><ul><li>Eventually Oracle produced a patch 5746174 </li></ul><ul><ul><li>MRP HANGS WITH ASYNC LNS AND PARALLEL ARCHIVAL </li></ul></ul>
    28. 28. Utilize those cpu cycles Using Your Standby <ul><li>A Standby can be considered an insurance policy </li></ul><ul><li>Several ways to utilize your standby </li></ul><ul><ul><li>Run your backups from your standby </li></ul></ul><ul><ul><li>Open your standby read only for reporting </li></ul></ul><ul><ul><li>Flashback standby to look at old data </li></ul></ul><ul><ul><li>Open your standby read write for testing purposes </li></ul></ul>
    29. 29. Open for Reports Using Your Standby <ul><li>You need to cancel managed recovery </li></ul><ul><ul><li>sql> alter database recover managed standby database cancel; </li></ul></ul><ul><li>Then simply open the standby </li></ul><ul><ul><li>sql> alter database open; </li></ul></ul><ul><li>Redo is still transported to your standby </li></ul><ul><li>To transition back to applying redo shutdown the open standby, startup mount and restart the recovery process </li></ul>
    30. 30. Open for read write Using Your Standby <ul><li>You must have flashback database enabled for this </li></ul><ul><li>Stop redo apply on standby </li></ul><ul><li>Create a restore point </li></ul><ul><li>Activate the Standby & perform read/write testing </li></ul><ul><li>Flashback to restore point </li></ul><ul><li>Start the redo on the Standby again </li></ul>
    31. 31. Open for read write Using Your Standby Physical Standby Physical Standby read write Restore Point Flashback Database Activate standby
    32. 32. Flashback Database in a Nutshell Using Your Standby <ul><li>Set up Flashback Database </li></ul><ul><ul><li>alter system set db_recovery_file_dest_size = 8G; </li></ul></ul><ul><ul><li>alter system set db_recovery_file_dest = 'your flashback destination'; </li></ul></ul><ul><ul><li>alter system set db_flashback_retention_target = 1440 ; </li></ul></ul><ul><ul><li>alter database flashback on; </li></ul></ul><ul><li>Once you have cancelled the standby recovery create a guaranteed restore point </li></ul><ul><ul><li>create guaranteed restore point before_activate; </li></ul></ul>
    33. 33. Open for read write Using Your Standby <ul><li>Activate your Standby </li></ul><ul><ul><li>SQL> ALTER DATABASE ACTIVATE STANDBY DATABASE; </li></ul></ul><ul><li>You can open the Standby for business </li></ul><ul><ul><li>SQL> ALTER DATABASE OPEN; </li></ul></ul><ul><li>To become a Standby again shutdown and startup in mount </li></ul><ul><ul><li>SQL> FLASHBACK DATABASE TO RESTORE POINT BEFORE_ACTIVATE; </li></ul></ul><ul><ul><li>SQL> ALTER DATABASE CONVERT TO PHYSICAL STANDBY; </li></ul></ul>
    34. 34. Open for read write Using Your Standby <ul><li>However things never go according to plan </li></ul><ul><ul><li>ORA-00600: internal error code, arguments: [3705], [1], [8], [3], [8], [], [] </li></ul></ul><ul><li>This was bug 4479323 which is a bug with recovery (not standby specific) and only occurs in a RAC environment </li></ul><ul><li>This is fixed in </li></ul>
    35. 35. It’s good to test Doing a Switchover <ul><li>A business continuity plan is no good unless it’s been tested </li></ul><ul><li>It’s not all about the database </li></ul><ul><li>Good to think in terms of services </li></ul>
    36. 36. Database Switchover Doing a Switchover <ul><li>Make sure your standby is up-to-date </li></ul><ul><li>Check your primary database switchover status: </li></ul><ul><ul><li>primary> SELECT SWITCHOVER_STATUS FROM V$DATABASE; </li></ul></ul><ul><li>Switchover primary database </li></ul><ul><ul><li>primary> ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY with session shutdown; </li></ul></ul><ul><li>Switchover the standby </li></ul><ul><ul><li>standby> ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY with session shutdown; </li></ul></ul>
    37. 37. DNS Primer Doing a switchover <ul><li>DNS allows translation from hostname to IP address </li></ul><ul><ul><li> IN A </li></ul></ul><ul><li>Our principle is all services are accessed through a CNAME </li></ul><ul><ul><li> 5M IN CNAME </li></ul></ul><ul><li>relocation of the service is just a case of changing where the CNAME points </li></ul>
    38. 38. Conclusion Conclusion <ul><li>Dataguard is an efficient DR solution for your primary database </li></ul><ul><li>Dataguard is mostly reliable but is not without it’s blips </li></ul><ul><li>There are opportunities for gaining added value from your standby </li></ul><ul><li>You can’t test your Business continuity plan enough </li></ul>
    39. 39. Questions? Adventures in Dataguard <ul><li>Contact: </li></ul><ul><li>[email_address] </li></ul><ul><li> </li></ul>