• Save
Cloug Troubleshooting Oracle 11g Rac 101 Tips And Tricks
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Cloug Troubleshooting Oracle 11g Rac 101 Tips And Tricks

  • 20,811 views
Uploaded on

Oracle 11g RAC (Real Application Clusters) presentation that I gave at CLOUG in Chile this year.

Oracle 11g RAC (Real Application Clusters) presentation that I gave at CLOUG in Chile this year.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
20,811
On Slideshare
20,681
From Embeds
130
Number of Embeds
6

Actions

Shares
Downloads
1
Comments
4
Likes
24

Embeds 130

http://www.scoop.it 102
http://www.slashdocs.com 14
http://www.linkedin.com 4
http://www.docseek.net 4
https://www.linkedin.com 4
http://www.lmodules.com 2

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Session : Troubleshooting Oracle 11g Real Application Clusters 101: Insider Tips and Tricks Ben Prusinski Ben Prusinski and Associates http://www.ben-oracle.com [email_address] CLOUG/ Santiago, Chile Tuesday 14 April 2009
  • 2. Speaker Qualifications Ben Prusinski
    • Oracle ACE and Oracle Certified Professional with 14 plus years of real world experience with Oracle since version 7.3.4
    • Oracle Author of two books on Oracle database technology
  • 3. Agenda
  • 4. Agenda: Troubleshooting Oracle 11g RAC
    • Proactive checks to keep Oracle 11g RAC happy and healthy
    • Common RAC problems and solutions
    • Root cause analysis for RAC
    • Understanding Clusterware problems
    • Solving critical tuning issues for RAC
    • DBA 101 Toolkit for RAC problem solving
  • 5. Checks and Balances for 11g RAC
  • 6. Proactive checks to keep Oracle 11g RAC happy and healthy
    • Setup monitoring system to automate checks before major problems occur!
    • Verify status for RAC processes and Clusterware
    • Check for issues with ASM
    • Check status for hardware, network, OS
  • 7. Monitoring Systems for 11g RAC
    • Oracle Grid Control provides monitoring alerts for Oracle 11g RAC
    • System level OS scripts to monitor Clusterware and Oracle 11g RAC processes
    • Check for 11g ASM processes and 11g RAC database processes
  • 8. Verification 11g RAC Processes
    • First, check operating system level that all 11g RAC processes up and running for Clusterware:
    • Oracle Metalink Note # 761259.1 How to Check the Clusterware Processes
    • [oracle@sdrac01 11.1.0]$ ps -ef|grep crsd
    • root 2853 1 0 Apr04 ? 00:00:00
    • /u01/app/oracle/product/11.1.0/crs/bin/crsd.bin reboot
    • [oracle@sdrac01 11.1.0]$ ps -ef|grep cssd
    • root 2846 1 0 Apr04 ? 00:03:15 /bin/sh /etc/init.d/init.cssd fatal
    • root 3630 2846 0 Apr04 ? 00:00:00 /bin/sh /etc/init.d/init.cssd daemon
    • /u01/app/oracle/product/11.1.0/crs/bin/ocssd.bin
    • [oracle@sdrac01 11.1.0]$ ps -ef|grep evmd
    • oracle 3644 2845 0 Apr04 ? 00:00:00
    • /u01/app/oracle/product/11.1.0/crs/bin/evmd.bin
    • oracle 9595 29413 0 23:59 pts/3 00:00:00 grep evmd
  • 9. Verify 11g RAC Processes
    • oprocd: Runs on Unix when vendor Clusterware is not running. On Linux, only starting with 11.1.0.4.
    • oclsvmon.bin: Usually runs when a third party clusterware is used oclsomon.bin: Checks program of the ocssd.bin (starting in 11.1.0.1) diskmon.bin : new 11.1.0.7 process for Oracle Exadata Machine oclskd.bin: new 11.1.0.6 process to reboot nodes in case RDBMS instances for 11g RAC are in a hang condition
    • There are three fatal processes, i.e. processes whose abnormal halt or kill will provokes a node reboot (Metalink Note:265769.1) 1. the ocssd.bin 2. the oprocd.bin 3. the oclsomon.bin
    • The other processes are automatically restarted when they go away.
  • 10. Scripts for RAC monitoring
    • Metalink 135714.1 provides racdiag.sql script to collect health status for 11g RAC environments.
    • TIME
    • --------------------
    • FEB-11-2009 10:06:36
    • 1 row selected.
    • INST_ID INSTANCE_NAME HOST_NAME VERSION STATUS STARTUP_TIME
    • ------- --------- --------- -------- ------- ----------
    • 1 rac01 sdrac01 11.1.0.7 OPEN FEB-01-2009
    • 2 rac02 sdrac02 11.1.0.7 OPEN FEB-01-2009
    • 2 rows selected
  • 11. Check Status 11g RAC Clusterware
    • CRSCTL is your friend
    • [oracle@sdrac01 11.1.0]$ crsctl
    • Usage: crsctl check crs - checks the viability of the CRS stack
    • crsctl check cssd - checks the viability of CSS
    • crsctl check crsd - checks the viability of CRS
    • crsctl check evmd - checks the viability of EVM
    • Worked Example of using CRSCTL for 11g RAC
    • [oracle@sdrac01 11.1.0]$ crsctl check crs
    • CSS appears healthy
    • CRS appears healthy
    • EVM appears healthy
  • 12. More Checks for 11g RAC
    • Use srvctl to get quick status check for 11g RAC:
    • [oracle@sdrac01]$ srvctl
    • Usage: srvctl <command> <object> [<options>]
    • command: enable|disable|start|stop|relocate|status|add|remove|modify|getenv|setenv|unsetenv|config
    • objects: database|instance|service|nodeapps|asm|listener
  • 13. Using SRVCTL with 11g RAC
    • Using SRVCTL to Check Database and Instances for 11g RAC
    • 11g RAC Database Status:
    • srvctl status database -d <database-name> [-f] [-v] [-S <level>]
    • srvctl status instance -d <database-name> -i <instance-name> >[,<instance-name-list>]
    • [-f] [-v] [-S <level>]
    • srvctl status service -d <database-name> -s <service-name>[,<service-name-list>]
    • [-f] [-v] [-S <level>]
    • srvctl status nodeapps [-n <node-name>]
    • srvctl status asm -n <node_name>
  • 14. SRVCTL for 11g RAC- Syntax
    • Status of the database, all instances and all services.
    • $ srvctl status database -d ORACLE -v
    • Status of named instances with their current services.
    • $srvctl status instance -d ORACLE -i RAC01, RAC02 -v
    • Status of a named services.
    • $srvctl status service -d ORACLE -s ERP -v
    • Status of all nodes supporting database applications.
    • $srvctl status nodeapps –n {nodename}
  • 15. SRVCTL Worked Examples 11g RAC
    • Database and Instance Status Checks
    • $ srvctl status database -d RACDB -v
    • Instance RAC01 is not running on node sdrac01
    • Instance RAC02 is not running on node sdrac02
    • Node Application Checks
    • $ srvctl status nodeapps -n sdrac01
    • VIP is not running on node: sdrac02
    • GSD is running on node: sdrac01
    • Listener is not running on node: sdrac01
    • ONS daemon is running on node: sdraco1
    • ASM Status Check for 11g RAC
    • $ srvctl status asm -n sdrac01
    • ASM instance +ASM1 is not running on node sdrac01.
  • 16. Don’t forget about CRS_STAT
    • CRS_STAT useful for quick check for 11g RAC!
    • $ crs_stat -t
    • Name Type Target State Host
    • ----------------------------------------------------------
    • ora....B1.inst application ONLINE OFFLINE
    • ora....B2.inst application ONLINE OFFLINE
    • ora....ux1.gsd application ONLINE ONLINE sdrac01
    • ora....ux1.ons application ONLINE ONLINE sdrac01
    • ora....ux1.vip application ONLINE OFFLINE
    • ora....t1.inst application ONLINE OFFLINE
    • ora.test.db application OFFLINE OFFLINE
    • ora....t1.inst application ONLINE OFFLINE
  • 17. 11g Checks for ASM with RAC
    • 11g ASM has new features but still mostly the same as far as monitoring is concerned.
    • Check at the operating system level to ensure all critical 11g ASM processes are up and running:
    • $ ps -ef|grep asm
    • oracle 23471 1 0 01:46 ? 00:00:00 asm_pmon_+ASM1
    • oracle 23483 1 1 01:46 ? 00:00:00 asm_diag_+ASM1
    • oracle 23485 1 0 01:46 ? 00:00:00 asm_psp0_+ASM1
    • oracle 23494 1 1 01:46 ? 00:00:00 asm_lmon_+ASM1
    • oracle 23496 1 1 01:46 ? 00:00:00 asm_lmd0_+ASM1
    • oracle 23498 1 1 01:46 ? 00:00:00 asm_lms0_+ASM1
    • oracle 23534 1 0 01:46 ? 00:00:00 asm_mman_+ASM1
    • oracle 23536 1 1 01:46 ? 00:00:00 asm_dbw0_+ASM1
    • oracle 23546 1 0 01:46 ? 00:00:00 asm_lgwr_+ASM1
    • oracle 23553 1 0 01:46 ? 00:00:00 asm_ckpt_+ASM1
    • oracle 23561 1 0 01:46 ? 00:00:00 asm_smon_+ASM1
    • oracle 23570 1 0 01:46 ? 00:00:00 asm_rbal_+ASM1
    • oracle 23572 1 0 01:46 ? 00:00:00 asm_gmon_+ASM1
    • oracle 23600 1 0 01:47 ? 00:00:00 asm_lck0_+ASM1
  • 18. More checks for 11g ASM
    • Use the ASMCMD command to check status for 11g ASM with RAC
    • The ls and lsdg commands provide summary for ASM configuration
    • $ asmcmd
    • ASMCMD> ls
    • MY_DG1/
    • MY_DG2/
    • ASMCMD> lsdg
    • State Type Rebal Unbal Sector Block AU Total_MB Free_MB Req_mir_free_MB Usable_file_MB Offline_disks Name
    • MOUNTED EXTERN N N 512 4096 1048576 3920 1626 0 1626 0 MY_DG1/
    • MOUNTED EXTERN N N 512 4096 1048576 3920 1408 0 1408 0 MY_DG2/
  • 19. SQL*Plus with 11g ASM
    • Useful query to check status for 11g ASM with RAC from SQL*PLUS:
    • SQL> select name, path, state from v$asm_disk;
    • NAME PATH STATE
    • ------------------------- -------------------- ----------
    • MY_DG1_0001 /dev/raw/raw12 NORMAL
    • MY_DG1_0000 /dev/raw/raw11 NORMAL
    • MY_DG1_0002 /dev/raw/raw13 NORMAL
    • MY_DG2_0000 /dev/raw/raw15 NORMAL
    • MY_DG2_0001 /dev/raw/raw16 NORMAL
    • MY_DG1_0003 /dev/raw/raw14 NORMAL
  • 20. Healthchecks- OCR and Votedisk for 11g RAC
  • 21. Quick Review- 11g RAC Concepts OCR and Vote Disk
    • What is the OCR?
    • Oracle Cluster Registry purpose is to hold cluster and database configuration information for RAC and Cluster Ready Services (CRS) such as the cluster node list, and cluster database instance to node mapping, and CRS application resource profiles.
    • The OCR must be stored on either shared raw devices or OCFS/OCFS2 (Oracle Cluster Filesystem)
    • What is the Voting Disk?
    • The Voting disk manages cluster node membership and must be stored on either shared raw disk or OCFS/OCFS2 cluster filesystem.
  • 22. OCR and Vote Disk Health Check
    • Without the OCR and Vote Disk 11g RAC will fail!
    • Useful health checks for OCR with OCRCHECK command:
    • $ ocrcheck
    • Status of Oracle Cluster Registry is as follows :
    • Version : 2
    • Total space (kbytes) : 297084
    • Used space (kbytes) : 3848
    • Available space (kbytes) : 293236
    • ID : 2007457116
    • Device/File Name : /dev/raw/raw5
    • Device/File integrity check succeeded
    • Device/File Name : /dev/raw/raw6
    • Device/File integrity check succeeded
    • Cluster registry integrity check succeeded
  • 23. Healthcheck for Vote Disk
    • Use the CRSCTL command:
    • $ crsctl query css votedisk
    • 0. 0 /dev/raw/raw7
    • 1. 0 /dev/raw/raw8
    • 2. 0 /dev/raw/raw9
    • located 3 votedisk(s).
  • 24. Problems and Solutions: 11g RAC
  • 25. 11g RAC Problems and Solutions
    • Missing Clusterware resources offline
    • Failed or corrupted vote disk
    • Failed or corrupted OCR disks
    • RAC node reboot issues
    • Hardware, Storage, Network problems with RAC
  • 26. Root Cause Analysis 11g RAC
    • First step- locate and examine 11g RAC log files.
    • Metalink Note 781632.1 and 311321.1 are useful
    • CRS_HOME Log Files
    • $ CRS_HOMElog odename acg contains log files for VIP and ONS resources
    • RBDMS_HOME log files under ORACLE_HOME/log/nodename/racg
    • Example:
    • /u01/app/oracle/product/11.1.0/db_1/log/sdrac01/racg
    • Errors are reported to imon<DB_NAME>.log files
    • $ view imon.log
    • 2009-03-15 21:39:38.497: [ RACG][3002129328] [13876][3002129328][ora.RACDB.RACDB2.inst]: clsrfdbe_enqueue: POST_ALERT() failed: evttypname='down' type='1' resource='ora.RACDB.RACDB2.inst' node='sdrac01' time='2009-03-15 21:39:36.0 -05:00' card=0
    • 2009-03-15 21:40:08.521: [ RACG][3002129328] [13876][3002129328][ora.RACDB.RACDB2.inst]: CLSR-0002: Oracle error encountered while executing DISCONNECT
    • 2009-03-15 21:40:08.521: [ RACG][3002129328] [13876][3002129328][ora.RACDB.RACDB2.inst]: ORA-03114: not connected to ORACLE
  • 27. 11g RAC Log Files
    • ASM Log Files for 11g RAC root cause analysis
    • ASM_HOME/log/nodename/racg if ASM is separate from the RDBMS otherwise these logs are located under RDBMS_HOME
    • ASM log files for 11g RAC analysis are named in format convention of ora.nodename.asm.log
    • $ view ora.sdrac01.ASM1.asm.log
  • 28. 11g RAC ASM Log File
    • $ view ora.sdrac01.ASM1.asm.log
    • 2009-03-15 21:40:03.725: [ RACG][3086936832] [11200][3086936832][ora.sdrac01.ASM1.asm]: Real Application Clusters, Oracle Label Security, OLAP
    • and Data Mining Scoring Engine options
    • SQL> ASM instance shutdown
    • SQL> Disconnected from Oracle Database 10g Enterprise Edition Release 11.1.0.6.0 – Production
    • With the Partitioning, Real Application
    • 2009-03-15 21:40:03.725: [ RACG][3086936832] [11200][3086936832][ora.sdrac01.ASM1.asm]: Clusters, Oracle Label Security, OLAP
    • and Data Mining Scoring Engine options
  • 29. Missing Clusterware resources offline
    • Common problem unable to start Clusterware resources
    • The command for crs_stat -t output shows VIP is offline and trying to start it gives error : CRS-0215: Could not start resource 'ora.dbtest2.vip'. Example: crs_stat -t Name Type Target State Host ------------------------------------------------------------ ora....st2.gsd application ONLINE ONLINE rac01 ora....st2.ons application ONLINE ONLINE rac01 ora....st2.vip application ONLINE OFFLINE
  • 30. Offline Clusterware Resources
    • [root@sdrac01]# ./srvctl start nodeapps -n sdrac01
    • sdrac01:ora.sdrac01.vip:Interface eth0 checked failed (host=sdrac01.ben.com)
    • sdrac01:ora.sdrac01.vip:Invalid parameters, or failed to bring up VIP (host=sdrac01.ben.com)
    • sdrac01:ora.sdrac01.vip:Interface eth0 checked failed (host=sdrac01.ben.com)
    • sdrac01:ora.sdrac01.vip:Invalid parameters, or failed to bring up VIP (host=sdrac01.ben.com)
    • CRS-1006: No more members to consider
    • CRS-0215: Could not start resource 'ora.sdrac01.vip'.
    • sdrac01:ora.sdrac01.vip:Interface eth0 checked failed (host=sdrac01.ben.com)
    • sdrac01:ora.sdrac01.vip:Invalid parameters, or failed to bring up VIP (host=sdrac01.ben.com)
    • sdrac01:ora.sdrac01.vip:Interface eth0 checked failed (host=sdrac01.ben.com)
    • sdrac01:ora.sdrac01.vip:Invalid parameters, or failed to bring up VIP (host=sdrac01.ben.com)
    • CRS-1006: No more members to consider
    • CRS-0215: Could not start resource 'ora.sdrac01.LISTENER_SDRAC01.lsnr'.
  • 31. Solution for Offline Clusterware Resources
    • Metalink Notes 781632.1 and 356535.1 have some good troubleshooting advice with failed CRS resources.
    • First, we need to diagnose current settings for VIP:
    • [root@sdrac011 bin]# ./srvctl config nodeapps -n sdrac01 -a -g -s -l
    • VIP exists.: /sdrac01-vip.ben.com/192.168.203.111/255.255.255.0/eth0
    • GSD exists.
    • ONS daemon exists.
    • Listener exists.
    • Start Debug for failed resources by either setting the environment variable _USR_ORA_DEBUG=1 in the script $ORA_CRS_HOME/bin/racgvip or using crsctl debug command shown in below example:
    • # ./crsctl debug log res &quot;ora.sdrac01.vip:5&quot;
    • Set Resource Debug Module: ora.sdrac01.vip Level: 5
  • 32. Useful debug output 11g RAC VIP Issue
    • # ./srvctl start nodeapps -n sdrac01
    • sdrac01:ora.sdrac01.vip:Wed Apr 8 01:12:43 EDT 2009 [ 27550 ] Broadcast = 192.168.203.255
    • sdrac01:ora.sdrac01.vip:Wed Apr 8 01:12:43 EDT 2009 [ 27550 ] Checking interface existance
    • sdrac01:ora.sdrac01.vip:Wed Apr 8 01:12:43 EDT 2009 [ 27550 ] Calling getifbyip
    • sdrac01:ora.sdrac01.vip:Wed Apr 8 01:12:43 EDT 2009 [ 27550 ] getifbyip: started for 192.168.203.111
    • sdrac01:ora.sdrac01.vip:Wed Apr 8 01:12:43 EDT 2009 [ 27550 ] Completed getifbyip
    • sdrac01:ora.sdrac01.vip:Wed Apr 8 01:12:43 EDT 2009 [ 27550 ] Calling getifbyip -a
    • sdrac01:ora.sdrac01.vip:Wed Apr 8 01:12:43 EDT 2009 [ 27550 ] getifbyip: started for 192.168.203.111
    • sdrac01:ora.sdrac01.vip:Wed Apr 8 01:12:43 EDT 2009 [ 27550 ] Completed getifbyip
    • sdrac01:ora.sdrac01.vip:Wed Apr 8 01:12:44 EDT 2009 [ 27550 ] Completed with initial interface test
    • sdrac01:ora.sdrac01.vip:Wed Apr 8 01:12:44 EDT 2009 [ 27550 ] Interface tests
    • sdrac01:ora.sdrac01.vip:Wed Apr 8 01:12:44 EDT 2009 [ 27550 ] checkIf: start for if=eth0
    • sdrac01:ora.sdrac01.vip:Wed Apr 8 01:12:44 EDT 2009 [ 27550 ] /sbin/mii-tool eth0 error
    • sdrac01:ora.sdrac01.vip:Wed Apr 8 01:12:44 EDT 2009 [ 27550 ] defaultgw: started
    • sdrac01:ora.sdrac01.vip:Wed Apr 8 01:12:44 EDT 2009 [ 27550 ] defaultgw: completed with 192.168.203.2
    • sdrac01:ora.sdrac01.vip:Wed Apr 8 01:12:47 EDT 2009 [ 27550 ] checkIf: ping and RX packets checked if=eth0 failed
    • sdrac01:ora.sdrac01.vip:Interface eth0 checked failed (host=sdrac01.us.oracle.com)
  • 33. Failed VIP Resource 11g RAC
    • Start the VIP using srvctl start nodeapps again. This will create a log for VIP starting problem in the directory $ORA_CRS_HOME/log/racg/*vip.log
    • Review the log files
    • # cd/u01/app/oracle/product/11.1.0/crs/log/sdrac01/racg
    • [root@sdrac01 racg]# ls
    • evtf.log ora.sdrac01.ons.log ora.test.db.log racgmain
    • ora.RACDB.db.log ora.sdrac01.vip.log racgeut
    • ora.sdrac01.gsd.log ora.target.db.log racgevtf
    • Turn off debugging with command :
    • # ./crsctl debug log res &quot;ora.sdrac01.vip:0&quot;
  • 34. Example: 11g RAC Resource Offline
    • # view ora.sdrac01.vip.log
    • 2009-04-08 00:45:36.447: [ RACG][3086936832] [22614][3086936832][ora.sdrac01.vip]: clsrcexecut: rc = 1, time = 6.210s
    • 2009-04-08 00:45:42.765: [ RACG][3086936832] [22614][3086936832][ora.sdrac01.vip]: Interface eth0 checked failed (host=sdrac01.us.oracle.com)
    • Invalid parameters, or failed to bring up VIP (host=sdrac01.us.oracle.com)
    • 2009-04-08 00:45:42.765: [ RACG][3086936832] [22614][3086936832][ora.sdrac01.vip]: clsrcexecut: env ORACLE_CONFIG_HOME=/u01/app/oracle/product/11.1.0/crs
    • 2009-04-08 00:45:42.765: [ RACG][3086936832] [22614][3086936832][ora.sdrac01.vip]: clsrcexecut: cmd = /u01/app/oracle/product/11.1.0/crs/bin/racgeut -e _USR_ORA_DEBUG=0 54 /u01/app/oracle/product/11.1.0/crs/bin/racgvip check sdrac01
    • 2009-04-08 00:45:42.765: [ RACG][3086936832] [22614][3086936832][ora.sdrac01.vip]: clsrcexecut: rc = 1, time = 6.320s
    • 2009-04-08 00:45:42.765: [ RACG][3086936832] [22614][3086936832][ora.sdrac01.vip]: end for resource = ora.sdrac01.vip, action = start, status = 1, time = 12.560s
  • 35. Solution for Offline VIP Resource
    • Stop nodeapps with srvctl stop nodeapps –n sdrac01
    • Login as root and edit $ORA_CRS_HOME/bin/racgvip
    • Change the value of variable FAIL_WHEN_DEFAULTGW_NOT_FOUND=0
    • Start nodeapps with srvctl start nodeapps –n sdrac01 and you should see the resources ONLINE
  • 36. Failed or Corrupted Vote Disk
    • Best practice- multiple copies of vote disk on different disk volumes to eliminate single point of failure (SPOF).
    • Metalink Note 279793.1 has tips on vote disk for RAC
    • Make sure you take backups with dd utility (UNIX/Linux) or ocopy utility (Windows)
    • Take frequent backups if using dd should be 4k blocksize on Linux/UNIX platform to ensure complete blocks are backed up for voting disk.
    • Without backup you must re-install CRS!
  • 37. Failed or corrupted OCR disks
    • Best practice- maintain frequent backups of OCR on separate disk volumes to avoid single point of failure (SPOF)
    • OCRCONFIG utility to perform recovery
    • Metalink Notes 220970.1 , 428681.1 and 390880.1 are useful
    • Find backup for OCR
  • 38. Recover OCR from backup
    • # ./ocrconfig
    • Name:
    • ocrconfig - Configuration tool for Oracle Cluster Registry.
    • Synopsis:
    • ocrconfig [option]
    • option:
    • -export <filename> [-s online]
    • - Export cluster register contents to a file
    • -import <filename> - Import cluster registry contents from a file
    • -upgrade [<user> [<group>]]
    • - Upgrade cluster registry from previous version
    • -downgrade [-version <version string>]
    • - Downgrade cluster registry to the specified version
    • -backuploc <dirname> - Configure periodic backup location
    • -showbackup - Show backup information
    • -restore <filename> - Restore from physical backup
    • -replace ocr|ocrmirror [<filename>] - Add/replace/remove a OCR device/file
    • -overwrite - Overwrite OCR configuration on disk
    • -repair ocr|ocrmirror <filename> - Repair local OCR configuration
    • -help - Print out this help information
    • Note:
    • A log file will be created in
    • $ORACLE_HOME/log/<hostname>/client/ocrconfig_<pid>.log. Please ensure
    • you have file creation privileges in the above directory before
    • running this tool.
  • 39. Using OCRCONFIG for recover lost OCR
    • First we need to find our backups of the OCR with ocrconfig utility
    • # ./ocrconfig -showbackup
    • rac01 2009/04/07 23:01:40 /u01/app/oracle/product/11.1.0/crs/cdata/crs
    • rac01 2009/04/07 19:01:39 /u01/app/oracle/product/11.1.0/crs/cdata/crs
    • rac01 2009/04/07 01:40:31 /u01/app/oracle/product/11.1.0/crs/cdata/crs
    • rac01 2009/04/06 21:40:30 /u01/app/oracle/product/11.1.0/crs/cdata/crs
    • rac01 2009/04/03 14:12:46 /u01/app/oracle/product/11.1.0/crs/cdata/crs
  • 40. Recovery lost/corrupt OCR
    • We check the status of OCR backups:
    • $ ls -l
    • total 24212
    • -rw-r--r-- 1 oracle oinstall 2949120 Aug 29 2008 backup00.ocr
    • -rw-r--r-- 1 oracle oinstall 2949120 Aug 21 2008 backup01.ocr
    • -rw-r--r-- 1 oracle oinstall 2949120 Aug 20 2008 backup02.ocr
    • -rw-r--r-- 1 root root 2949120 Apr 4 19:26 day_.ocr
    • -rw-r--r-- 1 oracle oinstall 2949120 Aug 29 2008 day.ocr
    • -rw-r--r-- 1 root root 4116480 Apr 7 23:01 temp.ocr
    • -rw-r--r-- 1 oracle oinstall 2949120 Aug 29 2008 week_.ocr
    • -rw-r--r-- 1 oracle oinstall 2949120 Aug 19 2008 week.ocr
    • Next we use OCRCONFIG –restore to recover the lost or corrupted OCR from a valid backup
    • $ ocrconfig –restore backup00.ocr
  • 41. 11g RAC node reboot issues
    • What causes node reboots in 11g RAC?
    • Root cause can be difficult to diagnose
    • Can be due to network and storage issues
    • Metalink Note 265769.1 good reference point for node reboot issues and provides useful Decision Tree for these issues with RAC.
    • If there is a ocssd.bin problem/failure, the oprocd daemon detected a scheduling problem, or some other fatal problem, a node will reboot in a RAC cluster. This functionality is used for I/O fencing to ensure that writes from I/O capable clients can be cleared avoiding potential corruption scenarios in the event of a network split, node hang, or some other fatal event.
  • 42. 11g RAC Clusterware Processes – Node Reboot Issues
    • When ocssd.bin process dies it notifies the oprocd process to shoot the node in the head and cause the to node reboot (STONITH).
    • OCSSD (aka CSS daemon) - This process is spawned in init.cssd. It runs in both vendor clusterware and non-vendor clusterware environments and is armed with a node kill via the init script. OCSSD's primary job is internode health monitoring and RDBMS instance endpoint discovery. It runs as the Oracle user.
    • INIT.CSSD - In a normal environment, init spawns init.cssd, which in turn spawns OCSSD as a child. If ocssd dies or is killed, the node kill functionality of the init script will kill the node.
    • OPROCD - This process is spawned in any non-vendor clusterware environment, except on Windows where Oracle uses a kernel driver to perform the same actions and Linux prior to version 10.2.0.4. If oprocd detects problems, it will kill a node via C code. It is spawned in init.cssd and runs as root. This daemon is used to detect hardware and driver freezes on the machine. If a machine were frozen for long enough that the other nodes evicted it from the cluster, it needs to kill itself to prevent any IO from getting reissued to the disk after the rest of the cluster has remastered locks.&quot;
    • OCLSOMON (10.2.0.2 and above) - This process monitors the CSS daemon for hangs or scheduling issues and can reboot a node if there is a perceived hang.
    • Data collection is vital
    • OSWatcher tool Metalink Note 301137.1 and 433472.1 have the details on how to setup this diagnosis tool for Linux/UNIX and Windows
  • 43. Root Cause: Node Reboots 11g RAC
    • Find the process that caused the node to reboot
    • Review all log and trace files to determine failed process for 11g RAC
    • * Messages file locations: Sun: /var/adm/messages HP-UX: /var/adm/syslog/syslog.log Tru64: /var/adm/messages Linux: /var/log/messages IBM: /bin/errpt -a > messages.out ** CSS log locations: 11.1 and 10.2: <CRS_HOME>/log/<node name>/cssd 10.1: <CRS_HOME>/css/log LINUX:
    • *** Oprocd log locations: In /etc/oracle/oprocd or /var/opt/oracle/oprocd depending on version/platform.
  • 44. 11g RAC Log Files for Troubleshooting
    • For 10.2 and above, all files under: <CRS_HOME>/log For 10.1: <CRS_HOME>/crs/log <CRS_HOME>/crs/init <CRS_HOME>/css/log <CRS_HOME>/css/init <CRS_HOME>/evm/log <CRS_HOME>/evm/init <CRS_HOME>/srvm/log
    • Useful tool called RAC DDT to collect all 11g RAC log and trace files
    • Metalink Note 301138.1 covers use of RAC DDT
    • Also important to collect OS and network information:
    • netstat, iostat, vmstat and ping outputs from 11g RAC cluster nodes
  • 45. OCSSD Reboots and 11g RAC
    • Network failure or latency between nodes. It would take at least 30 consecutive missed checkins to cause a reboot, where heartbeats are issued once per second. Example of missed checkins in the CSS log: WARNING: clssnmPollingThread: node <node> (1) at 50% heartbeat fatal, eviction in 29.100 seconds
    • Review messages file to determine root cause for OCSSD failures.
    • If the messages file reboot time < missed checkin time then the node eviction was likely not due to these missed checkins. If the messages file reboot time > missed checkin time then the node eviction was likely a result of the missed checkins.
    • Problems writing to or reading from the CSS voting disk. Check CSS logs: ERROR: clssnmDiskPingMonitorThread: voting device access hanging (160008 miliseconds)
    • High load averages due to lack of CPU resources.
    • Misconfiguration of CRS. Possible misconfigurations:
  • 46. OPROCD Failure and Node Reboots
    • Four things cause OPROC to fail and node reboot with 11g RAC:
    • 1) An OS scheduler problem.
    • 2) The OS is getting locked up in a driver or hardware issue.
    • 3) Excessive amounts of load on the machine, thus preventing the scheduler from behaving reasonably.
    • 4) An Oracle bug such as Bug 5015469
  • 47. OCLSOMON- RAC Node Reboot
    • Four root causes to the OCLSOMON process failure that causes 11g RAC node reboot condition:
    • 1) Hung threads within the CSS daemon.
    • 2) OS scheduler problems
    • 3) Excessive amounts of load on the machine
    • 4) Oracle bugs
  • 48. Hardware, Storage, Network problems
    • Check certification matrix on Metalink for supported versions for network drivers, storage and firmware releases with 11g RAC.
    • Develop close working relationship with system and network team. Educate them on RAC.
    • System utilities such as ifconfig, netstat, ping, and traceroute are essential for diagnosis and root cause analysis.
  • 49. Summary
    • What happened to my 11g RAC Clusterware?
    • Failed resources in 11g RAC Clusterware
    • Upgrade and Migration issues for 11g RAC and Clusterware
    • Patch Upgrade issues with Clusterware
    • Node eviction issues
  • 50. Tuning 11g RAC
  • 51. Solving critical tuning issues for RAC
    • Tune for single instance first and then RAC
    • Interconnect Performance Tuning
    • Cluster related wait issues
    • Lock/Latch Contention
    • Parallel tuning tips for RAC
    • ASM Tuning for RAC
  • 52. Interconnect Tuning for 11g RAC
    • Invest in best network for 11g RAC Interconnect
    • Infiniband offers robust performance
    • Majority of performance problems in 11g RAC are due to poorly sized network for interconnect
  • 53. DBA Toolkit for 11g RAC
  • 54. DBA 101 Toolkit for 11g RAC
    • Oracle 11g DBA Tools:
    • Oracle 11g ADDM
    • Oracle 11g AWR
    • Oracle 11g Enterprise Manager/Grid Control
    • Operating System Tools
  • 55. Using Oracle 11g Tools
    • ADDM and AWR now provide RAC specific monitoring checks and reports
    • AWR Report Sample for 11g RAC via OEM Grid Control or awrrpt.sql
    • SQL> @?/rdbms/admin/awrrpt.sql
    • WORKLOAD REPOSITORY report for
    • DB Name DB Id Instance Inst Num Startup Time Release RAC
    • ---- ------ --------- ----- --------------- -------- -----
    • RACDB 2057610071 RAC01 1 20-Jan-09 20:50 11.1.0.7.0 YES
    • Host Name Platform CPUs Cores Sockets Memory(GB)
    • ------- ---------- ---- ----- ------- ----------
    • sdrac01 Linux x86 64-bit 8 8 4 31.49
    • Snap Id Snap Time Sessions Curs/Sess
    • --------- ------------------- -------- ---------
    • Begin Snap: 12767 21-Jan-09 00:00:06 361 25.9
    • End Snap: 12814 21-Jan-09 08:40:09 423 22.0
    • Elapsed: 520.05 (mins)
    • DB Time: 102,940.70 (mins)
  • 56. Using AWR with 11g RAC
    • We want to examine the following areas from AWR for 11g RAC Performance:
    • RAC Statistics DB/Inst: RACDB/RAC01 Snaps: 12767-12814
    • Begin End
    • ----- -----
    • Number of Instances: 3 3
    • Global Cache Load Profile
    • ~~~~~~~~~~~~~~~~~~~~~~~~~ Per Second Per Transaction
    • --------------- ---------------
    • Global Cache blocks received: 88.89 2.41
    • Global Cache blocks served: 92.32 2.51
    • GCS/GES messages received: 906.54 24.63
    • GCS/GES messages sent: 755.21 20.52
    • DBWR Fusion writes: 5.56 0.15
    • Estd Interconnect traffic (KB) 1,774.22
  • 57. AWR for 11g RAC (Continued)
    • Global Cache Efficiency Percentages (Target local+remote 100%)
    • ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    • Buffer access - local cache %: 99.59
    • Buffer access - remote cache %: 0.12
    • Buffer access - disk %: 0.29
    • Global Cache and Enqueue Services - Workload Characteristics
    • ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    • Avg global enqueue get time (ms): 2.7
    • Avg global cache cr block receive time (ms): 3.2
    • Avg global cache current block receive time (ms): 1.1
    • Avg global cache cr block build time (ms): 0.0
    • Avg global cache cr block send time (ms): 0.0
    • Global cache log flushes for cr blocks served %: 11.3
    • Avg global cache cr block flush time (ms): 29.4
    • Avg global cache current block pin time (ms): 11.6
    • Avg global cache current block send time (ms): 0.1
    • Global cache log flushes for current blocks served %: 0.3
    • Avg global cache current block flush time (ms): 61.8
    • Global Cache and Enqueue Services - Messaging Statistics
    • ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    • Avg message sent queue time (ms): 4902.6
    • Avg message sent queue time on ksxp (ms): 1.2
    • Avg message received queue time (ms): 0.1
    • Avg GCS message process time (ms): 0.0
    • Avg GES message process time (ms): 0.0
    • % of direct sent messages: 70.13
    • % of indirect sent messages: 28.36
    • % of flow controlled messages: 1.51
    • -------------------------------------------------------------
    • Cluster Interconnect
    • ~~~~~~~~~~~~~~~~~~~~
    • Begin End
    • -------------------------------------------------- -----------
    • Interface IP Address Pub Source IP Pub Src
    • ---------- --------------- --- ------------------------------ --- --- ---
    • bond0 10.10.10.1 N Oracle Cluster Repository
  • 58. Interconnect Performance 11g RAC
    • Interconnect Performance key for identify performance issues with 11g RAC!
    • Interconnect Throughput by Client DB/Inst: RACDB/RAC01 Snaps: 12767-12814
    • -> Throughput of interconnect usage by major consumers.
    • -> All throughput numbers are megabytes per second
    • Send Receive
    • Used By Mbytes/sec Mbytes/sec
    • ---------------- ----------- -----------
    • Global Cache .72 .69
    • Parallel Query .01 .01
    • DB Locks .16 .17
    • DB Streams .00 .00
    • Other .02 .02
    • -------------------------------------------------------------
    • Interconnect Device Statistics DB/Inst: RACDB/RAC01 Snaps: 12767-12814
    • -> Throughput and errors of interconnect devices (at OS level).
    • -> All throughput numbers are megabytes per second
    • Device Name IP Address Public Source
    • --------------- ---------------- ------ -------------------------------
    • Send Send
    • Send Send Send Buffer Carrier
    • Mbytes/sec Errors Dropped Overrun Lost
    • ----------- -------- -------- -------- --------
    • Receive Receive
    • Receive Receive Receive Buffer Frame
    • Mbytes/sec Errors Dropped Overrun Errors
    • ----------- -------- -------- -------- --------
    • bond0 10.10.10.1 NO Oracle Cluster Repository
    • 1.43 0 0 0 0
    • 1.44 0 0 0 0
    • -------------------------------------------------------------
    • End of Report
  • 59. ADDM for 11g RAC
    • ADDM nicer interface than AWR and available via OEM Grid Control or addmrpt.sql script.
    • SQL> @?/admin/rdbms/addmrpt.sql
    • ----------------------------------
    • Analysis Period
    • ---------------
    • AWR snapshot range from 12759 to 12814.
    • Time period starts at 20-JAN-09 10.40.17 PM
    • Time period ends at 21-JAN-09 08.40.10 AM
    • Analysis Target
    • ---------------
    • Database 'RACDB' with DB ID 2057610071.
    • Database version 11.1.0.7.0.
    • ADDM performed an analysis of instance RAC01, numbered 1 and hosted at
    • sdrac01
    • Activity During the Analysis Period
    • -----------------------------------
    • Total database time was 7149586 seconds.
    • The average number of active sessions was 198.64.
    • Summary of Findings
    • -------------------
    • Description Active Sessions Recommendations
    • Percent of Activity
    • ---------------------------- ------------------- ---------------
    • 1 Unusual &quot;Network&quot; Wait Event 192.91 | 97.12 3
  • 60. Operating System Tools for 11g RAC
    • Strace for Linux
    • # ps -ef|grep crsd
    • root 2853 1 0 Apr05 ? 00:00:00 /u01/app/oracle/product/11.1.0/crs/bin/crsd.bin reboot
    • root 20036 2802 0 01:53 pts/3 00:00:00 grep crsd
    • [root@sdrac01 bin]# strace -p 2853
    • Process 2853 attached - interrupt to quit
    • futex(0xa458bbf8, FUTEX_WAIT, 7954, NULL
    • Truss for Solaris
    • Both are excellent OS trace level tools to find out exactly what a specific Oracle 11g RAC process is doing.
  • 61. Preguntas?
    • Hay algunas preguntas? Questions?
    • I’ll also be available in the Oracle ACE lodge
    • Tambien se puede enviarme sus preguntas :
    • Email: ben@ben-oracle.com
  • 62. Conclusion
    • Muchas gracias!
    • Please complete your evaluation form
    • Ben Prusinski [email_address]
    • Oracle 11g Real Application Clusters 101: Insider Tips and Tricks
    • My company- Ben Prusinski and Associates
    • http://www.ben-oracle.com
    • Oracle Blog
    • http://oracle-magician.blogspot.com/