Under the Hoodof Oracle ClusterwareMiracle OpenWorld 201015-Apr-2010Alex Gorbachev, The Pythian Group
Alex Gorbachev    • CTO, The Pythian Group    • Blogger    • OakTable Network member    • Oracle ACE Director    • BattleA...
Why Companies Trust Pythian    • Recognized Leader:    •   Global industry-leader in remote database administration servic...
Agenda    • Place of Clusterware in Oracle RAC    • Node membership and evictions    • Clusterware startup sequence    • O...
Agenda                       High                                         th Th                                           ...
Architecture    OS                     OS                                OS          VIP                     VIP          ...
Architecture    OS                     OS                                OS          VIP                     VIP          ...
OS    Clusterware6                 © 2009/2010 Pythian
OS    Clusterware                                               Cluster Synchronization Services                  CSSD6   ...
OS    Clusterware                                                    Cluster Ready Services                               ...
OS    Clusterware                                       HA Framework scripts                     VIP                  RACG...
Event Manager    OS    Clusterware                                       HA Framework scripts                     VIP     ...
Event Manager    OS    Clusterware                                       HA Framework scripts                     VIP     ...
OS    Clusterware                      VIP                  RACG     EVMD                  CRSD                  CSSD     ...
OS    Clusterware                     VIP                  RACG     EVMD                                                 C...
OS                                               OS    Clusterware                                      Clusterware       ...
OS                                               OS    Clusterware                                      Clusterware       ...
OS                                               OS    Clusterware                                      Clusterware       ...
OS                                                  OS    Clusterware                                         Clusterware ...
OS                                                  OS    Clusterware                                         Clusterware ...
OS                                                  OSShoot    Clusterware                                         Cluster...
OS                                                  OS     Clusterware                                         Clusterware...
OS                                                  OS     Clusterware                                         Clusterware...
OS                                                  OS     Clusterware                                         Clusterware...
OS                                                  OS     Clusterware                                         Clusterware...
OS                                  Clusterware                                                  VIP                      ...
OS                                                  ClusterwareAsk                                                        ...
OS                                                    OS     Clusterware                                           Cluster...
OS                                                    OS     Clusterware                                           Cluster...
OS                                        Clusterware                                                        VIP          ...
OS                                   Clusterware                                                   VIP                    ...
OS                                   Clusterware                                                   VIP                    ...
OS                                   Clusterware                                                   VIP                    ...
OS                                       Clusterware                                                       VIP            ...
OS                                     Clusterware                                                     VIP                ...
OS                                                  OS     Clusterware                                         Clusterware...
OS                                                  OS     Clusterware                                         Clusterware...
OS                                                  OS     Clusterware                                         Clusterware...
OS                                  Clusterware                                                  VIP                      ...
OS                                                  OS     Clusterware                                         Clusterware...
OS                                                  OS     Clusterware                                         Clusterware...
OS                                                  OS     Clusterware                                         Clusterware...
CSSD                         CSSD               interconnect15          © 2009/2010 Pythian
Evictions16               © 2009/2010 Pythian
Evictions     • Network   heartbeat lost16                                © 2009/2010 Pythian
Evictions     • Network  heartbeat lost     • Voting disk access lost16                               © 2009/2010 Pythian
Evictions     • Network  heartbeat lost     • Voting disk access lost     • CSSD is not healthy16                         ...
Evictions     • Network  heartbeat lost     • Voting disk access lost     • CSSD is not healthy     • OS is not healthy   ...
DEMO     NHB failure       • Simulate with “ifconfig eth1 down”       • Both nodes notice the loss       • Racing to evict...
NHB failure symptoms     • NHB    failure on several nodes      •   ocssd.log     • Evicted    node can contain other trac...
DEMO     CSSD is not healthy     • Simulate using kill -STOP <cssd.bin pid>     • Another node observes NHB loss      •   ...
OCSSD sick - symptoms     • Error in OCLSOMON.log     • OCSSD log might be clean on evicted node     • syslog might contai...
DEMO     host sick - CPU stalled     • Simulate     by pausing OPROCD      •   kill -STOP <oprocd pid>      •   sleep 1 or...
Killed by OPROCD - symptoms     • Hard to confirm (nothing in oprocd.log)     • Console output often helps      •   “SysRq...
10g on Linux - hangcheck-timer     • Replaced  by OPROCD in 11g and 10.2.0.4+     • Most of the time useless and inactive!...
Killed by hangcheck-timer     • Rarely   can be confirmed      •   “Hangcheck: hangcheck is restarting the machine”     • ...
Clusterware startup     • Linux    & UNIX inittab      •   init.cssd      •   init.evmd      •   init.crsd     • Linux    ...
Daemons startup sequence      Third-party      clusterware                    CSSD                              • Triggere...
Startup in Linux & Unix     [gorby@dime ~]$ ps -fe | grep init. | grep -v grep     root      6352      1   0 10:24 ... /bi...
Startup flow                             t28     © 2009/2010 Pythian
Startup flow     init.cssd fatal     init.evmd run     init.crsd run                                               t28     ...
Startup flow/etc/oracle/scls_scr/{host}/root/cssrun     init.cssd fatal     init.evmd run     init.crsd run                ...
Startup flow/etc/oracle/scls_scr/{host}/root/cssrun     init.cssd fatal     init.evmd run     init.crsd run                ...
Startup flow/etc/oracle/scls_scr/{host}/root/cssrun                        init.crs start                              init...
Startup flow/etc/oracle/scls_scr/{host}/root/cssrun                                                     /etc/oracle/scls_sc...
Startup flow/etc/oracle/scls_scr/{host}/root/cssrun                                                     /etc/oracle/scls_sc...
Startup flow/etc/oracle/scls_scr/{host}/root/cssrun                                             /etc/oracle/scls_scr/{host}...
DEMO     Startup troubleshooting     • Check processes using “ps -fe | grep init”     • Check syslog (/var/log/messages)  ...
Log files     • log/{host}/cssd/ocssd.log     • log/{host}/cssd/oclsomon/ocslmon.log      •   ocslmon.ba1, ocslmon.ba2,... ...
Windows world     • OPROCD  = OraFenceService     • EVMD = OracleEVMService     • CRSD = OracleCRService     • CSSD = Orac...
OS     Clusterware                       VIP                             • Passing    clusterware events                  ...
OS                                                        EVMD     Clusterware                       VIP                  ...
OS     Clusterware                       VIP                   RACG      EVMD                   CRSD                   CSS...
VIP     OS                                                         CRSD     Clusterware                         RACG      ...
CRSD startup     • AfterCSSD and EVMD     • Re-spawned on failure      •   No eviction     • Runs       as root      •   V...
Oracle Cluster Registry                      • Repository      for all configuration data                         •   Exce...
CRS resources     • Standard       Oracle resources      •   ASM      •   Listener      •   VIP      •   Database and Inst...
CRS resource internals     • Unique name     • Associated action script      •   stop / start / check functions     • Othe...
DEMO     Resource profiles     • Use crs_stat [-t] to check status     • Use crs_stat -p to check attributes     • crs_* vs...
DEMO     OCR internals     • ocrcheck     • ocrconfig      •   used during install/ugrade      •   backup OCR      •   rec...
DEMO     racgvip case study     • Check the script     • Set env. vars and simulate the call     • Use _USR_ORA_DEBUG=1 in...
Resources hierarchy                                              CS                                                       ...
Resources and Oracle homes                                              CS                   DB Home           DB         ...
DEMO     troubleshooting resources     • {home}/log/{host}/racg/{resource_name}.log     • Old   way - edit racgwrap      •...
Troubleshooting summary     • crsctl check crs | crsd | cssd | evmd     • crs_stat [-t]     • crs_stat -p [{res_name}]    ...
Troubleshooting flow     • Is Clusterware up?     • Is Oracle resources up?      •   Listener & VIP      •   Database & ASM...
Enter the 11gR2 World - Grid     Infrastructure46                    © 2009/2010 Pythian
Enter the 11gR2 World - Grid     Infrastructure   Oracle Clusterware Administration and Deployment Guide46                ...
Enter the 11gR2 World - Grid     Infrastructure     My Oracle Support Note 1053147.147                                    ...
11g Grid Infrastructure Documentation     • OracleClusterware Administration and Deployment Guide     • MOS Note 1053147.1...
11gR2 Node Evictions     • Same      as in 10g + member kill escalation      •   LMON process may request CSS to remove an...
Questions?       Thank you!   http://www.pythian.com/gorbachev@pythian.com        © 2009/2010 Pythian
Upcoming SlideShare
Loading in...5
×

Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

635

Published on

Published in: Technology, Education
1 Comment
3 Likes
Statistics
Notes
No Downloads
Views
Total Views
635
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
61
Comments
1
Likes
3
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • - Successful growing business for more than 10 years \n- Served many customers with complex requirements/infrastructure just like yours. \n- Operate globally for 24 x 7 &amp;#x201C;always awake&amp;#x201D; services\n
  • \n
  • \n
  • Clusterware is generic with customizations for Oracle resources.\nOnly Clusterware accesses OCR and VD.\nOnly DB instances access shared database files.\nOCR is accessed by almost every Clusterware component - configuration read from OCR.\nVIP is part of OC.\nEmphasize shared access to data!!!\n
  • Clusterware is generic with customizations for Oracle resources.\nOnly Clusterware accesses OCR and VD.\nOnly DB instances access shared database files.\nOCR is accessed by almost every Clusterware component - configuration read from OCR.\nVIP is part of OC.\nEmphasize shared access to data!!!\n
  • Clusterware is generic with customizations for Oracle resources.\nOnly Clusterware accesses OCR and VD.\nOnly DB instances access shared database files.\nOCR is accessed by almost every Clusterware component - configuration read from OCR.\nVIP is part of OC.\nEmphasize shared access to data!!!\n
  • Clusterware is generic with customizations for Oracle resources.\nOnly Clusterware accesses OCR and VD.\nOnly DB instances access shared database files.\nOCR is accessed by almost every Clusterware component - configuration read from OCR.\nVIP is part of OC.\nEmphasize shared access to data!!!\n
  • OPROCD - pre 10.2.0.4 - hangcheck-timer\n
  • OPROCD - pre 10.2.0.4 - hangcheck-timer\n
  • OPROCD - pre 10.2.0.4 - hangcheck-timer\n
  • OPROCD - pre 10.2.0.4 - hangcheck-timer\n
  • OPROCD - pre 10.2.0.4 - hangcheck-timer\n
  • OPROCD - pre 10.2.0.4 - hangcheck-timer\n
  • OPROCD - pre 10.2.0.4 - hangcheck-timer\n
  • OPROCD - pre 10.2.0.4 - hangcheck-timer\n
  • OPROCD - pre 10.2.0.4 - hangcheck-timer\n
  • OPROCD - pre 10.2.0.4 - hangcheck-timer\n
  • Node membership and group membership for instances, ASM diskgrops\n
  • Node membership and group membership for instances, ASM diskgrops\n
  • CSSD cannot talk to each other -&gt; operations are not synchronized -&gt; shared data access -&gt; corruption\n
  • CSSD cannot talk to each other -&gt; operations are not synchronized -&gt; shared data access -&gt; corruption\n
  • CSSD cannot talk to each other -&gt; operations are not synchronized -&gt; shared data access -&gt; corruption\n
  • CSSD cannot talk to each other -&gt; operations are not synchronized -&gt; shared data access -&gt; corruption\n
  • CSSD cannot talk to each other -&gt; operations are not synchronized -&gt; shared data access -&gt; corruption\n
  • CSSD cannot talk to each other -&gt; operations are not synchronized -&gt; shared data access -&gt; corruption\n
  • CSSD cannot talk to each other -&gt; operations are not synchronized -&gt; shared data access -&gt; corruption\n
  • In addition to NHB, Oracle introduced DHB.\nIO Fencing needed on split brain to avoid evicted node doing any further IO&amp;#x2019;s.\nOracle doesn&amp;#x2019;t rely on any hardware - need compatibility with all palatform/hardware.\n
  • In addition to NHB, Oracle introduced DHB.\nIO Fencing needed on split brain to avoid evicted node doing any further IO&amp;#x2019;s.\nOracle doesn&amp;#x2019;t rely on any hardware - need compatibility with all palatform/hardware.\n
  • In addition to NHB, Oracle introduced DHB.\nIO Fencing needed on split brain to avoid evicted node doing any further IO&amp;#x2019;s.\nOracle doesn&amp;#x2019;t rely on any hardware - need compatibility with all palatform/hardware.\n
  • In addition to NHB, Oracle introduced DHB.\nIO Fencing needed on split brain to avoid evicted node doing any further IO&amp;#x2019;s.\nOracle doesn&amp;#x2019;t rely on any hardware - need compatibility with all palatform/hardware.\n
  • In addition to NHB, Oracle introduced DHB.\nIO Fencing needed on split brain to avoid evicted node doing any further IO&amp;#x2019;s.\nOracle doesn&amp;#x2019;t rely on any hardware - need compatibility with all palatform/hardware.\n
  • In addition to NHB, Oracle introduced DHB.\nIO Fencing needed on split brain to avoid evicted node doing any further IO&amp;#x2019;s.\nOracle doesn&amp;#x2019;t rely on any hardware - need compatibility with all palatform/hardware.\n
  • In addition to NHB, Oracle introduced DHB.\nIO Fencing needed on split brain to avoid evicted node doing any further IO&amp;#x2019;s.\nOracle doesn&amp;#x2019;t rely on any hardware - need compatibility with all palatform/hardware.\n
  • In addition to NHB, Oracle introduced DHB.\nIO Fencing needed on split brain to avoid evicted node doing any further IO&amp;#x2019;s.\nOracle doesn&amp;#x2019;t rely on any hardware - need compatibility with all palatform/hardware.\n
  • In addition to NHB, Oracle introduced DHB.\nIO Fencing needed on split brain to avoid evicted node doing any further IO&amp;#x2019;s.\nOracle doesn&amp;#x2019;t rely on any hardware - need compatibility with all palatform/hardware.\n
  • In addition to NHB, Oracle introduced DHB.\nIO Fencing needed on split brain to avoid evicted node doing any further IO&amp;#x2019;s.\nOracle doesn&amp;#x2019;t rely on any hardware - need compatibility with all palatform/hardware.\n
  • In addition to NHB, Oracle introduced DHB.\nIO Fencing needed on split brain to avoid evicted node doing any further IO&amp;#x2019;s.\nOracle doesn&amp;#x2019;t rely on any hardware - need compatibility with all palatform/hardware.\n
  • In addition to NHB, Oracle introduced DHB.\nIO Fencing needed on split brain to avoid evicted node doing any further IO&amp;#x2019;s.\nOracle doesn&amp;#x2019;t rely on any hardware - need compatibility with all palatform/hardware.\n
  • In addition to NHB, Oracle introduced DHB.\nIO Fencing needed on split brain to avoid evicted node doing any further IO&amp;#x2019;s.\nOracle doesn&amp;#x2019;t rely on any hardware - need compatibility with all palatform/hardware.\n
  • In addition to NHB, Oracle introduced DHB.\nIO Fencing needed on split brain to avoid evicted node doing any further IO&amp;#x2019;s.\nOracle doesn&amp;#x2019;t rely on any hardware - need compatibility with all palatform/hardware.\n
  • In addition to NHB, Oracle introduced DHB.\nIO Fencing needed on split brain to avoid evicted node doing any further IO&amp;#x2019;s.\nOracle doesn&amp;#x2019;t rely on any hardware - need compatibility with all palatform/hardware.\n
  • In addition to NHB, Oracle introduced DHB.\nIO Fencing needed on split brain to avoid evicted node doing any further IO&amp;#x2019;s.\nOracle doesn&amp;#x2019;t rely on any hardware - need compatibility with all palatform/hardware.\n
  • In addition to NHB, Oracle introduced DHB.\nIO Fencing needed on split brain to avoid evicted node doing any further IO&amp;#x2019;s.\nOracle doesn&amp;#x2019;t rely on any hardware - need compatibility with all palatform/hardware.\n
  • In addition to NHB, Oracle introduced DHB.\nIO Fencing needed on split brain to avoid evicted node doing any further IO&amp;#x2019;s.\nOracle doesn&amp;#x2019;t rely on any hardware - need compatibility with all palatform/hardware.\n
  • In addition to NHB, Oracle introduced DHB.\nIO Fencing needed on split brain to avoid evicted node doing any further IO&amp;#x2019;s.\nOracle doesn&amp;#x2019;t rely on any hardware - need compatibility with all palatform/hardware.\n
  • In addition to NHB, Oracle introduced DHB.\nIO Fencing needed on split brain to avoid evicted node doing any further IO&amp;#x2019;s.\nOracle doesn&amp;#x2019;t rely on any hardware - need compatibility with all palatform/hardware.\n
  • In addition to NHB, Oracle introduced DHB.\nIO Fencing needed on split brain to avoid evicted node doing any further IO&amp;#x2019;s.\nOracle doesn&amp;#x2019;t rely on any hardware - need compatibility with all palatform/hardware.\n
  • In addition to NHB, Oracle introduced DHB.\nIO Fencing needed on split brain to avoid evicted node doing any further IO&amp;#x2019;s.\nOracle doesn&amp;#x2019;t rely on any hardware - need compatibility with all palatform/hardware.\n
  • In addition to NHB, Oracle introduced DHB.\nIO Fencing needed on split brain to avoid evicted node doing any further IO&amp;#x2019;s.\nOracle doesn&amp;#x2019;t rely on any hardware - need compatibility with all palatform/hardware.\n
  • In addition to NHB, Oracle introduced DHB.\nIO Fencing needed on split brain to avoid evicted node doing any further IO&amp;#x2019;s.\nOracle doesn&amp;#x2019;t rely on any hardware - need compatibility with all palatform/hardware.\n
  • \n
  • Oracle can&amp;#x2019;t shoot another node without remote control and can&amp;#x2019;t rely on one type of IO fencing (HBA/SCSI reservations).\nWhat&amp;#x2019;s left - beg another another - please shoot yourself!\n
  • Oracle can&amp;#x2019;t shoot another node without remote control and can&amp;#x2019;t rely on one type of IO fencing (HBA/SCSI reservations).\nWhat&amp;#x2019;s left - beg another another - please shoot yourself!\n
  • Oracle can&amp;#x2019;t shoot another node without remote control and can&amp;#x2019;t rely on one type of IO fencing (HBA/SCSI reservations).\nWhat&amp;#x2019;s left - beg another another - please shoot yourself!\n
  • Oracle can&amp;#x2019;t shoot another node without remote control and can&amp;#x2019;t rely on one type of IO fencing (HBA/SCSI reservations).\nWhat&amp;#x2019;s left - beg another another - please shoot yourself!\n
  • Oracle can&amp;#x2019;t shoot another node without remote control and can&amp;#x2019;t rely on one type of IO fencing (HBA/SCSI reservations).\nWhat&amp;#x2019;s left - beg another another - please shoot yourself!\n
  • Oracle can&amp;#x2019;t shoot another node without remote control and can&amp;#x2019;t rely on one type of IO fencing (HBA/SCSI reservations).\nWhat&amp;#x2019;s left - beg another another - please shoot yourself!\n
  • Oracle can&amp;#x2019;t shoot another node without remote control and can&amp;#x2019;t rely on one type of IO fencing (HBA/SCSI reservations).\nWhat&amp;#x2019;s left - beg another another - please shoot yourself!\n
  • Oracle can&amp;#x2019;t shoot another node without remote control and can&amp;#x2019;t rely on one type of IO fencing (HBA/SCSI reservations).\nWhat&amp;#x2019;s left - beg another another - please shoot yourself!\n
  • Oracle can&amp;#x2019;t shoot another node without remote control and can&amp;#x2019;t rely on one type of IO fencing (HBA/SCSI reservations).\nWhat&amp;#x2019;s left - beg another another - please shoot yourself!\n
  • Oracle can&amp;#x2019;t shoot another node without remote control and can&amp;#x2019;t rely on one type of IO fencing (HBA/SCSI reservations).\nWhat&amp;#x2019;s left - beg another another - please shoot yourself!\n
  • Oracle can&amp;#x2019;t shoot another node without remote control and can&amp;#x2019;t rely on one type of IO fencing (HBA/SCSI reservations).\nWhat&amp;#x2019;s left - beg another another - please shoot yourself!\n
  • Oracle can&amp;#x2019;t shoot another node without remote control and can&amp;#x2019;t rely on one type of IO fencing (HBA/SCSI reservations).\nWhat&amp;#x2019;s left - beg another another - please shoot yourself!\n
  • Oracle can&amp;#x2019;t shoot another node without remote control and can&amp;#x2019;t rely on one type of IO fencing (HBA/SCSI reservations).\nWhat&amp;#x2019;s left - beg another another - please shoot yourself!\n
  • What if CSSD is not healthy? It&amp;#x2019;s very possible that it&amp;#x2019;s not network problem but CSSD just doesn&amp;#x2019;t reply for some reason. OCLSOMON comes to the scene.\n
  • What if CSSD is not healthy? It&amp;#x2019;s very possible that it&amp;#x2019;s not network problem but CSSD just doesn&amp;#x2019;t reply for some reason. OCLSOMON comes to the scene.\n
  • Worse yes, the whole node is sick and even OCLSOMON can&amp;#x2019;t function properly. Like CPU execution is stall.\n
  • Worse yes, the whole node is sick and even OCLSOMON can&amp;#x2019;t function properly. Like CPU execution is stall.\n
  • Worse yes, the whole node is sick and even OCLSOMON can&amp;#x2019;t function properly. Like CPU execution is stall.\n
  • Worse yes, the whole node is sick and even OCLSOMON can&amp;#x2019;t function properly. Like CPU execution is stall.\n
  • Worse yes, the whole node is sick and even OCLSOMON can&amp;#x2019;t function properly. Like CPU execution is stall.\n
  • Worse yes, the whole node is sick and even OCLSOMON can&amp;#x2019;t function properly. Like CPU execution is stall.\n
  • Losing access to voting disks - CSSD commit suicide.\nWhy? Cluster must have two communication paths + VD is the media for IO fencing.\n
  • Losing access to voting disks - CSSD commit suicide.\nWhy? Cluster must have two communication paths + VD is the media for IO fencing.\n
  • Losing access to voting disks - CSSD commit suicide.\nWhy? Cluster must have two communication paths + VD is the media for IO fencing.\n
  • All nodes can reboot if voting disk is lost.\nGood time to discuss voting disk redundancy? 1 vs 2 vs 3\n
  • All nodes can reboot if voting disk is lost.\nGood time to discuss voting disk redundancy? 1 vs 2 vs 3\n
  • All nodes can reboot if voting disk is lost.\nGood time to discuss voting disk redundancy? 1 vs 2 vs 3\n
  • All nodes can reboot if voting disk is lost.\nGood time to discuss voting disk redundancy? 1 vs 2 vs 3\n
  • All nodes can reboot if voting disk is lost.\nGood time to discuss voting disk redundancy? 1 vs 2 vs 3\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • diagwait -&gt; not set by default (assumed 0)\nreboottime -&gt; 3 seconds\nmargin = reboottime - diagwait\n\nSee init.cssd for more details\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • When Clusterware autostart is disabled (crsstart -&gt; disable) then &amp;#x201C;init.cssd autostart&amp;#x201D; doesn&amp;#x2019;t do anything. In this case a DBA can initiate the start later using &amp;#x201C;init.crs start&amp;#x201D; (10.1+) or crsctl start crs (10.2+).\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Configuration data - voting disks, ports, resource profiles (ASM, instances, listeners, VIPs and etc).\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • DEMO - existing dependencies\n
  • DB is in CRS Home\nLog files would be in appropriate Oracle home:\n{home}/log/{host}/racg/{resource_name}.log\nDEMO - log files and action script home match!\nDEMO - IMON logs\n
  • DEMO - stop DB + rename spfile + start DB\nold way if have time with .cap file\n
  • DEMO - lsmodules\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Mow10 uthoc-alex-gorbachev-public-100422164413-phpapp02

    1. 1. Under the Hoodof Oracle ClusterwareMiracle OpenWorld 201015-Apr-2010Alex Gorbachev, The Pythian Group
    2. 2. Alex Gorbachev • CTO, The Pythian Group • Blogger • OakTable Network member • Oracle ACE Director • BattleAgainstAnyGuess.com • Vice-president, Oracle RAC SIG2 © 2009/2010 Pythian
    3. 3. Why Companies Trust Pythian • Recognized Leader: • Global industry-leader in remote database administration services and consulting for Oracle, Oracle Applications, MySQL and SQL Server • Work with over 150 multinational companies such as Forbes.com, Fox Interactive media, and MDS Inc. to help manage their complex IT deployments • Expertise: • One of the world’s largest concentrations of dedicated, full-time DBA expertise. • Global Reach & Scalability: • 24/7/365 global remote support for DBA and consulting, systems administration, special projects or emergency response3 © 2009/2010 Pythian
    4. 4. Agenda • Place of Clusterware in Oracle RAC • Node membership and evictions • Clusterware startup sequence • Oracle Cluster Registry • Resources Management and troubleshooting • 11gR2 Grid Infrastructure4 © 2009/2010 Pythian
    5. 5. Agenda High th Th e e le m ss or yo e y Need to memorize u ou ne u ed nd to ers m ta em nd or , iz e Low Shallow In-depth Understanding4 © 2009/2010 Pythian
    6. 6. Architecture OS OS OS VIP VIP VIP Listener Listener Listener Service Service Service Instance Instance Instance ASM ASM ASM Clusterware Clusterware Clusterware interconnect storage access OCR Voting disk Shared storage5 © 2009/2010 Pythian
    7. 7. Architecture OS OS OS VIP VIP VIP Listener Listener Listener Service Service Service Instance Instance Instance ASM ASM ASM Clusterware Clusterware Clusterware interconnect storage access OCR Voting disk Shared storage5 © 2009/2010 Pythian
    8. 8. OS Clusterware6 © 2009/2010 Pythian
    9. 9. OS Clusterware Cluster Synchronization Services CSSD6 © 2009/2010 Pythian
    10. 10. OS Clusterware Cluster Ready Services Cluster Synchronization Services CRSD CSSD6 © 2009/2010 Pythian
    11. 11. OS Clusterware HA Framework scripts VIP RACG Cluster Ready Services Cluster Synchronization Services CRSD CSSD6 © 2009/2010 Pythian
    12. 12. Event Manager OS Clusterware HA Framework scripts VIP RACG Cluster Ready Services EVMD Cluster Synchronization Services CRSD CSSD6 © 2009/2010 Pythian
    13. 13. Event Manager OS Clusterware HA Framework scripts VIP RACG Cluster Ready Services EVMD Cluster Synchronization Services CRSD CSSD Oracle Process Monitor OPROCD6 © 2009/2010 Pythian
    14. 14. OS Clusterware VIP RACG EVMD CRSD CSSD OPROCD7 © 2009/2010 Pythian
    15. 15. OS Clusterware VIP RACG EVMD CSSD CRSD OPROCD7 © 2009/2010 Pythian
    16. 16. OS OS Clusterware Clusterware VIP VIP RACG RACG EVMD EVMD CRSD CRSD CSSD CSSD interconnect OPROCD OPROCD8 © 2009/2010 Pythian
    17. 17. OS OS Clusterware Clusterware VIP VIP RACG RACG EVMD EVMD CRSD CRSD CSSD CSSD interconnect OPROCD OPROCD8 © 2009/2010 Pythian
    18. 18. OS OS Clusterware Clusterware VIP VIP RACG RACG EVMD EVMD CRSD CRSD CSSD CSSD interconnect OPROCD OPROCD8 © 2009/2010 Pythian
    19. 19. OS OS Clusterware Clusterware VIP VIP RACG RACG EVMD EVMD CRSD CRSD CSSD CSSD interconnect OPROCD OPROCD Voting disk9 © 2009/2010 Pythian
    20. 20. OS OS Clusterware Clusterware VIP VIP RACG RACG EVMD EVMD CRSD CRSD CSSD CSSD interconnect OPROCD OPROCD Voting disk9 © 2009/2010 Pythian
    21. 21. OS OSShoot Clusterware Clusterware VIP VIPThe RACG RACG EVMD EVMDOther CRSD CRSDNode CSSD interconnect CSSD OPROCD OPROCDInTheHead Voting disk9 © 2009/2010 Pythian
    22. 22. OS OS Clusterware Clusterware VIP VIP RACG RACG EVMD EVMD CRSD CRSD CSSD CSSD interconnect OPROCD OPROCD Voting disk10 © 2009/2010 Pythian
    23. 23. OS OS Clusterware Clusterware VIP VIP RACG RACG EVMD EVMD CRSD CRSD CSSD CSSD interconnect OPROCD OPROCD Voting disk11 © 2009/2010 Pythian
    24. 24. OS OS Clusterware Clusterware VIP VIP RACG RACG EVMD EVMD CRSD CRSD CSSD CSSD interconnect OPROCD OPROCD Voting disk11 © 2009/2010 Pythian
    25. 25. OS OS Clusterware Clusterware VIP VIP RACG RACG EVMD EVMD CRSD CRSD CSSD CSSD interconnect OPROCD OPROCD Voting disk11 © 2009/2010 Pythian
    26. 26. OS Clusterware VIP RACG EVMD CRSD CSSD CSSD interconnect OPROCD Voting disk11 © 2009/2010 Pythian
    27. 27. OS ClusterwareAsk VIP RACGThe EVMD CRSDOther CSSD CSSDNode interconnect OPROCDToReboot Voting diskItself (c) known quote11 © 2009/2010 Pythian
    28. 28. OS OS Clusterware Clusterware VIP VIP RACG RACG EVMD EVMD CRSD CRSD CS SD CSSD interconnect OPROCD OPROCD Voting disk12 © 2009/2010 Pythian
    29. 29. OS OS Clusterware Clusterware VIP VIP RACG RACG OCLSOMON EVMD EVMD CRSD CRSD CS SD CSSD interconnect OPROCD OPROCD Voting disk12 © 2009/2010 Pythian
    30. 30. OS Clusterware VIP RACG OCLSOMON EVMD CRSD CSSD interconnect OPROCD Voting disk12 © 2009/2010 Pythian
    31. 31. OS Clusterware VIP RACG EVMD CRSD CSSD CSSD interconnect OPROCD Voting disk13 © 2009/2010 Pythian
    32. 32. OS Clusterware VIP RACG EVMD CRSD CSSD CSSD interconnect OPROCD Voting disk13 © 2009/2010 Pythian
    33. 33. OS Clusterware VIP RACG EVMD CRSD CSSD CSSD interconnect OPROCD Voting disk13 © 2009/2010 Pythian
    34. 34. OS Clusterware VIP RACG EVMD CRSD CSSD CSSD interconnect OPROCD OPROCD Voting disk13 © 2009/2010 Pythian
    35. 35. OS Clusterware VIP RACG EVMD CRSD CSSD interconnect OPROCD OPROCD Voting disk13 © 2009/2010 Pythian
    36. 36. OS OS Clusterware Clusterware VIP VIP RACG RACG EVMD EVMD CRSD CRSD CSSD CSSD interconnect OPROCD OPROCD Voting disk14 © 2009/2010 Pythian
    37. 37. OS OS Clusterware Clusterware VIP VIP RACG RACG EVMD EVMD CRSD CRSD CSSD CSSD interconnect OPROCD OPROCD Voting disk14 © 2009/2010 Pythian
    38. 38. OS OS Clusterware Clusterware VIP VIP RACG RACG EVMD EVMD CRSD CRSD CSSD CSSD interconnect OPROCD OPROCD Voting disk14 © 2009/2010 Pythian
    39. 39. OS Clusterware VIP RACG EVMD CRSD CSSD CSSD interconnect OPROCD Voting disk14 © 2009/2010 Pythian
    40. 40. OS OS Clusterware Clusterware VIP VIP RACG RACG EVMD EVMD CRSD CRSD CSSD CSSD interconnect OPROCD OPROCD Voting disk15 © 2009/2010 Pythian
    41. 41. OS OS Clusterware Clusterware VIP VIP RACG RACG EVMD EVMD CRSD CRSD CSSD CSSD interconnect OPROCD OPROCD15 © 2009/2010 Pythian
    42. 42. OS OS Clusterware Clusterware VIP VIP RACG RACG EVMD EVMD CRSD CRSD CSSD CSSD interconnect OPROCD OPROCD15 © 2009/2010 Pythian
    43. 43. CSSD CSSD interconnect15 © 2009/2010 Pythian
    44. 44. Evictions16 © 2009/2010 Pythian
    45. 45. Evictions • Network heartbeat lost16 © 2009/2010 Pythian
    46. 46. Evictions • Network heartbeat lost • Voting disk access lost16 © 2009/2010 Pythian
    47. 47. Evictions • Network heartbeat lost • Voting disk access lost • CSSD is not healthy16 © 2009/2010 Pythian
    48. 48. Evictions • Network heartbeat lost • Voting disk access lost • CSSD is not healthy • OS is not healthy • OPROCD - Unix, Windows, 11g Linux • hangcheck-timer - 10g Linux16 © 2009/2010 Pythian
    49. 49. DEMO NHB failure • Simulate with “ifconfig eth1 down” • Both nodes notice the loss • Racing to evict each other • from voting disk => 2 equal sub-clusters • survives the one with the lowest leader # • leader is the node with lowest # in sub-cluster • Winner evicts another node • Setting kill-block in voting disk • CSSD and OCLSOMON race to suicide17 © 2009/2010 Pythian
    50. 50. NHB failure symptoms • NHB failure on several nodes • ocssd.log • Evicted node can contain other traces • maybe - syslog (Linux - /var/log/messages) • maybe - oclsomon.log • almost always - console • Network is only *possible* root cause • check syslog, ifconfig, netstat • Network engineering - switches logs18 © 2009/2010 Pythian
    51. 51. DEMO CSSD is not healthy • Simulate using kill -STOP <cssd.bin pid> • Another node observes NHB loss • After misscount seconds => attempt eviction • but CSSD is frozen and can’t commit suicide • OCLSOMON detects CSSD timeout • Commit suicide19 © 2009/2010 Pythian
    52. 52. OCSSD sick - symptoms • Error in OCLSOMON.log • OCSSD log might be clean on evicted node • syslog might contain OCLSOMON diag. err. • Console often contains diag. err. • Depending on syslogd settings • Set diagwait to more that 3 for better diagnosability • 3 seconds is reboottime • Increases risk of corruption20 © 2009/2010 Pythian
    53. 53. DEMO host sick - CPU stalled • Simulate by pausing OPROCD • kill -STOP <oprocd pid> • sleep 1 or 2 • kill -CONT <oprocd pid> • oprocd.log • Usually nothing if node is reset • Immediate reboot • Console might contain diag msg21 © 2009/2010 Pythian
    54. 54. Killed by OPROCD - symptoms • Hard to confirm (nothing in oprocd.log) • Console output often helps • “SysRq: resetting” could be in syslog as well • Root cause • Faulty hardware, drivers, caused by IO/network • Kernel bugs, NTP bugs • Investigate syslog messages • Margin can be tuned • diagwait and reboottime CSSD parameters22 © 2009/2010 Pythian
    55. 55. 10g on Linux - hangcheck-timer • Replaced by OPROCD in 11g and 10.2.0.4+ • Most of the time useless and inactive! • Metalink Note 726833.1 • Updated 21-JUL-08! • Oracle suggests to keep both • I would only leave OPROCD • Metalink Note 567730.1 • OPROCD in 10.2.0.423 © 2009/2010 Pythian
    56. 56. Killed by hangcheck-timer • Rarely can be confirmed • “Hangcheck: hangcheck is restarting the machine” • Can set hangcheck_dump_tasks to dump state • See source code...24 © 2009/2010 Pythian
    57. 57. Clusterware startup • Linux & UNIX inittab • init.cssd • init.evmd • init.crsd • Linux & UNIX init.d • init.crs • Windows Services25 © 2009/2010 Pythian
    58. 58. Daemons startup sequence Third-party clusterware CSSD • Triggered • by init.crs from init.d sequence • manually EVMD CRSD26 © 2009/2010 Pythian
    59. 59. Startup in Linux & Unix [gorby@dime ~]$ ps -fe | grep init. | grep -v grep root 6352 1 0 10:24 ... /bin/sh /etc/init.d/init.evmd run root 6353 1 0 10:24 ... /bin/sh /etc/init.d/init.cssd fatal root 6354 1 0 10:24 ... /bin/sh /etc/init.d/init.crsd run root 7356 6353 0 10:25 ... /bin/sh /etc/init.d/init.cssd oprocd root 7364 6353 0 10:25 ... /bin/sh /etc/init.d/init.cssd oclsomon root 7383 6353 0 10:25 ... /bin/sh /etc/init.d/init.cssd daemon [gorby@dime ~]$ tail -3 /etc/inittab h1:35:respawn:/etc/init.d/init.evmd run >/dev/null 2>&1 </dev/null h2:35:respawn:/etc/init.d/init.cssd fatal >/dev/null 2>&1 </dev/null h3:35:respawn:/etc/init.d/init.crsd run >/dev/null 2>&1 </dev/null [gorby@dime ~]$ ls -l /etc/rc3.d/S96init.crs lrwxrwxrwx 1 root root 20 Aug 1 23:51 /etc/rc3.d/S96init.crs -> /etc/init.d/init.crs27 © 2009/2010 Pythian
    60. 60. Startup flow t28 © 2009/2010 Pythian
    61. 61. Startup flow init.cssd fatal init.evmd run init.crsd run t28 © 2009/2010 Pythian
    62. 62. Startup flow/etc/oracle/scls_scr/{host}/root/cssrun init.cssd fatal init.evmd run init.crsd run t28 © 2009/2010 Pythian
    63. 63. Startup flow/etc/oracle/scls_scr/{host}/root/cssrun init.cssd fatal init.evmd run init.crsd run t28 © 2009/2010 Pythian
    64. 64. Startup flow/etc/oracle/scls_scr/{host}/root/cssrun init.crs start init.cssd autostart init.cssd fatal init.evmd run init.crsd run t28 © 2009/2010 Pythian
    65. 65. Startup flow/etc/oracle/scls_scr/{host}/root/cssrun /etc/oracle/scls_scr/{host}/root/crsstart • enable • disable init.crs start init.cssd autostart init.cssd fatal init.evmd run init.crsd run t28 © 2009/2010 Pythian
    66. 66. Startup flow/etc/oracle/scls_scr/{host}/root/cssrun /etc/oracle/scls_scr/{host}/root/crsstart • enable • disable init.crs start init.cssd autostart init.cssd fatal init.evmd run init.crsd run t28 © 2009/2010 Pythian
    67. 67. Startup flow/etc/oracle/scls_scr/{host}/root/cssrun /etc/oracle/scls_scr/{host}/root/crsstart • enable • disable init.cssd oprodc oprocd init.cssd oclsomon oclsomon.bin init.cssd oclsvmon oclsvmon.bin init.cssd daemon ocssd.bin init.cssd fatal evmd.bin init.evmd run init.crsd run crsd.bin t28 © 2009/2010 Pythian
    68. 68. DEMO Startup troubleshooting • Check processes using “ps -fe | grep init” • Check syslog (/var/log/messages) • Can point to /tmp/crsctl.##### • Remember boot sequence • Clusterware log files • if *.bin processes are running already • crsctl • crsctl check crs/cssd/crsd/evmd29 © 2009/2010 Pythian
    69. 69. Log files • log/{host}/cssd/ocssd.log • log/{host}/cssd/oclsomon/ocslmon.log • ocslmon.ba1, ocslmon.ba2,... • /etc/oracle/oprocd/{host}.oprocd.log • {host}.oprocd.log.{timestamp} • syslog • Linux /var/log/messages • Solaris /var/adm/log • Console logs30 © 2009/2010 Pythian
    70. 70. Windows world • OPROCD = OraFenceService • EVMD = OracleEVMService • CRSD = OracleCRService • CSSD = OracleCSService • OPMD • Oracle Process Manager Daemon • Start trigger like init.crs in *nix • registered with Windows Service Control Manager (WSCM) and delay start by 60 seconds31 © 2009/2010 Pythian
    71. 71. OS Clusterware VIP • Passing clusterware events RACG • Usually not a problem EVMD • Verify • evmwatch -A CRSD • evmpost -u "my message" CSSD OPROCD32 © 2009/2010 Pythian
    72. 72. OS EVMD Clusterware VIP • Passing clusterware events RACG • Usually not a problem • Verify • evmwatch -A CRSD • evmpost -u "my message" CSSD OPROCD32 © 2009/2010 Pythian
    73. 73. OS Clusterware VIP RACG EVMD CRSD CSSD OPROCD33 © 2009/2010 Pythian
    74. 74. VIP OS CRSD Clusterware RACG • CRSD manages cluster resources EVMD • Stop / Start • Failover • VIP management CSSD • New resources and etc. OPROCD • RACG helper scripts33 © 2009/2010 Pythian
    75. 75. CRSD startup • AfterCSSD and EVMD • Re-spawned on failure • No eviction • Runs as root • VIP control • OCR management • root ulimits are in place! • Can run resources owned by any user • owner is the property of a resource34 © 2009/2010 Pythian
    76. 76. Oracle Cluster Registry • Repository for all configuration data • Except OCR location itself • OCR is accessed mostly read-only • Every component reads OCR • OCR is written only by CRS • only from a single OCR master node### crsd.log ###2008-08-02 22:23:50.958: [ OCRMAS] [3065154448]th_master:13:I AM THE NEW OCR MASTER at incar 12. Node Number 135 © 2009/2010 Pythian
    77. 77. CRS resources • Standard Oracle resources • ASM • Listener • VIP • Database and Instance • etc.. • srvctl => manages Oracle resources • Custom user resources • crs_% => manages any resources36 © 2009/2010 Pythian
    78. 78. CRS resource internals • Unique name • Associated action script • stop / start / check functions • Other attributes • check frequency • pre-requisites • restart retries • etc... • All info stored in OCR37 © 2009/2010 Pythian
    79. 79. DEMO Resource profiles • Use crs_stat [-t] to check status • Use crs_stat -p to check attributes • crs_* vs srvctl (like srvctl config ... -a) • Standard action scripts • racgimon • racgwrap / racgmain • racgvip • racgons • usrvip38 © 2009/2010 Pythian
    80. 80. DEMO OCR internals • ocrcheck • ocrconfig • used during install/ugrade • backup OCR • recover OCR • ocrdump • txt or xml39 © 2009/2010 Pythian
    81. 81. DEMO racgvip case study • Check the script • Set env. vars and simulate the call • Use _USR_ORA_DEBUG=1 in the script40 © 2009/2010 Pythian
    82. 82. Resources hierarchy CS • 10.2.0.2 (?) DB (Collective Service) • released dependency of Service ASM and Instance on VIP Instance • If DB registered ASM manually with srvctl Listener • ASM dependency missing GSD ONS VIPNodeapps Only 10.1 and 10.2.0.141 © 2009/2010 Pythian
    83. 83. Resources and Oracle homes CS DB Home DB (Collective Service) Service Instance ASM ASM Home Listener can be in ASM home ASM home can be Oracle home Listener CRS Home GSD ONS VIPNodeapps Logs are in appropriate home Only 10.1 and 10.2.0.142 © 2009/2010 Pythian
    84. 84. DEMO troubleshooting resources • {home}/log/{host}/racg/{resource_name}.log • Old way - edit racgwrap • Uncomment _USR_ORA_DEBUG=1 • crsctl debug log res ‘{res_name}:{0|1}’ • crs_stat -p | grep DEBUG • Run “srvctl start ...” manually • SRVM_TRACE=TRUE43 © 2009/2010 Pythian
    85. 85. Troubleshooting summary • crsctl check crs | crsd | cssd | evmd • crs_stat [-t] • crs_stat -p [{res_name}] • crsctl debug log css | crs | evm | res • crsctl lsmodules css | crs | evm • crs_stop {res_name} [-f] (stop force resource) • ocrdump • See scripts44 © 2009/2010 Pythian
    86. 86. Troubleshooting flow • Is Clusterware up? • Is Oracle resources up? • Listener & VIP • Database & ASM instance • Services • Did any nodes got rebooted? • Did any resources re-started? • $ORA_CRS_HOME/log/{host}/crs/crsd.log • $ORA_CRS_HOME/log/{host}/alert{host}.log • MOS Note 265769.1 “Troubleshooting 10g and 11.1 Clusterware Reboots”45 © 2009/2010 Pythian
    87. 87. Enter the 11gR2 World - Grid Infrastructure46 © 2009/2010 Pythian
    88. 88. Enter the 11gR2 World - Grid Infrastructure Oracle Clusterware Administration and Deployment Guide46 © 2009/2010 Pythian
    89. 89. Enter the 11gR2 World - Grid Infrastructure My Oracle Support Note 1053147.147 © 2009/2010 Pythian
    90. 90. 11g Grid Infrastructure Documentation • OracleClusterware Administration and Deployment Guide • MOS Note 1053147.1 • 11gR2 Clusterware and Grid Home - What You Need to Know • MOS Note 1050908.1 • How to Troubleshoot Grid Infrastructure Startup Issues • MOS Note 1053970.1 • Troubleshooting 11.2 Grid Infastructure Installation Root.sh Issues • MOS Note 1050693.1 • Troubleshooting 11.2 Clusterware Node Evictions (Reboots)48 © 2009/2010 Pythian
    91. 91. 11gR2 Node Evictions • Same as in 10g + member kill escalation • LMON process may request CSS to remove an instance from the cluster via the instance eviction mechanism.  If this times out it could escalate to a node kill. • Processes evicting • CSSD • CSSDAGENT • CSSDMONITOR49 © 2009/2010 Pythian
    92. 92. Questions? Thank you! http://www.pythian.com/gorbachev@pythian.com © 2009/2010 Pythian
    1. Gostou de algum slide específico?

      Recortar slides é uma maneira fácil de colecionar informações para acessar mais tarde.

    ×