Oracle Clusterware Node Management and Voting Disks
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Oracle Clusterware Node Management and Voting Disks

on

  • 6,335 views

 

Statistics

Views

Total Views
6,335
Views on SlideShare
6,311
Embed Views
24

Actions

Likes
6
Downloads
332
Comments
1

2 Embeds 24

http://www.linkedin.com 21
https://www.linkedin.com 3

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Thanks. I've been looking for this information!
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Oracle Clusterware Node Management and Voting Disks Document Transcript

  • 1. <Insert Picture Here>Node Management in Oracle ClusterwareMarkus MichalewiczSenior Principal Product Manager Oracle RAC and Oracle RAC One Node
  • 2. The following is intended to outline our generalproduct direction. It is intended for informationpurposes only, and may not be incorporated into anycontract. It is not a commitment to deliver anymaterial, code, or functionality, and should not berelied upon in making purchasing decisions.The development, release, and timing of anyfeatures or functionality described for Oracle’sproducts remain at the sole discretion of Oracle.Agenda• Oracle Clusterware 11.2.0.1 Processes <Insert Picture Here>• Node Monitoring Basics• Node Eviction Basics• Re-bootless Node Fencing (restart)• Advanced Node Management• The Corner Cases• More Information / Q&A
  • 3. Oracle Clusterware 11g Rel. 2 ProcessesMost are not important for node managementOracle Clusterware 11g Rel. 2 ProcessesMost are not important for node management – focus! OHASD CSSD ora.cssd CSSDMONITOR (was: oprocd) ora.cssdmonitor
  • 4. <Insert Picture Here> Node Monitoring BasicsBasic Hardware Layout Oracle ClusterwareNode management is hardware independent Public Lan Public Lan Private Lan / Interconnect CSSD CSSD CSSD SAN SAN Network Network Voting Disk
  • 5. What does CSSD do?CSSD monitors and evicts nodes• Monitors nodes using 2 communication channels: – Private Interconnect  Network Heartbeat – Voting Disk based communication  Disk Heartbeat• Evicts (forcibly removes nodes from a cluster) nodes dependent on heartbeat feedback (failures) CSSD “Ping” CSSD “Ping”Network HeartbeatInterconnect basics• Each node in the cluster is “pinged” every second• Nodes must respond in css_misscount time (defaults to 30 secs.) – Reducing the css_misscount time is generally not supported• Network heartbeat failures will lead to node evictions – CSSD-log: [date / time] [CSSD][1111902528]clssnmPollingThread: node mynodename (5) at 75% heartbeat fatal, removal in 6.770 seconds CSSD “Ping” CSSD
  • 6. Disk HeartbeatVoting Disk basics – Part 1• Each node in the cluster “pings” (r/w) the Voting Disk(s) every second• Nodes must receive a response in (long / short) diskTimeout time – I/O errors indicate clear accessibility problems  timeout is irrelevant• Disk heartbeat failures will lead to node evictions – CSSD-log: … [CSSD] [1115699552] >TRACE: clssnmReadDskHeartbeat: node(2) is down. rcfg(1) wrtcnt(1) LATS(63436584) Disk lastSeqNo(1) CSSD CSSD “Ping”Voting Disk StructureVoting Disk basics – Part 2• Voting Disks contain dynamic and static data: – Dynamic data: disk heartbeat logging – Static data: information about the nodes in the cluster• With 11.2.0.1 Voting Disks got an “identity”: – E.g. Voting Disk serial number: [GRID]> crsctl query css votedisk 1. 2 1212f9d6e85c4ff7bf80cc9e3f533cc1 (/dev/sdd5) [DATA]• Voting Disks must therefore not be copied using “dd” or “cp” anymore Node information Disk Heartbeat Logging
  • 7. “Simple Majority Rule”Voting Disk basics – Part 3• Oracle supports redundant Voting Disks for disk failure protection• “Simple Majority Rule” applies: – Each node must “see” the simple majority of configured Voting Disks at all times in order not to be evicted (to remain in the cluster)  trunc(n/2+1) with n=number of voting disks configured and n>=1 CSSD CSSDInsertion 1: “Simple Majority Rule”…… In extended Oracle clusters • http://www.oracle.com/goto/rac – Using standard NFS to support a third voting file for extended cluster configurations (PDF) CSSD CSSD • Same principles apply • Voting Disks are just geographically dispersed
  • 8. Insertion 2: Voting Disk in Oracle ASMThe way of storing Voting Disks doesn’t change its use [GRID]> crsctl query css votedisk 1. 2 1212f9d6e85c4ff7bf80cc9e3f533cc1 (/dev/sdd5) [DATA] 2. 2 aafab95f9ef84f03bf6e26adc2a3b0e8 (/dev/sde5) [DATA] 3. 2 28dd4128f4a74f73bf8653dabd88c737 (/dev/sdd6) [DATA] Located 3 voting disk(s).• Oracle ASM auto creates 1/3/5 Voting Files – Based on Ext/Normal/High redundancy and on Failure Groups in the Disk Group – Per default there is one failure group per disk – ASM will enforce the required number of disks – New failure group type: Quorum Failgroup <Insert Picture Here> Node Eviction Basics
  • 9. Why are nodes evicted? To prevent worse things from happening…• Evicting (fencing) nodes is a preventive measure (a good thing)!• Nodes are evicted to prevent consequences of a split brain: – Shared data must not be written by independently operating nodes – The easiest way to prevent this is to forcibly remove a node from the cluster 1 2 CSSD CSSDHow are nodes evicted in general?“STONITH like” or node eviction basics – Part 1• Once it is determined that a node needs to be evicted, – A “kill request” is sent to the respective node(s) – Using all (remaining) communication channels• A node (CSSD) is requested to “kill itself”  “STONITH like” – “STONITH” foresees that a remote node kills the node to be evicted 1 2 CSSD CSSD
  • 10. How are nodes evicted?EXAMPLE: Heartbeat failure• The network heartbeat between nodes has failed – It is determined which nodes can still talk to each other – A “kill request” is sent to the node(s) to be evicted  Using all (remaining) communication channels  Voting Disk(s)• A node is requested to “kill itself”; executer: typically CSSD 1 CSSD CSSD 2How can nodes be evicted?Using IPMI / Node eviction basics – Part 2• Oracle Clusterware 11.2.0.1 and later supports IPMI (optional) – Intelligent Platform Management Interface (IPMI) drivers required• IPMI allows remote-shutdown of nodes using additional hardware – A Baseboard Management Controller (BMC) per cluster node is required 1 CSSD CSSD
  • 11. Insertion: Node Eviction Using IPMIEXAMPLE: Heartbeat failure• The network heartbeat between the nodes has failed – It is determined which nodes can still talk to each other – IPMI is used to remotely shutdown the node to be evicted 1 CSSDWhich node is evicted?Node eviction basics – Part 3• Voting Disks and heartbeat communication is used to determine the node• In a 2 node cluster, the node with the lowest node number should survive• In a n-node cluster, the biggest sub-cluster should survive (votes based) 1 2 CSSD CSSD
  • 12. <Insert Picture Here> Re-bootless Node Fencing (restart)Re-bootless Node Fencing (restart)Fence the cluster, do not reboot the node• Until Oracle Clusterware 11.2.0.2, fencing meant “re-boot”• With Oracle Clusterware 11.2.0.2, re-boots will be seen less, because: – Re-boots affect applications that might run an a node, but are not protected – Customer requirement: prevent a reboot, just stop the cluster – implemented... Standalone Standalone App X App Y Oracle RAC Oracle RAC DB Inst. 1 DB Inst. 2 CSSD CSSD
  • 13. Re-bootless Node Fencing (restart)How it works• With Oracle Clusterware 11.2.0.2, re-boots will be seen less: – Instead of fast re-booting the node, a graceful shutdown of the stack is attempted• It starts with a failure – e.g. network heartbeat or interconnect failure Standalone Standalone App X App Y Oracle RAC Oracle RAC DB Inst. 1 DB Inst. 2 CSSD CSSDRe-bootless Node Fencing (restart)How it works• With Oracle Clusterware 11.2.0.2, re-boots will be seen less: – Instead of fast re-booting the node, a graceful shutdown of the stack is attempted• It starts with a failure – e.g. network heartbeat or interconnect failure Standalone Standalone App X App Y Oracle RAC Oracle RAC DB Inst. 1 DB Inst. 2 CSSD CSSD
  • 14. Re-bootless Node Fencing (restart)How it works• With Oracle Clusterware 11.2.0.2, re-boots will be seen less: – Instead of fast re-booting the node, a graceful shutdown of the stack is attempted• Then IO issuing processes are killed; it is made sure that no IO process remains – For a RAC DB mainly the log writer and the database writer are of concern Standalone Standalone App X App Y Oracle RAC DB Inst. 1 CSSD CSSDRe-bootless Node Fencing (restart)How it works• With Oracle Clusterware 11.2.0.2, re-boots will be seen less: – Instead of fast re-booting the node, a graceful shutdown of the stack is attempted• Once all IO issuing processes are killed, remaining processes are stopped – IF the check for a successful kill of the IO processes, fails → reboot Standalone Standalone App X App Y Oracle RAC DB Inst. 1 CSSD CSSD
  • 15. Re-bootless Node Fencing (restart)How it works• With Oracle Clusterware 11.2.0.2, re-boots will be seen less: – Instead of fast re-booting the node, a graceful shutdown of the stack is attempted• Once all remaining processes are stopped, the stack stops itself with a “restart flag” Standalone Standalone App X App Y Oracle RAC DB Inst. 1 CSSD OHASDRe-bootless Node Fencing (restart)How it works• With Oracle Clusterware 11.2.0.2, re-boots will be seen less: – Instead of fast re-booting the node, a graceful shutdown of the stack is attempted• OHASD will finally attempt to restart the stack after the graceful shutdown Standalone Standalone App X App Y Oracle RAC DB Inst. 1 CSSD OHASD
  • 16. Re-bootless Node Fencing (restart)EXCEPTIONS• With Oracle Clusterware 11.2.0.2, re-boots will be seen less, unless…: – IF the check for a successful kill of the IO processes fails → reboot – IF CSSD gets killed during the operation → reboot – IF cssdmonitor (oprocd replacement) is not scheduled → reboot – IF the stack cannot be shutdown in “short_disk_timeout”-seconds → reboot Standalone Standalone App X App Y Oracle RAC Oracle RAC DB Inst. 1 DB Inst. 2 CSSD CSSD <Insert Picture Here> Advanced Node Management
  • 17. Determine the Biggest Sub-ClusterVoting Disk basics – Part 4• Each node in the cluster is “pinged” every second (network heartbeat)• Each node in the cluster “pings” (r/w) the Voting Disk(s) every second 1 2 3 CSSD CSSD CSSD 1 2 3Determine the Biggest Sub-ClusterVoting Disk basics – Part 4• In a n-node cluster, the biggest sub-cluster should survive (votes based) 1 2 3 CSSD CSSD CSSD 2 1 3
  • 18. Redundant Voting Disks – Why odd?Voting Disk basics – Part 5• Redundant Voting Disks  Oracle managed redundancy • Assume for a moment only 2 1 voting disks are supported… CSSD 2 3 CSSD CSSDRedundant Voting Disks – Why odd?Voting Disk basics – Part 5• Advanced scenarios need to be considered 1 • Without the “Simple Majority CSSD Rule”, what would we do? 2 3 CSSD CSSD • Even with the “Simple Majority Rule” in place – Each node can see only one voting disk, which would lead to an eviction of all nodes
  • 19. Redundant Voting Disks – Why odd?Voting Disk basics – Part 5 1 2 3 1 CSSD 2 3 CSSD CSSD 1 1 2 2 3 3Redundant Voting Disks – Why odd?Voting Disk basics – Part 5 1 2 3 1 CSSD 2 3 CSSD CSSD 1 1 2 2 3 3
  • 20. <Insert Picture Here> The Corner CasesCase 1: Partial Failures in the ClusterWhen somebody uses a pair of scissors in the wrong way… • A properly configured cluster with 3 voting disks as shown CSSD CSSD • What happens if there is a storage network failure as shown (lost remote access)?
  • 21. Case 1: Partial Failures in the ClusterWhen somebody uses a pair of scissors in the wrong way… • There will be no node eviction! • IF storage mirroring is used (for data files), the respective solution must handle this case. CSSD CSSD • Covered in Oracle ASM 11.2.0.2: – _asm_storagemaysplit = TRUE – Backported to 11.1.0.7Case 2: CSSD is stuckCSSD cannot execute request• A node is requested to “kill itself”• BUT CSSD is “stuck” or “sick” (does not execute) – e.g.: – CSSD failed for some reason – CSSD is not scheduled within a certain margin OCSSDMONITOR (was: oprocd) will take over and execute 1 CSSD CSSD
  • 22. Case 2: CSSD is stuckCSSD cannot execute request• A node is requested to “kill itself”• BUT CSSD is “stuck” or “sick” (does not execute) – e.g.: – CSSD failed for some reason – CSSD is not scheduled within a certain margin OCSSDMONITOR (was: oprocd) will take over and execute 1 CSSD CSSDmonitor CSSDCase 3: Node Eviction EscalationMembers of a cluster can escalate kill requests• Cluster members (e.g Oracle RAC instances) can request Oracle Clusterware to kill a specific member of the cluster• Oracle Clusterware will then attempt to kill the requested member Oracle RAC Oracle RAC DB Inst. 1 DB Inst. 2 Inst. 1:kill inst. 2 CSSD CSSD
  • 23. Case 3: Node Eviction EscalationMembers of a cluster can escalate kill requests• Oracle Clusterware will then attempt to kill the requested member• If the requested member kill is unsuccessful, a node eviction escalation can be issued, which leads to the eviction of the node, on which the particular member currently resides Oracle RAC Oracle RAC DB Inst. 1 DB Inst. 2 Inst. 1:kill inst. 2 CSSD CSSDCase 3: Node Eviction EscalationMembers of a cluster can escalate kill requests• Oracle Clusterware will then attempt to kill the requested member• If the requested member kill is unsuccessful, a node eviction escalation can be issued, which leads to the eviction of the node, on which the particular member currently resides Oracle RAC Oracle RAC DB Inst. 1 DB Inst. 2 Inst. 1:kill inst. 2 CSSD CSSD
  • 24. Case 3: Node Eviction EscalationMembers of a cluster can escalate kill requests• Oracle Clusterware will then attempt to kill the requested member• If the requested member kill is unsuccessful, a node eviction escalation can be issued, which leads to the eviction of the node, on which the particular member currently resides Oracle RAC DB Inst. 1 CSSD <Insert Picture Here> More Information
  • 25. More Information• My Oracle Support Notes: – ID 294430.1 - CSS Timeout Computation in Oracle Clusterware – ID 395878.1 - Heartbeat/Voting/Quorum Related Timeout Configuration for Linux, OCFS2, RAC Stack to Avoid Unnecessary Node Fencing, Panic and Reboot• http://www.oracle.com/goto/clusterware – Oracle Clusterware 11g Release 2 Technical Overview• http://www.oracle.com/goto/asm• http://www.oracle.com/goto/rac