Successfully reported this slideshow.
Your SlideShare is downloading. ×

了解Oracle rac brain split resolution

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
了解 Oracle RAC Brain
 Split Resolution


      by Maclean.liu
            liu.maclean@gmail.com
        www.oracledatabase1...
About Me

l Email:liu.maclean@gmail.com
l Blog:www.oracledatabase12g.com
l Oracle Certified Database Administrator Master ...
大约是一周前,一位资深的 Oracle 工程师向我和客户介绍 RAC 中脑裂的处理过程,据他介
绍脑裂发生时通过各节点对 voting disk(投票磁盘)的抢夺,那些争抢到(n/2+1)数量 voting disk
的节点就可以 surviv...
Advertisement
Advertisement
Advertisement
Loading in …3
×

Check these out next

1 of 18 Ad

More Related Content

Slideshows for you (20)

Viewers also liked (20)

Advertisement

Similar to 了解Oracle rac brain split resolution (20)

More from maclean liu (20)

Advertisement

Recently uploaded (20)

了解Oracle rac brain split resolution

  1. 1. 了解 Oracle RAC Brain Split Resolution by Maclean.liu liu.maclean@gmail.com www.oracledatabase12g.com
  2. 2. About Me l Email:liu.maclean@gmail.com l Blog:www.oracledatabase12g.com l Oracle Certified Database Administrator Master 10g and 11g l Over 6 years experience with Oracle DBA technology l Over 7 years experience with Linux technology l Member Independent Oracle Users Group l Member All China Users Group l Presents for advanced Oracle topics: RAC, DataGuard, Performance Tuning and Oracle Internal.
  3. 3. 大约是一周前,一位资深的 Oracle 工程师向我和客户介绍 RAC 中脑裂的处理过程,据他介 绍脑裂发生时通过各节点对 voting disk(投票磁盘)的抢夺,那些争抢到(n/2+1)数量 voting disk 的节点就可以 survive(幸存)下来,而没有争抢到 voting disk 的节点则被 evicted 踢出节点。 不得不说以上这番观点,来得太过随意了,一位从 Oracle 6 就开始从事维护工作的老工程师 也会犯这样的概念性错误,只能说 Oracle 技术的更新过于日新月异了。 在理解脑裂(Brain Split)处理过程前,有必要介绍一下 Oracle RAC Css(Cluster Synchronization Services)的工作框架: Oracle RAC CSS 提供 2 种后台服务包括群组管理(Group Managment 简称 GM)和节点监控 (Node Monitor 简称 NM),其中 GM 管理组(group)和锁(lock)服务。在集群中任意时刻总有一 个节点会充当 GM 主控节点(master node)。集群中的其他节点串行地将 GM 请求发送到主控 节点(master node),而 master node 将集群成员变更信息广播给集群中的其他节点。组成员关 系(group membership)在每次发生集群重置(cluster reconfiguration)时发生同步。每一个节点独 立地诠释集群成员变化信息。
  4. 4. 而节点监控 NM 服务则负责通过 skgxn(skgxn-libskgxn.a,提供节点监控的库)与其他厂商的集 群软件保持节点信息的一致性。此外 NM 还提供对我们熟知的网络心跳(Network heartbeat) 和磁盘心跳(Disk heartbeat)的维护以保证节点始终存活着。当集群成员没有正常 Network heartbeat 或 Disk heartbeat 时 NM 负责将成员踢出集群,被踢出集群的节点将发生节点重启 (reboot)。 NM 服务通过 OCR 中的记录(OCR 中记录了 Interconnect 的信息)来了解其所需要监听和交互 的端点,将心跳信息通过网络发送到其他集群 成员。同时它也监控来自所有其他集群成员 的网络心跳 Network heartbeat,每一秒钟都会发生这样的网络心跳,若某个节点的网络心跳 在 misscount(by the way:10.2.0.1 中 Linux 上默认 misscount 为 60s,其他平台为 30s,若使用了 第三方 vendor clusterware 则为 600s,但 10.2.0.1 中未引入 disktimeout;10.2.0.4 以后 misscount 为 60s,disktimeout 为 200s;11.2 以后 misscount 为 30s:CRS-4678: Successful get misscount 30 for Cluster Synchronization Services,CRS-4678: Successful get disktimeout 200 for Cluster Synchronization Services)指定的秒数中都没有被收到的话,该节点被认为已经”死 亡”了。NM 还负责当其他节点加入或离开集群时初始化集群的重置 (Initiates cluster reconfiguration)。 在解决脑裂的场景中,NM 还会监控 voting disk 以了解其他的竞争子集群(subclusters)。关于 子集群我们有必要介绍一下,试想我们的环境中存在大量的节点,以 Oracle 官方构建过 的 128 个节点的环境为我们的想象空间,当网络故障发生时存在多种的可能性,一种可能性是 全局的网络失败,即 128 个节点中每个节点都不能互相发生网络心 跳,此时会产生多达 128 个的信息”孤岛”子集群。另一种可能性是局部的网络失败,128 个节点中被分成多个部分, 每个部分中包含多于一个的节点,这些部 分就可以被称作子集群(subclusters)。当出现网络 故障时子集群内部的多个节点仍能互相通信传输投票信息(vote mesg),但子集群或者孤岛节点 之间已经无法通过常规的 Interconnect 网络交流了,这个时候 NM Reconfiguration 就需要用到 voting disk 投票磁盘。 因为 NM 要使用 voting disk 来解决因为网络故障造成的通信障碍,所以需要保证 voting disk 在任意时刻都可以被正常访问。在正常状态下,每个节点都会进行磁盘心跳活动,具体来说 就是会到投票磁盘的某个块上写入 disk 心跳信息,这种活动 每一秒钟都会发生,同时 CSS 还会每秒读取一种称作”kill block”的”赐死块”,当”kill block”的内容表示本节点被驱逐出集群 时,CSS 会主动重启节点。
  5. 5. 为了保证以上的磁盘心跳和读取”kill block”的活动始终正常运作 CSS 要求保证至少(N/2+1)个 投票磁盘要被节点正常访问,这样就保证了每 2 个节点间总是至少有一个投票磁盘是它们都 可以正常访问的,在正常情况下(注意是风平浪静的正常情况)只要节点所能访问的在线 voting disk 多于无法访问的 voting disk,该节点都能幸福地活下去,当无法访问的 voting disk 多于正常的 voting disk 时,Cluster Communication Service 进程将失败并引起节点重启。所以 有一种说法认为 voting disk 只要有 2 个足以保证冗余度就可以了,没有必要有 3 个或以上 voting disk,这种说法是错误的。Oracle 推荐集群中至少要有 3 个 voting disks。 当实际的 NM Reconfiguration 集群重置情况发生时所有的 active 节点和正在加入集群的节点 都会参与到 reconfig 中,那些没有应答(ack)的节点都将不再被归入新的集群关系中。实际上 reconfig 重置包括多个阶段: 1.初始化阶段 — reconfig manager(由集群成员号最低的节点担任)向其他节点发送启动 reconfig 的信号 2.投票阶段 — 节点向 reconfig manager 发送该节点所了解的成员关系 3.脑裂检查阶段 — reconfig manager 检查是否脑裂 4.驱逐阶段 — reconfig manager 驱逐非成员节点 5.更新阶段 — reconfig manager 向成员节点发送权威成员关系信息 在脑裂检查阶段 Reconfig Manager 会找出那些没有 Network Heartbeat 而有 Disk Heartbeat 的 节点,并通过 Network Heartbeat(如果可能的话)和 Disk Heartbeat 的信息来计算所有竞争子集 群(subcluster)内的节点数目,并依据以下 2 种因素决定哪个子集群应当存活下去: 1. 拥有最多节点数目的子集群(Sub-cluster with largest number of Nodes) 2. 若子集群内数目相等则为拥有最低节点号的子集群(Sub-cluster with lowest node number),举例来说在一个 2 节点的 RAC 环境中总是 1 号节点会获胜。
  6. 6. 在完成脑裂检查后进入驱逐阶段,被驱逐节点会收到发送给它们的驱逐信息(如果网络可用 的话),若无法发送信息则会通过写出驱逐通知到 voting disk 上的”kill block”来达到驱逐通知 的目的。同时还会等待被驱逐节点表示其已收到驱逐通知,这种表示可能是通过网络通信的 方式也可能是投票磁盘上的状态信息。 可以看到 Oracle CSS 中 Brain Split Check 时会尽可能地保证最大的一个子集群存活下来以保 证 RAC 系统具有最高的可用性,而并不如那位资深工程师所说的在 Cluster Reconfiguration 阶段会通过节点对投票磁盘的抢占来保证哪个节点存活下来。 以下为一个三节点 RAC 环境中的 2 个示例场景: 1.1 号节点网络失败,2,3 号节点形成子集群;2,3 节点通过 voting disk 向 1 号节点发起驱 逐: 以下为 1 号节点的 ocssd.log 日志: [ CSSD]2011-04-23 17:11:42.943 [3042950032] >WARNING: clssnmPollingThread: node vrh2 (2) at 50 3.280308e-268artbeat fatal, eviction in 29.610 seconds [ CSSD]2011-04-23 17:11:42.943 [3042950032] >TRACE: clssnmPollingThread: node vrh2 (2) is impending reconfig, flag 1037, misstime 30390 [ CSSD]2011-04-23 17:11:42.943 [3042950032] >WARNING: clssnmPollingThread: node vrh3 (3) at 50 3.280308e-268artbeat fatal, eviction in 29.150 seconds 对 2,3 号节点发起 misscount 计时 [ CSSD]2011-04-23 17:11:42.943 [3042950032] >TRACE: clssnmPollingThread: node vrh3 (3) is impending reconfig, flag 1037, misstime 30850 [ CSSD]2011-04-23 17:11:42.943 [3042950032] >TRACE: clssnmPollingThread: diskTimeout set to (57000)ms impending reconfig status(1) [ CSSD]2011-04-23 17:11:44.368 [3042950032] >WARNING: clssnmPollingThread: node vrh2 (2) at 50 3.280308e-268artbeat fatal, eviction in 28.610 seconds [ CSSD]2011-04-23 17:12:04.778 [3042950032] >WARNING: clssnmPollingThread: node vrh2 (2) at 75 3.280308e-268artbeat fatal, eviction in 14.580 seconds [ CSSD]2011-04-23 17:12:04.779 [3042950032] >WARNING: clssnmPollingThread: node vrh3 (3) at 75 3.280308e-268artbeat fatal, eviction in 14.120 seconds [ CSSD]2011-04-23 17:12:06.207 [3042950032] >WARNING: clssnmPollingThread: node vrh2 (2) at 75 3.280308e-268artbeat fatal, eviction in 13.580 seconds [ CSSD]2011-04-23 17:12:17.719 [3042950032] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 3.280308e-268artbeat fatal, eviction in 5.560 seconds [ CSSD]2011-04-23 17:12:17.719 [3042950032] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 3.280308e-268artbeat fatal, eviction in 5.100 seconds [ CSSD]2011-04-23 17:12:19.165 [3042950032] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 3.280308e-268artbeat fatal, eviction in 4.560 seconds [ CSSD]2011-04-23 17:12:19.165 [3042950032] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 3.280308e-268artbeat fatal, eviction in 4.100 seconds [ CSSD]2011-04-23 17:12:20.642 [3042950032] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 3.280308e-268artbeat fatal, eviction in 3.560 seconds [ CSSD]2011-04-23 17:12:20.642 [3042950032] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 3.280308e-268artbeat fatal, eviction in 3.100 seconds [ CSSD]2011-04-23 17:12:22.139 [3042950032] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 3.280308e-268artbeat fatal, eviction in 2.560 seconds
  7. 7. [ CSSD]2011-04-23 17:12:22.139 [3042950032] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 3.280308e-268artbeat fatal, eviction in 2.100 seconds [ CSSD]2011-04-23 17:12:23.588 [3042950032] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 3.280308e-268artbeat fatal, eviction in 1.550 seconds [ CSSD]2011-04-23 17:12:23.588 [3042950032] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 3.280308e-268artbeat fatal, eviction in 1.090 seconds 2 号节点的 ocssd.log 日志: [ CSSD]2011-04-23 17:11:53.054 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 50 8.910601e-269artbeat fatal, eviction in 29.800 seconds [ CSSD]2011-04-23 17:11:53.054 [3053439888] >TRACE: clssnmPollingThread: node vrh1 (1) is impending reconfig, flag 1037, misstime 30200 [ CSSD]2011-04-23 17:11:53.054 [3053439888] >TRACE: clssnmPollingThread: diskTimeout set to (57000)ms impending reconfig status(1) [ CSSD]2011-04-23 17:11:54.516 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 50 8.910601e-269artbeat fatal, eviction in 28.790 seconds [ CSSD]2011-04-23 17:12:14.826 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 75 8.910601e-269artbeat fatal, eviction in 14.800 seconds [ CSSD]2011-04-23 17:12:16.265 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 75 8.910601e-269artbeat fatal, eviction in 13.800 seconds [ CSSD]2011-04-23 17:12:27.755 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 8.910601e-269artbeat fatal, eviction in 5.800 seconds [ CSSD]2011-04-23 17:12:29.197 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 8.910601e-269artbeat fatal, eviction in 4.800 seconds [ CSSD]2011-04-23 17:12:30.658 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 8.910601e-269artbeat fatal, eviction in 3.800 seconds [ CSSD]2011-04-23 17:12:32.133 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 8.910601e-269artbeat fatal, eviction in 2.800 seconds [ CSSD]2011-04-23 17:12:33.602 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 8.910601e-269artbeat fatal, eviction in 1.790 seconds [ CSSD]2011-04-23 17:12:35.126 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 8.910601e-269artbeat fatal, eviction in 0.800 seconds [ CSSD]2011-04-23 17:12:35.399 [117574544] >TRACE: clssnmHandleSync: diskTimeout set to (57000)ms [ CSSD]2011-04-23 17:12:35.399 [117574544] >TRACE: clssnmHandleSync: Acknowledging sync: src[3] srcName[vrh3] seq[21] sync[10] clssnmHandleSyn 应答 3 号节点发送的同步信息 [ CSSD]2011-04-23 17:12:35.399 [5073104] >USER: NMEVENT_SUSPEND [00][00] [00][0e] 发生 Node Monitoring SUSPEND 事件 [ CSSD]2011-04-23 17:12:35.405 [117574544] >TRACE: clssnmSendVoteInfo: node(3) syncSeqNo(10) 通过 clssnmSendVoteInfo 向 3 号节点发送投票信息 Vote mesg [ CSSD]2011-04-23 17:12:35.415 [117574544] >TRACE: clssnmUpdateNodeState: node 0, state (0/0) unique (0/0) prevConuni(0) birth (0/0) (old/new) [ CSSD]2011-04-23 17:12:35.415 [117574544] >TRACE: clssnmUpdateNodeState: node 1, state (3/0) unique (1303592601/1303592601) prevConuni(0) birth (9/9) (old/new) [ CSSD]2011-04-23 17:12:35.415 [117574544] >TRACE: clssnmDiscHelper: vrh1, node(1) connection failed, con (0xb7e80ae8), probe((nil)) [ CSSD]2011-04-23 17:12:35.415 [117574544] >TRACE: clssnmDeactivateNode: node 1 (vrh1) left cluster 确认 1 号节点离开了集群 cluster
  8. 8. [ CSSD]2011-04-23 17:12:35.415 [117574544] >TRACE: clssnmUpdateNodeState: node 2, state (3/3) unique (1303591210/1303591210) prevConuni(0) birth (2/2) (old/new) [ CSSD]2011-04-23 17:12:35.415 [117574544] >TRACE: clssnmUpdateNodeState: node 3, state (3/3) unique (1303591326/1303591326) prevConuni(0) birth (3/3) (old/new) [ CSSD]2011-04-23 17:12:35.415 [117574544] >USER: clssnmHandleUpdate: SYNC(10) from node(3) completed [ CSSD]2011-04-23 17:12:35.416 [117574544] >USER: clssnmHandleUpdate: NODE 2 (vrh2) IS ACTIVE MEMBER OF CLUSTER [ CSSD]2011-04-23 17:12:35.416 [117574544] >USER: clssnmHandleUpdate: NODE 3 (vrh3) IS ACTIVE MEMBER OF CLUSTER [ CSSD]2011-04-23 17:12:35.416 [117574544] >TRACE: clssnmHandleUpdate: diskTimeout set to (200000)ms [ CSSD]2011-04-23 17:12:35.416 [3021970320] >TRACE: clssgmReconfigThread: started for reconfig (10) [ CSSD]2011-04-23 17:12:35.416 [3021970320] >USER: NMEVENT_RECONFIG [00] [00][00][0c] [ CSSD]2011-04-23 17:12:35.417 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock crs_version type 2 [ CSSD]2011-04-23 17:12:35.417 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(0) grock(crs_version) birth(9/9) [ CSSD]2011-04-23 17:12:35.418 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock _ORA_CRS_FAILOVER type 3 [ CSSD]2011-04-23 17:12:35.418 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock EVMDMAIN type 2 [ CSSD]2011-04-23 17:12:35.418 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(EVMDMAIN) birth(9/9) [ CSSD]2011-04-23 17:12:35.418 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock CRSDMAIN type 2 [ CSSD]2011-04-23 17:12:35.418 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(CRSDMAIN) birth(9/9) [ CSSD]2011-04-23 17:12:35.419 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock _ORA_CRS_MEMBER_vrh1 type 3 [ CSSD]2011-04-23 17:12:35.419 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock _ORA_CRS_MEMBER_vrh2 type 3 [ CSSD]2011-04-23 17:12:35.419 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock _ORA_CRS_MEMBER_vrh3 type 3 [ CSSD]2011-04-23 17:12:35.419 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock ocr_crs type 2 [ CSSD]2011-04-23 17:12:35.419 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(ocr_crs) birth(9/9) [ CSSD]2011-04-23 17:12:35.419 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock #CSS_CLSSOMON type 2 [ CSSD]2011-04-23 17:12:35.419 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(#CSS_CLSSOMON) birth(9/9) [ CSSD]2011-04-23 17:12:35.419 [3021970320] >TRACE: clssgmEstablishConnections: 2 nodes in cluster incarn 10 [ CSSD]2011-04-23 17:12:35.419 [3063929744] >TRACE: clssgmPeerDeactivate: node 1 (vrh1), death 10, state 0x80000000 connstate 0xa [ CSSD]2011-04-23 17:12:35.419 [3063929744] >TRACE: clssgmPeerListener: connects done (2/2) [ CSSD]2011-04-23 17:12:35.419 [3021970320] >TRACE: clssgmEstablishMasterNode: MASTER for 10 is node(2) birth(2) [ CSSD]2011-04-23 17:12:35.419 [3021970320] >TRACE: clssgmMasterCMSync: Synchronizing group/lock status [ CSSD]2011-04-23 17:12:35.428 [3021970320] >TRACE: clssgmMasterSendDBDone: group/lock status synchronization complete [ CSSD]CLSS-3000: reconfiguration successful, incarnation 10 with 2 nodes [ CSSD]CLSS-3001: local node number 2, master node number 2
  9. 9. 完成 reconfiguration [ CSSD]2011-04-23 17:12:35.440 [3021970320] >TRACE: clssgmReconfigThread: completed for reconfig(10), with status(1) 以下为 3 号节点的 ocssd.log: [ CSSD]2011-04-23 17:12:36.303 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 50 1.867300e-268artbeat fatal, eviction in 29.220 seconds [ CSSD]2011-04-23 17:12:36.303 [3053439888] >TRACE: clssnmPollingThread: node vrh1 (1) is impending reconfig, flag 1037, misstime 30780 [ CSSD]2011-04-23 17:12:36.303 [3053439888] >TRACE: clssnmPollingThread: diskTimeout set to (57000)ms impending reconfig status(1) [ CSSD]2011-04-23 17:12:57.889 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 75 1.867300e-268artbeat fatal, eviction in 14.220 seconds [ CSSD]2011-04-23 17:13:10.674 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 1.867300e-268artbeat fatal, eviction in 5.220 seconds [ CSSD]2011-04-23 17:13:12.115 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 1.867300e-268artbeat fatal, eviction in 4.220 seconds [ CSSD]2011-04-23 17:13:13.597 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 1.867300e-268artbeat fatal, eviction in 3.210 seconds [ CSSD]2011-04-23 17:13:15.024 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 1.867300e-268artbeat fatal, eviction in 2.220 seconds [ CSSD]2011-04-23 17:13:16.504 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 1.867300e-268artbeat fatal, eviction in 1.220 seconds [ CSSD]2011-04-23 17:13:17.987 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 1.867300e-268artbeat fatal, eviction in 0.220 seconds [ CSSD]2011-04-23 17:13:18.325 [3053439888] >TRACE: clssnmPollingThread: Eviction started for node vrh1 (1), flags 0x040d, state 3, wt4c 0 [ CSSD]2011-04-23 17:13:18.326 [3032460176] >TRACE: clssnmDoSyncUpdate: Initiating sync 10 [ CSSD]2011-04-23 17:13:18.326 [3032460176] >TRACE: clssnmDoSyncUpdate: diskTimeout set to (57000)ms [ CSSD]2011-04-23 17:13:18.326 [3032460176] >TRACE: clssnmSetupAckWait: Ack message type (11) [ CSSD]2011-04-23 17:13:18.326 [3032460176] >TRACE: clssnmSetupAckWait: node(2) is ALIVE [ CSSD]2011-04-23 17:13:18.326 [3032460176] >TRACE: clssnmSetupAckWait: node(3) is ALIVE [ CSSD]2011-04-23 17:13:18.327 [3032460176] >TRACE: clssnmSendSync: syncSeqNo(10) [ CSSD]2011-04-23 17:13:18.329 [3032460176] >TRACE: clssnmWaitForAcks: Ack message type(11), ackCount(2) [ CSSD]2011-04-23 17:13:18.329 [89033616] >TRACE: clssnmHandleSync: diskTimeout set to (57000)ms [ CSSD]2011-04-23 17:13:18.329 [89033616] >TRACE: clssnmHandleSync: Acknowledging sync: src[3] srcName[vrh3] seq[21] sync[10] [ CSSD]2011-04-23 17:13:18.330 [8136912] >USER: NMEVENT_SUSPEND [00][00] [00][0e] [ CSSD]2011-04-23 17:13:18.332 [3032460176] >TRACE: clssnmWaitForAcks: done, msg type(11) [ CSSD]2011-04-23 17:13:18.332 [3032460176] >TRACE: clssnmDoSyncUpdate: Terminating node 1, vrh1, misstime(60010) state(5) [ CSSD]2011-04-23 17:13:18.332 [3032460176] >TRACE: clssnmSetupAckWait: Ack message type (13) [ CSSD]2011-04-23 17:13:18.332 [3032460176] >TRACE: clssnmSetupAckWait: node(2) is ACTIVE [ CSSD]2011-04-23 17:13:18.332 [3032460176] >TRACE: clssnmSetupAckWait: node(3) is ACTIVE [ CSSD]2011-04-23 17:13:18.334 [3032460176] >TRACE: clssnmWaitForAcks: Ack message type(13), ackCount(2) [ CSSD]2011-04-23 17:13:18.335 [89033616] >TRACE: clssnmSendVoteInfo:
  10. 10. node(3) syncSeqNo(10) [ CSSD]2011-04-23 17:13:18.337 [3032460176] >TRACE: clssnmWaitForAcks: done, msg type(13) 以上完成了 2-3 节点间的 Vote mesg 通信,这些信息包含 Node identifier,GM peer to peer listening endpoint 以及 View of cluster membership。 [ CSSD]2011-04-23 17:13:18.337 [3032460176] >TRACE: clssnmCheckDskInfo: Checking disk info... 开始检测 voting disk 上的信息 [ CSSD]2011-04-23 17:13:18.337 [3032460176] >TRACE: clssnmCheckDskInfo: node 1, vrh1, state 5 with leader 1 has smaller cluster size 1; my cluster size 2 with leader 2 发现其他子集群,包含 1 号节点且 1 号节点为该子集群的 leader,为最小子集群;3 号与 2 号节点组成最大子 集群,2 号节点为 leader 节点 [ CSSD]2011-04-23 17:13:18.337 [3032460176] >TRACE: clssnmEvict: Start [ CSSD]2011-04-23 17:13:18.337 [3032460176] >TRACE: clssnmEvict: Evicting node 1, vrh1, birth 9, death 10, impendingrcfg 1, stateflags 0x40d 发起对 1 号节点的驱逐 [ CSSD]2011-04-23 17:13:18.337 [3032460176] >TRACE: clssnmSendShutdown: req to node 1, kill time 443294 [ CSSD]2011-04-23 17:13:18.339 [3032460176] >TRACE: clssnmDiscHelper: vrh1, node(1) connection failed, con (0xb7eaf220), probe((nil)) [ CSSD]2011-04-23 17:13:18.340 [3032460176] >TRACE: clssnmWaitOnEvictions: Start [ CSSD]2011-04-23 17:13:18.340 [3032460176] >TRACE: clssnmWaitOnEvictions: node 1, vrh1, undead 1 [ CSSD]2011-04-23 17:13:18.340 [3032460176] >TRACE: clssnmCheckKillStatus: Node 1, vrh1, down, LATS(443144),timeout(150) clssnmCheckKillStatus 检查 1 号节点是否 down 了 [ CSSD]2011-04-23 17:13:18.340 [3032460176] >TRACE: clssnmSetupAckWait: Ack message type (15) [ CSSD]2011-04-23 17:13:18.340 [3032460176] >TRACE: clssnmSetupAckWait: node(2) is ACTIVE [ CSSD]2011-04-23 17:13:18.340 [3032460176] >TRACE: clssnmSetupAckWait: node(3) is ACTIVE [ CSSD]2011-04-23 17:13:18.340 [3032460176] >TRACE: clssnmSendUpdate: syncSeqNo(10) [ CSSD]2011-04-23 17:13:18.341 [3032460176] >TRACE: clssnmWaitForAcks: Ack message type(15), ackCount(2) [ CSSD]2011-04-23 17:13:18.341 [89033616] >TRACE: clssnmUpdateNodeState: node 0, state (0/0) unique (0/0) prevConuni(0) birth (0/0) (old/new) [ CSSD]2011-04-23 17:13:18.341 [89033616] >TRACE: clssnmUpdateNodeState: node 1, state (5/0) unique (1303592601/1303592601) prevConuni(1303592601) birth (9/9) (old/new) [ CSSD]2011-04-23 17:13:18.341 [89033616] >TRACE: clssnmDeactivateNode: node 1 (vrh1) left cluster [ CSSD]2011-04-23 17:13:18.341 [89033616] >TRACE: clssnmUpdateNodeState: node 2, state (3/3) unique (1303591210/1303591210) prevConuni(0) birth (2/2) (old/new)
  11. 11. [ CSSD]2011-04-23 17:13:18.341 [89033616] >TRACE: clssnmUpdateNodeState: node 3, state (3/3) unique (1303591326/1303591326) prevConuni(0) birth (3/3) (old/new) [ CSSD]2011-04-23 17:13:18.342 [89033616] >USER: clssnmHandleUpdate: SYNC(10) from node(3) completed [ CSSD]2011-04-23 17:13:18.342 [89033616] >USER: clssnmHandleUpdate: NODE 2 (vrh2) IS ACTIVE MEMBER OF CLUSTER [ CSSD]2011-04-23 17:13:18.342 [89033616] >USER: clssnmHandleUpdate: NODE 3 (vrh3) IS ACTIVE MEMBER OF CLUSTER [ CSSD]2011-04-23 17:13:18.342 [89033616] >TRACE: clssnmHandleUpdate: diskTimeout set to (200000)ms [ CSSD]2011-04-23 17:13:18.347 [3032460176] >TRACE: clssnmWaitForAcks: done, msg type(15) [ CSSD]2011-04-23 17:13:18.348 [3032460176] >TRACE: clssnmDoSyncUpdate: Sync 10 complete! [ CSSD]2011-04-23 17:13:18.350 [3021970320] >TRACE: clssgmReconfigThread: started for reconfig (10) [ CSSD]2011-04-23 17:13:18.350 [3021970320] >USER: NMEVENT_RECONFIG [00] [00][00][0c] [ CSSD]2011-04-23 17:13:18.351 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock crs_version type 2 [ CSSD]2011-04-23 17:13:18.352 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(0) grock(crs_version) birth(9/9) [ CSSD]2011-04-23 17:13:18.353 [3063929744] >TRACE: clssgmDispatchCMXMSG(): got message type(7) src(2) incarn(10) during incarn(9/9) [ CSSD]2011-04-23 17:13:18.354 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock _ORA_CRS_FAILOVER type 3 [ CSSD]2011-04-23 17:13:18.355 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock EVMDMAIN type 2 [ CSSD]2011-04-23 17:13:18.355 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(EVMDMAIN) birth(9/9) [ CSSD]2011-04-23 17:13:18.355 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock CRSDMAIN type 2 [ CSSD]2011-04-23 17:13:18.355 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(CRSDMAIN) birth(9/9) [ CSSD]2011-04-23 17:13:18.355 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock _ORA_CRS_MEMBER_vrh1 type 3 [ CSSD]2011-04-23 17:13:18.355 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock _ORA_CRS_MEMBER_vrh2 type 3 [ CSSD]2011-04-23 17:13:18.356 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock _ORA_CRS_MEMBER_vrh3 type 3 [ CSSD]2011-04-23 17:13:18.356 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock ocr_crs type 2 [ CSSD]2011-04-23 17:13:18.356 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(ocr_crs) birth(9/9) [ CSSD]2011-04-23 17:13:18.356 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock #CSS_CLSSOMON type 2 [ CSSD]2011-04-23 17:13:18.356 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(#CSS_CLSSOMON) birth(9/9) [ CSSD]2011-04-23 17:13:18.357 [3021970320] >TRACE: clssgmEstablishConnections: 2 nodes in cluster incarn 10 [ CSSD]2011-04-23 17:13:18.366 [3063929744] >TRACE: clssgmPeerDeactivate: node 1 (vrh1), death 10, state 0x80000000 connstate 0xa [ CSSD]2011-04-23 17:13:18.367 [3063929744] >TRACE: clssgmHandleDBDone(): src/dest (2/65535) size(68) incarn 10 [ CSSD]2011-04-23 17:13:18.367 [3063929744] >TRACE: clssgmPeerListener: connects done (2/2) [ CSSD]2011-04-23 17:13:18.369 [3021970320] >TRACE: clssgmEstablishMasterNode: MASTER for 10 is node(2) birth(2) 更新阶段
  12. 12. [ CSSD]CLSS-3000: reconfiguration successful, incarnation 10 with 2 nodes [ CSSD]CLSS-3001: local node number 3, master node number 2 [ CSSD]2011-04-23 17:13:18.372 [3021970320] >TRACE: clssgmReconfigThread: completed for reconfig(10), with status(1) 2.另一场景为 1 号节点未加入集群,2 号节点的网络失败,因 2 号节点的 member number 较 小故其通过 voting disk 向 3 号节点发起驱逐 以下为 2 号节点的 ocssd.log 日志 [ CSSD]2011-04-23 17:41:48.643 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 50 8.910601e-269artbeat fatal, eviction in 29.890 seconds [ CSSD]2011-04-23 17:41:48.643 [3053439888] >TRACE: clssnmPollingThread: node vrh3 (3) is impending reconfig, flag 1037, misstime 30110 [ CSSD]2011-04-23 17:41:48.643 [3053439888] >TRACE: clssnmPollingThread: diskTimeout set to (57000)ms impending reconfig status(1) [ CSSD]2011-04-23 17:41:50.132 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 50 8.910601e-269artbeat fatal, eviction in 28.890 seconds [ CSSD]2011-04-23 17:42:10.533 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 75 8.910601e-269artbeat fatal, eviction in 14.860 seconds [ CSSD]2011-04-23 17:42:11.962 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 75 8.910601e-269artbeat fatal, eviction in 13.860 seconds [ CSSD]2011-04-23 17:42:23.523 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 8.910601e-269artbeat fatal, eviction in 5.840 seconds [ CSSD]2011-04-23 17:42:24.989 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 8.910601e-269artbeat fatal, eviction in 4.840 seconds [ CSSD]2011-04-23 17:42:26.423 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 8.910601e-269artbeat fatal, eviction in 3.840 seconds [ CSSD]2011-04-23 17:42:27.890 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 8.910601e-269artbeat fatal, eviction in 2.840 seconds [ CSSD]2011-04-23 17:42:29.382 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 8.910601e-269artbeat fatal, eviction in 1.840 seconds [ CSSD]2011-04-23 17:42:30.832 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 8.910601e-269artbeat fatal, eviction in 0.830 seconds [ CSSD]2011-04-23 17:42:32.020 [3053439888] >TRACE: clssnmPollingThread: Eviction started for node vrh3 (3), flags 0x040d, state 3, wt4c 0 [ CSSD]2011-04-23 17:42:32.020 [3032460176] >TRACE: clssnmDoSyncUpdate: Initiating sync 13 [ CSSD]2011-04-23 17:42:32.020 [3032460176] >TRACE: clssnmDoSyncUpdate: diskTimeout set to (57000)ms [ CSSD]2011-04-23 17:42:32.020 [3032460176] >TRACE: clssnmSetupAckWait: Ack message type (11) [ CSSD]2011-04-23 17:42:32.020 [3032460176] >TRACE: clssnmSetupAckWait: node(2) is ALIVE [ CSSD]2011-04-23 17:42:32.020 [3032460176] >TRACE: clssnmSendSync: syncSeqNo(13) [ CSSD]2011-04-23 17:42:32.021 [3032460176] >TRACE: clssnmWaitForAcks: Ack message type(11), ackCount(1) [ CSSD]2011-04-23 17:42:32.021 [117574544] >TRACE: clssnmHandleSync: diskTimeout set to (57000)ms [ CSSD]2011-04-23 17:42:32.021 [117574544] >TRACE: clssnmHandleSync: Acknowledging sync: src[2] srcName[vrh2] seq[13] sync[13] [ CSSD]2011-04-23 17:42:32.021 [3032460176] >TRACE: clssnmWaitForAcks: done, msg type(11) [ CSSD]2011-04-23 17:42:32.021 [3032460176] >TRACE: clssnmDoSyncUpdate: Terminating node 3, vrh3, misstime(60000) state(5) [ CSSD]2011-04-23 17:42:32.021 [3032460176] >TRACE: clssnmSetupAckWait: Ack message type (13) [ CSSD]2011-04-23 17:42:32.021 [3032460176] >TRACE: clssnmSetupAckWait: node(2) is ACTIVE
  13. 13. [ CSSD]2011-04-23 17:42:32.021 [5073104] >USER: NMEVENT_SUSPEND [00][00] [00][0c] [ CSSD]2011-04-23 17:42:32.021 [3032460176] >TRACE: clssnmWaitForAcks: Ack message type(13), ackCount(1) [ CSSD]2011-04-23 17:42:32.022 [117574544] >TRACE: clssnmSendVoteInfo: node(2) syncSeqNo(13) [ CSSD]2011-04-23 17:42:32.022 [3032460176] >TRACE: clssnmWaitForAcks: done, msg type(13) [ CSSD]2011-04-23 17:42:32.022 [3032460176] >TRACE: clssnmCheckDskInfo: Checking disk info... [ CSSD]2011-04-23 17:42:32.022 [3032460176] >TRACE: clssnmCheckDskInfo: node 3, vrh3, state 5 with leader 3 has smaller cluster size 1; my cluster size 1 with leader 2 检查 voting disk 后发现子集群 3 为最小"子集群"(3 号节点的 node number 较 2 号大);2 号节点为最大 子集群 [ CSSD]2011-04-23 17:42:32.022 [3032460176] >TRACE: clssnmEvict: Start [ CSSD]2011-04-23 17:42:32.022 [3032460176] >TRACE: clssnmEvict: Evicting node 3, vrh3, birth 3, death 13, impendingrcfg 1, stateflags 0x40d [ CSSD]2011-04-23 17:42:32.022 [3032460176] >TRACE: clssnmSendShutdown: req to node 3, kill time 1643084 发起对 3 号节点的驱逐和 shutdown request [ CSSD]2011-04-23 17:42:32.023 [3032460176] >TRACE: clssnmDiscHelper: vrh3, node(3) connection failed, con (0xb7e79bb0), probe((nil)) [ CSSD]2011-04-23 17:42:32.023 [3032460176] >TRACE: clssnmWaitOnEvictions: Start [ CSSD]2011-04-23 17:42:32.023 [3032460176] >TRACE: clssnmWaitOnEvictions: node 3, vrh3, undead 1 [ CSSD]2011-04-23 17:42:32.023 [3032460176] >TRACE: clssnmCheckKillStatus: Node 3, vrh3, down, LATS(1642874),timeout(210) [ CSSD]2011-04-23 17:42:32.023 [3032460176] >TRACE: clssnmSetupAckWait: Ack message type (15) [ CSSD]2011-04-23 17:42:32.023 [3032460176] >TRACE: clssnmSetupAckWait: node(2) is ACTIVE [ CSSD]2011-04-23 17:42:32.023 [3032460176] >TRACE: clssnmSendUpdate: syncSeqNo(13) [ CSSD]2011-04-23 17:42:32.024 [3032460176] >TRACE: clssnmWaitForAcks: Ack message type(15), ackCount(1) [ CSSD]2011-04-23 17:42:32.024 [117574544] >TRACE: clssnmUpdateNodeState: node 0, state (0/0) unique (0/0) prevConuni(0) birth (0/0) (old/new) [ CSSD]2011-04-23 17:42:32.024 [117574544] >TRACE: clssnmUpdateNodeState: node 1, state (0/0) unique (0/0) prevConuni(0) birth (0/0) (old/new) [ CSSD]2011-04-23 17:42:32.024 [117574544] >TRACE: clssnmUpdateNodeState: node 2, state (3/3) unique (1303591210/1303591210) prevConuni(0) birth (2/2) (old/new) [ CSSD]2011-04-23 17:42:32.024 [117574544] >TRACE: clssnmUpdateNodeState: node 3, state (5/0) unique (1303591326/1303591326) prevConuni(1303591326) birth (3/3) (old/new) [ CSSD]2011-04-23 17:42:32.024 [117574544] >TRACE: clssnmDeactivateNode: node 3 (vrh3) left cluster [ CSSD]2011-04-23 17:42:32.024 [117574544] >USER: clssnmHandleUpdate: SYNC(13) from node(2) completed [ CSSD]2011-04-23 17:42:32.024 [117574544] >USER: clssnmHandleUpdate: NODE 2 (vrh2) IS ACTIVE MEMBER OF CLUSTER [ CSSD]2011-04-23 17:42:32.024 [117574544] >TRACE: clssnmHandleUpdate: diskTimeout set to (200000)ms [ CSSD]2011-04-23 17:42:32.024 [3032460176] >TRACE: clssnmWaitForAcks: done, msg type(15) [ CSSD]2011-04-23 17:42:32.024 [3032460176] >TRACE: clssnmDoSyncUpdate:
  14. 14. Sync 13 complete! [ CSSD]2011-04-23 17:42:32.024 [3021970320] >TRACE: clssgmReconfigThread: started for reconfig (13) [ CSSD]2011-04-23 17:42:32.024 [3021970320] >USER: NMEVENT_RECONFIG [00] [00][00][04] [ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock crs_version type 2 [ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(2) grock(crs_version) birth(3/3) [ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock _ORA_CRS_FAILOVER type 3 [ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(0) grock(_ORA_CRS_FAILOVER) birth(3/3) [ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock EVMDMAIN type 2 [ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(3) grock(EVMDMAIN) birth(3/3) [ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock CRSDMAIN type 2 [ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(3) grock(CRSDMAIN) birth(3/3) [ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock _ORA_CRS_MEMBER_vrh1 type 3 [ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(0) grock(_ORA_CRS_MEMBER_vrh1) birth(3/3) [ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock _ORA_CRS_MEMBER_vrh3 type 3 [ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(0) grock(_ORA_CRS_MEMBER_vrh3) birth(3/3) [ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock ocr_crs type 2 [ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(3) grock(ocr_crs) birth(3/3) [ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock #CSS_CLSSOMON type 2 [ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(3) grock(#CSS_CLSSOMON) birth(3/3) [ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmEstablishConnections: 1 nodes in cluster incarn 13 [ CSSD]2011-04-23 17:42:32.026 [3063929744] >TRACE: clssgmPeerDeactivate: node 3 (vrh3), death 13, state 0x0 connstate 0xf [ CSSD]2011-04-23 17:42:32.026 [3063929744] >TRACE: clssgmPeerListener: connects done (1/1) [ CSSD]2011-04-23 17:42:32.026 [3021970320] >TRACE: clssgmEstablishMasterNode: MASTER for 13 is node(2) birth(2) [ CSSD]2011-04-23 17:42:32.026 [3021970320] >TRACE: clssgmMasterCMSync: Synchronizing group/lock status [ CSSD]2011-04-23 17:42:32.026 [3021970320] >TRACE: clssgmMasterSendDBDone: group/lock status synchronization complete [ CSSD]CLSS-3000: reconfiguration successful, incarnation 13 with 1 nodes [ CSSD]CLSS-3001: local node number 2, master node number 2 完成 reconfiguration [ CSSD]2011-04-23 17:42:32.027 [3021970320] >TRACE: clssgmReconfigThread: completed for reconfig(13), with status(1)
  15. 15. 以下为 3 号节点的 ocssd.log 日志: [ CSSD]2011-04-23 17:42:33.204 [3053439888] >WARNING: clssnmPollingThread: node vrh2 (2) at 50 1.867300e-268artbeat fatal, eviction in 29.360 seconds [ CSSD]2011-04-23 17:42:33.204 [3053439888] >TRACE: clssnmPollingThread: node vrh2 (2) is impending reconfig, flag 1039, misstime 30640 [ CSSD]2011-04-23 17:42:33.204 [3053439888] >TRACE: clssnmPollingThread: diskTimeout set to (57000)ms impending reconfig status(1) [ CSSD]2011-04-23 17:42:55.168 [3053439888] >WARNING: clssnmPollingThread: node vrh2 (2) at 75 1.867300e-268artbeat fatal, eviction in 14.330 seconds [ CSSD]2011-04-23 17:43:08.182 [3053439888] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 1.867300e-268artbeat fatal, eviction in 5.310 seconds [ CSSD]2011-04-23 17:43:09.661 [3053439888] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 1.867300e-268artbeat fatal, eviction in 4.300 seconds [ CSSD]2011-04-23 17:43:11.144 [3053439888] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 1.867300e-268artbeat fatal, eviction in 3.300 seconds [ CSSD]2011-04-23 17:43:12.634 [3053439888] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 1.867300e-268artbeat fatal, eviction in 2.300 seconds [ CSSD]2011-04-23 17:43:14.053 [3053439888] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 1.867300e-268artbeat fatal, eviction in 1.300 seconds [ CSSD]2011-04-23 17:43:15.467 [3053439888] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 1.867300e-268artbeat fatal, eviction in 0.300 seconds [ CSSD]2011-04-23 17:43:15.911 [3053439888] >TRACE: clssnmPollingThread: Eviction started for node vrh2 (2), flags 0x040f, state 3, wt4c 0 [ CSSD]2011-04-23 17:43:15.911 [3032460176] >TRACE: clssnmDoSyncUpdate: Initiating sync 13 [ CSSD]2011-04-23 17:43:15.911 [3032460176] >TRACE: clssnmDoSyncUpdate: diskTimeout set to (57000)ms [ CSSD]2011-04-23 17:43:15.911 [3032460176] >TRACE: clssnmSetupAckWait: Ack message type (11) [ CSSD]2011-04-23 17:43:15.911 [3032460176] >TRACE: clssnmSetupAckWait: node(3) is ALIVE [ CSSD]2011-04-23 17:43:15.911 [3032460176] >TRACE: clssnmSendSync: syncSeqNo(13) [ CSSD]2011-04-23 17:43:15.911 [3032460176] >TRACE: clssnmWaitForAcks: Ack message type(11), ackCount(1) [ CSSD]2011-04-23 17:43:15.912 [89033616] >TRACE: clssnmHandleSync: diskTimeout set to (57000)ms [ CSSD]2011-04-23 17:43:15.912 [89033616] >TRACE: clssnmHandleSync: Acknowledging sync: src[3] srcName[vrh3] seq[29] sync[13] [ CSSD]2011-04-23 17:43:15.912 [8136912] >USER: NMEVENT_SUSPEND [00][00] [00][0c] [ CSSD]2011-04-23 17:43:15.912 [3032460176] >TRACE: clssnmWaitForAcks: done, msg type(11) [ CSSD]2011-04-23 17:43:15.912 [3032460176] >TRACE: clssnmDoSyncUpdate: Terminating node 2, vrh2, misstime(60010) state(5) [ CSSD]2011-04-23 17:43:15.912 [3032460176] >TRACE: clssnmSetupAckWait: Ack message type (13) [ CSSD]2011-04-23 17:43:15.912 [3032460176] >TRACE: clssnmSetupAckWait: node(3) is ACTIVE [ CSSD]2011-04-23 17:43:15.913 [89033616] >TRACE: clssnmSendVoteInfo: node(3) syncSeqNo(13) [ CSSD]2011-04-23 17:43:15.912 [3032460176] >TRACE: clssnmWaitForAcks: Ack message type(13), ackCount(1) [ CSSD]2011-04-23 17:43:15.913 [3032460176] >TRACE: clssnmCheckDskInfo: Checking disk info... [ CSSD]2011-04-23 17:43:15.913 [3032460176] >ERROR: clssnmCheckDskInfo: Aborting local node to avoid splitbrain. [ CSSD]2011-04-23 17:43:15.913 [3032460176] >ERROR: : my node(3), Leader(3), Size(1) VS Node(2), Leader(2), Size(1) 读取 voting disk 后发现 kill block,为避免 split brain,自我 aborting!
  16. 16. [ CSSD]2011-04-23 17:43:15.913 [3032460176] >ERROR: ################################### [ CSSD]2011-04-23 17:43:15.913 [3032460176] >ERROR: clssscExit: CSSD aborting [ CSSD]2011-04-23 17:43:15.913 [3032460176] >ERROR: ################################### [ CSSD]--- DUMP GROCK STATE DB --- [ CSSD]---------- [ CSSD] type 2, Id 4, Name = (crs_version) [ CSSD] flags: 0x1000 [ CSSD] grant: count=0, type 0, wait 0 [ CSSD] Member Count =2, master 2 [ CSSD] . . . . . [ CSSD] memberNo =2, seq 2 [ CSSD] flags = 0x0, granted 0 [ CSSD] refCnt = 1 [ CSSD] nodeNum = 3, nodeBirth 3 [ CSSD] privateDataSize = 0 [ CSSD] publicDataSize = 0 [ CSSD] . . . . . [ CSSD] memberNo =1, seq 12 [ CSSD] flags = 0x1000, granted 0 [ CSSD] refCnt = 1 [ CSSD] nodeNum = 2, nodeBirth 2 [ CSSD] privateDataSize = 0 [ CSSD] publicDataSize = 0 [ CSSD]---------- [ CSSD]---------- [ CSSD] type 3, Id 11, Name = (_ORA_CRS_FAILOVER) [ CSSD] flags: 0x0 [ CSSD] grant: count=1, type 3, wait 1 [ CSSD] Member Count =1, master -3 [ CSSD] . . . . . [ CSSD] memberNo =0, seq 0 [ CSSD] flags = 0x12, granted 1 [ CSSD] refCnt = 1 [ CSSD] nodeNum = 3, nodeBirth 3 [ CSSD] privateDataSize = 0 [ CSSD] publicDataSize = 0 [ CSSD]---------- [ CSSD]---------- [ CSSD] type 2, Id 2, Name = (EVMDMAIN) [ CSSD] flags: 0x1000 [ CSSD] grant: count=0, type 0, wait 0 [ CSSD] Member Count =2, master 2 [ CSSD] . . . . . [ CSSD] memberNo =2, seq 1 [ CSSD] flags = 0x0, granted 0 [ CSSD] refCnt = 1 [ CSSD] nodeNum = 2, nodeBirth 2 [ CSSD] privateDataSize = 508 [ CSSD] publicDataSize = 504 [ CSSD] . . . . . [ CSSD] memberNo =3, seq 2 [ CSSD] flags = 0x0, granted 0 [ CSSD] refCnt = 1 [ CSSD] nodeNum = 3, nodeBirth 3 [ CSSD] privateDataSize = 508 [ CSSD] publicDataSize = 504 [ CSSD]---------- [ CSSD]---------- [ CSSD] type 2, Id 5, Name = (CRSDMAIN) [ CSSD] flags: 0x1000 [ CSSD] grant: count=0, type 0, wait 0
  17. 17. [ CSSD] Member Count =1, master 3 [ CSSD] . . . . . [ CSSD] memberNo =3, seq 2 [ CSSD] flags = 0x0, granted 0 [ CSSD] refCnt = 1 [ CSSD] nodeNum = 3, nodeBirth 3 [ CSSD] privateDataSize = 128 [ CSSD] publicDataSize = 128 [ CSSD]---------- [ CSSD]---------- [ CSSD] type 3, Id 12, Name = (_ORA_CRS_MEMBER_vrh1) [ CSSD] flags: 0x0 [ CSSD] grant: count=1, type 3, wait 1 [ CSSD] Member Count =1, master -3 [ CSSD] . . . . . [ CSSD] memberNo =0, seq 0 [ CSSD] flags = 0x12, granted 1 [ CSSD] refCnt = 1 [ CSSD] nodeNum = 3, nodeBirth 3 [ CSSD] privateDataSize = 0 [ CSSD] publicDataSize = 0 [ CSSD]---------- [ CSSD]---------- [ CSSD] type 3, Id 12, Name = (_ORA_CRS_MEMBER_vrh3) [ CSSD] flags: 0x0 [ CSSD] grant: count=1, type 3, wait 1 [ CSSD] Member Count =1, master -3 [ CSSD] . . . . . [ CSSD] memberNo =0, seq 0 [ CSSD] flags = 0x12, granted 1 [ CSSD] refCnt = 1 [ CSSD] nodeNum = 3, nodeBirth 3 [ CSSD] privateDataSize = 0 [ CSSD] publicDataSize = 0 [ CSSD]---------- [ CSSD]---------- [ CSSD] type 2, Id 3, Name = (ocr_crs) [ CSSD] flags: 0x1000 [ CSSD] grant: count=0, type 0, wait 0 [ CSSD] Member Count =2, master 3 [ CSSD] . . . . . [ CSSD] memberNo =3, seq 2 [ CSSD] flags = 0x0, granted 0 [ CSSD] refCnt = 1 [ CSSD] nodeNum = 3, nodeBirth 3 [ CSSD] privateDataSize = 0 [ CSSD] publicDataSize = 32 [ CSSD] . . . . . [ CSSD] memberNo =2, seq 12 [ CSSD] flags = 0x1000, granted 0 [ CSSD] refCnt = 1 [ CSSD] nodeNum = 2, nodeBirth 2 [ CSSD] privateDataSize = 0 [ CSSD] publicDataSize = 32 [ CSSD]---------- [ CSSD]---------- [ CSSD] type 2, Id 1, Name = (#CSS_CLSSOMON) [ CSSD] flags: 0x1000 [ CSSD] grant: count=0, type 0, wait 0 [ CSSD] Member Count =2, master 2 [ CSSD] . . . . . [ CSSD] memberNo =2, seq 1 [ CSSD] flags = 0x1000, granted 0 [ CSSD] refCnt = 1
  18. 18. [ CSSD] nodeNum = 2, nodeBirth 2 [ CSSD] privateDataSize = 0 [ CSSD] publicDataSize = 0 [ CSSD] . . . . . [ CSSD] memberNo =3, seq 2 [ CSSD] flags = 0x1000, granted 0 [ CSSD] refCnt = 1 [ CSSD] nodeNum = 3, nodeBirth 3 [ CSSD] privateDataSize = 0 [ CSSD] publicDataSize = 0 [ CSSD]---------- [ CSSD]--- END OF GROCK STATE DUMP --- [ CSSD]------- Begin Dump ------- © 2011, www.oracledatabase12g.com. 版权所有.文章允许转载,但必须以链接方式注明源地址, 否则追求法律责任.

×