Oracle 11g RAC 节点异常重启问题分析

一、背景

在国庆期间巡检的时候,发现数据库alert日志中出现了异常重启的信息,当即对该报错进行分析处理。

二、处理过程

(1)数据库告警日志分析

node1 alert:
Sat Oct 05 13:05:14 2024
Thread 1 advanced to log sequence 6981 (LGWR switch)
  Current log# 11 seq# 6981 mem# 0: +DATA/ybqddb/onlinelog/group_11.302.1144593261
Sat Oct 05 13:05:15 2024
Archived Log entry 12130 added for thread 1 sequence 6980 ID 0x8d497377 dest 1:
Sat Oct 05 14:50:48 2024
Reconfiguration started (old inc 27, new inc 29)
List of instances:
 1 (myinst: 1) 
 Global Resource Directory frozen
 * dead instance detected - domain 0 invalid = TRUE 
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Sat Oct 05 14:50:48 2024
Sat Oct 05 14:50:48 2024
 LMS 3: 1 GCS shadows cancelled, 1 closed, 0 Xw survived
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Sat Oct 05 14:50:48 2024
 LMS 1: 1 GCS shadows cancelled, 0 closed, 0 Xw survived
Sat Oct 05 14:50:48 2024
 LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info 
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Post SMON to start 1st pass IR
Sat Oct 05 14:50:48 2024
Instance recovery: looking for dead threads
Beginning instance recovery of 1 threads
 Submitted all GCS remote-cache requests
 Post SMON to start 1st pass IR
 Fix write in gcs resources
Reconfiguration complete
 parallel recovery started with 32 processes
Started redo scan
Completed redo scan
 read 76 KB redo, 16 data blocks need recovery
Started redo application at
 Thread 2: logseq 5168, block 86069
Sat Oct 05 14:50:53 2024
Setting Resource Manager plan SCHEDULER[0x32DE]:DEFAULT_MAINTENANCE_PLAN via scheduler window
Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter
Recovery of Online Redo Log: Thread 2 Group 10 Seq 5168 Reading mem 0
  Mem# 0: +DATA/ybqddb/onlinelog/group_10.300.1144593255
Completed redo application of 0.01MB
Completed instance recovery at
 Thread 2: logseq 5168, block 86222, scn 245107035
 16 data blocks read, 16 data blocks written, 76 redo k-bytes read
Sat Oct 05 14:50:53 2024
minact-scn: master found reconf/inst-rec before recscn scan old-inc#:29 new-inc#:29
Thread 2 advanced to log sequence 5169 (thread recovery)
Redo thread 2 internally disabled at seq 5169 (SMON)
Sat Oct 05 14:50:54 2024
Archived Log entry 12131 added for thread 2 sequence 5168 ID 0x8d497377 dest 1:
Sat Oct 05 14:50:54 2024
ARC2: Archiving disabled thread 2 sequence 5169
Archived Log entry 12132 added for thread 2 sequence 5169 ID 0x8d497377 dest 1:
minact-scn: master continuing after IR
minact-scn: Master considers inst:2 dead
Sat Oct 05 14:51:49 2024
Decreasing number of real time LMS from 4 to 0
Sat Oct 05 14:52:11 2024
Reconfiguration started (old inc 29, new inc 31)
List of instances:
 1 2 (myinst: 1) 
 Global Resource Directory frozen
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Sat Oct 05 14:52:11 2024
 LMS 3: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Sat Oct 05 14:52:11 2024
 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Sat Oct 05 14:52:11 2024
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Sat Oct 05 14:52:11 2024
 LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info 
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
Sat Oct 05 14:52:11 2024
minact-scn: Master returning as live inst:2 has inc# mismatch instinc:0 cur:31 errcnt:0
 Submitted all GCS remote-cache requests
 Fix write in gcs resources
Reconfiguration complete
Sat Oct 05 14:53:26 2024
Increasing number of real time LMS from 0 to 4
Sat Oct 05 17:05:20 2024
ALTER SYSTEM ARCHIVE LOG
Sat Oct 05 17:05:21 2024
Thread 1 advanced to log sequence 6982 (LGWR switch)
  Current log# 5 seq# 6982 mem# 0: +DATA/ybqddb/onlinelog/group_5.290.1144593225
Sat Oct 05 17:05:21 2024
Archived Log entry 12134 added for thread 1 sequence 6981 ID 0x8d497377 dest 1:
Sat Oct 05 21:05:22 2024
ALTER SYSTEM ARCHIVE LOG
Sat Oct 05 21:05:22 2024
Thread 1 advanced to log sequence 6983 (LGWR switch)
  Current log# 7 seq# 6983 mem# 0: +DATA/ybqddb/onlinelog/group_7.294.1144593235
Sat Oct 05 21:05:23 2024
Archived Log entry 12135 added for thread 1 sequence 6982 ID 0x8d497377 dest 1:
Sun Oct 06 01:08:47 2024
ALTER SYSTEM ARCHIVE LOG
Sun Oct 06 01:08:49 2024
Thread 1 advanced to log sequence 6984 (LGWR switch)
  Current log# 9 seq# 6984 mem# 0: +DATA/ybqddb/onlinelog/group_9.298.1144593249
Sun Oct 06 01:08:49 2024
Archived Log entry 12138 added for thread 1 sequence 6983 ID 0x8d497377 dest 1:
Sun Oct 06 05:05:18 2024
ALTER SYSTEM ARCHIVE LOG
Sun Oct 06 05:05:18 2024
Thread 1 advanced to log sequence 6985 (LGWR switch)
  Current log# 11 seq# 6985 mem# 0: +DATA/ybqddb/onlinelog/group_11.302.1144593261
Archived Log entry 12139 added for thread 2 sequence 5173 ID 0x8d497377 dest 1:
Sun Oct 06 05:05:19 2024

node2 alert:

Sat Oct 05 13:05:14 2024
Archived Log entry 12129 added for thread 2 sequence 5167 ID 0x8d497377 dest 1:
Sat Oct 05 14:50:47 2024
NOTE: ASMB terminating
Errors in file /u01/app/oracle/diag/rdbms/ybqddb/ybqddb2/trace/ybqddb2_asmb_15097.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID: 
Session ID: 2109 Serial number: 3
Errors in file /u01/app/oracle/diag/rdbms/ybqddb/ybqddb2/trace/ybqddb2_asmb_15097.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID: 
Session ID: 2109 Serial number: 3
ASMB (ospid: 15097): terminating the instance due to error 15064
Instance terminated by ASMB, pid = 15097
Sat Oct 05 14:51:59 2024
Starting ORACLE instance (normal)
************************ Large Pages Information *******************
Per process system memlock (soft) limit = UNLIMITED
 
Total Shared Global Region in Large Pages = 0 KB (0%)
 
Large Pages used by this instance: 0 (0 KB)
Large Pages unused system wide = 0 (0 KB)
Large Pages configured system wide = 0 (0 KB)
Large Page size = 2048 KB
 
从节点2日志可以看出,在14:50:47出现与asm实例通信失败报错,并于14:51:59该节点实例开始启动。

(2)查看crs及ocssd日志

node1 crs日志:

2024-07-26 03:02:59.119: 
[client(240419)]CRS-10051:CVU found following errors with Clusterware setup : PRVF-4657 : Name resolution setup check for "nxybj-scan" (IP address: 172.31.122.38) failedPRVF-4664 : Found inconsistent name resolution entries for SCAN name "nxybj-scan"

2024-10-05 14:50:31.281: 
[cssd(10595)]CRS-1612:Network communication with node nxybj2 (2) missing for 50% of timeout interval.  Removal of this node from cluster in 14.200 seconds
2024-10-05 14:50:38.282: 
[cssd(10595)]CRS-1611:Network communication with node nxybj2 (2) missing for 75% of timeout interval.  Removal of this node from cluster in 7.200 seconds
2024-10-05 14:50:43.283: 
[cssd(10595)]CRS-1610:Network communication with node nxybj2 (2) missing for 90% of timeout interval.  Removal of this node from cluster in 2.200 seconds
2024-10-05 14:50:45.484: 
[cssd(10595)]CRS-1607:Node nxybj2 is being evicted in cluster incarnation 587229053; details at (:CSSNM00007:) in /u01/app/11.2.0/grid/log/nxybj1/cssd/ocssd.log.
2024-10-05 14:50:47.994: 
[cssd(10595)]CRS-1625:Node nxybj2, number 2, was manually shut down
2024-10-05 14:50:47.999: 
[cssd(10595)]CRS-1601:CSSD Reconfiguration complete. Active nodes are nxybj1 .
2024-10-05 14:50:48.000: 
[crsd(14356)]CRS-5504:Node down event reported for node 'nxybj2'.
2024-10-05 14:50:51.248: 
[crsd(14356)]CRS-2773:Server 'nxybj2' has been removed from pool 'Generic'.
2024-10-05 14:50:51.249: 
[crsd(14356)]CRS-2773:Server 'nxybj2' has been removed from pool 'ora.orcl'.
2024-10-05 14:50:51.249: 
[crsd(14356)]CRS-2773:Server 'nxybj2' has been removed from pool 'ora.ybqddb'.
2024-10-05 14:50:51.249: 
[crsd(14356)]CRS-2773:Server 'nxybj2' has been removed from pool 'ora.ybzxddb'.
2024-10-05 14:51:13.084: 
[cssd(10595)]CRS-1601:CSSD Reconfiguration complete. Active nodes are nxybj1 nxybj2 .
2024-10-05 14:51:42.117: 
[crsd(14356)]CRS-2772:Server 'nxybj2' has been assigned to pool 'Generic'.
2024-10-05 14:51:42.118: 
[crsd(14356)]CRS-2772:Server 'nxybj2' has been assigned to pool 'ora.ybzxddb'.
2024-10-05 14:51:42.118: 
[crsd(14356)]CRS-2772:Server 'nxybj2' has been assigned to pool 'ora.ybqddb'.
2024-10-05 14:51:42.118: 
[crsd(14356)]CRS-2772:Server 'nxybj2' has been assigned to pool 'ora.orcl'.

node1 ocssd日志:
2024-10-05 14:50:28.553: [    CSSD][2946451200]clssnmSendingThread: sent 5 status msgs to all nodes
2024-10-05 14:50:31.281: [    CSSD][2948028160]clssnmPollingThread: node nxybj2 (2) at 50% heartbeat fatal, removal in 14.200 seconds
2024-10-05 14:50:31.281: [    CSSD][2948028160]clssnmPollingThread: node nxybj2 (2) is impending reconfig, flag 2229260, misstime 15800
2024-10-05 14:50:31.281: [    CSSD][2948028160]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
2024-10-05 14:50:31.282: [    CSSD][3159578368]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693547, LATS 2409222280, lastSeqNo 108691578, uniqueness 1708519428, timestamp 1728111030/2407806020
2024-10-05 14:50:31.282: [    CSSD][3164333824]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693549, LATS 2409222280, lastSeqNo 108693378, uniqueness 1708519428, timestamp 1728111030/2407806140
2024-10-05 14:50:31.691: [    CSSD][3162732288]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111031/2409222690
2024-10-05 14:50:31.781: [    CSSD][3172316928]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111031/2409222780
2024-10-05 14:50:31.784: [    CSSD][3169081088]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693551, LATS 2409222790, lastSeqNo 108691579, uniqueness 1708519428, timestamp 1728111031/2407807110
2024-10-05 14:50:31.811: [    CSSD][3167487744]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111031/2409222810
2024-10-05 14:50:32.193: [    CSSD][3162732288]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2409223200
2024-10-05 14:50:32.282: [    CSSD][3159578368]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693553, LATS 2409223280, lastSeqNo 108693547, uniqueness 1708519428, timestamp 1728111032/2407807530
2024-10-05 14:50:32.291: [    CSSD][3172316928]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2409223290
2024-10-05 14:50:32.321: [    CSSD][3167487744]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2409223320
2024-10-05 14:50:32.701: [    CSSD][3162732288]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2409223700
2024-10-05 14:50:32.783: [    CSSD][3159578368]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693556, LATS 2409223790, lastSeqNo 108693553, uniqueness 1708519428, timestamp 1728111032/2407808040
2024-10-05 14:50:32.793: [    CSSD][3172316928]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2409223790
2024-10-05 14:50:32.823: [    CSSD][3167487744]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2409223820
2024-10-05 14:50:33.203: [    CSSD][3162732288]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2409224200
2024-10-05 14:50:33.282: [    CSSD][3159578368]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693559, LATS 2409224280, lastSeqNo 108693556, uniqueness 1708519428, timestamp 1728111033/2407808540
2024-10-05 14:50:33.301: [    CSSD][3172316928]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2409224300
2024-10-05 14:50:33.331: [    CSSD][3167487744]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2409224330
2024-10-05 14:50:33.554: [    CSSD][2946451200]clssnmSendingThread: sending status msg to all nodes
2024-10-05 14:50:33.554: [    CSSD][2946451200]clssnmSendingThread: sent 5 status msgs to all nodes
2024-10-05 14:50:33.711: [    CSSD][3162732288]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2409224710
2024-10-05 14:50:33.784: [    CSSD][3159578368]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693562, LATS 2409224790, lastSeqNo 108693559, uniqueness 1708519428, timestamp 1728111033/2407809050
2024-10-05 14:50:33.803: [    CSSD][3172316928]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2409224800
2024-10-05 14:50:33.833: [    CSSD][3167487744]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2409224840
2024-10-05 14:50:34.213: [    CSSD][3162732288]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2409225210
2024-10-05 14:50:34.282: [    CSSD][3169081088]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693563, LATS 2409225280, lastSeqNo 108693551, uniqueness 1708519428, timestamp 1728111033/2407809130
2024-10-05 14:50:34.282: [    CSSD][3159578368]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693565, LATS 2409225280, lastSeqNo 108693562, uniqueness 1708519428, timestamp 1728111034/2407809550
2024-10-05 14:50:34.311: [    CSSD][3172316928]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2409225310
2024-10-05 14:50:34.341: [    CSSD][3167487744]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2409225340
2024-10-05 14:50:34.721: [    CSSD][3162732288]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2409225720
2024-10-05 14:50:34.784: [    CSSD][3164333824]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693567, LATS 2409225790, lastSeqNo 108693549, uniqueness 1708519428, timestamp 1728111034/2407809680
2024-10-05 14:50:34.784: [    CSSD][3159578368]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693568, LATS 2409225790, lastSeqNo 108693565, uniqueness 1708519428, timestamp 1728111034/2407810050
2024-10-05 14:50:34.813: [    CSSD][3172316928]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2409225810
 

                                                                                                                            
node2 crs日志:
[ctssd(11996)]CRS-2409:The clock on host nxybj2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2024-10-03 05:57:32.243: 
[ctssd(11996)]CRS-2409:The clock on host nxybj2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2024-10-05 14:50:31.036: 
[cssd(10538)]CRS-1612:Network communication with node nxybj1 (1) missing for 50% of timeout interval.  Removal of this node from cluster in 14.520 seconds
2024-10-05 14:50:39.038: 
[cssd(10538)]CRS-1611:Network communication with node nxybj1 (1) missing for 75% of timeout interval.  Removal of this node from cluster in 6.520 seconds
2024-10-05 14:50:43.039: 
[cssd(10538)]CRS-1610:Network communication with node nxybj1 (1) missing for 90% of timeout interval.  Removal of this node from cluster in 2.520 seconds
2024-10-05 14:50:45.560: 
[cssd(10538)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /u01/app/11.2.0/grid/log/nxybj2/cssd/ocssd.log.
2024-10-05 14:50:45.560: 
[cssd(10538)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/11.2.0/grid/log/nxybj2/cssd/ocssd.log
2024-10-05 14:50:45.560: 
[cssd(10538)]CRS-1608:This node was evicted by node 1, nxybj1; details at (:CSSNM00005:) in /u01/app/11.2.0/grid/log/nxybj2/cssd/ocssd.log.
2024-10-05 14:50:45.560: 
[cssd(10538)]CRS-1608:This node was evicted by node 1, nxybj1; details at (:CSSNM00005:) in /u01/app/11.2.0/grid/log/nxybj2/cssd/ocssd.log.
2024-10-05 14:50:45.577: 
[cssd(10538)]CRS-1652:Starting clean up of CRSD resources.
2024-10-05 14:50:46.038: 
[cssd(10538)]CRS-1608:This node was evicted by node 1, nxybj1; details at (:CSSNM00005:) in /u01/app/11.2.0/grid/log/nxybj2/cssd/ocssd.log.
2024-10-05 14:50:46.803: 
[/u01/app/11.2.0/grid/bin/oraagent.bin(14456)]CRS-5016:Process "/u01/app/11.2.0/grid/opmn/bin/onsctli" spawned by agent "/u01/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/app/11.2.0/grid/log/nxybj2/agent/crsd/oraagent_grid/oraagent_grid.log"
2024-10-05 14:50:47.404: 
[/u01/app/11.2.0/grid/bin/oraagent.bin(14456)]CRS-5016:Process "/u01/app/11.2.0/grid/bin/lsnrctl" spawned by agent "/u01/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/app/11.2.0/grid/log/nxybj2/agent/crsd/oraagent_grid/oraagent_grid.log"
2024-10-05 14:50:47.405: 
[cssd(10538)]CRS-1654:Clean up of CRSD resources finished successfully.
2024-10-05 14:50:47.406: 
[cssd(10538)]CRS-1655:CSSD on node nxybj2 detected a problem and started to shutdown.
2024-10-05 14:50:47.434: 
[/u01/app/11.2.0/grid/bin/oraagent.bin(14698)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/oraagent_oracle' disconnected from server. Details at (:CRSAGF00117:) {0:7:28777} in /u01/app/11.2.0/grid/log/nxybj2/agent/crsd/oraagent_oracle/oraagent_oracle.log.
2024-10-05 14:50:47.436: 
[ohasd(10322)]CRS-2765:Resource 'ora.crsd' has failed on server 'nxybj2'.
2024-10-05 14:50:47.437: 
[/u01/app/11.2.0/grid/bin/oraagent.bin(14456)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/oraagent_grid' disconnected from server. Details at (:CRSAGF00117:) {0:1:4} in /u01/app/11.2.0/grid/log/nxybj2/agent/crsd/oraagent_grid/oraagent_grid.log.
2024-10-05 14:50:47.437: 
[/u01/app/11.2.0/grid/bin/orarootagent.bin(14460)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:3:2522} in /u01/app/11.2.0/grid/log/nxybj2/agent/crsd/orarootagent_root/orarootagent_root.log.
2024-10-05 14:50:47.691: 
[cssd(10538)]CRS-1660:The CSS daemon shutdown has completed
2024-10-05 14:50:47.846: 
[ohasd(10322)]CRS-2765:Resource 'ora.evmd' has failed on server 'nxybj2'.
2024-10-05 14:50:47.848: 
[ohasd(10322)]CRS-2765:Resource 'ora.ctssd' has failed on server 'nxybj2'.
2024-10-05 14:50:48.452: 
[crsd(345044)]CRS-0805:Cluster Ready Service aborted due to failure to communicate with Cluster Synchronization Service with error [3]. Details at (:CRSD00109:) in /u01/app/11.2.0/grid/log/nxybj2/crsd/crsd.log.
2024-10-05 14:50:48.598: 
[ohasd(10322)]CRS-2765:Resource 'ora.cssdmonitor' has failed on server 'nxybj2'.
2024-10-05 14:50:48.603: 
[ohasd(10322)]CRS-2765:Resource 'ora.crsd' has failed on server 'nxybj2'.
2024-10-05 14:50:48.861: 
[ctssd(348486)]CRS-2402:The Cluster Time Synchronization Service aborted on host nxybj2. Details at (:ctss_css_init1:) in /u01/app/11.2.0/grid/log/nxybj2/ctssd/octssd.log.
2024-10-05 14:50:49.102: 
[ohasd(10322)]CRS-2765:Resource 'ora.cssd' has failed on server 'nxybj2'.
2024-10-05 14:50:49.105: 
[/u01/app/11.2.0/grid/bin/oraagent.bin(10448)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/app/11.2.0/grid/log/nxybj2/agent/ohasd/oraagent_grid/oraagent_grid.log"
2024-10-05 14:50:49.271: 
[/u01/app/11.2.0/grid/bin/oraagent.bin(10448)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/app/11.2.0/grid/log/nxybj2/agent/ohasd/oraagent_grid/oraagent_grid.log"
2024-10-05 14:50:49.272: 
[ohasd(10322)]CRS-2765:Resource 'ora.asm' has failed on server 'nxybj2'.
2024-10-05 14:50:51.030: 
[cssd(349990)]CRS-1713:CSSD daemon is started in clustered mode
2024-10-05 14:50:52.863: 
[ohasd(10322)]CRS-2878:Failed to restart resource 'ora.ctssd'
2024-10-05 14:50:52.863: 
[ohasd(10322)]CRS-2769:Unable to failover resource 'ora.ctssd'.
2024-10-05 14:51:06.688: 
[cssd(349990)]CRS-1707:Lease acquisition for node nxybj2 number 2 completed
2024-10-05 14:51:07.344: 
[ohasd(10322)]CRS-2765:Resource 'ora.cluster_interconnect.haip' has failed on server 'nxybj2'.
2024-10-05 14:51:07.954: 
[cssd(349990)]CRS-1605:CSSD voting file is online: /dev/asmdisk/ocr02; details in /u01/app/11.2.0/grid/log/nxybj2/cssd/ocssd.log.
2024-10-05 14:51:07.961: 
[cssd(349990)]CRS-1605:CSSD voting file is online: /dev/asmdisk/ocr03; details in /u01/app/11.2.0/grid/log/nxybj2/cssd/ocssd.log.
2024-10-05 14:51:07.968: 
[cssd(349990)]CRS-1605:CSSD voting file is online: /dev/asmdisk/ocr01; details in /u01/app/11.2.0/grid/log/nxybj2/cssd/ocssd.log.
2024-10-05 14:51:13.083: 
[cssd(349990)]CRS-1601:CSSD Reconfiguration complete. Active nodes are nxybj1 nxybj2 .
2024-10-05 14:51:15.026: 
[ctssd(351425)]CRS-2403:The Cluster Time Synchronization Service on host nxybj2 is in observer mode.
2024-10-05 14:51:15.261: 
[ctssd(351425)]CRS-2407:The new Cluster Time Synchronization Service reference node is host nxybj1.
2024-10-05 14:51:15.261: 
[ctssd(351425)]CRS-2401:The Cluster Time Synchronization Service started on host nxybj2.
2024-10-05 14:51:17.018: 
[ohasd(10322)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE
2024-10-05 14:51:17.018: 
[ohasd(10322)]CRS-2769:Unable to failover resource 'ora.diskmon'.
2024-10-05 14:51:22.920: 
[ohasd(10322)]CRS-2878:Failed to restart resource 'ora.cluster_interconnect.haip'
2024-10-05 14:51:22.920: 
[/u01/app/11.2.0/grid/bin/oraagent.bin(10448)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/app/11.2.0/grid/log/nxybj2/agent/ohasd/oraagent_grid/oraagent_grid.log"
2024-10-05 14:51:23.116: 
[/u01/app/11.2.0/grid/bin/oraagent.bin(10448)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/app/11.2.0/grid/log/nxybj2/agent/ohasd/oraagent_grid/oraagent_grid.log"
2024-10-05 14:51:30.099: 
[ohasd(10322)]CRS-2878:Failed to restart resource 'ora.asm'
2024-10-05 14:51:31.109: 
[ohasd(10322)]CRS-2878:Failed to restart resource 'ora.crsd'
2024-10-05 14:51:36.711: 
[crsd(353461)]CRS-1012:The OCR service started on node nxybj2.
2024-10-05 14:51:36.717: 
[evmd(348478)]CRS-1401:EVMD started on node nxybj2.
2024-10-05 14:51:37.724: 
[crsd(353461)]CRS-1201:CRSD started on node nxybj2.
2024-10-05 14:51:39.350: 

                                                                                                                    
node2 ocssd日志:
2024-10-05 14:50:31.036: [    CSSD][1568802560]clssnmPollingThread: node nxybj1 (1) at 50% heartbeat fatal, removal in 14.520 seconds
2024-10-05 14:50:31.036: [    CSSD][1568802560]clssnmPollingThread: node nxybj1 (1) is impending reconfig, flag 2491406, misstime 15480
2024-10-05 14:50:31.036: [    CSSD][1568802560]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
2024-10-05 14:50:31.037: [    CSSD][1584641792]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592072, LATS 2407806370, lastSeqNo 108585015, uniqueness 1708518327, timestamp 1728111030/2409221770
2024-10-05 14:50:31.037: [    CSSD][1575139072]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592073, LATS 2407806370, lastSeqNo 108589502, uniqueness 1708518327, timestamp 1728111030/2409221810
2024-10-05 14:50:31.687: [    CSSD][1583048448]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111031/2407807020
2024-10-05 14:50:31.777: [    CSSD][1587885824]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111031/2407807110
2024-10-05 14:50:31.817: [    CSSD][1578292992]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111031/2407807150
2024-10-05 14:50:32.037: [    CSSD][1575139072]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592076, LATS 2407807370, lastSeqNo 108592073, uniqueness 1708518327, timestamp 1728111031/2409222810
2024-10-05 14:50:32.197: [    CSSD][1583048448]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2407807530
2024-10-05 14:50:32.287: [    CSSD][1587885824]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2407807620
2024-10-05 14:50:32.327: [    CSSD][1578292992]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2407807660
2024-10-05 14:50:32.480: [    CSSD][1567225600]clssnmSendingThread: sending status msg to all nodes
2024-10-05 14:50:32.480: [    CSSD][1567225600]clssnmSendingThread: sent 5 status msgs to all nodes
2024-10-05 14:50:32.539: [    CSSD][1575139072]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592079, LATS 2407807880, lastSeqNo 108592076, uniqueness 1708518327, timestamp 1728111032/2409223320
2024-10-05 14:50:32.698: [    CSSD][1583048448]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2407808040
2024-10-05 14:50:32.789: [    CSSD][1587885824]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2407808130
2024-10-05 14:50:32.828: [    CSSD][1578292992]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2407808170
2024-10-05 14:50:33.038: [    CSSD][1579894528]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592080, LATS 2407808370, lastSeqNo 108591299, uniqueness 1708518327, timestamp 1728111032/2409223700
2024-10-05 14:50:33.038: [    CSSD][1575139072]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592082, LATS 2407808370, lastSeqNo 108592079, uniqueness 1708518327, timestamp 1728111032/2409223820
2024-10-05 14:50:33.207: [    CSSD][1583048448]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2407808540
2024-10-05 14:50:33.297: [    CSSD][1587885824]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2407808630
2024-10-05 14:50:33.337: [    CSSD][1578292992]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2407808670
2024-10-05 14:50:33.540: [    CSSD][1584641792]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592084, LATS 2407808880, lastSeqNo 108592072, uniqueness 1708518327, timestamp 1728111033/2409224300
2024-10-05 14:50:33.546: [    CSSD][1575139072]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592085, LATS 2407808880, lastSeqNo 108592082, uniqueness 1708518327, timestamp 1728111033/2409224330
2024-10-05 14:50:33.709: [    CSSD][1583048448]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2407809050
2024-10-05 14:50:33.799: [    CSSD][1587885824]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2407809130
2024-10-05 14:50:33.838: [    CSSD][1578292992]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2407809170
2024-10-05 14:50:34.038: [    CSSD][1579894528]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592086, LATS 2407809370, lastSeqNo 108592080, uniqueness 1708518327, timestamp 1728111033/2409224710
2024-10-05 14:50:34.038: [    CSSD][1575139072]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592088, LATS 2407809370, lastSeqNo 108592085, uniqueness 1708518327, timestamp 1728111033/2409224840
2024-10-05 14:50:34.217: [    CSSD][1583048448]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2407809550
2024-10-05 14:50:34.307: [    CSSD][1587885824]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2407809640
2024-10-05 14:50:34.347: [    CSSD][1578292992]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2407809680
2024-10-05 14:50:34.540: [    CSSD][1579894528]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592089, LATS 2407809880, lastSeqNo 108592086, uniqueness 1708518327, timestamp 1728111034/2409225210
2024-10-05 14:50:34.540: [    CSSD][1575139072]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592091, LATS 2407809880, lastSeqNo 108592088, uniqueness 1708518327, timestamp 1728111034/2409225340
2024-10-05 14:50:34.718: [    CSSD][1583048448]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2407810050
2024-10-05 14:50:34.809: [    CSSD][1587885824]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2407810150
2024-10-05 14:50:34.849: [    CSSD][1578292992]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2407810180                                                                                                 

通过crs日志可以看出,在14:50:31开始出现节点之间通信失败报错,随后节点2发生了重启。

(3) 查看节点2系统日志

节点2系统日志:
[root@nxybj2 log]# more messages-20241006
Sep 29 03:50:01 nxybj2 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="7519" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Sep 29 03:50:02 nxybj2 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up-
to-date.
Sep 30 03:28:02 nxybj2 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up-
to-date.
Oct  1 04:16:02 nxybj2 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up-
to-date.
Oct  2 03:24:02 nxybj2 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up-
to-date.
Oct  3 03:23:03 nxybj2 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up-
to-date.
Oct  4 03:27:02 nxybj2 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up-
to-date.
Oct  5 03:34:02 nxybj2 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up-
to-date.
Oct  5 14:50:46 nxybj2 ntpd[394054]: Deleting interface #4 bond0:1, 172.31.122.37#123, interface stats: received=0, sent=0, dropped=0, active_time=6191372 secs
Oct  5 14:51:15 nxybj2 ntpd[394054]: Deleting interface #6 bond1:1, 169.254.209.169#123, interface stats: received=0, sent=0, dropped=0, active_time=6191401 secs
Oct  5 14:51:23 nxybj2 ntpd[394054]: Listen normally on 17 bond1:1 169.254.209.169 UDP 123
Oct  5 14:51:46 nxybj2 ntpd[394054]: Listen normally on 18 bond0:1 172.31.122.37 UDP 123

 系统日志无异常报错。

三、故障原因分析

根据日志信息可以看到,是由于节点之间心跳通信异常,导致节点2发生了重启,根据日志中的关键信息:

2024-10-05 14:50:45.484: 
[cssd(10595)]CRS-1607:Node nxybj2 is being evicted in cluster incarnation 587229053; details at (:CSSNM00007:) in /u01/app/11.2.0/grid/log/nxybj1/cssd/ocssd.log.
2024-10-05 14:50:47.994: 
[cssd(10595)]CRS-1625:Node nxybj2, number 2, was manually shut down

在mos上查询该报错,通过查询到的结果分析,该报错为bug:

受影响版本为:11.2.0.3,当前生成环境版本为:11.2.0.4,由此可以确定本次节点2重启是bug导致。

进一步查询,该bug在如下版本中进行了修复:

根据查询结果,要想修复此bug,就需要打对应版本的补丁。
 

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.xdnf.cn/news/1556710.html

如若内容造成侵权/违法违规/事实不符,请联系一条长河网进行投诉反馈,一经查实,立即删除!

相关文章

【光追模组】使命召唤7黑色行动光追mod,调色并修改光影,并且支持光追效果,游戏画质大提升

大家好,今天小编我给大家继续引入一款游戏mod,这次这个模组主要是针对使命召唤7黑色行动进行修改,如果你觉得游戏本身光影有缺陷,觉得游戏色彩有点失真的话,或者说你想让使命召唤7这款游戏增加对光线追踪的支持的话&am…

75 华三vlan端口隔离

华三vlan端口隔离 为了实现端口间的二层隔离,可以将不同的端口加入不同的VLAN,但VLAN资源有限。采用端口隔离特性,用户只需要将端口加入到隔离组中,就可以实现隔离组内端口之间二层隔离,而不关心这些端口所属VLAN&…

剖析 Redis:应对雪崩、穿透和击穿的实战秘籍

前言 用户的数据通常存储在数据库中,而数据库的数据存放在磁盘上。 磁盘的读写速度在计算机硬件中可以说是最慢的。 如果用户的所有请求都直接访问数据库,当请求数量增多时,数据库很容易崩溃。因此,为了避免用户直接访问数据库&a…

Java垃圾回收简述

什么是Java的垃圾回收? 自动管理内存的机制,负责自动释放不再被程序引用的对象所占用的内存。 怎么触发垃圾回收? 内存不足时:JVM检测到堆内存不足时,无法为新的对象分配内存时,会自动触发垃圾回收。手动…

Pandas -----------------------基础知识(八)

Pandas内置Matplotlib 加载数据 import pandas as pdanscombe pd.read_csv(/root/pandas_code_ling/data/e_anscombe.csv) anscombe dataset_1 anscombe[anscombe[dataset]I] dataset_1dataset_1.describe() 提供数据 dataset_1 anscombe[anscombe[dataset]I] dataset_2 an…

【C语言】分支和循环(2)

🤔个人主页: 起名字真南 😙个人专栏:【数据结构初阶】 【C语言】 【C】 目录 1 关系操作符2 条件操作符3 逻辑操作符 :|| ,&& ,!3.1 逻辑取反运算符3.2 与运算符3.3 或运算符3.4 练习闰年判断3.5 短…

仪器校准机构不符合项应该怎么签发和整改?

签发不合格项是内审工作之中非常重要的一环,那么如何正确签发不合格项,下列几个方面可以供大家参考: 一、目的 为便于仪器校准机构正确理解不符合项整改要求,特制定本指南,以指导企业规范、有效、高效地处理不符合项。…

旅游管理智能化:SpringBoot框架的应用

第一章 绪论 1.1 研究现状 时代的发展,我们迎来了数字化信息时代,它正在渐渐的改变着人们的工作、学习以及娱乐方式。计算机网络,Internet扮演着越来越重要的角色,人们已经离不开网络了,大量的图片、文字、视频冲击着我…

SpringBoot教程(二十四) | SpringBoot实现分布式定时任务之Quartz

SpringBoot教程(二十四) | SpringBoot实现分布式定时任务之Quartz 简介适用场景Quartz核心概念Quartz 存储方式Quartz 版本类型引入相关依赖方式一:内存方式(MEMORY)存储实现定时任务1. 定义任务类2. 定义任务描述及创建任务触发器3. Quartz的…

强引用、软引用、弱引用、虚引用用法

强引用、软引用、弱引用、虚引用用法 强引用弱引用弱引用虚引用 强引用 强引用是指程序中在程序代码之中类似“Object obj new Object()”的引用关系,无论任何情况下,只要强引用关系还存在,垃圾回收器就不会回收掉被引用的对象。 强引用是我…

日期类(Date)的实现 (C++版)

​ 🌹个人主页🌹:喜欢草莓熊的bear 🌹专栏🌹:C入门 目录 前言 一、Date的头文件,包含函数声明 二、 Date.cpp 2.1 int GetMonthDay(int year, int month) 2.2 bool Check() 2.3 Date& …

【吊打面试官系列-MySQL面试题】什么是基本表?什么是视图?

大家好,我是锋哥。今天分享关于【什么是基本表?什么是视图?】面试题,希望对大家有帮助; 什么是基本表?什么是视图? 基本表是本身独立存在的表,在 SQL 中一个关系就对应一个表。 视图…

【含开题报告+文档+PPT+源码】闲置二手市场小程序的设计与实现

开题报告 闲置二手市场平台的背景可以追溯到互联网的普及和电子商务的兴起。随着互联网技术的不断发展,人们的消费观念也在不断变化,越来越多的人开始关注二手商品的价值和优势。同时,大用户群体也在不断增加,他们对于经济实惠的…

利用顺序栈输出对应的二进制数,找迷宫出口详解(数据结构作业04)

目录 利用顺序栈输出对应的二进制数 代码: 运行结果: 找迷宫出口 代码: 图解: 运行结果: 利用顺序栈输出对应的二进制数 键盘输入一个十进制正整数89,用C语言设计一个算法,利用顺序栈…

MambaAD 实验部分讲解

4 实验 4.1 设置:数据集、指标和细节 数据集(6个) 1.MVTec-AD: 包含5种类型的纹理和10种类型的对象,总共5,354张高分辨率图像。 实验: 3,629张正常图像被指定为训练。 剩下的 1,725 张图像被保留用于测试…

网络基础擅长组建乐队

让我们荡起双桨 来说说网络吧 现有计算机要进行协作,网络的产生是必然的 局域网:计算机数量更多了, 通过交换机和路由器连接在一起 广域网:将远隔千里的计算机都连在一起 交换机路由器等设备就应运而生 计算机是人的工具,人要协…

美国游戏发展趋势

美国拥有一些最大、最具影响力的游戏开发工作室,是游戏行业的全球领导者。凭借丰富地创新历史,美国游戏开发不断发展,受到尖端技术、消费者偏好和市场动态的影响。已经出现了几个趋势,这些趋势定义了该国游戏发展的方向&#xff0…

node高版本报错: digital envelope routines::unsupported

node高版本报错: digital envelope routines::unsupported 解决方案: package.json中,启动命令前加上: set NODE_OPTIONS--openssl-legacy-provider &&

WPF 手撸插件 八 操作数据库一

1、本文将使用SqlSugar创建Sqlite数据库,进行入门的增删改查等操作。擦,咋写着写着凌乱起来了。 SqlSugar官方文档:简单示例,1分钟入门 - SqlSugar 5x - .NET果糖网 2、环境SqlSugar V5.0版本需要.Net Framework 4.6 &#xff0…

Qt源码-Qt多媒体音频框架

Qt 多媒体音频框架 一、概述二、音频设计1. ALSA 基础2. Qt 音频类1. 接口实现2. alsa 插件实现 一、概述 环境详细Qt版本Qt 5.15操作系统Deepin v23代码工具Visual Code源码https://github.com/qt/qtmultimedia/tree/5.15 这里记录一下在Linux下Qt 的 Qt Multimedia 模块的设…