A-A+

如何定位rac节点重启原因(2)私网网卡down掉

2013年01月30日 DG&RAC&OGG, TroubleShooting 暂无评论 阅读 1,715 次

这篇是模拟一个rac中的其中一个节点的私网网卡down掉后,导致rac节点重启前后的日志记录。
小结如下:
1、首先查看节点的主机messages日志,找出时间断点信息确认问题发生时间(此案例为Jan 30 00:27:49左右),以及问题的大概情况。
2、根据1中的时间去数据库实例的alert日志中找相关消息。
3、反查日志ocssd.log文件,确认WARNING或者ERROR发生时间和1中一致,并找出节点驱逐信息。
4、查看crsd.log日志中关于资源超时信息以及crs启动信息。

情景回放2:人为停掉其中一个节点的私网网卡
查看节点状态

[oracle@rac1 ~]$ crs_stat -t
Name           Type           Target    State     Host        
------------------------------------------------------------
ora....SM1.asm application    ONLINE    ONLINE    rac1        
ora....C1.lsnr application    ONLINE    ONLINE    rac1        
ora.rac1.gsd   application    ONLINE    ONLINE    rac1        
ora.rac1.ons   application    ONLINE    ONLINE    rac1        
ora.rac1.vip   application    ONLINE    ONLINE    rac1        
ora....SM2.asm application    ONLINE    ONLINE    rac2        
ora....C2.lsnr application    ONLINE    ONLINE    rac2        
ora.rac2.gsd   application    ONLINE    ONLINE    rac2        
ora.rac2.ons   application    ONLINE    ONLINE    rac2        
ora.rac2.vip   application    ONLINE    ONLINE    rac2        
ora.racdb.db   application    ONLINE    ONLINE    rac2        
ora....b1.inst application    ONLINE    ONLINE    rac1        
ora....b2.inst application    ONLINE    ONLINE    rac2 

人为停掉节点2的私网网卡

[oracle@rac2 ~]$ oifcfg getif
eth0  192.168.137.0  global  public
eth1  192.168.136.0  global  cluster_interconnect
[root@rac2 ~]# ifconfig eth1 down

此时节点发生重启现象

查看日志
主机messages记录如下:

Jan 30 00:13:20 rac2 logger: Cluster Ready Services completed waiting on dependencies.
Jan 30 00:13:20 rac2 last message repeated 2 times
Jan 30 00:13:25 rac2 logger: Running CRSD with TZ =
Jan 30 00:13:26 rac2 logger: Oracle CSS Family monitor starting.
Jan 30 00:13:29 rac2 logger: Oracle CSS restart. 0, 1
Jan 30 00:15:14 rac2 ntpd[3567]: synchronized to LOCAL(0), stratum 10
Jan 30 00:15:14 rac2 ntpd[3567]: kernel time sync enabled 0001
Jan 30 00:18:42 rac2 avahi-daemon[3872]: Registering new address record for 192.168.137.154 on eth0.
Jan 30 00:27:20 rac2 avahi-daemon[3872]: Interface eth1.IPv4 no longer relevant for mDNS.		--此处发现网卡不通问题
Jan 30 00:27:20 rac2 avahi-daemon[3872]: Leaving mDNS multicast group on interface eth1.IPv4 with address 192.168.136.152.
Jan 30 00:27:20 rac2 avahi-daemon[3872]: Withdrawing address record for 192.168.136.152 on eth1.
Jan 30 00:27:29 rac2 kernel:  connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 951329, last ping 956330, now 961330
Jan 30 00:27:29 rac2 kernel:  connection1:0: detected conn error (1011)
Jan 30 00:27:30 rac2 iscsid: Kernel reported iSCSI connection 1:0 error (1011) state (3)
Jan 30 00:27:47 rac2 kernel: eth1: link up
Jan 30 00:27:48 rac2 avahi-daemon[3872]: New relevant interface eth1.IPv4 for mDNS.
Jan 30 00:27:48 rac2 avahi-daemon[3872]: Joining mDNS multicast group on interface eth1.IPv4 with address 218.30.64.194.
Jan 30 00:27:49 rac2 avahi-daemon[3872]: Registering new address record for 218.30.64.194 on eth1.
Jan 30 00:29:13 rac2 logger: Oracle clsomon failed with fatal status 13.	--进程clsomon发现css异常使主机重启	
Jan 30 00:34:04 rac2 syslogd 1.4.1: restart.

数据库节点的alert记录如下:

Wed Jan 30 00:22:11 CST 2013
Completed: ALTER DATABASE OPEN
Wed Jan 30 00:28:04 CST 2013
ospid 7904: network interface with IP address 192.168.136.152 no longer operational		--此处已经检测数eth1私网存在问题
requested interface 192.168.136.152 not found. Check output from ifconfig command
Wed Jan 30 00:29:04 CST 2013
ospid 7904: network interface with IP address 192.168.136.152 no longer operational
requested interface 192.168.136.152 not found. Check output from ifconfig command
Wed Jan 30 00:42:22 CST 2013		--此处是数据库实例重启记录
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Interface type 1 eth1 192.168.136.0 configured from OCR for use as a cluster interconnect
Interface type 1 eth0 192.168.137.0 configured from OCR for use as  a public interface
Picked latch-free SCN scheme 2
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.5.0.
System parameters with non-default values:
  processes                = 150
  __shared_pool_size       = 155189248
  __large_pool_size        = 4194304
  __java_pool_size         = 4194304
  __streams_pool_size      = 0
  spfile                   = +DATA/racdb/spfileracdb.ora
  sga_target               = 603979776
  control_files            = +DATA/racdb/controlfile/current.276.796865453
  db_block_size            = 8192
  __db_cache_size          = 436207616
  compatible               = 10.2.0.5.0
  log_archive_dest_1       = LOCATION=+DATA/
  log_archive_format       = %t_%s_%r.dbf
  db_file_multiblock_read_count= 16
  cluster_database         = TRUE
  cluster_database_instances= 2
  db_create_file_dest      = +DATA
  thread                   = 2
  instance_number          = 2
  undo_management          = AUTO
  undo_tablespace          = UNDOTBS2
部分未列出

节点ocss日志记录如下:

[root@rac2 cssd]# cat  ocssd.log | grep WARNING 
[    CSSD]2013-01-29 21:49:37.085 [2996820880] >WARNING: clssgmShutDown: Received explicit shutdown request from client.
[    CSSD]2013-01-29 21:49:37.965 [2996820880] >WARNING: clssgmClientShutdown: graceful shutdown completed.
[    CSSD]2013-01-29 21:49:38.484 [2954857360] >WARNING: clssnmPollingThread: state(3) clusterState(1) exit
[    CSSD]2013-01-29 21:49:38.962 [2946464656] >WARNING: clssnmSendingThread: state(3) clusterState(0) exit
[    CSSD]2013-01-29 21:49:38.994 [2996820880] >WARNING: clssscExit: CSSD terminating normally from thread GMClientListener
[    CSSD]2013-01-30 00:14:02.662 [3059497168] >WARNING: clssnmStartClusterJoin: attempt to join domain(0) twice
[    CSSD]2013-01-30 00:27:49.883 [2954861456] >WARNING: clssnmPollingThread: node rac1 (1) at 50% heartbeat fatal, eviction in 29.850 seconds seedhbimpd 0		--节点准备被驱逐
[    CSSD]2013-01-30 00:27:50.898 [2954861456] >WARNING: clssnmPollingThread: node rac1 (1) at 50% heartbeat fatal, eviction in 28.840 seconds seedhbimpd 0
[    CSSD]2013-01-30 00:28:02.786 [3015064464] >WARNING: clssnmDiskPMT: voting device hang at 75% fatal, termination in 14060 ms, disk (0//dev/raw/raw2)		--voting device hang at 75% fatal也反应了messages日志中记录[Jan 30 00:29:13 rac2 logger: Oracle clsomon failed with fatal status 13.]
[    CSSD]2013-01-30 00:28:04.900 [2954861456] >WARNING: clssnmPollingThread: node rac1 (1) at 75% heartbeat fatal, eviction in 14.840 seconds seedhbimpd 0
[    CSSD]2013-01-30 00:28:05.881 [2954861456] >WARNING: clssnmPollingThread: node rac1 (1) at 75% heartbeat fatal, eviction in 13.860 seconds seedhbimpd 0
[    CSSD]2013-01-30 00:28:11.897 [3015064464] >WARNING: clssnmDiskPMT: voting device hang at 90% fatal, termination in 4950 ms, disk (0//dev/raw/raw2)
[    CSSD]2013-01-30 00:28:12.898 [3015064464] >WARNING: clssnmDiskPMT: voting device hang at 90% fatal, termination in 3950 ms, disk (0//dev/raw/raw2)
[    CSSD]2013-01-30 00:28:13.878 [2954861456] >WARNING: clssnmPollingThread: node rac1 (1) at 90% heartbeat fatal, eviction in 5.860 seconds seedhbimpd 0
[    CSSD]2013-01-30 00:28:13.898 [3015064464] >WARNING: clssnmDiskPMT: voting device hang at 90% fatal, termination in 2950 ms, disk (0//dev/raw/raw2)
[    CSSD]2013-01-30 00:28:14.879 [2954861456] >WARNING: clssnmPollingThread: node rac1 (1) at 90% heartbeat fatal, eviction in 4.860 seconds seedhbimpd 0
[    CSSD]2013-01-30 00:28:14.899 [3015064464] >WARNING: clssnmDiskPMT: voting device hang at 90% fatal, termination in 1950 ms, disk (0//dev/raw/raw2)
[    CSSD]2013-01-30 00:28:15.880 [2954861456] >WARNING: clssnmPollingThread: node rac1 (1) at 90% heartbeat fatal, eviction in 3.860 seconds seedhbimpd 0
[    CSSD]2013-01-30 00:28:15.900 [3015064464] >WARNING: clssnmDiskPMT: voting device hang at 90% fatal, termination in 950 ms, disk (0//dev/raw/raw2)
[    CSSD]2013-01-30 00:28:16.880 [2954861456] >WARNING: clssnmPollingThread: node rac1 (1) at 90% heartbeat fatal, eviction in 2.860 seconds seedhbimpd 0
[    CSSD]2013-01-30 00:40:12.858 [2938071952] >WARNING: clssnmLocalJoinEvent: takeover succ

[root@rac2 cssd]# cat  ocssd.log | grep ERROR
[    CSSD]2013-01-30 00:28:16.860 [3015064464] >ERROR:   clssnmDiskPMT: Aborting, 1 of 1 voting disks unavailable		--此处也可看出心跳不通
[    CSSD]2013-01-30 00:28:16.925 [3015064464] >ERROR:   ###################################
[    CSSD]2013-01-30 00:28:16.925 [3015064464] >ERROR:   clssscExit: CSSD aborting from thread clssnmvDiskPingMonitorThread
[    CSSD]2013-01-30 00:28:16.925 [3015064464] >ERROR:   ###################################

crsd.log 日志记录如下:

2013-01-30 00:22:30.294: [  OCRUTL][2940738448]u_freem: mem passed is null
2013-01-30 00:28:37.930: [  CRSEVT][2662259600]0CAAMonitorHandler :: 0:Could not join /opt/ora10g/product/crs/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child

2013-01-30 00:28:37.930: [  CRSEVT][2662259600]0CAAMonitorHandler :: 0:Action Script /opt/ora10g/product/crs/bin/racgwrap(check) timed out for ora.rac2.vip! (timeout=60)		--此处超时报错
2013-01-30 00:28:37.931: [  CRSAPP][2662259600]0CheckResource error for ora.rac2.vip error code = -2
2013-01-30 00:36:13.784: [ default][3058538176][ENTER]0		--此处节点已经重启完毕,开始启动crs及启动资源
Oracle Database 10g CRS Release 10.2.0.5.0 Production Copyright 1996, 2004, Oracle.  All rights reserved
2013-01-30 00:36:13.833: [ default][3058538176]0CRS Daemon Starting
2013-01-30 00:36:13.837: [ CRSMAIN][3058538176]0Checking the OCR device
2013-01-30 00:36:13.981: [ CRSMAIN][3058538176]0Connecting to the CSS Daemon
2013-01-30 00:36:14.258: [ COMMCRS][3057961872]clsc_connect: (0xa0d2d80) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_rac2_))

2013-01-30 00:36:14.259: [ CSSCLNT][3058538176]clsssInitNative: connect failed, rc 9

2013-01-30 00:36:14.260: [  CRSRTI][3058538176]0CSS is not ready. Received status 3 from CSS. Waiting for good status ..

2013-01-30 00:36:15.540: [ COMMCRS][3057961872]clsc_connect: (0xa0d2dd8) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_rac2_))

2013-01-30 00:36:15.540: [ CSSCLNT][3058538176]clsssInitNative: connect failed, rc 9
下边省略

以上便是一个模拟私网网卡down掉后节点重启的一个案例。

一般出错信息如下:
1、在messages日志中,相关信息较为好找,一般有如下类似信息:

Jan 30 00:27:20 rac2 avahi-daemon[3872]: Interface eth1.IPv4 no longer relevant for mDNS.		
Jan 30 00:27:20 rac2 avahi-daemon[3872]: Leaving mDNS multicast group on interface eth1.IPv4 with address 192.168.136.152.
Jan 30 00:27:20 rac2 avahi-daemon[3872]: Withdrawing address record for 192.168.136.152 on eth1.
Jan 30 00:27:29 rac2 kernel:  connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 951329, last ping 956330, now 961330
Jan 30 00:27:29 rac2 kernel:  connection1:0: detected conn error (1011)
Jan 30 00:27:30 rac2 iscsid: Kernel reported iSCSI connection 1:0 error (1011) state (3)
Jan 30 00:27:47 rac2 kernel: eth1: link up
... ...
Jan 30 00:29:13 rac2 logger: Oracle clsomon failed with fatal status 13.
Jan 30 00:34:04 rac2 syslogd 1.4.1: restart.

2、在数据库节点的alert日志中,一般会记录以下信息:

Wed Jan 30 00:28:04 CST 2013
ospid 7904: network interface with IP address 192.168.136.152 no longer operational
requested interface 192.168.136.152 not found. Check output from ifconfig command

3、在节点ocss日志中,主要查看WARNING和ERROR关键字的信息,一般就是节点驱逐信息,如下:

[    CSSD]2013-01-30 00:27:49.883 [2954861456] >WARNING: clssnmPollingThread: node rac1 (1) at 50% heartbeat fatal, eviction in 29.850 seconds seedhbimpd 0	
[    CSSD]2013-01-30 00:27:50.898 [2954861456] >WARNING: clssnmPollingThread: node rac1 (1) at 50% heartbeat fatal, eviction in 28.840 seconds seedhbimpd 0
[    CSSD]2013-01-30 00:28:02.786 [3015064464] >WARNING: clssnmDiskPMT: voting device hang at 75% fatal, termination in 14060 ms, disk (0//dev/raw/raw2)		--voting device hang at 75% fatal也反应了messages日志中记录[Jan 30 00:29:13 rac2 logger: Oracle clsomon failed with fatal status 13.]
[    CSSD]2013-01-30 00:28:04.900 [2954861456] >WARNING: clssnmPollingThread: node rac1 (1) at 75% heartbeat fatal, eviction in 14.840 seconds seedhbimpd 0
[    CSSD]2013-01-30 00:28:05.881 [2954861456] >WARNING: clssnmPollingThread: node rac1 (1) at 75% heartbeat fatal, eviction in 13.860 seconds seedhbimpd 0
[    CSSD]2013-01-30 00:28:11.897 [3015064464] >WARNING: clssnmDiskPMT: voting device hang at 90% fatal, termination in 4950 ms, disk (0//dev/raw/raw2)
... ...
[    CSSD]2013-01-30 00:28:16.925 [3015064464] >ERROR:   clssscExit: CSSD aborting from thread clssnmvDiskPingMonitorThread

4、在节点crsd.log 日志记录的消息一般是资源vip超时的信息,如下:

category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child

2013-01-30 00:28:37.930: [  CRSEVT][2662259600]0CAAMonitorHandler :: 0:Action Script /opt/ora10g/product/crs/bin/racgwrap(check) timed out for ora.rac2.vip! (timeout=60)
2013-01-30 00:28:37.931: [  CRSAPP][2662259600]0CheckResource error for ora.rac2.vip error code = -2

给我留言

Copyright © YallonKing 保留所有权利.   Theme  Ality

用户登录

分享到: