redhat5.5雙機斷網關機不能實現重啟
問題:兩台IBM伺服器做雙機,斷網測試的時候,直接關機,不能實現重啟。查看日誌也沒有互相fence failed現象。
硬體IBM X3850 X5兩台,操作系統redhat5.5 兩個cisco3560swich,每個伺服器4塊網卡和2兩塊光纖卡,fence設備是IBM IMM。
eth0/eth1各接一個交換機,兩塊做bond0。eht4/eth5兩根心跳相連,做bond1。2塊光纖卡eth2/eth3。
之前出現過重啟伺服器后6塊網卡之間MAC地址漂移現象,最終通過加MAC到每個網卡配置文件把問題解決了。不知道雙機和這有沒有關係?
另有8台x3650跟3850同樣的配置,雙機測試已經沒有問題。
我的配置信息
主機名:
root@ynrhzf-db1 bond0:192.168.141.11 bond0:192.168.142.11
root@ynrhzf-db2 bond0:192.168.141.12 bond0:192.168.142.12
# cat /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1 localhost.localdomain localhost
192.168.141.11 db1.anypay.yn ynrhzf-db1
192.168.141.12 db2.anypay.yn ynrhzf-db2
192.168.141.10 ynrhzf-db //浮動IP
192.168.142.11 pri-db1
192.168.142.12 pri-db2
192.168.141.103 imm-db1
192.168.141.104 imm-db2
# service cman start
Starting cluster:
Loading modules... done
Mounting configfs... done
Starting ccsd... done
Starting cman... done
Starting daemons... done
Starting fencing... done
[ OK ]
# service cman start
Starting cluster:
Loading modules... done
Mounting configfs... done
Starting ccsd... done
Starting cman... done
Starting daemons... done
Starting fencing... done
[ OK ]
cluster.conf配置文件:
root@ynrhzf-db1 network-scripts]# cat /etc/cluster/cluster.conf
<?xml version="1.0" ?>
<cluster config_version="3" name="db-cluster">
<fence_daemon post_fail_delay="0" post_join_delay="3"/>
<clusternodes>
<clusternode name="db1.anypay.yn" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="imm-db1"/>
</method>
</fence>
</clusternode>
<clusternode name="db2.anypay.yn" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="imm-db2"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes="1" two_node="1">
<multicast addr="227.0.0.10"/>
</cman>
<fencedevices>
<fencedevice agent="fence_rsa" ipaddr="192.168.141.103" login="USERID" name="imm-db1" passwd="PASSW0RD"/>
<fencedevice agent="fence_rsa" ipaddr="192.168.141.104" login="USERID" name="imm-db2" passwd="PASSW0RD"/>
</fencedevices>
<rm>
<failoverdomains>
<failoverdomain name="db-failover" ordered="1" restricted="1">
<failoverdomainnode name="db1.anypay.yn" priority="1"/>
<failoverdomainnode name="db2.anypay.yn" priority="1"/>
</failoverdomain>
</failoverdomains>
<resources>
<ip address="192.168.141.10/24" monitor_link="1"/>
</resources>
<service autostart="1" domain="db-failover" name="db-services">
<ip ref="192.168.141.10/24"/>
</service>
</rm>
</cluster>
測試結果:
root@ynrhzf-db1 network-scripts]# fence_rsa -a 192.168.141.103 -l USERID -p PASSW0RD -o status
Status: ON
# fence_rsa -a 192.168.141.104 -l USERID -p PASSW0RD -o status
Status: ON
telnet遠程管理口也沒問題,進去后兩台都執行reset命令,都能實現重啟。
我的cluster.conf配置文件:
root@ynrhzf-db1 network-scripts]# cat /etc/cluster/cluster.conf
<?xml version="1.0" ?>
<cluster config_version="3" name="db-cluster">
<fence_daemon post_fail_delay="0" post_join_delay="3"/>
<clusternodes>
<clusternode name="db1.anypay.yn" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="imm-db1"/>
</method>
</fence>
</clusternode>
<clusternode name="db2.anypay.yn" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="imm-db2"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes="1" two_node="1">
<multicast addr="227.0.0.10"/>
</cman>
<fencedevices>
<fencedevice agent="fence_rsa" ipaddr="192.168.141.103" login="USERID" name="imm-db1" passwd="PASSW0RD"/>
<fencedevice agent="fence_rsa" ipaddr="192.168.141.104" login="USERID" name="imm-db2" passwd="PASSW0RD"/>
</fencedevices>
<rm>
<failoverdomains>
<failoverdomain name="db-failover" ordered="1" restricted="1">
<failoverdomainnode name="db1.anypay.yn" priority="1"/>
<failoverdomainnode name="db2.anypay.yn" priority="1"/>
</failoverdomain>
</failoverdomains>
<resources>
<ip address="192.168.141.10/24" monitor_link="1"/>
</resources>
<service autostart="1" domain="db-failover" name="db-services">
<ip ref="192.168.141.10/24"/>
</service>
</rm>
</cluster>
這是在db1斷網測試的日誌:
Mar 12 05:32:04 db2 openais: Members Joined:
Mar 12 05:32:04 db2 openais: r(0) ip(192.168.141.11)
Mar 12 05:32:04 db2 openais: This node is within the primary component and will provide service.
Mar 12 05:32:04 db2 openais: entering OPERATIONAL state.
Mar 12 05:32:04 db2 openais: got nodejoin message 192.168.141.11
Mar 12 05:32:04 db2 openais: got nodejoin message 192.168.141.12
Mar 12 05:32:04 db2 openais: got joinlist message from node 1
Mar 12 05:32:21 db2 kernel: dlm: Using TCP for communications
Mar 12 05:32:21 db2 kernel: dlm: got connection from 1
Mar 12 05:32:22 db2 clurgmgrd: <notice> Resource Group Manager Starting
Mar 12 05:36:50 db2 dhclient: DHCPREQUEST on usb0 to 169.254.95.118 port 67
Mar 12 05:36:51 db2 dhclient: DHCPACK from 169.254.95.118
Mar 12 05:36:51 db2 dhclient: bound to 169.254.95.120 -- renewal in 294 seconds.
Mar 12 05:36:57 db2 openais: The token was lost in the OPERATIONAL state.
Mar 12 05:36:57 db2 openais: Receive multicast socket recv buffer size (320000 bytes).
Mar 12 05:36:57 db2 openais: Transmit multicast socket send buffer size (320000 bytes).
Mar 12 05:36:57 db2 openais: entering GATHER state from 2.
Mar 12 05:37:17 db2 openais: entering GATHER state from 0.
Mar 12 05:37:17 db2 openais: Creating commit token because I am the rep.
Mar 12 05:37:17 db2 openais: Saving state aru 3b high seq received 3b
Mar 12 05:37:17 db2 openais: Storing new sequence id for ring 18
Mar 12 05:37:17 db2 openais: entering COMMIT state.
Mar 12 05:37:17 db2 openais: entering RECOVERY state.
Mar 12 05:37:17 db2 openais: position member 192.168.141.12:
Mar 12 05:37:17 db2 openais: previous ring seq 20 rep 192.168.141.11
Mar 12 05:37:17 db2 openais: aru 3b high delivered 3b received flag 1
Mar 12 05:37:17 db2 openais: Did not need to originate any messages in recovery.
Mar 12 05:37:17 db2 openais: Sending initial ORF token
Mar 12 05:37:17 db2 openais: CLM CONFIGURATION CHANGE
Mar 12 05:37:17 db2 openais: New Configuration:
Mar 12 05:37:17 db2 kernel: dlm: closing connection to node 1
Mar 12 05:37:17 db2 fenced: db1.anypay.yn not a cluster member after 0 sec post_fail_delay
Mar 12 05:37:17 db2 openais: r(0) ip(192.168.141.12)
Mar 12 05:37:17 db2 fenced: fencing node "db1.anypay.yn"
Mar 12 05:37:17 db2 openais: Members Left:
Mar 12 05:37:17 db2 openais: r(0) ip(192.168.141.11)
Mar 12 05:37:17 db2 openais: Members Joined:
Mar 12 05:37:17 db2 openais: CLM CONFIGURATION CHANGE
Mar 12 05:37:17 db2 openais: New Configuration:
Mar 12 05:37:17 db2 openais: r(0) ip(192.168.141.12)
Mar 12 05:37:17 db2 openais: Members Left:
Mar 12 05:37:17 db2 openais: Members Joined:
Mar 12 05:37:17 db2 openais: This node is within the primary component and will provide service.
Mar 12 05:37:17 db2 openais: entering OPERATIONAL state.
Mar 12 05:37:17 db2 openais: got nodejoin message 192.168.141.12
Mar 12 05:37:17 db2 openais: got joinlist message from node 2
Mar 12 05:37:28 db2 kernel: igb: eth4 NIC Link is Down
Mar 12 05:37:28 db2 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it
Mar 12 05:37:28 db2 kernel: bonding: bond1: making interface eth5 the new active one.
Mar 12 05:37:29 db2 kernel: igb: eth5 NIC Link is Down
Mar 12 05:37:29 db2 kernel: bonding: bond1: link status definitely down for interface eth5, disabling it
Mar 12 05:37:29 db2 kernel: bonding: bond1: now running without any active interface !
Mar 12 05:38:03 db2 ccsd: Attempt to close an unopened CCS descriptor (3180).
Mar 12 05:38:03 db2 ccsd: Error while processing disconnect: Invalid request descriptor
Mar 12 05:38:03 db2 fenced: fence "db1.anypay.yn" success
Mar 12 05:38:04 db2 clurgmgrd: <notice> Taking over service service:db-services from down member db1.anypay.yn
Mar 12 05:38:06 db2 avahi-daemon: Registering new address record for 192.168.141.10 on bond0.
Mar 12 05:38:07 db2 clurgmgrd: <notice> Service service:db-services started
Mar 12 05:41:11 db2 kernel: usb 8-1: new low speed USB device using uhci_hcd and address 2
Mar 12 05:41:12 db2 kernel: usb 8-1: configuration #1 chosen from 1 choice
Mar 12 05:41:12 db2 kernel: input: USB Keyboard as /class/input/input1
Mar 12 05:41:12 db2 kernel: input: USB HID v1.10 Keyboard [ USB Keyboard] on usb-0000:00:1d.2-1
Mar 12 05:41:12 db2 kernel: input: USB Keyboard as /class/input/input2
Mar 12 05:41:12 db2 kernel: input: USB HID v1.10 Device [ USB Keyboard] on usb-0000:00:1d.2-1
Mar 12 05:41:32 db2 kernel: usb 8-1: USB disconnect, address 2
Mar 12 05:41:44 db2 dhclient: DHCPREQUEST on usb0 to 169.254.95.118 port 67
Mar 12 05:41:45 db2 dhclient: DHCPACK from 169.254.95.118
Mar 12 05:41:45 db2 dhclient: bound to 169.254.95.120 -- renewal in 252 seconds.
Mar 12 05:42:04 db2 kernel: igb: eth4 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:42:04 db2 kernel: bonding: bond1: link status definitely up for interface eth4.
Mar 12 05:42:04 db2 kernel: bonding: bond1: making interface eth4 the new active one.
Mar 12 05:42:04 db2 kernel: bonding: bond1: first active interface up!
Mar 12 05:42:04 db2 kernel: igb: eth5 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:42:05 db2 kernel: bonding: bond1: link status definitely up for interface eth5.
Mar 12 05:42:06 db2 kernel: igb: eth5 NIC Link is Down
Mar 12 05:42:06 db2 kernel: igb: eth4 NIC Link is Down
Mar 12 05:42:06 db2 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it
Mar 12 05:42:06 db2 kernel: bonding: bond1: now running without any active interface !
Mar 12 05:42:06 db2 kernel: bonding: bond1: link status definitely down for interface eth5, disabling it
Mar 12 05:42:08 db2 kernel: igb: eth4 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:42:08 db2 kernel: igb: eth5 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:42:08 db2 kernel: bonding: bond1: link status definitely up for interface eth4.
Mar 12 05:42:08 db2 kernel: bonding: bond1: making interface eth4 the new active one.
Mar 12 05:42:08 db2 kernel: bonding: bond1: first active interface up!
Mar 12 05:42:08 db2 kernel: bonding: bond1: link status definitely up for interface eth5.
Mar 12 05:42:10 db2 kernel: igb: eth4 NIC Link is Down
Mar 12 05:42:10 db2 kernel: igb: eth5 NIC Link is Down
Mar 12 05:42:10 db2 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it
Mar 12 05:42:10 db2 kernel: bonding: bond1: now running without any active interface !
Mar 12 05:42:10 db2 kernel: bonding: bond1: link status definitely down for interface eth5, disabling it
Mar 12 05:42:11 db2 kernel: igb: eth5 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:42:11 db2 kernel: bonding: bond1: link status definitely up for interface eth5.
Mar 12 05:42:11 db2 kernel: bonding: bond1: making interface eth5 the new active one.
Mar 12 05:42:11 db2 kernel: bonding: bond1: first active interface up!
Mar 12 05:42:11 db2 kernel: igb: eth4 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:42:11 db2 kernel: bonding: bond1: link status definitely up for interface eth4.
Mar 12 05:42:48 db2 kernel: igb: eth5 NIC Link is Down
Mar 12 05:42:48 db2 kernel: bonding: bond1: link status definitely down for interface eth5, disabling it
Mar 12 05:42:48 db2 kernel: bonding: bond1: making interface eth4 the new active one.
Mar 12 05:42:48 db2 kernel: igb: eth4 NIC Link is Down
Mar 12 05:42:48 db2 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it
Mar 12 05:42:48 db2 kernel: bonding: bond1: now running without any active interface !
Mar 12 05:42:49 db2 kernel: igb: eth5 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:42:49 db2 kernel: bonding: bond1: link status definitely up for interface eth5.
Mar 12 05:42:49 db2 kernel: bonding: bond1: making interface eth5 the new active one.
Mar 12 05:42:49 db2 kernel: bonding: bond1: first active interface up!
Mar 12 05:42:49 db2 kernel: igb: eth4 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:42:49 db2 kernel: bonding: bond1: link status definitely up for interface eth4.
Mar 12 05:43:27 db2 kernel: igb: eth4 NIC Link is Down
Mar 12 05:43:27 db2 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it
Mar 12 05:43:27 db2 kernel: igb: eth5 NIC Link is Down
Mar 12 05:43:27 db2 kernel: bonding: bond1: link status definitely down for interface eth5, disabling it
Mar 12 05:43:27 db2 kernel: bonding: bond1: now running without any active interface !
Mar 12 05:43:29 db2 kernel: igb: eth4 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:43:29 db2 kernel: bonding: bond1: link status definitely up for interface eth4.
Mar 12 05:43:29 db2 kernel: bonding: bond1: making interface eth4 the new active one.
Mar 12 05:43:29 db2 kernel: bonding: bond1: first active interface up!
Mar 12 05:43:30 db2 kernel: igb: eth5 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:43:30 db2 kernel: bonding: bond1: link status definitely up for interface eth5.
這個配置文件不知道影響大不大
# cat /etc/modprobe.conf
alias scsi_hostadapter megaraid_sas
alias scsi_hostadapter1 ata_piix
alias scsi_hostadapter2 lpfc
alias eth0 bnx2
alias eth1 bnx2
alias bond0 bonding
options bond0 miimon=100 mode=1
alias bond1 bonding
options bond1 miimon=100 mode=0
### BEGIN UltraPath Driver Comments ###
remove upUpper if [ -d /proc/mpp ] && [ `ls -a /proc/mpp | wc -l` -gt 2 ]; then echo -e "Please Unload Physical HBA Driver prior to unloading upUpper."; else /sbin/modprobe -r --ignore-remove upUpper; fi
# Additional config info can be found in /opt/mpp/modprobe.conf.mppappend.
# The Above config info is needed if you want to make mkinitrd manually.
# Edit the '/etc/modprobe.conf' file and run 'upUpdate' to create Ramdisk dynamically.
### END UltraPath Driver Comments ###
options qla2xxx qlport_down_retry=5
options lpfc lpfc_nodev_tmo=30
alias eth2 e1000e
alias eth3 e1000e
alias eth4 igb
alias eth5 igb
這問題已經困擾我N多天了,大俠們幫我分析分析吧!!!!
《解決方案》
配置和日誌均顯示沒有什麼大問題。
但/etc/hosts中,將這個地方:
192.168.141.11 db1.anypay.yn ynrhzf-db1
192.168.141.12 db2.anypay.yn ynrhzf-db2
改成:
192.168.141.11 db1.anypay.yn
192.168.141.12 db2.anypay.yn
以避免斷網之後的fence錯誤。心跳線不能直連,必須通過交換機走。fence的時候不重啟而是關機的情況,估計得從伺服器的BIOS設置中去查。
《解決方案》
謝謝大俠,明天試試去
《解決方案》
bios里查看了一下,沒找到需要修改的東西哦 鬱悶呢
《解決方案》
集群調用fence_rsa的默認行為就是reboot,所以這個地方應該不是在操作系統上能配置的。再仔細查查。