請教RHCS的問題(REDHAT Enterprise 5)
我在我在其中一台機器(LINUX1)上起service cman start,在這之前兩台機器的cman都是沒起的
日誌刷出以下內容:
Jun 23 18:49:13 LINUX1 openais: CMAN 2.0.60 (built Jan 23 2007 12:42:29) started
Jun 23 18:49:13 LINUX1 openais: Not using a virtual synchrony filter.
Jun 23 18:49:13 LINUX1 openais: Creating commit token because I am the rep.
Jun 23 18:49:13 LINUX1 openais: Saving state aru 0 high seq received 0
Jun 23 18:49:13 LINUX1 openais: entering COMMIT state.
Jun 23 18:49:13 LINUX1 openais: entering RECOVERY state.
Jun 23 18:49:13 LINUX1 openais: position member 192.168.10.31:
Jun 23 18:49:13 LINUX1 openais: previous ring seq 0 rep 192.168.10.31
Jun 23 18:49:13 LINUX1 openais: aru 0 high delivered 0 received flag 0
Jun 23 18:49:13 LINUX1 openais: Did not need to originate any messages in recovery.
Jun 23 18:49:13 LINUX1 openais: Storing new sequence id for ring 4
Jun 23 18:49:13 LINUX1 openais: Sending initial ORF token
Jun 23 18:49:13 LINUX1 openais: CLM CONFIGURATION CHANGE
Jun 23 18:49:13 LINUX1 openais: New Configuration:
Jun 23 18:49:13 LINUX1 openais: Members Left:
Jun 23 18:49:13 LINUX1 openais: Members Joined:
Jun 23 18:49:13 LINUX1 openais: This node is within the primary component and will provide service.
Jun 23 18:49:13 LINUX1 openais: CLM CONFIGURATION CHANGE
Jun 23 18:49:13 LINUX1 openais: New Configuration:
Jun 23 18:49:13 LINUX1 openais: r(0) ip(192.168.10.31)
Jun 23 18:49:13 LINUX1 openais: Members Left:
Jun 23 18:49:13 LINUX1 openais: Members Joined:
Jun 23 18:49:13 LINUX1 openais: r(0) ip(192.168.10.31)
Jun 23 18:49:13 LINUX1 openais: This node is within the primary component and will provide service.
Jun 23 18:49:13 LINUX1 openais: entering OPERATIONAL state.
Jun 23 18:49:13 LINUX1 openais: quorum regained, resuming activity
Jun 23 18:49:13 LINUX1 openais: got nodejoin message 192.168.10.31
Jun 23 18:49:13 LINUX1 ccsd: Initial status:: Quorate
Jun 23 18:49:19 LINUX1 fenced: LINUX2 not a cluster member after 3 sec post_join_delay
Jun 23 18:49:19 LINUX1 fenced: fencing node "LINUX2"
Jun 23 18:49:19 LINUX1 fence_manual: Node LINUX2 needs to be reset before recovery can procede. Waiting for LINUX2 to rejoin the cluster or for manual acknowledgement that it has been reset (i.e. fence_ack_manual -n LINUX2)
然後我執行fence_ack_manual -n LINUX2
日誌刷出:
Jun 23 18:50:43 LINUX1 fenced: fence "LINUX2" success
Jun 23 18:50:48 LINUX1 ccsd: Attempt to close an unopened CCS descriptor (180).
Jun 23 18:50:48 LINUX1 ccsd: Error while processing disconnect: Invalid request descriptor
我如果在另一台機器上起cman,也會刷出相同的日誌,只是 LINUX2換成了LINUX1;
兩台機器cman都起來后,兩台機器的情況入下:
# clustat -l
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
LINUX1 1 Online, Local
LINUX2 2 Offline
# clustat -l
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
LINUX1 1 Offline
LINUX2 2 Online, Local
請教各位高手,這是什麼問題,如何解決?
[ 本帖最後由 txl829 於 2008-6-28 14:32 編輯 ]
《解決方案》
把你的結構拓撲,以及
配置文件/etc/cluster/cluster.conf,/etc/hosts,以及/var/log/message拿來看看。
我覺得,應該是配置文件錯。
《解決方案》
# cat cluster.conf
<?xml version="1.0" ?>
<cluster config_version="2" name="_cluster">
<fence_daemon post_fail_delay="0" post_join_delay="3"/>
<clusternodes>
<clusternode name="LINUX1" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="svr_ip" nodename="LINUX1"/>
</method>
</fence>
</clusternode>
<clusternode name="LINUX2" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="svr_ip" nodename="LINUX2"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes="1" two_node="1"/>
<fencedevices>
<fencedevice agent="fence_manual" name="svr_ip"/>
</fencedevices>
<rm>
<failoverdomains>
<failoverdomain name="" ordered="0" restricted="0">
<failoverdomainnode name="LINUX1" priority="1"/>
<failoverdomainnode name="LINUX2" priority="1"/>
</failoverdomain>
</failoverdomains>
<resources/>
<service autostart="1" domain="" name="serv_ip" recovery="relocate">
<ip address="192.168.10.32" monitor_link="1"/>
</service>
</rm>
</cluster>
hosts
# cat hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1 localhost.localdomain localhost
::1 localhost6.localdomain6 localhost6
192.168.10.11 1
192.168.10.13 2
192.168.10.31 LINUX1
192.168.10.33 LINUX2
192.168.10.32 svr_ip
日誌
Jun 23 22:24:42 LINUX2 ccsd: Starting ccsd 2.0.60:
Jun 23 22:24:42 LINUX2 ccsd: Built: Jan 23 2007 12:42:25
Jun 23 22:24:42 LINUX2 ccsd: Copyright (C) Red Hat, Inc. 2004 All rights reserved.
Jun 23 22:24:42 LINUX2 ccsd: cluster.conf (cluster name = _cluster, version = 2) found.
Jun 23 22:24:45 LINUX2 openais: AIS Executive Service RELEASE 'subrev 1324 version 0.80.2'
Jun 23 22:24:45 LINUX2 openais: Copyright (C) 2002-2006 MontaVista Software, Inc and contributors.
Jun 23 22:24:45 LINUX2 openais: Copyright (C) 2006 Red Hat, Inc.
Jun 23 22:24:45 LINUX2 openais: AIS Executive Service: started and ready to provide service.
Jun 23 22:24:45 LINUX2 openais: Using default multicast address of 239.192.88.13
Jun 23 22:24:45 LINUX2 openais: openais component openais_cpg loaded.
Jun 23 22:24:45 LINUX2 openais: Registering service handler 'openais cluster closed process group service v1.01'
Jun 23 22:24:45 LINUX2 openais: openais component openais_cfg loaded.
Jun 23 22:24:45 LINUX2 openais: Registering service handler 'openais configuration service'
Jun 23 22:24:45 LINUX2 openais: openais component openais_msg loaded.
Jun 23 22:24:45 LINUX2 openais: Registering service handler 'openais message service B.01.01'
Jun 23 22:24:45 LINUX2 openais: openais component openais_lck loaded.
Jun 23 22:24:45 LINUX2 openais: Registering service handler 'openais distributed locking service B.01.01'
Jun 23 22:24:45 LINUX2 openais: openais component openais_evt loaded.
Jun 23 22:24:45 LINUX2 openais: Registering service handler 'openais event service B.01.01'
Jun 23 22:24:45 LINUX2 openais: openais component openais_ckpt loaded.
Jun 23 22:24:45 LINUX2 openais: Registering service handler 'openais checkpoint service B.01.01'
Jun 23 22:24:45 LINUX2 openais: openais component openais_amf loaded.
Jun 23 22:24:45 LINUX2 openais: Registering service handler 'openais availability management framework B.01.01'
Jun 23 22:24:45 LINUX2 openais: openais component openais_clm loaded.
Jun 23 22:24:45 LINUX2 openais: Registering service handler 'openais cluster membership service B.01.01'
Jun 23 22:24:45 LINUX2 openais: openais component openais_evs loaded.
Jun 23 22:24:45 LINUX2 openais: Registering service handler 'openais extended virtual synchrony service'
Jun 23 22:24:45 LINUX2 openais: openais component openais_cman loaded.
Jun 23 22:24:45 LINUX2 openais: Registering service handler 'openais CMAN membership service 2.01'
Jun 23 22:24:45 LINUX2 openais: Token Timeout (10000 ms) retransmit timeout (495 ms)
Jun 23 22:24:45 LINUX2 openais: token hold (386 ms) retransmits before loss (20 retrans)
Jun 23 22:24:45 LINUX2 openais: join (60 ms) send_join (0 ms) consensus (4800 ms) merge (200 ms)
Jun 23 22:24:45 LINUX2 openais: downcheck (1000 ms) fail to recv const (50 msgs)
Jun 23 22:24:45 LINUX2 openais: seqno unchanged const (30 rotations) Maximum network MTU 1500
Jun 23 22:24:45 LINUX2 openais: window size per rotation (50 messages) maximum messages per rotation (17 messages)
Jun 23 22:24:45 LINUX2 openais: send threads (0 threads)
Jun 23 22:24:45 LINUX2 openais: RRP token expired timeout (495 ms)
Jun 23 22:24:45 LINUX2 openais: RRP token problem counter (2000 ms)
Jun 23 22:24:45 LINUX2 openais: RRP threshold (10 problem count)
Jun 23 22:24:45 LINUX2 openais: RRP mode set to none.
Jun 23 22:24:45 LINUX2 openais: heartbeat_failures_allowed (0)
Jun 23 22:24:45 LINUX2 openais: max_network_delay (50 ms)
Jun 23 22:24:45 LINUX2 openais: HeartBeat is Disabled. To enable set heartbeat_failures_allowed > 0
Jun 23 22:24:45 LINUX2 openais: Receive multicast socket recv buffer size (262142 bytes).
Jun 23 22:24:45 LINUX2 openais: Transmit multicast socket send buffer size (262142 bytes).
Jun 23 22:24:45 LINUX2 openais: The network interface is now up.
Jun 23 22:24:45 LINUX2 openais: Created or loaded sequence id 0.192.168.10.33 for this ring.
Jun 23 22:24:45 LINUX2 openais: entering GATHER state from 15.
Jun 23 22:24:45 LINUX2 openais: Initialising service handler 'openais extended virtual synchrony service'
Jun 23 22:24:45 LINUX2 openais: Initialising service handler 'openais cluster membership service B.01.01'
Jun 23 22:24:45 LINUX2 openais: Initialising service handler 'openais availability management framework B.01.01'
Jun 23 22:24:45 LINUX2 openais: Initialising service handler 'openais checkpoint service B.01.01'
Jun 23 22:24:45 LINUX2 openais: Initialising service handler 'openais event service B.01.01'
Jun 23 22:24:45 LINUX2 openais: Initialising service handler 'openais distributed locking service B.01.01'
Jun 23 22:24:45 LINUX2 openais: Initialising service handler 'openais message service B.01.01'
Jun 23 22:24:45 LINUX2 openais: Initialising service handler 'openais configuration service'
Jun 23 22:24:45 LINUX2 openais: Initialising service handler 'openais cluster closed process group service v1.01'
Jun 23 22:24:45 LINUX2 openais: Initialising service handler 'openais CMAN membership service 2.01'
Jun 23 22:24:45 LINUX2 openais: CMAN 2.0.60 (built Jan 23 2007 12:42:29) started
Jun 23 22:24:45 LINUX2 openais: Not using a virtual synchrony filter.
Jun 23 22:24:45 LINUX2 openais: Creating commit token because I am the rep.
Jun 23 22:24:45 LINUX2 openais: Saving state aru 0 high seq received 0
Jun 23 22:24:45 LINUX2 openais: entering COMMIT state.
Jun 23 22:24:45 LINUX2 openais: entering RECOVERY state.
Jun 23 22:24:45 LINUX2 openais: position member 192.168.10.33:
Jun 23 22:24:45 LINUX2 openais: previous ring seq 0 rep 192.168.10.33
Jun 23 22:24:45 LINUX2 openais: aru 0 high delivered 0 received flag 0
Jun 23 22:24:45 LINUX2 openais: Did not need to originate any messages in recovery.
Jun 23 22:24:45 LINUX2 openais: Storing new sequence id for ring 4
Jun 23 22:24:45 LINUX2 openais: Sending initial ORF token
Jun 23 22:24:45 LINUX2 openais: CLM CONFIGURATION CHANGE
Jun 23 22:24:45 LINUX2 openais: New Configuration:
Jun 23 22:24:45 LINUX2 openais: Members Left:
Jun 23 22:24:45 LINUX2 openais: Members Joined:
Jun 23 22:24:45 LINUX2 openais: This node is within the primary component and will provide service.
Jun 23 22:24:45 LINUX2 openais: CLM CONFIGURATION CHANGE
Jun 23 22:24:45 LINUX2 openais: New Configuration:
Jun 23 22:24:45 LINUX2 openais: r(0) ip(192.168.10.33)
Jun 23 22:24:45 LINUX2 openais: Members Left:
Jun 23 22:24:45 LINUX2 openais: Members Joined:
Jun 23 22:24:45 LINUX2 openais: r(0) ip(192.168.10.33)
Jun 23 22:24:45 LINUX2 openais: This node is within the primary component and will provide service.
Jun 23 22:24:45 LINUX2 openais: entering OPERATIONAL state.
Jun 23 22:24:45 LINUX2 openais: quorum regained, resuming activity
Jun 23 22:24:45 LINUX2 openais: got nodejoin message 192.168.10.33
Jun 23 22:24:45 LINUX2 ccsd: Initial status:: Quorate
Jun 23 22:24:50 LINUX2 fenced: LINUX1 not a cluster member after 3 sec post_join_delay
Jun 23 22:24:50 LINUX2 fenced: fencing node "LINUX1"
Jun 23 22:24:50 LINUX2 fence_manual: Node LINUX1 needs to be reset before recovery can procede. Waiting for LINUX1 to rejoin the cluster or for manual acknowledgement that it has been reset (i.e. fence_ack_manual -n LINUX1)
Jun 23 22:25:42 LINUX2 fenced: fence "LINUX1" success
Jun 23 22:25:47 LINUX2 ccsd: Attempt to close an unopened CCS descriptor (180).
Jun 23 22:25:47 LINUX2 ccsd: Error while processing disconnect: Invalid request descriptor
[ 本帖最後由 txl829 於 2008-6-28 14:33 編輯 ]
《解決方案》
1、你的cluster是用conga配置的還是system-config-cluster配置的?
檢查所有的集群組件安裝成功,檢查關鍵服務是否啟動成功。
2、clustat -l顯示的結果是1台主機被識別,另一台主機並沒有被識別,而且顯示都沒有被rmager管理,好似心跳解析有問題,從hosts文件可見,建議換個IP段,命名使用全稱。
《解決方案》
基本上是上面說的問題,第二台主機在啟動的時候無法獲得第一台主機的心跳,當然第一台主機也是一樣。
日誌中顯示不出除自己之外其他主機加入的信息,在此情況下實際上cluster是無法quorum的。
所以:
首先你的硬體結構是怎樣的?物理上必須要保證心跳信號所在的線路也就是10.31和10.33是通的,同時注意檢查防火牆是否對心跳信號有影響。
另外既然是RHEL5的集群,你是在用哪個kernel?如果是xen的話,換成普通的kernel。xend會對集群網路配置有影響。
《解決方案》
回復 #1 txl829 的帖子
1.我是用system-config-cluster配置的;
2.兩台機器的10.31,10.33的地址是通的,相互能ping通,兩台機器的防火牆都已禁用;
3.我見clustat -l看到的信息不對,因此沒有起rgmanager;
4.操作系統的內核應該不是xen,我用redhat的的光碟裝的,沒動過內核;
5.後來我用conga重新配了一遍,還是同樣的問題;
6.我還遇到過這樣的情況,兩台機器用clustat -l看,兩個節點都是online,但Local會分別的自已的節點上,也就是講在node1上看,local就在node1,在node2上看,local就在node2。
《解決方案》
6.我還遇到過這樣的情況,兩台機器用clustat -l看,兩個節點都是online,但Local會分別的自已的節點上,也就是講在node1上看,local就在node1,在node2上看,local就在node2。
這是正常的。至於你說的上面其他的情況,按照你講的應該沒有問題。