RHCS配置問題

←手機掃碼閱讀火星人 @ 2014-03-04 , reply:0

RHCS配置問題

我是用兩台vmware做的RHCS,系統是RHEL5.3
/etc/hosts文件的內容是
172.2.9.220 rhcs1
172.2.9.221  rhcs2
172.2.9.222  rhcs_vip

cluster.conf文件的內容是

<?xml version="1.0" ?>
<cluster config_version="5" name="rhcs">
<fence_daemon post_fail_delay="0" post_join_delay="3"/>
<clusternodes>
<clusternode name="rhcs1" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="rhcs_fence" nodename="rhcs1"/>
</method>
</fence>
</clusternode>
<clusternode name="rhcs2" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="rhcs_fence" nodename="rhcs2"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes="2" two_node="1"/>
<fencedevices>
<fencedevice agent="fence_manual" name="rhcs_fence"/>
</fencedevices>
<rm>
<failoverdomains>
<failoverdomain name="rhcs_fd" ordered="0" restricted="1">
<failoverdomainnode name="rhcs1" priority="1"/>
<failoverdomainnode name="rhcs2" priority="1"/>
</failoverdomain>


</failoverdomains>
<resources>
<fs device="/dev/hdb1" force_fsck="0" force_unmount="1" fsid="41630" fstype="ext3" mountpoint="/ha" name="ha" options="" self_fence="0"/>
<ip address="172.2.9.222" monitor_link="1"/>
</resources>
<service autostart="1" domain="rhcs_fd" name="rhcs_svr" recovery="relocate">
<fs ref="ha"/>
<ip ref="172.2.9.222"/>
</service>
</rm>
</cluster>

fence設備是採用的手動管理，參照的是史應生的《紅帽集群 (高可用性) 配置，管理和維護－最強版》
現在的問題是，我啟動第一台機器，集群能起來，但是當我起第二台機器的時候集群就斷開了，以下是日誌文件信息
Aug 28 02:18:08 rhcs1 openais: entering GATHER state from 11.
Aug 28 02:18:08 rhcs1 openais: Creating commit token because I am the rep.
Aug 28 02:18:08 rhcs1 openais: Saving state aru 1d high seq received 1d
Aug 28 02:18:08 rhcs1 openais: Storing new sequence id for ring 80
Aug 28 02:18:08 rhcs1 openais: entering COMMIT state.
Aug 28 02:18:08 rhcs1 openais: entering RECOVERY state.
Aug 28 02:18:08 rhcs1 openais: position member 172.2.9.220:
Aug 28 02:18:08 rhcs1 openais: previous ring seq 124 rep 172.2.9.220
Aug 28 02:18:08 rhcs1 openais: aru 1d high delivered 1d received flag 1
Aug 28 02:18:08 rhcs1 openais: position member 172.2.9.221:
Aug 28 02:18:08 rhcs1 openais: previous ring seq 120 rep 172.2.9.221
Aug 28 02:18:08 rhcs1 openais: aru a high delivered a received flag 1
Aug 28 02:18:08 rhcs1 openais: Did not need to originate any messages in recovery.
Aug 28 02:18:08 rhcs1 openais: Sending initial ORF token
Aug 28 02:18:08 rhcs1 openais: CLM CONFIGURATION CHANGE
Aug 28 02:18:08 rhcs1 openais: New Configuration:
Aug 28 02:18:08 rhcs1 openais: r(0) ip(172.2.9.220)
Aug 28 02:18:08 rhcs1 openais: Members Left:
Aug 28 02:18:08 rhcs1 openais: Members Joined:
Aug 28 02:18:08 rhcs1 openais: CLM CONFIGURATION CHANGE
Aug 28 02:18:08 rhcs1 openais: New Configuration:
Aug 28 02:18:08 rhcs1 openais: r(0) ip(172.2.9.220)
Aug 28 02:18:08 rhcs1 openais: r(0) ip(172.2.9.221)
Aug 28 02:18:08 rhcs1 openais: Members Left:
Aug 28 02:18:08 rhcs1 openais: Members Joined:
Aug 28 02:18:08 rhcs1 openais: r(0) ip(172.2.9.221)
Aug 28 02:18:08 rhcs1 openais: This node is within the primary component and will provide service.
Aug 28 02:18:08 rhcs1 openais: entering OPERATIONAL state.
Aug 28 02:18:08 rhcs1 openais: cman killed by node 2 because we rejoined the cluster without a full restart
Aug 28 02:18:08 rhcs1 gfs_controld: groupd_dispatch error -1 errno 11
Aug 28 02:18:08 rhcs1 gfs_controld: groupd connection died
Aug 28 02:18:08 rhcs1 gfs_controld: cluster is down, exiting
Aug 28 02:18:33 rhcs1 ccsd: Unable to connect to cluster infrastructure after 30 seconds.
Aug 28 02:19:03 rhcs1 ccsd: Unable to connect to cluster infrastructure after 60 seconds.
Aug 28 02:19:33 rhcs1 ccsd: Unable to connect to cluster infrastructure after 90 seconds.
Aug 28 02:20:03 rhcs1 ccsd: Unable to connect to cluster infrastructure after 120 seconds.

有時候集群有正常，但是我執行 clusvadm -r rhcs1 e -m rhcs2想把服務切到第二台上面去，報錯：說rhcs2服務不存在。

然後就是fence設備的配置問題，在system-config-cluster上添加fence設備，指定用manul fence之後，然後在添加到兩個節點。做了這些操作之後還要做什麼嗎？

求高手指教。。

《解決方案》

回復 #1 tanyangxf 的帖子

理解錯誤那個是要在無理主機上使用vm的fence設備的

《解決方案》

物理主機上使用vm的fence設備？不明白，希望能說明白點，謝謝了。

《解決方案》

Aug 28 02:18:08 rhcs1 openais: cman killed by node 2 because we rejoined the cluster without a full restart

檢查一下，為什麼沒有full start？

《解決方案》

我啟動節點1的時候都啟動了啊，然後啟動節點2的時候節點1的集群就down了

這是啟動節點2時候的log
Aug 28 13:50:03 rhcs2 openais: Sending initial ORF token
Aug 28 13:50:03 rhcs2 openais: CLM CONFIGURATION CHANGE
Aug 28 13:50:03 rhcs2 openais: New Configuration:
Aug 28 13:50:03 rhcs2 openais: Members Left:
Aug 28 13:50:03 rhcs2 openais: Members Joined:
Aug 28 13:50:03 rhcs2 openais: CLM CONFIGURATION CHANGE
Aug 28 13:50:03 rhcs2 openais: New Configuration:
Aug 28 13:50:03 rhcs2 openais: r(0) ip(172.2.9.221)
Aug 28 13:50:03 rhcs2 openais: Members Left:
Aug 28 13:50:03 rhcs2 openais: Members Joined:
Aug 28 13:50:03 rhcs2 openais: r(0) ip(172.2.9.221)
Aug 28 13:50:03 rhcs2 openais: This node is within the primary component and will provide service.
Aug 28 13:50:03 rhcs2 openais: entering OPERATIONAL state.
Aug 28 13:50:03 rhcs2 openais: quorum regained, resuming activity
Aug 28 13:50:03 rhcs2 openais: got nodejoin message 172.2.9.221
Aug 28 13:50:03 rhcs2 openais: entering GATHER state from 11.
Aug 28 13:50:03 rhcs2 openais: Saving state aru a high seq received a
Aug 28 13:50:03 rhcs2 openais: Storing new sequence id for ring 94
Aug 28 13:50:03 rhcs2 openais: entering COMMIT state.
Aug 28 13:50:03 rhcs2 openais: entering RECOVERY state.
Aug 28 13:50:03 rhcs2 openais: position member 172.2.9.220:
Aug 28 13:50:03 rhcs2 openais: previous ring seq 4 rep 172.2.9.220
Aug 28 13:50:03 rhcs2 openais: aru e high delivered e received flag 1
Aug 28 13:50:03 rhcs2 openais: position member 172.2.9.221:
Aug 28 13:50:03 rhcs2 openais: previous ring seq 144 rep 172.2.9.221
Aug 28 13:50:03 rhcs2 ccsd: Cluster is not quorate.  Refusing connection.
Aug 28 13:50:03 rhcs2 openais: aru a high delivered a received flag 1
Aug 28 13:50:03 rhcs2 ccsd: Error while processing connect: Connection refused
Aug 28 13:50:03 rhcs2 openais: Did not need to originate any messages in recovery.
Aug 28 13:50:03 rhcs2 openais: CLM CONFIGURATION CHANGE
Aug 28 13:50:03 rhcs2 openais: New Configuration:
Aug 28 13:50:03 rhcs2 openais: r(0) ip(172.2.9.221)
Aug 28 13:50:03 rhcs2 openais: Members Left:
Aug 28 13:50:03 rhcs2 openais: Members Joined:
Aug 28 13:50:03 rhcs2 openais: CLM CONFIGURATION CHANGE
Aug 28 13:50:03 rhcs2 openais: New Configuration:
Aug 28 13:50:03 rhcs2 openais: r(0) ip(172.2.9.220)
Aug 28 13:50:03 rhcs2 openais: r(0) ip(172.2.9.221)
Aug 28 13:50:03 rhcs2 openais: Members Left:
Aug 28 13:50:03 rhcs2 openais: Members Joined:
Aug 28 13:50:03 rhcs2 openais: r(0) ip(172.2.9.220)
Aug 28 13:50:03 rhcs2 openais: This node is within the primary component and will provide service.
Aug 28 13:50:03 rhcs2 openais: entering OPERATIONAL state.
Aug 28 13:50:03 rhcs2 openais: Killing node rhcs1 because it has rejoined the cluster with existing state
Aug 28 13:50:04 rhcs2 ccsd: Initial status:: Quorate
Aug 28 13:50:13 rhcs2 openais: The token was lost in the OPERATIONAL state.
Aug 28 13:50:13 rhcs2 openais: Receive multicast socket recv buffer size (262142 bytes).
Aug 28 13:50:13 rhcs2 openais: Transmit multicast socket send buffer size (262142 bytes).
Aug 28 13:50:14 rhcs2 openais: entering GATHER state from 2.
Aug 28 13:50:18 rhcs2 openais: entering GATHER state from 0.
Aug 28 13:50:18 rhcs2 openais: Creating commit token because I am the rep.
Aug 28 13:50:18 rhcs2 openais: Saving state aru 3 high seq received 3
Aug 28 13:50:18 rhcs2 openais: Storing new sequence id for ring 98
Aug 28 13:50:18 rhcs2 openais: entering COMMIT state.
Aug 28 13:50:18 rhcs2 openais: entering RECOVERY state.
Aug 28 13:50:18 rhcs2 openais: position member 172.2.9.221:
Aug 28 13:50:18 rhcs2 openais: previous ring seq 148 rep 172.2.9.220
Aug 28 13:50:18 rhcs2 openais: aru 3 high delivered 3 received flag 1
Aug 28 13:50:18 rhcs2 openais: Did not need to originate any messages in recovery.
Aug 28 13:50:18 rhcs2 openais: Sending initial ORF token
Aug 28 13:50:18 rhcs2 openais: CLM CONFIGURATION CHANGE
Aug 28 13:50:18 rhcs2 openais: New Configuration:
Aug 28 13:50:18 rhcs2 openais: r(0) ip(172.2.9.221)
Aug 28 13:50:18 rhcs2 openais: Members Left:
Aug 28 13:50:18 rhcs2 openais: r(0) ip(172.2.9.220)
Aug 28 13:50:18 rhcs2 openais: Members Joined:
Aug 28 13:50:18 rhcs2 openais: CLM CONFIGURATION CHANGE
Aug 28 13:50:18 rhcs2 openais: New Configuration:
Aug 28 13:50:18 rhcs2 openais: r(0) ip(172.2.9.221)
Aug 28 13:50:18 rhcs2 openais: Members Left:
Aug 28 13:50:18 rhcs2 openais: Members Joined:
Aug 28 13:50:18 rhcs2 openais: This node is within the primary component and will provide service.
Aug 28 13:50:18 rhcs2 openais: entering OPERATIONAL state.
Aug 28 13:50:18 rhcs2 openais: got nodejoin message 172.2.9.221
Aug 28 13:50:18 rhcs2 openais: got nodejoin message 172.2.9.221
Aug 28 13:50:18 rhcs2 openais: Can't find cluster node at r(0) ip(172.2.9.220)
Aug 28 13:50:54 rhcs2 fenced: rhcs1 not a cluster member after 3 sec post_join_delay
Aug 28 13:50:54 rhcs2 fenced: fencing node "rhcs1"
Aug 28 13:50:54 rhcs2 fence_manual: Node rhcs1 needs to be reset before recovery can procede.  Waiting for rhcs1 to rejoin the cluster or for manual acknowledgement that it has been reset (i.e. fence_ack_manual -n rhcs1)

前面的帖子上是發的節點1的log

《解決方案》

先確保你的環境沒有問題。
關閉防火牆SELINUX，然後將所有CLUSTER服務設置為DISABLED。
重啟兩台機器。
分別在第一二台機器上相繼啟動CMAN試試。

《解決方案》

我把modclusterd  cman 和rgmanager服務都禁止開機啟動了，防火牆和selinux也是禁止的

啟動節點1的時候沒問題，當啟動節點2的時候還是老問題，節點1的集群down了，這是節點1的日誌
Aug 28 18:39:14 rhcs1 openais: entering GATHER state from 11.
Aug 28 18:39:14 rhcs1 openais: Creating commit token because I am the rep.
Aug 28 18:39:14 rhcs1 openais: Saving state aru 10 high seq received 10
Aug 28 18:39:14 rhcs1 openais: Storing new sequence id for ring a0
Aug 28 18:39:14 rhcs1 openais: entering COMMIT state.
Aug 28 18:39:14 rhcs1 openais: entering RECOVERY state.
Aug 28 18:39:14 rhcs1 openais: position member 172.2.9.220:
Aug 28 18:39:14 rhcs1 openais: previous ring seq 152 rep 172.2.9.220
Aug 28 18:39:14 rhcs1 openais: aru 10 high delivered 10 received flag 1
Aug 28 18:39:14 rhcs1 openais: position member 172.2.9.221:
Aug 28 18:39:14 rhcs1 openais: previous ring seq 156 rep 172.2.9.221
Aug 28 18:39:14 rhcs1 openais: aru a high delivered a received flag 1
Aug 28 18:39:14 rhcs1 openais: Did not need to originate any messages in recovery.
Aug 28 18:39:14 rhcs1 openais: Sending initial ORF token
Aug 28 18:39:14 rhcs1 openais: CLM CONFIGURATION CHANGE
Aug 28 18:39:14 rhcs1 openais: New Configuration:
Aug 28 18:39:14 rhcs1 openais: r(0) ip(172.2.9.220)
Aug 28 18:39:14 rhcs1 openais: Members Left:
Aug 28 18:39:14 rhcs1 openais: Members Joined:
Aug 28 18:39:14 rhcs1 openais: CLM CONFIGURATION CHANGE
Aug 28 18:39:14 rhcs1 openais: New Configuration:
Aug 28 18:39:14 rhcs1 openais: r(0) ip(172.2.9.220)
Aug 28 18:39:14 rhcs1 openais: r(0) ip(172.2.9.221)
Aug 28 18:39:14 rhcs1 openais: Members Left:
Aug 28 18:39:14 rhcs1 openais: Members Joined:
Aug 28 18:39:14 rhcs1 openais: r(0) ip(172.2.9.221)
Aug 28 18:39:14 rhcs1 openais: This node is within the primary component and will provide service.
Aug 28 18:39:14 rhcs1 openais: entering OPERATIONAL state.
Aug 28 18:39:14 rhcs1 openais: cman killed by node 2 because we rejoined the cluster without a full restart
Aug 28 18:39:14 rhcs1 groupd: cman_get_nodes error -1 104
Aug 28 18:39:14 rhcs1 gfs_controld: cluster is down, exiting
Aug 28 18:39:14 rhcs1 fenced: cman_get_nodes error -1 104
Aug 28 18:39:39 rhcs1 ccsd: Unable to connect to cluster infrastructure after 30 seconds.

我的環境是，兩台Vmware裝的rhel5.3
網路用的是橋接，兩台主機都能互相ping通。在虛擬機上設置了一個共享磁碟，為兩台虛擬機共享。

配置 rhcs的時候就是，添加了兩個節點之後，然後添加了fence設備，把fence設備設置成是manul fence。然後把fence分別加入到兩個節點上。然後就是添加失效域，把兩個節點都加進去，然後就是添加資源，然後再添加服務。（以上操作是在節點1上進行）然後把cluster.conf文件拷到節點2。然後啟動節點1的cman，沒問題，啟動節點2的cman就出現上面的問題了。

在別的帖子上看到的很多都是用3台虛擬機，然後，用其中一台虛擬機做fence設備。我這樣配置兩台機器，如果換做是roseha或者是ibm的hacmp應該是沒問題的啊。難道是這個fence設備配得有問題？在用system-config-cluster中配置fence設備為manul fence之後，添加至兩節點，這樣就行了嗎？還是還要做別的操作？

[ 本帖最後由 tanyangxf 於 2009-8-28 10:54 編輯 ]

《解決方案》

從配置文件中看，fence沒有問題。你應該提供網卡的配置文件和/etc/hosts文件。

《解決方案》

我的/etc/hosts的內容是
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1             localhost.localdomain localhost
::1          localhost6.localdomain6 localhost6
172.2.9.220  rhcs1
172.2.9.221  rhcs2
172.2.9.222  rhcs_vip

網卡是使用的單網卡，eth0 使用的靜態配置
# cat /etc/sysconfig/network-scripts/ifcfg-eth0
# Advanced Micro Devices 79c970
DEVICE=eth0
BOOTPROTO=none
HWADDR=00:0c:29:69:d5:80
ONBOOT=yes
NETMASK=255.255.255.0
IPADDR=172.2.9.220
TYPE=Ethernet
USERCTL=no
IPV6INIT=no
PEERDNS=yes

《解決方案》

既然如此，拿sosreport來看吧。

Tags:

[火星人 ] RHCS配置問題已經有500次圍觀

本文地址：http://coctec.com/docs/service/show-post-6114.html

RHCS配置問題

RHCS配置問題

回復 #1 tanyangxf 的帖子

熱門文章

最新文章