歡迎您光臨本站 註冊首頁

RHCS配置問題

←手機掃碼閱讀     火星人 @ 2014-03-04 , reply:0

RHCS配置問題

我是用兩台vmware做的RHCS,系統是RHEL5.3
/etc/hosts文件的內容是
172.2.9.220 rhcs1
172.2.9.221  rhcs2
172.2.9.222  rhcs_vip

cluster.conf文件的內容是

<?xml version="1.0" ?>
<cluster config_version="5" name="rhcs">
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="rhcs1" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="rhcs_fence" nodename="rhcs1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="rhcs2" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="rhcs_fence" nodename="rhcs2"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="2" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_manual" name="rhcs_fence"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="rhcs_fd" ordered="0" restricted="1">
                                <failoverdomainnode name="rhcs1" priority="1"/>
                                <failoverdomainnode name="rhcs2" priority="1"/>
                        </failoverdomain>
                  

                </failoverdomains>
                <resources>
                        <fs device="/dev/hdb1" force_fsck="0" force_unmount="1" fsid="41630" fstype="ext3" mountpoint="/ha" name="ha" options="" self_fence="0"/>
                        <ip address="172.2.9.222" monitor_link="1"/>
                </resources>
                <service autostart="1" domain="rhcs_fd" name="rhcs_svr" recovery="relocate">
                        <fs ref="ha"/>
                        <ip ref="172.2.9.222"/>
                </service>
        </rm>
</cluster>


fence設備是採用的手動管理,參照的是史應生的《紅帽集群 (高可用性) 配置,管理和維護-最強版》
現在的問題是,我啟動第一台機器,集群能起來,但是當我起第二台機器的時候集群就斷開了,以下是日誌文件信息
Aug 28 02:18:08 rhcs1 openais: entering GATHER state from 11.
Aug 28 02:18:08 rhcs1 openais: Creating commit token because I am the rep.
Aug 28 02:18:08 rhcs1 openais: Saving state aru 1d high seq received 1d
Aug 28 02:18:08 rhcs1 openais: Storing new sequence id for ring 80
Aug 28 02:18:08 rhcs1 openais: entering COMMIT state.
Aug 28 02:18:08 rhcs1 openais: entering RECOVERY state.
Aug 28 02:18:08 rhcs1 openais: position member 172.2.9.220:
Aug 28 02:18:08 rhcs1 openais: previous ring seq 124 rep 172.2.9.220
Aug 28 02:18:08 rhcs1 openais: aru 1d high delivered 1d received flag 1
Aug 28 02:18:08 rhcs1 openais: position member 172.2.9.221:
Aug 28 02:18:08 rhcs1 openais: previous ring seq 120 rep 172.2.9.221
Aug 28 02:18:08 rhcs1 openais: aru a high delivered a received flag 1
Aug 28 02:18:08 rhcs1 openais: Did not need to originate any messages in recovery.
Aug 28 02:18:08 rhcs1 openais: Sending initial ORF token
Aug 28 02:18:08 rhcs1 openais: CLM CONFIGURATION CHANGE
Aug 28 02:18:08 rhcs1 openais: New Configuration:
Aug 28 02:18:08 rhcs1 openais:     r(0) ip(172.2.9.220)  
Aug 28 02:18:08 rhcs1 openais: Members Left:
Aug 28 02:18:08 rhcs1 openais: Members Joined:
Aug 28 02:18:08 rhcs1 openais: CLM CONFIGURATION CHANGE
Aug 28 02:18:08 rhcs1 openais: New Configuration:
Aug 28 02:18:08 rhcs1 openais:     r(0) ip(172.2.9.220)  
Aug 28 02:18:08 rhcs1 openais:     r(0) ip(172.2.9.221)  
Aug 28 02:18:08 rhcs1 openais: Members Left:
Aug 28 02:18:08 rhcs1 openais: Members Joined:
Aug 28 02:18:08 rhcs1 openais:     r(0) ip(172.2.9.221)  
Aug 28 02:18:08 rhcs1 openais: This node is within the primary component and will provide service.
Aug 28 02:18:08 rhcs1 openais: entering OPERATIONAL state.
Aug 28 02:18:08 rhcs1 openais: cman killed by node 2 because we rejoined the cluster without a full restart
Aug 28 02:18:08 rhcs1 gfs_controld: groupd_dispatch error -1 errno 11
Aug 28 02:18:08 rhcs1 gfs_controld: groupd connection died
Aug 28 02:18:08 rhcs1 gfs_controld: cluster is down, exiting
Aug 28 02:18:33 rhcs1 ccsd: Unable to connect to cluster infrastructure after 30 seconds.
Aug 28 02:19:03 rhcs1 ccsd: Unable to connect to cluster infrastructure after 60 seconds.
Aug 28 02:19:33 rhcs1 ccsd: Unable to connect to cluster infrastructure after 90 seconds.
Aug 28 02:20:03 rhcs1 ccsd: Unable to connect to cluster infrastructure after 120 seconds.


有時候集群有正常,但是我執行 clusvadm -r rhcs1 e -m rhcs2想把服務切到第二台上面去,報錯:說rhcs2服務不存在。

然後就是fence設備的配置問題,在system-config-cluster上添加fence設備,指定用manul fence之後,然後在添加到兩個節點。做了這些操作之後還要做什麼嗎?

求高手指教。。
《解決方案》

回復 #1 tanyangxf 的帖子

理解錯誤那個是要在無理主機上使用vm的fence設備的
《解決方案》

物理主機上使用vm的fence設備?不明白,希望能說明白點,謝謝了。
《解決方案》

Aug 28 02:18:08 rhcs1 openais: cman killed by node 2 because we rejoined the cluster without a full restart

檢查一下,為什麼沒有full start?
《解決方案》

我啟動節點1的時候都啟動了啊,然後啟動節點2的時候節點1的集群就down了

這是啟動節點2時候的log
Aug 28 13:50:03 rhcs2 openais: Sending initial ORF token
Aug 28 13:50:03 rhcs2 openais: CLM CONFIGURATION CHANGE
Aug 28 13:50:03 rhcs2 openais: New Configuration:
Aug 28 13:50:03 rhcs2 openais: Members Left:
Aug 28 13:50:03 rhcs2 openais: Members Joined:
Aug 28 13:50:03 rhcs2 openais: CLM CONFIGURATION CHANGE
Aug 28 13:50:03 rhcs2 openais: New Configuration:
Aug 28 13:50:03 rhcs2 openais:     r(0) ip(172.2.9.221)  
Aug 28 13:50:03 rhcs2 openais: Members Left:
Aug 28 13:50:03 rhcs2 openais: Members Joined:
Aug 28 13:50:03 rhcs2 openais:     r(0) ip(172.2.9.221)  
Aug 28 13:50:03 rhcs2 openais: This node is within the primary component and will provide service.
Aug 28 13:50:03 rhcs2 openais: entering OPERATIONAL state.
Aug 28 13:50:03 rhcs2 openais: quorum regained, resuming activity
Aug 28 13:50:03 rhcs2 openais: got nodejoin message 172.2.9.221
Aug 28 13:50:03 rhcs2 openais: entering GATHER state from 11.
Aug 28 13:50:03 rhcs2 openais: Saving state aru a high seq received a
Aug 28 13:50:03 rhcs2 openais: Storing new sequence id for ring 94
Aug 28 13:50:03 rhcs2 openais: entering COMMIT state.
Aug 28 13:50:03 rhcs2 openais: entering RECOVERY state.
Aug 28 13:50:03 rhcs2 openais: position member 172.2.9.220:
Aug 28 13:50:03 rhcs2 openais: previous ring seq 4 rep 172.2.9.220
Aug 28 13:50:03 rhcs2 openais: aru e high delivered e received flag 1
Aug 28 13:50:03 rhcs2 openais: position member 172.2.9.221:
Aug 28 13:50:03 rhcs2 openais: previous ring seq 144 rep 172.2.9.221
Aug 28 13:50:03 rhcs2 ccsd: Cluster is not quorate.  Refusing connection.
Aug 28 13:50:03 rhcs2 openais: aru a high delivered a received flag 1
Aug 28 13:50:03 rhcs2 ccsd: Error while processing connect: Connection refused
Aug 28 13:50:03 rhcs2 openais: Did not need to originate any messages in recovery.
Aug 28 13:50:03 rhcs2 openais: CLM CONFIGURATION CHANGE
Aug 28 13:50:03 rhcs2 openais: New Configuration:
Aug 28 13:50:03 rhcs2 openais:     r(0) ip(172.2.9.221)  
Aug 28 13:50:03 rhcs2 openais: Members Left:
Aug 28 13:50:03 rhcs2 openais: Members Joined:
Aug 28 13:50:03 rhcs2 openais: CLM CONFIGURATION CHANGE
Aug 28 13:50:03 rhcs2 openais: New Configuration:
Aug 28 13:50:03 rhcs2 openais:     r(0) ip(172.2.9.220)  
Aug 28 13:50:03 rhcs2 openais:     r(0) ip(172.2.9.221)  
Aug 28 13:50:03 rhcs2 openais: Members Left:
Aug 28 13:50:03 rhcs2 openais: Members Joined:
Aug 28 13:50:03 rhcs2 openais:     r(0) ip(172.2.9.220)  
Aug 28 13:50:03 rhcs2 openais: This node is within the primary component and will provide service.
Aug 28 13:50:03 rhcs2 openais: entering OPERATIONAL state.
Aug 28 13:50:03 rhcs2 openais: Killing node rhcs1 because it has rejoined the cluster with existing state
Aug 28 13:50:04 rhcs2 ccsd: Initial status:: Quorate
Aug 28 13:50:13 rhcs2 openais: The token was lost in the OPERATIONAL state.
Aug 28 13:50:13 rhcs2 openais: Receive multicast socket recv buffer size (262142 bytes).
Aug 28 13:50:13 rhcs2 openais: Transmit multicast socket send buffer size (262142 bytes).
Aug 28 13:50:14 rhcs2 openais: entering GATHER state from 2.
Aug 28 13:50:18 rhcs2 openais: entering GATHER state from 0.
Aug 28 13:50:18 rhcs2 openais: Creating commit token because I am the rep.
Aug 28 13:50:18 rhcs2 openais: Saving state aru 3 high seq received 3
Aug 28 13:50:18 rhcs2 openais: Storing new sequence id for ring 98
Aug 28 13:50:18 rhcs2 openais: entering COMMIT state.
Aug 28 13:50:18 rhcs2 openais: entering RECOVERY state.
Aug 28 13:50:18 rhcs2 openais: position member 172.2.9.221:
Aug 28 13:50:18 rhcs2 openais: previous ring seq 148 rep 172.2.9.220
Aug 28 13:50:18 rhcs2 openais: aru 3 high delivered 3 received flag 1
Aug 28 13:50:18 rhcs2 openais: Did not need to originate any messages in recovery.
Aug 28 13:50:18 rhcs2 openais: Sending initial ORF token
Aug 28 13:50:18 rhcs2 openais: CLM CONFIGURATION CHANGE
Aug 28 13:50:18 rhcs2 openais: New Configuration:
Aug 28 13:50:18 rhcs2 openais:     r(0) ip(172.2.9.221)  
Aug 28 13:50:18 rhcs2 openais: Members Left:
Aug 28 13:50:18 rhcs2 openais:     r(0) ip(172.2.9.220)  
Aug 28 13:50:18 rhcs2 openais: Members Joined:
Aug 28 13:50:18 rhcs2 openais: CLM CONFIGURATION CHANGE
Aug 28 13:50:18 rhcs2 openais: New Configuration:
Aug 28 13:50:18 rhcs2 openais:     r(0) ip(172.2.9.221)  
Aug 28 13:50:18 rhcs2 openais: Members Left:
Aug 28 13:50:18 rhcs2 openais: Members Joined:
Aug 28 13:50:18 rhcs2 openais: This node is within the primary component and will provide service.
Aug 28 13:50:18 rhcs2 openais: entering OPERATIONAL state.
Aug 28 13:50:18 rhcs2 openais: got nodejoin message 172.2.9.221
Aug 28 13:50:18 rhcs2 openais: got nodejoin message 172.2.9.221
Aug 28 13:50:18 rhcs2 openais: Can't find cluster node at r(0) ip(172.2.9.220)  
Aug 28 13:50:54 rhcs2 fenced: rhcs1 not a cluster member after 3 sec post_join_delay
Aug 28 13:50:54 rhcs2 fenced: fencing node "rhcs1"
Aug 28 13:50:54 rhcs2 fence_manual: Node rhcs1 needs to be reset before recovery can procede.  Waiting for rhcs1 to rejoin the cluster or for manual acknowledgement that it has been reset (i.e. fence_ack_manual -n rhcs1)


前面的帖子上是發的節點1的log
《解決方案》

先確保你的環境沒有問題。
關閉防火牆SELINUX,然後將所有CLUSTER服務設置為DISABLED。
重啟兩台機器。
分別在第一二台機器上相繼啟動CMAN試試。
《解決方案》

我把modclusterd  cman 和rgmanager服務都禁止開機啟動了,防火牆和selinux也是禁止的

啟動節點1的時候沒問題,當啟動節點2的時候還是老問題,節點1的集群down了,這是節點1的日誌
Aug 28 18:39:14 rhcs1 openais: entering GATHER state from 11.
Aug 28 18:39:14 rhcs1 openais: Creating commit token because I am the rep.
Aug 28 18:39:14 rhcs1 openais: Saving state aru 10 high seq received 10
Aug 28 18:39:14 rhcs1 openais: Storing new sequence id for ring a0
Aug 28 18:39:14 rhcs1 openais: entering COMMIT state.
Aug 28 18:39:14 rhcs1 openais: entering RECOVERY state.
Aug 28 18:39:14 rhcs1 openais: position member 172.2.9.220:
Aug 28 18:39:14 rhcs1 openais: previous ring seq 152 rep 172.2.9.220
Aug 28 18:39:14 rhcs1 openais: aru 10 high delivered 10 received flag 1
Aug 28 18:39:14 rhcs1 openais: position member 172.2.9.221:
Aug 28 18:39:14 rhcs1 openais: previous ring seq 156 rep 172.2.9.221
Aug 28 18:39:14 rhcs1 openais: aru a high delivered a received flag 1
Aug 28 18:39:14 rhcs1 openais: Did not need to originate any messages in recovery.
Aug 28 18:39:14 rhcs1 openais: Sending initial ORF token
Aug 28 18:39:14 rhcs1 openais: CLM CONFIGURATION CHANGE
Aug 28 18:39:14 rhcs1 openais: New Configuration:
Aug 28 18:39:14 rhcs1 openais:     r(0) ip(172.2.9.220)  
Aug 28 18:39:14 rhcs1 openais: Members Left:
Aug 28 18:39:14 rhcs1 openais: Members Joined:
Aug 28 18:39:14 rhcs1 openais: CLM CONFIGURATION CHANGE
Aug 28 18:39:14 rhcs1 openais: New Configuration:
Aug 28 18:39:14 rhcs1 openais:     r(0) ip(172.2.9.220)  
Aug 28 18:39:14 rhcs1 openais:     r(0) ip(172.2.9.221)  
Aug 28 18:39:14 rhcs1 openais: Members Left:
Aug 28 18:39:14 rhcs1 openais: Members Joined:
Aug 28 18:39:14 rhcs1 openais:     r(0) ip(172.2.9.221)  
Aug 28 18:39:14 rhcs1 openais: This node is within the primary component and will provide service.
Aug 28 18:39:14 rhcs1 openais: entering OPERATIONAL state.
Aug 28 18:39:14 rhcs1 openais: cman killed by node 2 because we rejoined the cluster without a full restart
Aug 28 18:39:14 rhcs1 groupd: cman_get_nodes error -1 104
Aug 28 18:39:14 rhcs1 gfs_controld: cluster is down, exiting
Aug 28 18:39:14 rhcs1 fenced: cman_get_nodes error -1 104
Aug 28 18:39:39 rhcs1 ccsd: Unable to connect to cluster infrastructure after 30 seconds.


我的環境是,兩台Vmware裝的rhel5.3
網路用的是橋接,兩台主機都能互相ping通。在虛擬機上設置了一個共享磁碟,為兩台虛擬機共享。

配置 rhcs的時候就是,添加了兩個節點之後,然後添加了fence設備,把fence設備設置成是manul fence。然後把fence分別加入到兩個節點上。然後就是添加失效域,把兩個節點都加進去,然後就是添加資源,然後再添加服務。(以上操作是在節點1上 進行)然後把cluster.conf文件拷到節點2。然後啟動節點1的cman,沒問題,啟動節點2的cman就出現上面的問題了。


在別的帖子上看到的很多都是用3台虛擬機,然後,用其中一台虛擬機做fence設備。我這樣配置兩台機器,如果換做是roseha或者是ibm的hacmp應該是沒問題的啊。難道是這個fence設備配得有問題?在用system-config-cluster中配置fence設備為manul fence之後,添加至兩節點,這樣就行了嗎?還是還要做別的操作?

[ 本帖最後由 tanyangxf 於 2009-8-28 10:54 編輯 ]
《解決方案》

從配置文件中看,fence沒有問題。你應該提供網卡的配置文件和/etc/hosts文件。
《解決方案》

我的/etc/hosts的內容是
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1               localhost.localdomain localhost
::1             localhost6.localdomain6 localhost6
172.2.9.220  rhcs1
172.2.9.221  rhcs2
172.2.9.222  rhcs_vip

網卡是使用的單網卡,eth0 使用的靜態配置
# cat /etc/sysconfig/network-scripts/ifcfg-eth0
# Advanced Micro Devices 79c970
DEVICE=eth0
BOOTPROTO=none
HWADDR=00:0c:29:69:d5:80
ONBOOT=yes
NETMASK=255.255.255.0
IPADDR=172.2.9.220
TYPE=Ethernet
USERCTL=no
IPV6INIT=no
PEERDNS=yes
《解決方案》

既然如此,拿sosreport來看吧。

[火星人 ] RHCS配置問題已經有500次圍觀

http://coctec.com/docs/service/show-post-6114.html