rhcs floating IP 的問題
有2台HP DL580,均安裝了RHEL5.1及其自帶的rhcs相關rpm,
分配的資源有:
floating1: ip:10.82.114.4
floating2: ip:10.30.7.185
script1: 默認的vsftpd腳本
script2: my_sample腳本 (這個腳本是起一個簡單的perl script,去create 一個 port,bind到floating ip 1上面,然後listen等連接,用於演示問題)
在cluster 管理界面創建2個service
service1:scrpit1,floating ip1,floating ip2
service2:script2, floating ip1
不同時enable,第一次enable service1,vsftpd服務正常起來,能夠通過floating IP連接,實驗成功。
diable了service1,enable service2,發現bind不到floating ip,報錯退出。
對比兩個service的輸出的/var/log/messages消息,發現起vsftpd服務的時候,avahi-daemon會現Register 兩個floating ip到相應的interface,而起sample服務的時候,則沒有做這個動作,導致my_sample腳本調用的時候,floating IP是不存在的,後面的bind肯定是失敗的了,實驗了mysqld,tomcat的服務,也是因為沒有Register floating IP造成失敗。
問題是:為什只有vsftpd服務的時候,rhcs回去讓avahi-daemon register floating IP?
反覆對比了cluster.conf文件,看不出哪裡不一樣的。
《解決方案》
回復 #1 PinkOrient 的帖子
補充1:
/etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="32" name="eadb_cluster">
<quorumd interval="2" label="quodisk" min_score="1" tko="10" votes="3">
<heuristic interval="2" program="ping 10.82.114.1 -c1 -t1" score="1"/>
</quorumd>
<fence_daemon post_fail_delay="0" post_join_delay="3"/>
<clusternodes>
<clusternode name="eadb01.smartone.com" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="eadb01_ilo"/>
</method>
</fence>
</clusternode>
<clusternode name="eadb02.smartone.com" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="eadb02_ilo"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman/>
<fencedevices>
<fencedevice agent="fence_ilo" hostname="192.168.0.10" login="Administrator" name="eadb01_ilo" passwd="password"/>
<fencedevice agent="fence_ilo" hostname="192.168.0.11" login="Administrator" name="eadb02_ilo" passwd="password"/>
</fencedevices>
<rm>
<failoverdomains>
<failoverdomain name="eadb_fd" ordered="0" restricted="0">
<failoverdomainnode name="eadb01.smartone.com" priority="1"/>
<failoverdomainnode name="eadb02.smartone.com" priority="1"/>
</failoverdomain>
<failoverdomain name="eadb_service" ordered="0" restricted="0">
<failoverdomainnode name="eadb01.smartone.com" priority="1"/>
<failoverdomainnode name="eadb02.smartone.com" priority="1"/>
</failoverdomain>
</failoverdomains>
<resources>
<ip address="10.82.114.4" monitor_link="1"/>
<script file="/etc/rc.d/init.d/vsftpd" name="vsftpd"/>
<clusterfs device="/dev/VolGroup00/LogVol00" force_unmount="1" fsid="53287" fstype="gfs" mountpoint="/gfs1" name="gfs_db_space" options=""/>
<clusterfs device="/dev/VolGroup01/LogVol01" force_unmount="1" fsid="1552" fstype="gfs" mountpoint="/gfs2" name="gfs_eadb_space" options=""/>
<mysql config_file="/etc/my.cnf" listen_address="0.0.0.0" mysql_options="" name="mysql_eadb" shutdown_wait="0"/>
<script file="/etc/init.d/tomcat_eadb" name="tomcat_eadb"/>
<tomcat-5 catalina_base="eadbmgr" catalina_options="/gfs2/eadb/jakarta-tomcat-5.0.30/conf" config_file="eadb_tomcat" name="/gfs2/eadb/jakarta-tomcat-5.0.30/conf/Catalina" shutdown_wait="" tomcat_user=""/>
<script file="/etc/init.d/tomcat5" name="tomcat"/>
<ip address="10.30.7.185" monitor_link="1"/>
<script file="/etc/init.d/my_sample" name="my_sample"/>
</resources>
<service autostart="1" domain="eadb_fd" name="vsftpd">
<script ref="vsftpd"/>
<ip ref="10.82.114.4"/>
<ip ref="10.30.7.185"/>
</service>
<service autostart="1" domain="eadb_service" exclusive="1" name="eadb">
<ip ref="10.82.114.4"/>
<clusterfs ref="gfs_db_space"/>
<clusterfs ref="gfs_eadb_space"/>
<mysql ref="mysql_eadb"/>
<script ref="tomcat_eadb"/>
<ip ref="10.30.7.185"/>
</service>
<service autostart="1" domain="eadb_service" name="tc">
<ip ref="10.82.114.4"/>
<script ref="tomcat"/>
</service>
<service autostart="1" exclusive="1" name="mysql">
<ip ref="10.82.114.4"/>
<ip ref="10.30.7.185"/>
<clusterfs ref="gfs_db_space"/>
<mysql ref="mysql_eadb"/>
</service>
<service autostart="1" domain="eadb_fd" name="sample" recovery="restart">
<ip ref="10.82.114.4"/>
<script ref="my_sample"/>
</service>
</rm>
</cluster>
==================
補充2:
my_sample service腳本調用的 perl script, eadb_sm.pl
#!/usr/bin/perl
use Socket;
# make the socket
socket(SERVER, PF_INET, SOCK_STREAM, getprotobyname('tcp'));
# so we can restart our server quickly
setsockopt(SERVER, SOL_SOCKET, SO_REUSEADDR, 1);
# build up my socket address
$ip="10.82.114.4";
$port=9999;
#$my_addr = sockaddr_in($port, INADDR_ANY);
$my_addr = sockaddr_in($port, inet_aton($ip));
bind(SERVER, $my_addr)
or die "Couldn't bind to port $server_port : $!\n";
# establish a queue for incoming connections
listen(SERVER, SOMAXCONN)
or die "Couldn't listen on port $server_port : $!\n";
# accept and process connections
while (accept(CLIENT, SERVER)) {
# do something with CLIENT
while(1) {
print stderr "Someone connected\n";
$bs = sysread(CLIENT, $buff, 2048);
if ($bs) {
($l, $s) = unpack("na*", $buff);
print stderr "Received: $l bytes, content:\n $s\n";
$tmp = "Received!!!";
$tosend = pack("na*", length($tmp), $tmp);
$bs = syswrite(CLIENT, $tosend);
if ($bs) {
print stderr "Sent: $tmp\n";
print stderr "=======================================\n";
} else {
print stderr "failed to send, exit";
}
last;
} else {
#print stderr "Nothing comes sleep for a while\n";
#sleep(1);
next;
}
}
close(CLIENT);
print stderr "Disconnected\n";
}
close(SERVER);
==================
補充3:
/var/log/messages信息
起vsftpd之前有註冊floating ip,起另外一個沒有
=====================
Nov 20 17:26:52 eadb01 clurgmgrd: <notice> Starting disabled service service:vsftpd
Nov 20 17:26:52 eadb01 avahi-daemon: Registering new address record for 10.82.114.4 on bond0.
Nov 20 17:26:53 eadb01 avahi-daemon: Registering new address record for 10.30.7.185 on bond1.
Nov 20 17:26:55 eadb01 clurgmgrd: <notice> Service service:vsftpd started
Nov 20 17:26:57 eadb01 snmpd: Connection from UDP: :33631
Nov 20 17:27:28 eadb01 last message repeated 3 times
Nov 20 17:27:45 eadb01 last message repeated 2 times
Nov 20 17:28:00 eadb01 snmpd: Connection from UDP: :33632
Nov 20 17:28:00 eadb01 snmpd: Received SNMP packet(s) from UDP: :33632
Nov 20 17:28:10 eadb01 clurgmgrd: <notice> Stopping service service:vsftpd
Nov 20 17:28:11 eadb01 avahi-daemon: Withdrawing address record for 10.30.7.185 on bond1.
Nov 20 17:28:15 eadb01 snmpd: Connection from UDP: :33632
Nov 20 17:28:16 eadb01 snmpd: Connection from UDP: :33632
Nov 20 17:28:21 eadb01 avahi-daemon: Withdrawing address record for 10.82.114.4 on bond0.
Nov 20 17:28:31 eadb01 clurgmgrd: <notice> Service service:vsftpd is disabled
Nov 20 17:28:31 eadb01 snmpd: Connection from UDP: :33632
Nov 20 17:28:35 eadb01 clurgmgrd: <notice> Starting disabled service service:sample
Nov 20 17:28:35 eadb01 clurgmgrd: <notice> Service service:sample started
Nov 20 17:28:43 eadb01 clurgmgrd: : <err> script:my_sample: status of /etc/init.d/my_sample failed (returned 2)
Nov 20 17:28:43 eadb01 clurgmgrd: <notice> status on script "my_sample" returned 1 (generic error)
Nov 20 17:28:43 eadb01 clurgmgrd: <notice> Stopping service service:sample
Nov 20 17:28:43 eadb01 clurgmgrd: <notice> Service service:sample is recovering
Nov 20 17:28:44 eadb01 clurgmgrd: <notice> Service service:sample is now running on member 2
Nov 20 17:28:46 eadb01 snmpd: Connection from UDP: :33632
Nov 20 17:29:17 eadb01 last message repeated 3 times
Nov 20 17:29:33 eadb01 last message repeated 2 times
Nov 20 17:29:43 eadb01 clurgmgrd: <notice> Stopping service service:sample
Nov 20 17:29:43 eadb01 clurgmgrd: <notice> Service service:sample is disabled
Nov 20 17:29:48 eadb01 snmpd: Connection from UDP: :33632
Nov 20 17:29:49 eadb01 snmpd: Connection from UDP: :33632
Nov 20 17:30:04 eadb01 snmpd: Connection from UDP: :33634
Nov 20 17:30:04 eadb01 snmpd: Received SNMP packet(s) from UDP: :33634
Nov 20 17:30:05 eadb01 clurgmgrd: <notice> Starting disabled service service:sample
Nov 20 17:30:05 eadb01 clurgmgrd: <notice> Service service:sample started
Nov 20 17:30:13 eadb01 clurgmgrd: : <err> script:my_sample: status of /etc/init.d/my_sample failed (returned 2)
Nov 20 17:30:13 eadb01 clurgmgrd: <notice> status on script "my_sample" returned 1 (generic error)
Nov 20 17:30:13 eadb01 clurgmgrd: <notice> Stopping service service:sample
Nov 20 17:30:13 eadb01 clurgmgrd: <notice> Service service:sample is recovering
Nov 20 17:30:14 eadb01 clurgmgrd: <notice> Service service:sample is now running on member 2
Nov 20 17:30:19 eadb01 snmpd: Connection from UDP: :33634
Nov 20 17:30:20 eadb01 snmpd: Connection from UDP: :33634
Nov 20 17:30:27 eadb01 clurgmgrd: <notice> Stopping service service:sample
Nov 20 17:30:27 eadb01 clurgmgrd: <notice> Service service:sample is stopped
Nov 20 17:30:27 eadb01 clurgmgrd: <notice> Starting stopped service service:sample
Nov 20 17:30:27 eadb01 clurgmgrd: <notice> Service service:sample started
Nov 20 17:30:30 eadb01 clurgmgrd: <notice> Stopping service service:sample
Nov 20 17:30:30 eadb01 clurgmgrd: <notice> Service service:sample is disabled
《解決方案》
回復 #1 PinkOrient 的帖子
從你的實現目的上分析要考慮資源組
status on script "my_sample" returned 1 (generic error)
這個說明腳步本身存在問題,啟動失敗
同時你要主機fw的規則是否禁止了數據包的傳送
《解決方案》
回復 #3 kns1024wh 的帖子
多謝回復,請再浪費點時間看看我的/var/log/messages.
腳本在rhcs外面用service my_sample start/stop/status 測試是OK的,當然這時候bind的IP指定的是一個static的IP。
如果我在rhcs裡面指定bind IP到0.0.0.0的話,也是跑的沒問題的,但是這個時候通過floating IP connect不到我的sample服務。
你這裡指出來的出錯,是因為我讓服務bind到floating IP上,而rhcs在起我的script之前,卻沒有通過avahi-daemon去register floating IP到interface上,所以進程被start之後,bind 失敗,自然是die掉了,接著rhcs通過status去monitor進程,自然是看到進程不到,就報出這個錯了。
我想問的就是,為什麼只有vsftpd這個service的時候,rhcs回去通過avahi-daemon吧floating IP做register,而其他的service卻沒有做這一步,造成bind floating IP失敗?
《解決方案》
回復 #3 kns1024wh 的帖子
多謝回復,請再浪費點時間看看我的/var/log/messages.
腳本在rhcs外面用service my_sample start/stop/status 測試是OK的,當然這時候bind的IP指定的是一個static的IP。
如果我在rhcs裡面指定bind IP到0.0.0.0的話,也是跑的沒問題的,但是這個時候通過floating IP connect不到我的sample服務。
你這裡指出來的出錯,是因為我讓服務bind到floating IP上,而rhcs在起我的script之前,卻沒有通過avahi-daemon去register floating IP到interface上,所以進程被start之後,bind 失敗,自然是die掉了,接著rhcs通過status去monitor進程,自然是看到進程不到,就報出這個錯了。
我想問的就是,為什麼只有vsftpd這個service的時候,rhcs回去通過avahi-daemon吧floating IP做register,而其他的service卻沒有做這一步,造成bind floating IP失敗?
《解決方案》
有人幫忙解釋一下rhcs的float IP的原理以及如何配置嗎?
未解決,頂一下不要沉了。
《解決方案》
再次頂起。。。。。
昨天用兩台vmware做試驗,情況依然如此,也是只有vsftpd的那個service會有register floating IP的動作,
後來想了個辦法,在vsftpd的service的基礎上,加上我要的服務的script,比如httpd,居然可以了,
非常奇怪,難道rhcs一定要帶著vsftpd才能玩?百思不得其解。
《解決方案》
這次搞定了,有同事幫我找到showrun之前發的經驗,然後經過試驗,解決了。
總結的經驗就是,加到Service的resource應該是分層次的,之前的誤區在於,把
IP 和 scripts這些resource當做相同層次的,都用「Add a Shared Resource to this service",
正確的做法應該是先用這個按鈕添加script,然後選中scripts,點擊」Attach a Shared/Private Resource to the selection",
把script要用到的其他resource Attach到腳本上面,呈樹狀關係。
這樣rhcs startup這個service的時候,會從樹葉開始把resource分配好或者運行好
Nov 21 08:05:05 node0 clurgmgrd: <notice> Starting disabled service service:httpd
Nov 21 08:05:10 node0 avahi-daemon: Registering new address record for 192.168.0.80 on eth0.
Nov 21 08:05:12 node0 avahi-daemon: Registering new address record for 192.168.0.15 on eth0.
Nov 21 08:05:13 node0 clurgmgrd: <notice> Service service:httpd started
Nov 21 08:05:32 node0 clurgmgrd: <notice> Stopping service service:httpd
Nov 21 08:05:32 node0 avahi-daemon: Withdrawing address record for 192.168.0.15 on eth0.
Nov 21 08:05:43 node0 avahi-daemon: Withdrawing address record for 192.168.0.80 on eth0.
Nov 21 08:05:56 node0 clurgmgrd: <notice> Service service:httpd is disabled
我的理解和showrun的相反,他的例子是把floating IP作為第一個加入service的resource,然後在上面attach上
gfs和script,但是我把script attache在floating IP下的時候發現rhcs並沒有register 這個IP,反之則OK了。
[ 本帖最後由 PinkOrient 於 2008-11-25 23:30 編輯 ]
《解決方案》
附上 cluster.conf 文件和測試結果
配置了4個service做試驗
httpd 可以正常工作,rhcs register了2個floating IP,然後運行script
httpd2 可以正常工作
httpd3 不能正常工作,沒有register floating IP
httpd4 不能正常工作,沒有register floating IP
=========================
<?xml version="1.0"?>
<cluster alias="new_cluster" config_version="40" name="new_cluster">
<fence_daemon post_fail_delay="0" post_join_delay="3"/>
<clusternodes>
<clusternode name="node0" nodeid="1" votes="1">
<fence/>
</clusternode>
<clusternode name="node1" nodeid="2" votes="1">
<fence/>
</clusternode>
</clusternodes>
<cman expected_votes="1" two_node="1"/>
<fencedevices>
<fencedevice agent="fence_manual" name="mf"/>
</fencedevices>
<rm>
<failoverdomains>
<failoverdomain name="ftpd" ordered="1" restricted="0">
<failoverdomainnode name="node0" priority="1"/>
<failoverdomainnode name="node1" priority="2"/>
</failoverdomain>
</failoverdomains>
<resources>
<ip address="192.168.0.15" monitor_link="1"/>
<script file="/etc/init.d/httpd" name="httpd"/>
<ip address="192.168.0.80" monitor_link="1"/>
</resources>
<service autostart="1" domain="ftpd" exclusive="1" name="httpd" recovery="relocate">
<script ref="httpd">
<ip ref="192.168.0.80"/>
<ip ref="192.168.0.15"/>
</script>
</service>
<service autostart="1" domain="ftpd" name="httpd2" recovery="restart">
<script ref="httpd">
<ip address="192.168.0.200" monitor_link="1"/>
</script>
</service>
<service autostart="1" domain="ftpd" name="httpd3" recovery="restart">
<ip ref="192.168.0.80"/>
<script ref="httpd"/>
</service>
<service autostart="1" name="httpd4">
<ip ref="192.168.0.15">
<script ref="httpd"/>
</ip>
</service>
</rm>
</cluster>
《解決方案》
你的script沒有status的輸出