Linx AS3.0 Cluster HA Oracle服務不行,求助!
1、硬體環境: 2 台HP rx 4640 2*1.3G ,4G ,1台HP MSA1000 陣列4塊146G做raid 5
2、系統環境:Redhat linux AS3.0 + Cluster HA + Oracle 9020 for Linux IA64
(操作系統和oracle 軟體是放在本機硬碟,oracle 數據文件/oradata 放在陣列上)
目前問題是:1、雙機可以配置成,但是oracle 服務起來時,資料庫不能自動起來,只能監聽可以自動啟動!為什麼,是不是我oracle 服務腳本寫錯了!
2、今天中午資料庫莫明其妙自己關閉,查看cluster 日誌發現是cluster oracle 服務自動重啟過,但資料庫無法自己載入起來!麻煩有經驗的兄弟一起看看!
腳本如下:
oraserver.sh
----------------------
#!/bin/sh
# description: Oracle auto start-stop script.
# chkconfig: - 20 80
#
# Set ORA_HOME to be equivalent to the $ORACLE_HOME
# from which you wish to execute dbstart and dbshut;
#
# Set ORA_OWNER to the user id of the owner of the
# Oracle database in ORA_HOME.
ORA_HOME=/oracle/product/9.2.0.4.0
ORA_OWNER=oracle
if [ ! -f $ORA_HOME/bin/dbstart ]
then
echo "Oracle startup: cannot start"
exit
fi
case "$1" in
'start')
# Start the Oracle databases:
# The following command assumes that the oracle login
# will not prompt the user for any values
su - $ORA_OWNER -c "$ORA_HOME/bin/lsnrctl start"
sh /oracle/dbstart
;;
'stop')
# Stop the Oracle databases:
# The following command assumes that the oracle login
# will not prompt the user for any values
sh /oracle/dbshut
su - $ORA_OWNER -c "$ORA_HOME/bin/lsnrctl stop"
;;
esac
---------------
dbstart.sh
---------------
su - oracle <<EOF
sqlplus /nolog
connect SYS/change_on_install as SYSDBA
shutdown immediate
exit
----------
dbshut
---------
su - oracle <<EOF
sqlplus /nolog
connect SYS/change_on_install as SYSDBA
shutdown immediate
exit
-----------
這樣子腳本有問題嗎?可以嗎?
3、另外貼出主機的/var/log/messages 日誌出來讓大家一起分析一下!是不是我一個cpu0壞了,把是報錯
Aug 31 12:19:51 localhost last message repeated 15 times
Aug 31 12:31:09 localhost kernel: oracle(27266): floating-point assist fault at ip 4000000004174822
Aug 31 12:31:09 localhost last message repeated 3 times
Aug 31 12:33:33 localhost kernel: oracle(1544): floating-point assist fault at ip 4000000004174822
Aug 31 12:33:33 localhost last message repeated 3 times
Aug 31 12:50:41 localhost modprobe: modprobe: Can't locate module
Aug 31 12:50:41 localhost clusvcmgrd: : <err> service error: IP address 172.17.116.175 missing
Aug 31 12:50:41 localhost clusvcmgrd: : <err> service error: : error fetching interface information: Device not found
Aug 31 12:50:41 localhost clusvcmgrd: : <err> service error: Check status failed on IP addresses for oracle
Aug 31 12:50:41 localhost clusvcmgrd: <warning> Restarting locally failed service oracle
Aug 31 12:50:42 localhost clusvcmgrd: : <notice> service notice: Stopping service oracle ...
Aug 31 12:50:42 localhost clusvcmgrd: : <notice> service notice: Running user script '/oracle/dbserver stop'
Aug 31 12:50:42 localhost su(pam_unix): session opened for user oracle by (uid=0)
Aug 31 12:54:45 localhost su(pam_unix): session closed for user oracle
Aug 31 12:54:45 localhost su(pam_unix): session opened for user oracle by (uid=0)
Aug 31 12:54:46 localhost su(pam_unix): session closed for user oracle
Aug 31 12:54:46 localhost modprobe: modprobe: Can't locate module
Aug 31 12:54:47 localhost clusvcmgrd: : <notice> service notice: Stopped service oracle ...
Aug 31 12:54:47 localhost clusvcmgrd: <notice> Starting stopped service oracle
Aug 31 12:54:47 localhost clusvcmgrd: : <notice> service notice: Starting service oracle ...
Aug 31 12:54:47 localhost kernel: kjournald starting. Commit interval 5 seconds
Aug 31 12:54:47 localhost kernel: EXT3 FS 2.4-0.9.19, 19 August 2002 on sd(8,34), internal journal
Aug 31 12:54:47 localhost kernel: EXT3-fs: mounted filesystem with ordered data mode.
Aug 31 12:54:47 localhost clusvcmgrd: : <notice> service notice: Running user script '/oracle/dbserver start'
Aug 31 12:54:47 localhost su(pam_unix): session opened for user oracle by (uid=0)
Aug 31 12:54:47 localhost su(pam_unix): session closed for user oracle
Aug 31 12:54:47 localhost clusvcmgrd: : <notice> service notice: Started service oracle ...
Aug 31 13:04:56 localhost kernel: mca: CPU 0 SAL log contains CPE error record
Aug 31 13:12:27 localhost kernel: mca: CPU 0 SAL log contains CPE error record
Aug 31 13:16:12 localhost kernel: mca: CPU 0 SAL log contains CPE error record
Aug 31 13:18:12 localhost kernel: mca: CPU 0 SAL log contains CPE error record
Aug 31 13:18:46 localhost login(pam_unix): session opened for user root by (uid=0)
Aug 31 13:18:46 localhost -- root: ROOT LOGIN ON pts/2 FROM 172.17.113.200
Aug 31 13:18:50 localhost su(pam_unix): session opened for user oracle by root(uid=0)
Aug 31 13:19:13 localhost su(pam_unix): session closed for user oracle
Aug 31 13:19:45 localhost su(pam_unix): session opened for user oracle by root(uid=0)
Aug 31 13:20:12 localhost kernel: mca: CPU 0 SAL log contains CPE error record
Aug 31 13:20:24 localhost kernel: oracle(26511): floating-point assist fault at ip 40000000041743e2
Aug 31 13:20:24 localhost kernel: oracle(26511): floating-point assist fault at ip 4000000004174822
Aug 31 13:20:24 localhost kernel: oracle(26511): floating-point assist fault at ip 40000000041743e2
Aug 31 13:20:24 localhost kernel: oracle(26511): floating-point assist fault at ip 4000000004174822
Aug 31 13:20:30 localhost kernel: oracle(26511): floating-point assist fault at ip 40000000041743e2
Aug 31 13:20:30 localhost kernel: oracle(26511): floating-point assist fault at ip 4000000004174822
Aug 31 13:20:30 localhost kernel: oracle(26511): floating-point assist fault at ip 40000000041743e2
Aug 31 13:20:30 localhost kernel: oracle(26511): floating-point assist fault at ip 4000000004174822
Aug 31 13:21:09 localhost kernel: oracle(26506): floating-point assist fault at ip 4000000004174822
Aug 31 13:21:09 localhost last message repeated 3 times
Aug 31 13:21:17 localhost su(pam_unix): session closed for user oracle
Aug 31 13:21:29 localhost su(pam_unix): session opened for user oracle by root(uid=0)
Aug 31 13:22:12 localhost kernel: mca: CPU 0 SAL log contains CPE error record
Aug 31 13:22:36 localhost kernel: oracle(26504): floating-point assist fault at ip 40000000041743e2
Aug 31 13:22:36 localhost kernel: oracle(26504): floating-point assist fault at ip 400000000418a0b1
Aug 31 13:22:36 localhost kernel: oracle(26504): floating-point assist fault at ip 4000000004b3d261
Aug 31 13:22:36 localhost kernel: oracle(26504): floating-point assist fault at ip 40000000041743e2
Aug 31 13:22:43 localhost su(pam_unix): session closed for user oracle
Aug 31 13:23:39 localhost su(pam_unix): session opened for user oracle by root(uid=0)
Aug 31 13:23:46 localhost kernel: oracle(26508): floating-point assist fault at ip 4000000004174822
Aug 31 13:23:46 localhost last message repeated 3 times
Aug 31 13:24:12 localhost kernel: mca: CPU 0 SAL log contains CPE error record
Aug 31 13:26:12 localhost kernel: mca: CPU 0 SAL log contains CPE error record
從日誌中那個175IP就是我的服務IP,有提示missing 后資料庫就是那時斷的!
是什麼原因!
《解決方案》
你發錯板了, linux板塊不是有集群板塊么? 我把你這個帖從系統管理移動到這裡來了. 你的帖要是掛在哪裡的話,恐怕不容易找到答案.
[ 本帖最後由 nntp 於 2006-9-1 01:22 編輯 ]
《解決方案》
1. 直接手工運行你寫的腳本,然後另外一個console 看tail log, 我懷疑你的sh /oracle/dbstart 執行失敗由於執行到這句的腳本許可權,執行腳本的環境參數造成啟動條件不滿足.
這種failover 的Oracle 集群調試有基本的方法和操作手段的,你這樣的問題應該不大,仔細點調把,建一個checklist,然後一個個去核.
2. CPU報這樣的錯,是明顯的 觸發了 race condition 拉,這是一個2.4的kernel bug. /proc包含了 salinfo的處理代碼,你現在用的操作系統版本的kernel有瑕疵,會在安騰2的伺服器上產生race condition, 有三個方法來解決.
a) 打一個針對kernel的salinfo的補丁,不過這需要非常專業的人來做,你如果沒有kernel patch 和debug經驗是很難搞定的.
b) 單獨2個節點的upgrade kernel 到最新的版本(現在是U8了)
c) upgrade 2個節點的redhat 到最新的Update(現在是Update 8).
我個人意見,如果客觀條件都滿足的話,採用方法c.
很可惜這三個解決方法都需要操作的人對於解決集群環境的操作系統級問題有足夠的經驗和能力,特別是如果你的cluster在生產環境根本無法停頓很久的情況下.
還有在操作的時候,注意數據和系統備份,便於rollback
Good Luck,
[ 本帖最後由 nntp 於 2006-9-1 01:28 編輯 ]
《解決方案》
/oracle/dbstart和/oracle/dbshut腳本不能成功啟動資料庫導致的
ps: 問個小問題
為什麼不使用和
su - $ORA_OWNER -c "$ORA_HOME/bin/lsnrctl stop"
同樣的機制來啟動資料庫呢?
《解決方案》
非常感謝樓上兄弟們的耐心回答!
那這樣子情況我的oracle 服務腳本應該怎麼寫!
《解決方案》
你dbstart.sh腳本寫錯了吧,寫的是關閉不是啟動資料庫
《解決方案》
原帖由 xujian200412 於 2006-9-1 16:18 發表
你dbstart.sh腳本寫錯了吧,寫的是關閉不是啟動資料庫
case "$1" in
'start')
# Start the Oracle databases:
# The following command assumes that the oracle login
# will not prompt the user for any values
su - $ORA_OWNER -c "$ORA_HOME/bin/lsnrctl start"
sh /oracle/dbstart
;;
咳....
《解決方案》
/oracle/dbstart
------------------------------
sqlplus /nolog <<eof
conn / as sysdba
startup
eof
-----------------------------
/oracle/dbshut
------------------------------
sqlplus /nolog <<eof
conn / as sysdba
shutdown immediate
eof
-----------------------------
1 以oracle身份運行/oracle/dbstart和/oracle/dbshut確認這兩個腳本可以正常啟動和關閉資料庫
2 以root身份運行su - oracle -c "/oracle/dbstart", 和su - oracle -c "/oracle/dbstop"看看oracle環境變數設置是否正確
3 將su - oracle -c "/oracle/dbstart", 和su - oracle -c "/oracle/dbstop"添加到cluster的腳本中,測試在機群中是否正常工作
《解決方案》
哪個兄弟,知道那裡還有HA軟體 不是Redhat linux cluster HA
《解決方案》
置頂帖中有
http://bbs.chinaunix.net/viewthread.php?tid=817188&extra=page%3D1