歡迎您光臨本站 註冊首頁

Rocks cluster 并行腳本的問題,高手進來看看

←手機掃碼閱讀     火星人 @ 2014-03-04 , reply:0

Rocks cluster 并行腳本的問題,高手進來看看

當申請資源超過8個cpu(單個計算節點為8個計算核心),也就是需要超過一個計算節點的cpu時,腳本就不能正常運行,具體如下:

Rocks cluster 5.3,使用SGE提交腳本來測試mpi-ring(腳本來自Rocks sge範例),使用8個或者更少的計算核心時,運行正常,如:
$ qsub -pe orte 2 mpi-ring.qsub

使用超過8個計算核心時,運行不正常,運行狀態先是"r",而後變成"dr",如:
$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
     25 0.55500 mpi-test.q xuwenyue     r     04/20/2010 00:30:22 all.q@compute-0-17.local          16        
$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
     25 0.55500 mpi-test.q xuwenyue     dr    04/20/2010 00:30:22 all.q@compute-0-17.local          16        

生成的錯誤信息如下:
$ cat mpi-test.qsub.o25
error: error: ending connection before all data received
error:
error reading job context from "qlogin_starter"
--------------------------------------------------------------------------
A daemon (pid 8970) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
mpirun: clean termination accomplished


qlogin命令正常,qrsh不正常,如
# qrsh -verbose
Your job 62 ("QRLOGIN") has been submitted
waiting for interactive job to be scheduled ...(several minutes)error:
error: ending connection before all data received
error:
error reading job context from "qlogin_starter"
《解決方案》

集群是否允許了rsh登陸?
《解決方案》

需要開啟rsh么?應該不是這樣吧
《解決方案》

本帖最後由 numdisp 於 2010-04-21 09:50 編輯


This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
涸澤而漁 發表於 2010-04-19 17:47 http://linux.chinaunix.net/bbs/images/common/back.gif

這不是有提示么,計算節點上有相應的運行庫么?
ssh到任意計算節點,然後在該節點上本地運行一下程序(不提交到SGE),能正確執行么?

rsh的話,只是有些程序需要(主要是一些使用了老的運行庫的程序),現在的大部分應用應該不需要了。

不過Rocks的玩意實在是bug多。曾經研究過他們的一些源代碼,許多地方簡直令人髮指。文檔也不怎麼樣,經常是新版本的文檔里還混雜著極其舊的信息,完全誤導。

[火星人 ] Rocks cluster 并行腳本的問題,高手進來看看已經有857次圍觀

http://coctec.com/docs/service/show-post-5728.html