Rocks cluster 并行腳本的問題，高手進來看看

←手機掃碼閱讀火星人 @ 2014-03-04 , reply:0

Rocks cluster 并行腳本的問題，高手進來看看

當申請資源超過8個cpu(單個計算節點為8個計算核心)，也就是需要超過一個計算節點的cpu時，腳本就不能正常運行，具體如下：

Rocks cluster 5.3，使用SGE提交腳本來測試mpi-ring（腳本來自Rocks sge範例），使用8個或者更少的計算核心時，運行正常，如：
$ qsub -pe orte 2 mpi-ring.qsub

使用超過8個計算核心時，運行不正常，運行狀態先是"r",而後變成"dr",如：
$ qstat
job-ID  prior name    user       state submit/start at    queue                         slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
   25 0.55500 mpi-test.q xuwenyue    r    04/20/2010 00:30:22 all.q@compute-0-17.local       16
$ qstat
job-ID  prior name    user       state submit/start at    queue                         slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
   25 0.55500 mpi-test.q xuwenyue    dr 04/20/2010 00:30:22 all.q@compute-0-17.local       16

生成的錯誤信息如下：
$ cat mpi-test.qsub.o25
error: error: ending connection before all data received
error:
error reading job context from "qlogin_starter"
--------------------------------------------------------------------------
A daemon (pid 8970) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
mpirun: clean termination accomplished

。
qlogin命令正常，qrsh不正常，如
# qrsh -verbose
Your job 62 ("QRLOGIN") has been submitted
waiting for interactive job to be scheduled ...(several minutes)error:
error: ending connection before all data received
error:
error reading job context from "qlogin_starter"

《解決方案》

集群是否允許了rsh登陸？

《解決方案》

需要開啟rsh么？應該不是這樣吧

《解決方案》

本帖最後由 numdisp 於 2010-04-21 09:50 編輯

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
涸澤而漁發表於 2010-04-19 17:47 http://linux.chinaunix.net/bbs/images/common/back.gif

這不是有提示么，計算節點上有相應的運行庫么？
ssh到任意計算節點，然後在該節點上本地運行一下程序（不提交到SGE），能正確執行么？

rsh的話，只是有些程序需要（主要是一些使用了老的運行庫的程序），現在的大部分應用應該不需要了。

不過Rocks的玩意實在是bug多。曾經研究過他們的一些源代碼，許多地方簡直令人髮指。文檔也不怎麼樣，經常是新版本的文檔里還混雜著極其舊的信息，完全誤導。

Tags:

[火星人 ] Rocks cluster 并行腳本的問題，高手進來看看已經有866次圍觀

本文地址：http://coctec.com/docs/service/show-post-5728.html

Rocks cluster 并行腳本的問題，高手進來看看

Rocks cluster 并行腳本的問題，高手進來看看

熱門文章

最新文章