歡迎您光臨本站 註冊首頁

故障診斷 Lotus Domino 的掛起和崩潰

←手機掃碼閱讀     火星人 @ 2014-03-05 , reply:0

故障診斷 Lotus Domino 的掛起和崩潰

故障診斷 Lotus Domino 的掛起和崩潰  
  
級別: 中級
Kiran Bellari, 軟體工程師, IBM
2006 年 3 月 27 日

快,伺服器掛起與崩潰之間究竟有什麼區別?更重要的是,如何修復它們?在本文中,我們將解釋如何識別 Lotus Domino 伺服器掛起和崩潰,以及如何分析和糾正它們。
Lotus Domino 構建得非常可靠。但是即使構建得再好的產品,也會遇到導致其掛起或崩潰的問題。當出現這樣的情況時,您隔離、分析和修復問題的速度越快,您的用戶社團就會越快高興起來並正常運行,您也因而能夠更快地返回去考慮別的事情。

本文提供了一些可用於修復 Notes/Domino 問題的思路。我們首先來定義伺服器掛起和伺服器崩潰之間的區別,以及如何解決每種問題的例子。我們最後將概述該產品的最新版本 —— Notes/Domino 7 —— 中包含的新的故障診斷特性。我們假設您是一名有經驗的 Domino 管理員,並且熟悉基本的 Notes/Domino 概念和術語。

何為伺服器掛起和崩潰?

在進入技術細節之前,我們首先定義兩個常用的術語,即崩潰(crash)和掛起(hang),以確保我們的理解是一致的。

伺服器崩潰
Domino 伺服器崩潰是這樣一種情景,即伺服器程序已經終止並且不再運行。您通常可以通過查看崩潰屏幕或者 NSD/RIP 日誌文件(取決於您運行的是什麼版本的 Domino),來確定伺服器終止時所執行的任務。

Domino 伺服器崩潰的常見故障現象包括:

Domino 伺服器不再運行,但是系統上的其他程序還在運行。
Domino 伺服器控制台不出現,即使當任務似乎已載入時。
Domino 伺服器已載入,並且沒做任何事情就突然死機。
一個 panic 錯誤出現在控制台上或 Log.nsf 中,並且系統死機。
NSD/RIP 自動運行並生成一個文件,伺服器自己死機和/或重新啟動。

存在幾種不同類型的伺服器崩潰。例如一次性崩潰(one-time crash),顧名思義,可能只出現一次,並且不會再次出現。一個導致 Domino 崩潰的進程訪問壞內存或已破壞的文檔時會出現一次性崩潰。例如,假設位於 Mail.box 中的一個文檔已經破壞。當 Domino 路由器訪問 Mail.box 想將該文檔路由到其目的地時,將產生一個 Domino 伺服器崩潰。類似的場景以後可能會出現,也可能不會出現。一般來說,一次性崩潰是最難分析的。

可重複的崩潰(reproducible crash)是一種可通過一系列步驟重複的崩潰。例如一個這樣的表單,其中包含一個編碼錯誤的按鈕,每當按這個按鈕時,都會導致一個可重複的崩潰。

重複的崩潰(Repetitive crashes)按一定的規律發生。它們似乎不與任何特定動作相關,而是發生在每天的相同時間。在這樣的場景中,您需要確切地知道,在導致問題的時間段,伺服器上在運行什麼。例如,假設 Domino 伺服器上啟用了一個預定的代理,每個月運行一次。該代理可能會導致伺服器崩潰。在這樣的場景中,首先需要禁用導致問題的代理,然後再檢查該代理為什麼會導致問題(並修復問題)。

ABEND 是伺服器崩潰的一種特殊形態。術語 ABEND 是 「abnormal end」 這兩個單詞的組合。ABEND 崩潰不產生 RIP 或 NSD 文件。

崩潰的原因如下:


代碼中的軟體問題(無論是在伺服器上還是客戶機上)。
資料庫中的破壞。
訪問 Domino 的第三方應用程序中的軟體問題。
內存不足。
定製代碼導致的限制操作。
內存泄漏。
未完成的請求。

伺服器掛起

Domino 伺服器掛起是這樣一種場景,即 Domino 伺服器仍在運行,但是伺服器上的一個或多個任務不響應請求。這些任務可能還是活躍的,但是不在做它們應該做的事情。術語 「掛起」 也定義了一種狀態,即當計算機程序不按設計運行時可能會出現的狀態。大部分時候,出現掛起是因為,低級循環或資源的持久不可用導致嚴重的性能問題。(伺服器掛起通常歸因於資源問題,所以有時可把它們看成性能問題。)

在掛起期間,程序看起來像已癱瘓,也不顯示錯誤消息,並且屏幕凍結或者應用程序不響應用戶的動作。鍵盤輸入或滑鼠點擊沒有反應,不管游標置於何處都一樣,但是程序仍在運行。與 ABEND 或崩潰不一樣,掛起有時會自己解決問題,應用程序繼續其正常的執行過程,無需您的干預。這樣的情況更應該看成是性能問題,而不是掛起。

Domino 伺服器掛起的故障現象包括:

Domino 仍在運行,但是不響應客戶機。在這種情況下,用戶通常報告說他們收到 「Server not responding」 消息。
控制台的行為就像是斷開連接的,不接受任何命令,甚至像 quit 這樣簡單的命令也不接受。
客戶機對伺服器的訪問(例如,打開資料庫)感覺到響應時間慢。
出現信號量超時。「show stat」 命令將報告信號量超時信息。下面是 Statrep.nsf 中報告的一個信號量超時的例子:Sem.Timeouts = 430D: 58 0A13:42 030B:28 0116:26 0A12:21。在這個例子中,430D 是信號量名稱,58 是超時的數量。注意,信號量超時並不一定表示性能問題。在忙碌的伺服器上出現信號量超時是很常見的。如果伺服器上沒有出現任何信號量超時,統計數據 Sem.timeouts 就不會出現在 Statrep.nsf 中。
會報告與性能相關的錯誤消息,比如:
Insufficient memory.
Insufficient memory. NSF Folder Pool is full.
Maximum number of memory segments that Notes can support has been exceeded.
Network operation did not complete in a reasonable amount of time.
Server not responding.

注意,在伺服器掛起場景中,NSD/RIP 是不會自動生成的。

導致伺服器掛起的原因包括,資源問題(資源不足)、第三方應用程序衝突和硬體問題。一般來說,伺服器掛起比伺服器崩潰更難分析。最後指出一點:崩潰和掛起不只出現在 Domino 伺服器上,也可以出現在 Notes 客戶機上。

故障診斷

在本節中,我們來看一些用於故障診斷伺服器崩潰和伺服器掛起的一般方法。

故障診斷 Domino 伺服器崩潰

如果 Domino 已經崩潰,並且不能重啟,那麼從 Notes.ini 變數 Servertask 刪除任務,並試圖縮小範圍和識別導致崩潰的任務。當您懷疑是某個特定的任務導致問題時,就打開伺服器控制台,並縮小該任務產生的可能的錯誤消息的範圍。例如,如果在訪問 Mail.box 中的郵件時路由器崩潰了,那麼重新命名 Mail.box 並允許伺服器重新創建 Mail.box。

如果您懷疑問題是已破壞的資料庫導致的,那麼在該資料庫上運行離線維護任務。如果崩潰是按規律發生的,那麼檢查崩潰發生時伺服器上執行的動作。

考慮下列問題:

Domino 伺服器向控制台或日誌文件報告錯誤消息嗎?
錯誤消息的確切語法是什麼樣的?
錯誤消息是哪裡產生的?是 Domino 伺服器上,還是 Notes 客戶機上?
該問題第一次出現是什麼時候?
在問題開始出現之前,最近做了更改嗎?

故障診斷 Notes 客戶機崩潰

首先,找出問題是否特定於某個用戶。如果是的,就檢查該用戶的配置,並將之與其他用戶的配置進行比較。此外,還要確定問題發生是否歸結於訪問某個特定的應用程序。如果是的,就請一個開發人員來檢查應用程序。

如果您懷疑問題是由已破壞的資料庫或文檔導致的,就運行維護任務 Updall、Fixup 和 Compact(用適當的開關)。此外,如果您認為問題是由於壞的索引,那麼試圖重新創建資料庫的全文本索引(如果可能的話)。

故障診斷 Domino 伺服器掛起

如果常量信號量問題出現在伺服器控制台上,那麼檢查任務的安排是否衝突。如果系統響應緩慢,那麼檢查您的非-Domino 應用程序,看它們是否也運行緩慢。另外, 一般來說,應該確保用所有最新的補丁更新了操作系統。

NSD 分析
確定讓伺服器崩潰的進程通常是解決伺服器崩潰的第一步。在 Domino 6 和更高版本中,NSD 文件是一個很好的起點。NSD 給出伺服器狀態的所有當前信息(所有線程的調用堆棧、內存信息,等等)。在發生崩潰時,Domino 伺服器將自動生成一個 NSD 日誌文件,並存儲在 data\IBM_TECHNICAL_SUPPORT 目錄中。NSD 日誌文件的文件名中帶有一個時間戳,展示了 NSD 是何時生成的。例如,[email protected]_17_18.log 表示這個 NSD 是 2006 年 1 月 17 日生成的。NSD 在運行時,會附加到每個進程和線程,以轉儲調用堆棧。這有助於您確定伺服器或工作站崩潰的原因。

NSD 文件的核心是堆棧跟蹤部分。這一部分提供代碼路徑的一個 breakdown,當前存在的進程中的每個線程要遍歷該路徑,以進入其當前狀態。這對於考察伺服器上的掛起或崩潰場景非常有幫助。此外,通過檢查 NSD 文件,可以找到 Domino data 目錄中生成的任何核心文件,並進行基本的分析,以跟蹤死去並遺留下核心文件的進程所做調用的最終堆棧。在諸如 Domino 這樣的複雜產品中,兩台不同伺服器上相同類型的動作的堆棧跟蹤可以產生不同的結果。

在 NSD 文件中,通過執行對單詞 「fatal」、「panic」 或 「segmentation」 的搜索,可以識別失敗進程中的可執行部分。找到進程后,我們可以看出誰在它之前,並有望確定崩潰是如何發生的。有時,當 「panic」、「fatal」 都沒有找到時,核心轉儲將包含對函數中 「segmentation fault」 的引用。這表明,進程試圖訪問因某種原因已破壞的共享內存段,並將不調用 「fatal_error」 或 「panic」 而崩潰。

下面是 NSD 文件的示例摘錄,其中的一個伺服器進程涉及到崩潰:

### FATAL THREAD 39/83 [ nSERVER:07c0: 2764]
### FP=0743f548, PC=60197cf3, SP=0743ebd0, stksize=2424
Exception code: c0000005 (ACCESS_VIOLATION)
############################################################
@[ 1] 0x60197cf3 [email protected]+483 (7430016,496dae76,0,496dace8)
@[ 2] 0x600018a4 [email protected]+148 (1153f38,2000000,743f608,1)
@[ 3] 0x6000bd92 [email protected]+610 (0,743fc74,f,0)
@[ 4] 0x600626cc [email protected]+2860 (4c5440e8,4cfb8dba,800f,1)
@[ 5] 0x600b9f6f [email protected]+351 (0,4cfb8dba,800f,1)
@[ 6] 0x10032d40 [email protected]+1424 (0,8d0c0035,4b64b5bc,4ae46dd6)
@[ 7] 0x100191fc [email protected]+2284 (41b0383,cb740064,0,23696f8)
@[ 8] 0x1002b8c8 [email protected]+1576 (4711d68,0,3,563fb10)
@[ 9] 0x100016cb [email protected]+763 (0,563fb10,0,10ec334)
@ 0x6011e5e4 [email protected]+212 (0,10ec334,563fb10,0)
0x77e887dd KERNEL32.GetModuleFileNameA+465

當確定了失敗進程后,您就可以著重故障診斷這個特定的進程了。

ServerTasks

如果一台伺服器不斷地崩潰(例如,每五分鐘一次),一個有用的故障診斷步驟是,從伺服器的 Notes.ini 文件臨時刪除 ServerTasks= 行。然後,伺服器可以重新啟動,任務可以單獨地載入,以確定是哪個進程導致崩潰。

Panic 消息
當 Domino 檢測到一個內部一致性錯誤,或者一個可能導致數據破壞或其他問題的條件時,它會立即調用一個名為 Panic 的子常式。這是在代碼操作時,用於不斷監控代碼的關鍵部分的一種特殊構造。這有助於在問題升級並可能破壞數據之前,儘可能早地捕捉問題。當發生 panic 時,它將導致系統停止(因此可看成是一種可控制的崩潰)。Panics 產生的消息,有時是英語形式的,有時是代碼形式的(例如,PANIC: 04:3C)。您可以將該代碼提交給 Lotus Software Technical Support,以便進一步故障診斷。

故障診斷工具

本節介紹您在遇到 Domino 伺服器崩潰或掛起時可用的一些故障診斷工具。在使用任何這些工具之前,請確保參考 Domino 管理文檔。此外,Domino 自助支持頁面 對於故障診斷信息也是一個好的資源。

RIP(Domino R5)

RIP 文件是在伺服器崩潰時產生的。該文件包含關於伺服器崩潰時在做什麼的信息。它報告系統上的任何崩潰,而不只是與 Domino 有關的崩潰。RIP 文件只在 Domino 5.x 中才產生。在 Domino 6 和更高版本中,NSD 取代了 RIP,並且還包括 RIP 中沒有的附加功能。

要產生 RIP 文件,需要將 QNC.EXE 載入到 Domino 伺服器上。QNC.EXE 程序(通常叫做 「quincy」)是與 Domino 一起發布的默認調試程序。QNC.EXE 程序通常位於 \Domino 目錄中。要啟用 QNC.EXE,請在操作系統的命令提示符下輸入 「qnc –I」。也可以通過在伺服器啟動時輸入 「qnc nserver」 啟動 QNC.EXE。如果在伺服器崩潰時不生成 RIP 文件,那麼請檢查 QNC.EXE 是否已啟用。通常,RIP 文件創建在 data 目錄中。

NSD(Domino 6 和更高版本)

如前所述,Domino 6 和更高版本提供 NSD 特性。這個文件包含關於伺服器崩潰時的狀態信息。有關更多信息,請參閱本文前面的 「NSD 分析」 一節。

內存轉儲(Domino 6 和更高版本)

在 Domino 6 和更高版本中,可以在伺服器控制台上使用命令 「sh memory dump」 來創建內存轉儲文件。內存轉儲文件包含關於 Domino 當前使用的內存的信息。這在故障診斷性能問題和內存泄漏時非常有用。通常,內存轉儲文件位於 data\IBM_TECHNICAL_SUPPORT 目錄中。內存轉儲文件名包含一個時間戳,表示生成 NSD 時的時間。例如:

memory_ [email protected]_50_08.dmp

注意:要將可用內存記錄到文件,而不是在伺服器控制台上查看它,請輸入下面的伺服器控制台命令: sh memory dump >memory.txt

HTTP 請求日誌

為了故障診斷與 Domino Web 伺服器崩潰和掛起有關的問題,Lotus Software Technical Support 通常會要求您創建 HTTP 請求日誌。要為請求日誌啟用默認設置,請編輯伺服器的 Notes.ini 文件,並添加 HTTPEnableThreadDebug=1 這一行。這將 HTTP 請求日誌記錄設置為默認級別。(要將日誌記錄級別設置為記錄更詳細的信息,請參閱 Domino 管理文檔。)也可以通過在 Domino 伺服器控制台輸入 「tell http debug thread on | off」 動態地啟用 HTTP 請求日誌記錄。啟用了 HTTP 請求日誌記錄之後,Domino 就會創建一系列名為 htthr*.log 的文件,例如 [email protected]

HTTP 請求日誌記錄應該只用於故障診斷特定的問題,並且通常是在 Lotus Software Technical Support 的指導和幫助下完成的。不要將請求日誌記錄用於其他目的,比如一般管理。這些日誌文件隨著時間會不斷增大,所以不應該長時間啟用該設置,否則會消耗掉所有可用的設備空間。

Automatic Data Collection

Notes/Domino 6.0.1 引入了自動診斷數據收集工具,也叫做 Automatic Data Collection,或者簡稱為 ADC。Automatic Data Collection 只意味著,當 Notes 客戶機或 Domino 伺服器崩潰時,該程序將收集調試崩潰時必需的所有數據,並在客戶機或伺服器重啟時發送到一個 mail-in 資料庫。然後,管理員就每個域具有一個位置,在這裡,他們可以看到所有客戶機和伺服器已經發生的所有崩潰。這將有助於消除這樣的情況,即管理員或用戶在客戶機或伺服器崩潰時不能捕獲適當的數據。

Notes.ini 設置

為了故障診斷性能和崩潰問題,您可以啟用下列 Notes.ini 調試參數:

Debug_threadid=1 記錄每個伺服器操作的每個進程和線程 ID。
Debug_show_timeout=1 打開到控制台的信號量超時消息,並創建一個名為的 semdebug.txt 信號量文本文件。
Debug_capture_timeout=10 給每個信號量超時消息加時間戳。
CONSOLE_LOG_ENABLED=1 (Domino 6 和更高版本)啟用 Domino 控制台日誌記錄。


伺服器崩潰的故障恢復

您可以將故障恢復設置為自動處理 Domino 伺服器崩潰。當伺服器崩潰時,它就自動關閉並重啟,無需任何管理員干預。Domino 將崩潰信息記錄在 data 目錄中。當伺服器重啟時,Domino 檢查它是否是崩潰后重啟。如果是的,就會自動給 「Mail Fault Notification to」 域中的人員或組發送一封電子郵件。

重大的錯誤(比如操作系統異常或內部 panic)終止每個 Domino 進程,並釋放所有相關的資源。啟動腳本檢測該場景,並重啟伺服器。如果您使用的是多伺服器分區,並且故障發生在單個分區中,那麼只有該分區終止並重啟。


Domino 7 中的新故障診斷特性

本節簡要介紹一些有助於您分析和糾正伺服器掛起和崩潰的 Domino 7 新特性。

Domino Domain Monitoring

Domino 7 中的一個最重要且有用的伺服器維護和故障診斷特性是 Domino Domain Monitoring (DDM)。這為監控一個域(或多個域)中的所有伺服器提供了一個中央位置。DDM 使用名為 probes 的程序來收集來自單個伺服器的伺服器信息,然後報告回一個特殊的資料庫(DDM.nsf),您可以在該資料庫中查看所收集的數據。這允許您從單個 Domino Administrator 控制台監控、分析和故障診斷大量的伺服器。

Activity Trends

Activity Trends 特性用於分析 「歷史」 伺服器數據,以助於發現只有通過很長時間才能發現的趨勢。您可以查看該數據,來幫助預計和避免未來的問題。該數據從日誌文件(Log.nsf)和 Catalog 任務收集而來,並存儲在 Activity Trends 資料庫(Activity.nsf)中。Activity Trends Collector 任務處理該數據,併產生 「趨勢化」 數據,用於繪製圖表和平衡資源。

將狀態條歷史寫到日誌文件

您可以將 Notes 客戶機狀態條消息設置為記錄到本地日誌文件(Log.nsf)或者您指定的外部文件。這有助於您故障診斷 Notes 客戶機崩潰。使用 Notes.ini 的設置 logstatusbar=1 將狀態條消息記錄到 Log.nsf。要查看已記錄的消息,請打開 Log.nsf 並點擊 Miscellaneous Events 視圖。狀態條消息後跟有 Status Msg。要將狀態條消息寫到外部文件,請使用 Notes.ini 的設置 Debug_Outfile=<path to file> 和 Notes.ini 的設置 logstatusbar=1。例如:
logstatusbar=1
Debug_Outfile=c:\temp\StatusBarLogging.txt

這將狀態條消息記錄到文件 StatusBarLogging.txt。

Log.nsf 文件也提供 Notes 客戶機崩潰之前記錄到狀態條中的動作的一個快照。

Fault Analyzer

Fault Analyzer 是一個新的伺服器特性,用於在所有新的崩潰被提交到 Automatic Data Collection mail-in 資料庫時對它們進行處理。Fault Analyzer 任務搜索為 Fault Report 文檔配置的資料庫,並確定堆棧是否與用戶或伺服器已經看到過的崩潰相匹配。它通過分析 Fault Report mail-in 資料庫中的調用堆棧,並分析它們以確定其中是否有相同問題的其他情況,從而在 Automatic Data Collection 特性的基礎上新增了功能。

Fault Analyzer 是在設置 Automatic Data Collection 的同時配置的(參見圖 1)。使用 Server Configuration 文檔在伺服器上設置 Automatic Data Collection 和啟用或禁用 Fault Analyzer。


圖 1. 配置 Fault Analyzer
http://www-128.ibm.com/developerworks/lotus/library/domino-server-crashes/fig1.jpg

如果 Fault Analyzer 找到重複的故障報告,那麼新的崩潰就被報告為初始崩潰的一個響應,並且附件要麼被從響應文檔刪除以節省資料庫空間,要麼用響應文檔進行保存。

Automatic Data Collection 增強

當您使用 Automatic Data Collection 工具來收集有關伺服器崩潰的信息時,現在伺服器被首先檢查,看它是否運行在 Domino 之下,如果是的,就使用 Controller 日誌。如果不是的,就檢查伺服器是否啟用了控制台記錄,如果是的,就使用控制台輸出。最後,如果既沒設置 Domino Controller,也沒設置控制台記錄,則會從 Log.nsf 中提取數據。

現在您可以選擇,Automatic Data Collection 工具在客戶機或伺服器上運行時,將會收集哪些文件(使用通配符)。在 Notes 客戶機上,它是使用 Desktop Policy Settings 文檔配置的(參見圖 2)。


圖 2. 在 Notes 客戶機上配置 Automatic Data Collection
  
http://www-128.ibm.com/developerworks/lotus/library/domino-server-crashes/fig2.jpg


在 Domino 伺服器上,它是使用 Server Configuration 文檔配置的(參見圖 3)。


圖 3. 在 Domino 伺服器上配置 Automatic Data Collection


這允許您從其他 IBM 產品以及第三方插件收集診斷文件。

可能會出現這樣的情況,即 Automatic Data Collection 發送的輸出非常大。如果這成為了問題,那麼您可以配置 Automatic Data Collection,限制 NSD 發送的附件和記錄到 Fault Reports 資料庫的控制台日誌的大小(參見圖 3)。

http://www-128.ibm.com/developerworks/lotus/library/domino-server-crashes/fig3.jpg
Shutdown Monitor

在您發出退出或重啟伺服器的命令之後,Domino 伺服器通常要花很長時間才能實際關閉。為了避免這一延遲,Shutdown Monitor 任務確保 Domino 在請求一終止時就立即終止。如果伺服器不在指定的時間內終止,那麼伺服器將被迫終止,並在終止之前生成一個 NSD 日誌。這個時間限制是在 Server 文檔的 Automatic Server Restart 部分的 Server Shutdown Timeout 域中指定的,在 Basics 附簽上(參見圖 4)。


圖 4. 設置 Server Shutdown Timeout
http://www-128.ibm.com/developerworks/lotus/library/domino-server-crashes/fig4.jpg

默認的 Server Shutdown Timeout 設置是 5 分鐘。可以使用 Notes.ini 的設置 shutdown_monitor_disabled=1 禁用該特性。

Process Monitor(僅針對 Windows 平台)

Process Monitor 任務監控應該作為 Domino 伺服器環境一部分運行的進程。(該任務只運行在 Microsoft Windows 平台上;該功能在 Domino for Unix 平台上已實現,無需使用單獨的伺服器任務。)如果任何這些進程缺失,或者一個進程在沒有完成通常的 Domino 終止常式時就意外終止了,那麼該任務將導致伺服器 panic 並確定哪個進程過早終止了。Process Monitor 任務與 Nprocmon.exe 一起工作,後者監控 Nserver.exe 進程的異常終止。

該特性可以大大減少異常終止問題出現的次數,而這樣的問題很難分析(因為通常難以確定哪個進程終止了並導致了伺服器問題)。要禁用 Process Monitor 任務,請在伺服器的 Notes.ini 文件中設置變數 process_monitor_disabled=1。

結束語

在本文中,我們定義了 Domino 伺服器掛起與崩潰之間的區別。討論了在分析和修復 Notes/Domino 問題時可以使用的一些故障診斷過程和工具。還了解了 Notes/Domino 7 中引入的一些新的故障診斷特性。在 Notes 客戶機或 Domino 伺服器遇到掛起或崩潰時,您可以來參考這篇文章,當然,希望您不要經常碰到這種情況。




參考資料

學習

您可以參閱本文在 developerWorks 全球站點上的 英文原文 。


developerWorks Lotus 文章 「New features in Lotus Domino 7.0」 概述了 Domino 7 中引入的所有新的伺服器特性。


在開始使用本文中提到的任何故障診斷工具之前,請參考 Domino 管理文檔。
此外,Domino 自助支持頁面 也是故障診斷信息的一個好資源。



討論

參與 developerWorks blogs 並加入 developerWorks 社區。
或到www.chinaunix.net社區的Lotus版塊


關於作者



  Kiran Bellari 於 2004 年加入 IBM,在加入 Domino 伺服器崩潰性能技術支持之前,最初是在 Domino Document Manager 團隊工作。Kiran 是雙料 CLP,從 National Insitute of Technology 獲得學士學位。

[ 本帖最後由 plumlee 於 2006-4-14 09:44 編輯 ]
《解決方案》

附上英文對照版

Troubleshooting Lotus Domino hangs and crashes
        developerWorks
       
       
Document options
        Set printer orientation to landscape mode       

Print this page
        Email this page       

E-mail this page

Free software for rapid results
               

Kick-start your Java apps

Rate this page
               

Help us improve this content

Level: Intermediate

Kiran Bellari, Software Engineer, IBM

17 Jan 2006

    Quick -- what's the difference between a server hang and a crash? More important, how do you go about fixing them? In this article, we explain how you can identify Lotus Domino server hangs and crashes, and what you can do to analyze and correct them.

       

More dW content related to: Insufficient memory - NSF monitor pool is full
       

Lotus Domino is built to be very reliable. But even the best-built products may encounter problems that cause them to hang or crash. When this happens, the quicker you can isolate, analyze, and fix the problem, the quicker your user community will be happily up and running -- and the quicker you can go back to worrying about other things.

This article offers some ideas you can use to fix Notes/Domino problems. We start by defining the differences between a server hang and a server crash, and how you can go about solving examples of each. We conclude with an overview of new troubleshooting features included in Notes/Domino 7, the latest release of the product. We assume you're an experienced Domino administrator, and are familiar with basic Notes/Domino concepts and terminology.

What are server hangs and crashes?

Before we get into the technical details, let's define two commonly used terms, crash and hang, to ensure we're all on the same page.

Server crash

A Domino server crash is a situation where the server program has terminated and it is no longer running. You can often determine the task that the server was performing when it terminated by looking at the crash screen, or from the NSD/RIP log file (depending on which release of Domino you are running).

Common symptoms of a Domino server crash include:

    * The Domino server is no longer running, but other programs on the system are still running.
    * The Domino server console does not appear, even when tasks appear to be loaded.
    * The Domino server loaded and abruptly came down without doing anything.
    * A panic error appears on the console or in Log.nsf, and the system comes down.
    * NSD/RIP automatically runs and generates a file, and the server comes down and/or restarts by itself.

There are several different types of server crashes. For example, a one-time crash, as the name implies, may occur once and never appear again. A one-time crash may be caused by bad memory or a corrupted document accessed by a process that resulted in Domino crashing. For example, suppose a document deposited in Mail.box is corrupted. When the Domino router accesses Mail.box to route the document to its destination, this produces a Domino server crash. A similar situation may or may not occur in the future. In general, one-time crashes are the most difficult to analyze.

A reproducible crash is one that can be repeated by following a sequence of steps. One example is a form that includes a badly coded button that always results in a crash when pressed.

Repetitive crashes occur on a particular schedule. They don't seem to be associated with any specific actions; instead, they may happen at the same time every day. In such situations, you need to identify exactly what is getting executed on the server at that time that may be causing the problem. For instance, imagine that a Domino server has a scheduled agent enabled that runs every month. This agent may be producing the server crash. In such scenarios, you need to first disable the agent creating the problem and then review why the agent is causing the problem (and fix it).

An ABEND is a special form of server crash. The term ABEND is a combination of the words "abnormal end." ABEND crashes do not produce RIP or NSD files.

Causes of crashes include:

    * A software problem in the code (either on the server or on the client).
    * Corruption in a database.
    * A software problem in a third-party application accessing Domino.
    * Insufficient memory.
    * Restricted operations caused by customized code.
    * A memory leak.
    * An incomplete request.

Server hang

A Domino server hang is a situation where the Domino server is still running, but one or more tasks on the server are not responding to requests. These tasks may still be active, but they are not doing what they are supposed to do. The term "hang" also defines a state that sometimes occurs when computer programs do not run as designed. Most of the time, a hang occurs due to a low-level loop or a permanent unavailability of a resource, causing serious performance issues. (Server hangs are most commonly attributed to resource issues, so they are sometimes considered performance problems.)

During a hang, the program seems to be paralyzed, no error messages are displayed, and the screen freezes or the application does not respond to users' actions. Keyboard input or mouse clicking has no effect, regardless of where the cursor is placed, but the program is still running. Unlike an ABEND or crash, sometimes a hang will resolve itself, and the application resumes its normal execution without your involvement. Such a case might be considered more of a performance issue than a hang.

Symptoms of a Domino server hang include:

    * Domino is still running, but is not responsive to clients. In this case, users often report that they are receiving 「Server not responding」 messages.
    * The console behaves as if it is disconnected and won』t accept any commands, not even a simple command such as quit.
    * Clients accessing the server (for example, opening databases) are experiencing slow response times.
    * Semaphore timeouts are occurring. The 'show stat' command will record semaphore timeout information. The following is an example of semaphore timeouts recorded in Statrep.nsf: Sem.Timeouts = 430D: 58 0A13:42 030B:28 0116:26 0A12:21. In this example, 430D is the semaphore name, and 58 is the number of timeouts. Note that semaphore timeouts do not always indicate a performance problem. It is common for semaphore timeouts to occur on a busy server. The statistic Sem.timeouts will not appear in Statrep.nsf if the server has not experienced any semaphore timeouts.
    * Performance-related error messages are reported, such as:
      Insufficient memory.
      Insufficient memory. NSF Folder Pool is full.
      Maximum number of memory segments that Notes can support has been exceeded.
      Network operation did not complete in a reasonable amount of time.
      Server not responding.

Note that in a server hang situation, an NSD/RIP is never generated automatically.

Causes of server hangs include resource problems (insufficient resources), third-party application conflicts, and hardware problems. In general, server hangs are more difficult to analyze than server crashes. One final note: crashes and hangs not only occur on the Domino server, they can also happen on the Notes client.

Troubleshooting

In this section, we examine some general approaches to troubleshooting server crashes and server hangs.

Troubleshooting Domino server crashes

If Domino has crashed and is not able to restart, remove tasks from the Notes.ini variable Servertask and attempt to narrow down and identify the task causing the crash. When you suspect a particular task is causing the problem, open the server console and narrow down the possible error messages generated by task. For example, if the router crashed while accessing mail in Mail.box, rename Mail.box and allow the server to recreate Mail.box.

If you suspect the problem is caused by a corrupted database, run offline maintenance tasks on this database. If the crash is occurring on a scheduled basis, review the actions performed on the server at the time of the crash.

Consider the following questions:

    * Is the Domino server reporting error messages to the console or the log file?
    * What is the exact syntax of the error message?
    * Where is the error message being generated, on the Domino server or on the Notes client?
    * When did this problem first appear?
    * Did you implement any recent changes before the problem started appearing?

Troubleshooting Notes client crashes

First, find out whether or not the problem is specific to a single user. If so, check the configuration of that user and compare it to configurations for other users. Also, determine whether or not the problem happens due to a specific application being accessed. If so, review the application with a developer.

If you suspect the problem is caused by a corrupted database or document, run the maintenance tasks Updall, Fixup, and Compact (with appropriate switches). Also, try to recreate the database's full-text index, if possible, if you think the problem is due to a bad index.

Troubleshooting Domino server hangs

If constant semaphore problems appear on the server console, check whether or not the tasks' schedule is conflicting. If the system is responding slowly, check your non-Domino applications to see whether or not they are also performing slowly. Additionally, as a general rule, make sure your operating system is updated with all the latest patches.

NSD analysis
Determining the process that crashed the server is often the first step in resolving a server crash. In Domino 6 and later, the NSD file can be a good place to start. NSD gives you all current information about the state of the server (call stacks for all threads, memory information, and so on). In the event of a crash, an NSD log file will automatically be generated by the Domino server and stored in the data\IBM_TECHNICAL_SUPPORT directory. An NSD log will have a file name with a time stamp showing the time when the NSD was generated. For example: [email protected]_17_18.log indicates this NSD was created on January 17, 2006. When NSD runs, it attaches to each process and thread, to dump the calls stacks. This can help you determine the cause of a server or workstation crash.

The "heart" of an NSD file is the stack trace section. This section provides a breakdown of the code path each thread in a currently existing process traversed to put it in its current state. This is very helpful in examining hang or crash situations on a server. Also, by examining the NSD file, you can find any core files generated in a Domino data directory, and can do a base-level analysis to trace the final stack of calls that were made by the process that died and left behind the core. In a complex product such as Domino, a stack trace of the same type of action on two different servers can produce different results.

In the NSD file, you can identify the executable in the failing process by performing a word search for "fatal," "panic," or "segmentation." By finding the process, we can see what preceded it, and hopefully determine how the crash occurred. When neither "panic" or "fatal" are found, sometimes a core dump will contain a reference to a "segmentation fault" in a function. This indicates that the process tried to access a shared memory segment that was corrupted for some reason, and will crash without calling "fatal_error" or "panic."

The following is a sample excerpt from an NSD file where a server process is involved in a crash:

### FATAL THREAD 39/83 [ nSERVER:07c0: 2764]
### FP=0743f548, PC=60197cf3, SP=0743ebd0, stksize=2424
Exception code: c0000005 (ACCESS_VIOLATION)
############################################################
@[ 1] 0x60197cf3 [email protected]+483 (7430016,496dae76,0,496dace8)
@[ 2] 0x600018a4 [email protected]+148 (1153f38,2000000,743f608,1)
@[ 3] 0x6000bd92 [email protected]+610 (0,743fc74,f,0)
@[ 4] 0x600626cc [email protected]+2860 (4c5440e8,4cfb8dba,800f,1)
@[ 5] 0x600b9f6f [email protected]+351 (0,4cfb8dba,800f,1)
@[ 6] 0x10032d40 [email protected]+1424 (0,8d0c0035,4b64b5bc,4ae46dd6)
@[ 7] 0x100191fc [email protected]+2284 (41b0383,cb740064,0,23696f8)
@[ 8] 0x1002b8c8 [email protected]+1576 (4711d68,0,3,563fb10)
@[ 9] 0x100016cb [email protected]+763 (0,563fb10,0,10ec334)
@ 0x6011e5e4 [email protected]+212 (0,10ec334,563fb10,0)
0x77e887dd KERNEL32.GetModuleFileNameA+465

When the failing process has been determined, you can focus on troubleshooting that particular process.

ServerTasks
If a server is crashing continuously (for example, every five minutes), a useful troubleshooting step is to temporarily remove the ServerTasks= line from the server's Notes.ini file. The server can then be restarted and tasks can be loaded individually to determine which process is causing the crash.

Panic messages
When Domino detects an internal consistency error, or a condition that may lead to corruption of data or some other problem, it immediately calls a subroutine called Panic. This is a special construct used to continually monitor critical parts of the code as it operates. This helps catch problems as early as possible, before they escalate and possibly destroy data. When a panic takes place, it brings the system to a stop (and thus can be considered a controlled crash). Panics generate messages, sometimes in English and sometimes in code (for example: PANIC: 04:3C). You can give this code to Lotus Software Technical Support for further troubleshooting.

Troubleshooting tools

This section reviews some of the troubleshooting tools available to you when you encounter a Domino server crash or hang. Before using any of these tools, be sure to consult the Domino administration documentation. Also, the Domino self-help support page is a good resource for troubleshooting information.

RIP (Domino R5)

A RIP file is generated when a server crashes. This file contains information about what the server was doing when it crashed. It reports any crash on the system, not just ones related to Domino. RIP files are generated only in Domino 5.x. In Domino 6 and later, NSD serves the purpose formerly performed by RIP, and also includes additional capabilities not included in RIP.

For a RIP file to be generated, QNC.EXE needs to be loaded on the Domino server. The QNC.EXE program (often called "quincy") is the default debugger program that ships with Domino. The QNC.EXE program is usually located in the \Domino directory. To enable QNC.EXE, type "qnc –I" at the operating system's command prompt. You can also enable QNC.EXE by typing "qnc nserver" at server launch. If RIP files are not generated when the server crashes, check whether QNC.EXE is enabled. Normally, RIP files get created in the data directory.

NSD (Domino 6 and later)

As mentioned previously, Domino 6 and later provides the NSD feature. This is a file that contains information about the state of the server at the time of a crash. For more information, see the section, "NSD analysis," earlier in this article.

Memory dump (Domino 6 and later)

In Domino 6 and later, you can use the command 「sh memory dump」 on the server console to create a memory dump file. A memory dump contains information on memory currently used by Domino. This is very useful when troubleshooting performance problems and memory leaks. Normally, memory dump files get collected in the data\IBM_TECHNICAL_SUPPORT directory. A memory dump file name includes a time stamp for the time when the NSD was generated. For example:

memory_ [email protected]_50_08.dmp

Note: To record the available memory to a file instead of viewing it on the server console, enter the following server console command: sh memory dump >memory.txt

HTTP request logs

To troubleshoot issues related to Domino Web server crashes and hangs, Lotus Software Technical Support will often ask you to create an HTTP request log. To enable the default settings for request logs, edit the server's Notes.ini file and add the line HTTPEnableThreadDebug=1. This sets HTTP request logging at the default level. (To set the logging level to record more details, see the Domino administration documentation.) You can also enable HTTP request logging dynamically by entering "tell http debug thread on | off" at the Domino server console. With HTTP request logging enabled, Domino creates a series of files with the name htthr*.log. For example: [email protected]

HTTP request logging should be used only for troubleshooting specific issues, and usually at the direction of and with assistance from Lotus Software Technical Support. Do not use request logging for other purposes, such as general administration. These log files grow in size over time, so you should not leave this setting enabled for long periods or you could consume all available drive space.

Automatic Data Collection

Notes/Domino 6.0.1 introduced the automatic diagnostic data collection tool, also known as Automatic Data Collection, or ADC for short. Automatic Data Collection simply means that, when a Notes client or Domino server crashes, the program gathers all the necessary data to debug the crash and sends it to a mail-in database when the client or server restarts. Administrators then have one location per domain in which they can see all the crashes that have occurred for all clients and servers. This will help eliminate the instances where an administrator or user may not be able to capture the proper data on a client or server crash.

Notes.ini settings

To troubleshoot performance and crash issues, you can enable the following Notes.ini debugging parameters:

    * Debug_threadid=1 logs each process and thread ID for each server operation.
    * Debug_show_timeout=1 turns on semaphore timeout messages to the console, and creates a semaphore text file called semdebug.txt.
    * Debug_capture_timeout=10 time stamps each semaphore timeout message.
    * CONSOLE_LOG_ENABLED=1 (Domino 6 and later) enables Domino console logging.

Fault recovery for server crashes

You can set up fault recovery to automatically handle Domino server crashes. When the server crashes, it shuts itself down and then restarts automatically, without any administrator intervention. Domino records crash information in the data directory. When the server restarts, Domino checks to see if it is restarting after a crash. If it is, an email is automatically sent to the person or group in the "Mail Fault Notification to" field.

A fatal error (such as an operating system exception or an internal panic) terminates each Domino process and releases all associated resources. The startup script detects the situation and restarts the server. If you are using multiple server partitions and a failure occurs in a single partition, only that partition is terminated and restarted.

New troubleshooting features in Domino 7

This section briefly discusses some new Domino 7 features that can help you analyze and correct server hangs and crashes.

Domino Domain Monitoring

One of the most significant and useful server maintenance and troubleshooting features in Domino 7 is Domino Domain Monitoring (DDM). This provides one central location for monitoring all the servers in a domain (or multiple domains). DDM uses programs called probes to gather server information from the individual servers, and then report back to a special database (DDM.nsf) where you can view the collected data. This allows you to monitor, analyze, and troubleshoot a large number of servers from a single Domino Administrator console.

Activity Trends

The Activity Trends feature lets you analyze "historical" server data, to help spot trends that can only be identified over an extended period of time. You can review this data to help predict and avoid future issues. This data is collected from the log file (Log.nsf) and the Catalog task, and stored in the Activity Trends database (Activity.nsf). The Activity Trends Collector task processes this data, and produces "trended" data that you can use for charting and resource balancing.

Writing status bar history to a log file

You can now enable Notes client logging of status bar messages to the local log file (Log.nsf) or to an external file that you designate. This can help you troubleshoot Notes client crashes. Use the Notes.ini setting logstatusbar=1 to enable logging of status bar messages to Log.nsf. To view the logged messages, open Log.nsf and then click the Miscellaneous Events view. Status bar messages are appended with Status Msg. To write the status bar messages to an external file, use the Notes.ini setting Debug_Outfile=<path to file> with the Notes.ini setting logstatusbar=1. For example:
logstatusbar=1
Debug_Outfile=c:\temp\StatusBarLogging.txt

This logs status bar messages to the file StatusBarLogging.txt.

The Log.nsf file can also provide a snapshot of actions logged in the status bar before the Notes client crashed.

Fault Analyzer

Fault Analyzer is a new server feature that processes all new crashes as they are delivered to the Automatic Data Collection mail-in database. The Fault Analyzer task searches the database configured for Fault Report documents and determines whether or not the stack matches a crash that has already been seen by a user or server. It adds to the functionality of the Automatic Data Collection feature by analyzing the call stacks that are located in the Fault Report mail-in database, and evaluating them to determine whether or not there are other instances of the same problem.

Fault Analyzer is configured at the same time that you set up Automatic Data Collection (see figure 1). Use the Server Configuration document to set up Automatic Data Collection on the server and to enable or disable Fault Analyzer.

Figure 1. Configuring Fault Analyzer
Configuring Fault Analyzer

If Fault Analyzer locates duplicate fault reports, the new crash is reported as a response to the original crash, and attachments are either removed from the response document to save space in the database, or they are saved with the response document.

Automatic Data Collection enhancements

When you use the Automatic Data Collection tool to gather information about server crashes, the server is now first checked to see if it is being run under the Domino Controller and, if so, uses the Controller logs. If not, the server is checked to see if console logging is enabled and, if so, uses the console output. Finally, data is extracted from Log.nsf if neither the Domino Controller nor console logging has been set.

Now you can select which files (using wildcards) will be collected by the Automatic Data Collection tool when it runs on clients or servers. On Notes clients, it is configured using a Desktop Policy Settings document (see figure 2).

Figure 2. Configuring Automatic Data Collection on the Notes client
Configuring Automatic Data Collection on the Notes client

On Domino servers, it is configured using the Server Configuration document (see figure 3).

Figure 3. Configuring Automatic Data Collection on the Domino server
Configuring Automatic Data Collection on the Notes client

This allows you to collect diagnostic files from other IBM products, as well as third-party add-ins.

There is a possibility that the output sent by Automatic Data Collection could be very large. If this becomes a problem, you can configure Automatic Data Collection to restrict the size of attachments sent by NSD and the console log to the Fault Reports database (see figure 3).

Shutdown Monitor

It often takes a long time for the Domino server to actually shut down after you issue a quit or restart server command. To avoid this delay, the Shutdown Monitor task ensures that Domino terminates when requested to do so. If the server doesn't terminate in the allotted time, the server will forcefully terminate and an NSD log will be generated before termination. The time limit is specified in the Server Shutdown Timeout field of the Automatic Server Restart section of the Server document, on the Basics tab (see figure 4).

Figure 4. Setting the Server Shutdown Timeout
Setting the Server Shutdown Timeout

The default Server Shutdown Timeout setting is 5 minutes. This feature can be disabled using the Notes.ini setting shutdown_monitor_disabled=1.

Process Monitor (Windows platforms only)

The Process Monitor task monitors the processes that should be running as part of the Domino server environment. (This task runs on Microsoft Windows platforms only; this functionality is implemented in Domino for Unix platforms without using a separate server task.) If any of these processes is missing, or if one terminates unexpectedly without completing the usual Domino termination routines, this task causes the server to panic and identify which process has prematurely terminated. The Process Monitor task works with Nprocmon.exe, which monitors the Nserver.exe process for abnormal terminations.

This feature can significantly reduce the number of abnormal termination problems, which otherwise are difficult to analyze (because it's often difficult to determine which process has terminated and caused the server problem). To disable the Process Monitor task, set the variable process_monitor_disabled=1 in the server's Notes.ini file.

Conclusion

In this article, we have defined the differences between a Domino server hang and a crash. We have discussed some troubleshooting procedures and tools you can follow when analyzing and fixing Notes/Domino problems. We also looked at new troubleshooting features introduced in Notes/Domino 7. You can consult this article whenever you encounter a hang or crash with the Notes client or Domino server -- which hopefully won't be very often!


        Back to top


Resources
Learn

    * The developerWorks Lotus article, "New features in Lotus Domino 7.0," provides an overview of all the new server features introduced in Domino 7.

    * Before using any of the troubleshooting tools mentioned in this article, consult the Domino administration documentation.

    * Also, the Domino self-help support page is a good resource for troubleshooting information.


Discuss

    * Participate in developerWorks blogs and get involved in the developerWorks community.



        Back to top


About the author

               

Kiran Bellari joined IBM on 2004 and initially worked for the Domino Document Manager team before joining Domino server crash performance technical support. Kiran is a dual CLP, and earned his Bachelor's Degree from the National Insitute of Technology.
《解決方案》

辛苦了,總體看了下感覺獲益不少,我的domino伺服器最近每個2小時左右就出問題,應該算掛起吧,現象就是先提示TCP/IP通信協議堆棧報告用完,要你增加內存或減少客戶機連接,接著就提示系統資源不足,無法完成請求的服務。關鍵是機器上進行其它操作也是這個提示,只有重啟
現在硬體方面除了撥號是否有問題不好排除外,其它應該沒問題,問題還沒解決,希望高手指點。
《解決方案》

是否裝有第3方軟體或前期做過配置修改

是否裝有第3方軟體或前期做過配置修改?

[火星人 ] 故障診斷 Lotus Domino 的掛起和崩潰已經有1400次圍觀

http://coctec.com/docs/service/show-post-40717.html