歡迎您光臨本站 註冊首頁

heartbeat IPaddr2 monitor timeout

←手機掃碼閱讀     火星人 @ 2014-03-04 , reply:0

heartbeat IPaddr2 monitor timeout

各位同學,

現在我使用了heartbeat 2.1.3 版本於生產系統。

發現了兩次IPaddr2 monitor timeout,  不知是linux系統的原因, 還是heartbeat的bug,
希望見過類似問題的同學, 能否分享一下。


heartbeat  利用 IPaddr2 這個resource agent 監控ip,  我用heartbeat有一年半了,
用於生產系統也有一年左右的時間。 所以相信在配置上應該沒有問題。

現在偶發了兩次很奇怪的問題。IPaddr2 monitor 竟然 timeout,
heartbeat發現這種情況后,  會去restart一下這個resource group,
如何restart呢, 先stop  service resrouce agent
(什麼叫resource agent呢, 例如我們寫一個支持ocf規範的shell腳本, 稱為resource agent, 來管理我們的應用進程。),
再 stop IPaddr2 這個resource agent.
結果呢, stop service resrouce agent 時又超時(20秒的timer啊), 而我那個service resrouce agent 的stop操作非常簡單,
沒有block的地方, 它竟然超時20秒。

這會引起heartbeat對service resrouce agent脫管, 變成所謂 un-managed.
我開始以為是時間漂移, 後來發現不是的。

但是在出問題的這段時間內(38秒), 的確我們有個其他進程在做備份操作, 會佔用大量的I/O,
但是我想也不至於讓操作系統癱瘓的地步。 畢竟這個操作每天這個點都做, 為什麼單單就這一天會影響到heartbeat呢 ?

如果這是heartbeat的bug, 我想網上早就有這樣的案例, 但是沒有搜索到類似的問題, 所以我想應該不是heartbeat的問題。

不知道各位同學在使用時有沒有碰到類似問題。企盼分享一下。



+++++++++++++++++++++++++++++++++++++++++  /var/log/ha-debug  +++++++++++++

iptv_user_agent: 2009/12/19_19:02:08 INFO: user_agent running
iptv_user_agent: 2009/12/19_19:02:10 INFO: user_agent running
iptv_user_agent: 2009/12/19_19:02:16 INFO: user_agent running
lrmd: 2009/12/19_19:02:18 WARN: IPaddr2_1:monitor process (PID 1497) timed out (try 1).  Killing with signal SIGTERM (15).
lrmd: 2009/12/19_19:02:18 WARN: operation monitor on ocf::IPaddr2::IPaddr2_1 for client 4459, its parameters: CRM_meta_interval= ip= cidr_netmask= CRM_meta_id= CRM_meta_timeout= crm_feature_set= iflabel= CRM_meta_name= nic= : pid timed out
crmd: 2009/12/19_19:02:18 ERROR: process_lrm_event: LRM operation IPaddr2_1_monitor_5000 (24) Timed Out (timeout=5000ms)
tengine: 2009/12/19_19:02:19 info: process_graph_event: Detected action IPaddr2_1_monitor_5000 from a different transition: 8 vs. 18
tengine: 2009/12/19_19:02:19 info: update_abort_priority: Abort priority upgraded to 1000000
tengine: 2009/12/19_19:02:19 WARN: update_failcount: Updating failcount for IPaddr2_1 on f7a40064-99bc-4647-aea2-a064cbc762be after failed monitor: rc=-2
crmd: 2009/12/19_19:02:19 info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_IPC_MESSAGE origin=route_message ]
crmd: 2009/12/19_19:02:19 info: do_state_transition: All 2 cluster nodes are eligible to run resources.
pengine: 2009/12/19_19:02:20 info: determine_online_status: Node aaa-a is online
pengine: 2009/12/19_19:02:20 info: common_apply_stickiness: Setting failure stickiness for IPaddr2_1 on aaa-a: -100
pengine: 2009/12/19_19:02:20 WARN: unpack_rsc_op: Processing failed op IPaddr2_1_monitor_5000 on aaa-a: Timed Out
pengine: 2009/12/19_19:02:20 info: determine_online_status: Node aaa-b is online
pengine: 2009/12/19_19:02:20 notice: group_print: Resource Group: group_1
pengine: 2009/12/19_19:02:20 notice: native_print:     IPaddr2_1 (heartbeat::ocf:IPaddr2):       Started aaa-a FAILED
pengine: 2009/12/19_19:02:20 notice: native_print:     iptv_user_agent   (heartbeat::ocf:iptv_user_agent):       Started aaa-a
pengine: 2009/12/19_19:02:20 notice: NoRoleChange: Recover resource IPaddr2_1    (aaa-a)
pengine: 2009/12/19_19:02:20 notice: StopRsc:   aaa-a    Stop IPaddr2_1
pengine: 2009/12/19_19:02:20 notice: StartRsc:  aaa-a    Start IPaddr2_1
pengine: 2009/12/19_19:02:20 notice: RecurringOp: aaa-a     IPaddr2_1_monitor_5000
pengine: 2009/12/19_19:02:20 notice: NoRoleChange: Leave resource iptv_user_agent        (aaa-a)
crmd: 2009/12/19_19:02:21 info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=route_message ]
tengine: 2009/12/19_19:02:21 info: unpack_graph: Unpacked transition 19: 11 actions in 11 synapses
tengine: 2009/12/19_19:02:21 info: te_pseudo_action: Pseudo action 13 fired and confirmed
tengine: 2009/12/19_19:02:21 info: send_rsc_command: Initiating action 9: iptv_user_agent_stop_0 on aaa-a
crmd: 2009/12/19_19:02:21 info: do_lrm_rsc_op: Performing op=iptv_user_agent_stop_0 key=9:19:cf4b6798-33c3-4fcb-880c-0b87b803ba39)
iptv_user_agent: 2009/12/19_19:02:23 INFO: user_agent running
crmd: 2009/12/19_19:02:23 info: process_lrm_event: LRM operation iptv_user_agent_monitor_2000 (call=26, rc=-2) Cancelled
lrmd: 2009/12/19_19:02:23 info: rsc:iptv_user_agent: stop
pengine: 2009/12/19_19:02:25 info: process_pe_message: Transition 19: PEngine Input stored in: /var/lib/heartbeat/pengine/pe-input-79.bz2
pengine: 2009/12/19_19:02:25 ERROR: subsystem_msg_dispatch: pengine took 5570ms to complete
iptv_user_agent: 2009/12/19_19:02:25 INFO: user_agent: enter iptv_stop
iptv_user_agent: 2009/12/19_19:02:25 INFO: generate user_agent alarm
tengine: 2009/12/19_19:02:41 WARN: action_timer_callback: Timer popped (abort_level=0, complete=false)
tengine: 2009/12/19_19:02:41 WARN: print_elem: Action missed its timeout: In-flight (id: iptv_user_agent_stop_0, loc: aaa-a, priority: 0)
lrmd: 2009/12/19_19:02:43 WARN: iptv_user_agent:stop process (PID 1576) timed out (try 1).  Killing with signal SIGTERM (15).
lrmd: 2009/12/19_19:02:43 info: RA output: (iptv_user_agent:stop:stderr)
Session terminated, killing shell...
crmd: 2009/12/19_19:02:43 ERROR: process_lrm_event: LRM operation iptv_user_agent_stop_0 (27) Timed Out (timeout=20000ms)
lrmd: 2009/12/19_19:02:43 WARN: operation stop on ocf::iptv_user_agent::iptv_user_agent for client 4459, its parameters: CRM_meta_timeout= crm_feature_set= : pid timed out
tengine: 2009/12/19_19:02:43 info: status_from_rc: Re-mapping op status to LRM_OP_ERROR for rc=-2
《解決方案》

希望heartbeat的愛好者和有使用經驗的多多發言。

使用heartbeat有一兩年了, 真是愛恨交加。
感覺寫heartbeat的人因該是牛人, 但是好像幾乎沒有人後續維護,
所以有不少小問題。

heartbeat是保護service的, 往往service沒事, 敏感的heartbeat到出了問題, 反倒把service給幹掉了。

不知道使用過heartbeat的有何感受。

分享一下。
《解決方案》

回復 #2 leeedward 的帖子

現在基本確定下來了。

這不是heartbeat自身的問題, 是一個其他的進程在做備份操作,  會佔用大量的memory,
操作系統 過了較長時間才會調度到這個IPaddr2 monitor 進程。
(我做了一個簡單的實驗, 一個最簡單的shell腳本, 被block約10秒才被執行。)

其實當時memory還有很多free, 但是這個備份進程會連續將 200M Bytes 大小的memory sync到disk.
不知OS為什麼會短暫的hicup。

我會將IPaddr2 monitor的timer從5秒調大到約60秒, 來規避這個問題。

特此分享一下。
《解決方案》

我也遇到過time out 問題跟你的很像,不過我這邊是cpu很高,不知道為什麼會很高
《解決方案》

樓主幫我也看看,我這個也遇到個monitor timeout ,不知道是不是與你這個有些類似
lrmd: 2012/08/10_23:00:18 WARN: ipservice:monitor process (PID 6516) timed out (try 1).  Killing with signal SIGTERM (15).
lrmd: 2012/08/10_23:00:18 WARN: operation monitor on lsb::ipresources::ipservice for client 22859, its parameters: CRM_meta_interval= CRM_meta_id= CRM_meta_timeout= crm_feature_set= CRM_meta_description= CRM_meta_name= : pid timed out
crmd: 2012/08/10_23:00:18 ERROR: process_lrm_event: LRM operation ipservice_monitor_5000 (17) Timed Out (timeout=10000ms)
tengine: 2012/08/10_23:00:18 info: process_graph_event: Action ipservice_monitor_5000 arrived after a completed transition
tengine: 2012/08/10_23:00:18 info: update_abort_priority: Abort priority upgraded to 1000000
tengine: 2012/08/10_23:00:18 WARN: update_failcount: Updating failcount for ipservice on 98f8988d-dde0-437d-b4d4-2127401130a8 after failed monitor: rc=-2
crmd: 2012/08/10_23:00:18 info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_IPC_MESSAGE origin=route_message ]
crmd: 2012/08/10_23:00:18 info: do_state_transition: All 1 cluster nodes are eligible to run resources.
pengine: 2012/08/10_23:00:18 notice: unpack_config: On loss of CCM Quorum: Ignore
pengine: 2012/08/10_23:00:18 info: determine_online_status: Node ha1 is online
pengine: 2012/08/10_23:00:18 info: common_apply_stickiness: Setting failure stickiness for ipservice on ha1: -200
pengine: 2012/08/10_23:00:18 WARN: unpack_rsc_op: Processing failed op ipservice_monitor_5000 on ha1: Timed Out
pengine: 2012/08/10_23:00:20 notice: group_print: Resource Group: group_1
tengine: 2012/08/10_23:00:20 info: extract_event: Aborting on transient_attributes changes for 98f8988d-dde0-437d-b4d4-2127401130a8
pengine: 2012/08/10_23:00:20 notice: native_print:     ipservice        (lsb:ipresources):        Started ha1 FAILED
tengine: 2012/08/10_23:00:20 WARN: notify_crmd: Delaying completion until all CIB updates complete
tengine: 2012/08/10_23:00:20 info: te_update_diff: Aborting on transient_attributes deletions
pengine: 2012/08/10_23:00:20 notice: NoRoleChange: Recover resource ipservice        (ha1)
tengine: 2012/08/10_23:00:20 WARN: notify_crmd: Delaying completion until all CIB updates complete
pengine: 2012/08/10_23:00:20 notice: StopRsc:   ha1        Stop ipservice
pengine: 2012/08/10_23:00:20 notice: StartRsc:  ha1        Start ipservice
pengine: 2012/08/10_23:00:20 notice: RecurringOp: ha1           ipservice_monitor_5000
pengine: 2012/08/10_23:00:20 info: process_pe_message: Transition 212: PEngine Input stored in: /var/lib/heartbeat/pengine/pe-input-212.bz2
pengine: 2012/08/10_23:00:20 notice: unpack_config: On loss of CCM Quorum: Ignore
pengine: 2012/08/10_23:00:20 info: determine_online_status: Node ha1 is online
pengine: 2012/08/10_23:00:20 info: common_apply_stickiness: Setting failure stickiness for ipservice on ha1: -300
pengine: 2012/08/10_23:00:20 WARN: unpack_rsc_op: Processing failed op ipservice_monitor_5000 on ha1: Timed Out
pengine: 2012/08/10_23:00:20 notice: group_print: Resource Group: group_1
pengine: 2012/08/10_23:00:20 notice: native_print:     ipservice        (lsb:ipresources):        Started ha1 FAILED
pengine: 2012/08/10_23:00:20 WARN: native_color: Resource ipservice cannot run anywhere
pengine: 2012/08/10_23:00:20 notice: StopRsc:   ha1        Stop ipservice
crmd: 2012/08/10_23:00:20 info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=route_message ]
tengine: 2012/08/10_23:00:20 info: unpack_graph: Unpacked transition 213: 4 actions in 4 synapses
tengine: 2012/08/10_23:00:20 info: te_pseudo_action: Pseudo action 8 fired and confirmed
tengine: 2012/08/10_23:00:20 info: send_rsc_command: Initiating action 2: ipservice_stop_0 on ha1
crmd: 2012/08/10_23:00:20 info: do_lrm_rsc_op: Performing op=ipservice_stop_0 key=2:213:b44f9dac-4adf-4fe5-b5fa-97facac3f586)
lrmd: 2012/08/10_23:00:20 info: rsc:ipservice: stop
crmd: 2012/08/10_23:00:20 info: process_lrm_event: LRM operation ipservice_monitor_5000 (call=17, rc=-2) Cancelled
lrmd: 2012/08/10_23:00:20 WARN: For LSB init script, no additional parameters are needed.
pengine: 2012/08/10_23:00:20 WARN: process_pe_message: Transition 213: WARNINGs found during PE processing. PEngine Input stored in: /var/lib/heartbeat/pengine/pe-warn-0.bz2
pengine: 2012/08/10_23:00:20 info: process_pe_message: Configuration WARNINGs found during PE processing.  Please run "crm_verify -L" to identify issues.
lrmd: 2012/08/10_23:00:20 info: RA output: (ipservice:stop:stdout) Stopping ldirectord...
lrmd: 2012/08/10_23:00:21 info: RA output: (ipservice:stop:stdout) success
《解決方案》

怎麼會fired 呢,把資源給解僱了??
我也這樣啊。不知道咋回事
heartbeat: 2013/05/16_13:00:39 info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth1
heartbeat: 2013/05/16_13:00:39 info: glib: ucast: bound send socket to device: eth1
"ha-log" 9410L, 1018003C
pengine: 2013/05/17_11:10:44 info: unpack_rsc_op: Remapping httpd_2_monitor_120000 (rc=7) on intel-01 to an ERROR (expected 0)
pengine: 2013/05/17_11:10:44 WARN: unpack_rsc_op: Processing failed op httpd_2_monitor_120000 on intel-01: Error
pengine: 2013/05/17_11:10:44 info: unpack_rsc_op: Remapping httpd_2_start_0 (rc=1) on intel-01 to an ERROR (expected 0)
pengine: 2013/05/17_11:10:44 WARN: unpack_rsc_op: Processing failed op httpd_2_start_0 on intel-01: Error
pengine: 2013/05/17_11:10:44 WARN: unpack_rsc_op: Compatability handling for failed op httpd_2_start_0 on intel-01
pengine: 2013/05/17_11:10:44 notice: group_print: Resource Group: group_1
pengine: 2013/05/17_11:10:44 notice: native_print:     IPaddr_192_168_7_189      (ocf::heartbeat:IPaddr):        Started intel-02
pengine: 2013/05/17_11:10:44 notice: native_print:     httpd_2   (lsb:httpd):    Started intel-02
pengine: 2013/05/17_11:10:44 WARN: native_color: Resource httpd_2 cannot run anywhere
pengine: 2013/05/17_11:10:44 notice: NoRoleChange: Leave resource IPaddr_192_168_7_189   (Started intel-02)
pengine: 2013/05/17_11:10:44 notice: NoRoleChange: Stop resource httpd_2 (Started intel-02)
pengine: 2013/05/17_11:10:44 notice: StopRsc:   intel-02 Stop httpd_2
crmd: 2013/05/17_11:10:44 info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=route_message ]
tengine: 2013/05/17_11:10:44 info: process_te_message: Processing graph derived from /var/lib/heartbeat/pengine/pe-warn-40.bz2
tengine: 2013/05/17_11:10:44 info: unpack_graph: Unpacked transition 10: 4 actions in 4 synapses
tengine: 2013/05/17_11:10:44 info: te_pseudo_action: Pseudo action 11 fired and confirmed
tengine: 2013/05/17_11:10:44 info: send_rsc_command: Initiating action 8: stop httpd_2_stop_0 on intel-02
pengine: 2013/05/17_11:10:44 WARN: process_pe_message: Transition 10: WARNINGs found during PE processing. PEngine Input stored in: /var/lib/heartbeat/pengine/pe-warn-40.bz2
crmd: 2013/05/17_11:10:44 info: do_lrm_rsc_op: Performing op=httpd_2_stop_0 key=8:10:0:2a241abe-1246-4478-830a-f0bb697001c0)
pengine: 2013/05/17_11:10:44 info: process_pe_message: Configuration WARNINGs found during PE processing.  Please run "crm_verify -L" to identify issues.
lrmd: 2013/05/17_11:10:44 info: rsc:httpd_2: stop
lrmd: 2013/05/17_11:10:44 WARN: For LSB init script, no additional parameters are needed.
lrmd: 2013/05/17_11:10:44 info: RA output: (httpd_2:stop:stdout) Stopping httpd:
lrmd: 2013/05/17_11:10:44 info: RA output: (httpd_2:stop:stdout) [
lrmd: 2013/05/17_11:10:44 info: RA output: (httpd_2:stop:stdout)   OK  ]^M
lrmd: 2013/05/17_11:10:44 info: RA output: (httpd_2:stop:stdout)

crmd: 2013/05/17_11:10:44 info: process_lrm_event: LRM operation httpd_2_stop_0 (call=12, rc=0) complete
tengine: 2013/05/17_11:10:44 info: match_graph_event: Action httpd_2_stop_0 (8) confirmed on intel-02 (rc=0)
tengine: 2013/05/17_11:10:44 info: te_pseudo_action: Pseudo action 12 fired and confirmed
tengine: 2013/05/17_11:10:44 info: te_pseudo_action: Pseudo action 2 fired and confirmed
tengine: 2013/05/17_11:10:44 info: run_graph: Transition 10: (Complete=4, Pending=0, Fired=0, Skipped=0, Incomplete=0)
tengine: 2013/05/17_11:10:44 info: notify_crmd: Transition 10 status: te_complete - <null>
crmd: 2013/05/17_11:10:44 info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_IPC_MESSAGE origin=route_message ]
《解決方案》

回復 5# 402550990


    你這個問題是怎麼解決的?
《解決方案》

timeout=5000ms
能把這個值給改的大一些么?
試一下
《解決方案》

回復 8# wenhq


    你是說, 修改這裡的timeout?      
        <primitive class="lsb" id="httpd_2" provider="heartbeat" type="httpd">
           <operations>
             <op id="httpd_2_mon" interval="120s" name="monitor" timeout="60s" start_delay="0" disabled="false" role="Started"/>
           </operations>
           <meta_attributes id="httpd_2_meta_attrs">
             <attributes>
               <nvpair id="httpd_2_metaattr_resource_stickiness" name="resource_stickiness" value="100"/>
               <nvpair id="httpd_2_metaattr_resource_failure_stickiness" name="resource_failure_stickiness" value="-20"/>
             </attributes>
           </meta_attributes>
         </primitive>
《解決方案》

本帖最後由 wenhq 於 2013-05-17 18:16 編輯

不對啊,看你的抱錯是5000ms怎麼配置里是60s呢?
難到還有別的設置?
anyway 試試把.

[火星人 ] heartbeat IPaddr2 monitor timeout已經有1213次圍觀

http://coctec.com/docs/service/show-post-4063.html