Linux 平台下Oracle 9i/10g/11gR1 IO-Fencing 的hangcheck-timer 模块说明(一)

2014-11-24 18:47:27 · 作者: · 浏览: 0

参考MOS:


Linux 环境下Oracle 9i10g11gR1 RAC 需要配置Hangcheck_timer模块。


Note : Hangheck timer is notrequired starting with Oracle Clusterware 11gR2


注意,在模块在11gR2RAC 中已经不在需要配置了。


Starting in release 9.2.0.2and later, Oracle RAC environments required using a new I/O fencing model,named the hangcheck-timer module. This module was implemented to replace theWatchdog module, which provided similar fencing functionality. Hangcheck-timerwas subsequently delivered as part of the standard kernel distribution forLinux kernel releases 2.4 and above.


9.2.0.2版本开始,ORACLERAC环境需要使用一个新的I/O fencing模块,叫做hangcheck-timer模块。这个模块用来代替Watchdog模块,提供类似的fencing功能。Hangcheck-timer模块是标准的linux2.4以上的内核中的一个子功能被发布。


Hangcheck-timer shouldbe loaded at boot time, and monitors the Linux kernel for long operatingsystem hangs that could affect the reliability of a RAC node. It runs inkernel mode and uses the Time Stamp Counter (TSC) to catch scheduling delays ornode hangs. This is done by setting a timer, then checking when the timerfires as to whether it was delayed by more than the allowed margin oferror. If the duration exceeds the allowed time of (hangcheck_tick +hangcheck_margin seconds), the machine is restarted. Hangcheck-timer willnot cause reboots to occur due to CPU starvation.


--Hangcheck-timer应该在系统启动的时候被加载, 并对于能够影响RAC节点稳定性的长时间的系统操作HANG进行内核监控。它运行在内核级别并使用Time Stamp Counter(TSC)来捕捉调度的延迟和节点HANG。这是通过设置一个timer,然后检查这个timerfires情况来判断是否延迟是否超过了误差的幅度。如果这个周期超过了允许的时间(也就是hangcheck_tick+hangcheck_margin秒),机器将会被重启,如果是CPU资源不足的时候,Hangcheck-timer将不会导致重启。


Hangcheck-timer requiresthree configuration parameters:


--Hangcheck-timer有三个配置参数:


(1) hangcheck_tick - defines howoften, in seconds, the hangcheck-timer checks the node for hangs. The defaultvalue is 60 seconds.


-- hangcheck_tick:定义了hangcheck-timer检查节点是否hang的频率,单位是秒,缺省是60.


(2) hangcheck_margin - defines howmuch margin is allowed, in seconds, between expected scheduling and realscheduling time. The default value is 180 seconds.


--hangcheck_margin:定义期望的和真正的scheduling之间允许的误差,单位是秒,缺省值是180.


(3) hangcheck_reboot - determinesif the hangcheck-timer restarts the node if the kernel fails to respond withinthe sum of the hangcheck_tick and hangcheck_margin parameter values. If theva lue of hangcheck_reboot is equal to or greater than 1, then thehangcheck-timer module restarts the system. If the hangcheck_reboot parameteris set to zero, then the hangcheck-timer module will not reboot the node,even if a hang is detected. The default value varies by kernelversion. In the 2.4 kernel, the default is 1. In 2.6 kernels, thedefault is 0.


--hangcheck_reboot:定义了如果内核在hangcheck-tickhangcheck-margin相加的时间内响应失败的话,hangcheck-timer是否重启节点。如果hangcheck_reboot的值大于等于1,hangcheck-timer模块将会重启系统;如果设置为0,则即使系统hang的时候hangcheck-timer也不会重启系统。在linux 2.4的内核中,这个缺省值是1;在2.6的内核中,缺省值是0


当hangcheck_reboot=1并且满足下面的公式时,hangcheck-timer将reboot系统: system hang time > (hangcheck_tick + hangcheck_margin)


All hangcheck-timer defaultvalues should be explicitly overridden when loading the kernel module, based onthe Oracle release as follows:


--所有的hangcheck-timer的参数的缺省值必须在加载内核模块的时候被显式的覆盖,不同的oracle版本可以按照下面来设置:


19i: Assuming thedefault setting of "oracm misscount" is set to 220 seconds:


hangcheck_tick=30hangcheck_margin=180 hangcheck_reboot=1


--9i: 假如"oracle misscount"的缺省设置是220秒,则hangcheck_tick=30hangcheck_margin=1