OGG的extract进程checkpoint时间点回到1988-01-01 00:00:00故障处理(一)

2015-02-13 23:46:40 · 作者: · 浏览: 83

1、故障现象
? ? Extract进程(SEXTR01)状态为running,但是Lag at Chkpt却达到5个多小时,且时间一直在增长,根本就不抽取新日志,状态信息如下:


GGSCI (caXXadgdb) 21> info all


Program? ? Status? ? ? Group? ? ? Lag at Chkpt? Time Since Chkpt


MANAGER? ? RUNNING? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?


EXTRACT? ? RUNNING? ? DPEYWGL? ? 00:00:00? ? ? 00:00:04?


EXTRACT? ? RUNNING? ? SEXTR01? ? 05:14:58? ? ? 00:00:03?
?


2、故障分析
2.1 debug一下进程状态信息
GGSCI (caXXadgdb) 23> info sextr01, showch debug


?


EXTRACT? ? SEXTR01? Last Started 2015-01-25 22:51? Status RUNNING


Checkpoint Lag? ? ? 05:21:26 (updated 00:00:04 ago)


? ? ? ? ? ? ? ? ? ? 2015-01-25 18:05:16? Thread 1, Seqno 5554, RBA 611373536


? ? ? ? ? ? ? ? ? ? SCN 3126.136656462 (13426204423758)


Log Read Checkpoint? Oracle Redo Logs


? ? ? ? ? ? ? ? ? ? 1988-01-01 00:00:00? Thread 2, Seqno 4016, RBA 1024


? ? ? ? ? ? ? ? ? ? SCN 44527.110828288 (191242119617280)


?


?


Current Checkpoint Detail:


?


Read Checkpoint #1


……


?


Read Checkpoint #2


?


? Oracle Threaded Redo Log


?


? Startup Checkpoint (starting position in the data source):


? ? Thread #: 2


? ? Sequence #: 4015


? ? RBA: 640984080


? ? Timestamp: 2015-01-25 18:03:51.000000


? ? SCN: 3126.136602577 (13426204369873)


? ? Redo File: +DGROUP1/caXXdb/onlinelog/group_15.267.799673791


?


? Recovery Checkpoint (position of oldest unprocessed transaction in the data source):


? ? Thread #: 2


? ? Sequence #: 4015


? ? RBA: 640984080


? ? Timestamp: 2015-01-25 18:03:51.000000


? ? SCN: 3126.136602577 (13426204369873)


? ? Redo File: +DGROUP1/caXXdb/onlinelog/group_15.267.799673791


?


? Current Checkpoint (position of last record read in the data source):


? ? Thread #: 2


? ? Sequence #: 4016


? ? RBA: 1024


? ? Timestamp: 1988-01-01 00:00:00.000000


? ? SCN: 44527.110828288 (191242119617280)


? ? Redo File: +DGROUP1/caXXdb/onlinelog/group_16.266.799673793


……
?


?


从这里发现一个不可思议的问题:


? ? ? Thread #: 2的current checkpoint 的timestamp竟然是1988-01-0100:00:00.000000,SCN的十进制数,竟然为191242119617280,远远大于当前数据库的SCN,多出一位数。


? ? ? 时间倒退,SCN号反而大涨,时间还那么有个性。这个问题挺有意思。


2.2 view extract进程的report
? ? 使用view report sextr01查看进程的报告,提示有ERROR,thread#2 4016 archive log 不能打开。(由于笔者没有将信息复制下来,所以当时的状态只能这么描述了),但是4016号archivelog实际物理文件是存在的,而且权限正常。


2.3 将进程停止下来,showch其状态
GGSCI (calladgdb) 24>stop sextr01


GGSCI (calladgdb) 25> info SEXTR01, showch


EXTRACT? ? SEXTR01? Last Started 2015-01-26 01:00? Status STOPPED


Checkpoint Lag? ? ? 06:56:55 (updated 00:00:58 ago)


Log Read Checkpoint? Oracle Redo Logs


? ? ? ? ? ? ? ? ? ? 2015-01-25 18:05:16? Thread 1, Seqno 5554, RBA 611373536


? ? ? ? ? ? ? ? ? ? SCN 3126.136656462 (13426204423758)


Log Read Checkpoint? Oracle Redo Logs


? ? ? ? ? ? ? ? ? ? 1988-01-01 00:00:00? Thread 2, Seqno 4016, RBA 1024


? ? ? ? ? ? ? ? ? ? SCN 44527.110828288 (191242119617280)


?


Current Checkpoint Detail:


?


Read Checkpoint #1


……


?


Read Checkpoint #2


?


? Oracle Threaded Redo Log


?


? Startup Checkpoint (starting position in the data source):


? ? Thread #: 2


? ? Sequence #: 4015


? ? RBA: 640984080


? ? Timestamp: 2015-01-25 18:03:51.000000


? ? SCN: 3126.136602577 (13426204369873)


? ? Redo File: +DGROUP1/caXXdb/onlinelog/group_15.267.799673791


?


? Recovery Checkpoint (position of oldest unprocessed transaction in the data source):


? ? Thread #: 2


? ? Sequence #: 4015


? ? RBA: 640984080


? ? Timestamp: 2015-01-25 18:03:51.000000


? ? SCN: 3126.136602577 (13426204369873)


? ? Redo File: +DGROUP1/caXXdb/onlinelog/group_15.267.799673791


?


? Current Checkpoint (position of last record read in the data source):


? ? Thread #: 2


? ? Sequence #: 4016


? ? RBA: 1024


? ? Timestamp: 1988-01-01 00:00:00.000000


? ? SCN: 44527.110828288 (191242119617280)


? ?