cancel
Showing results for 
Search instead for 
Did you mean: 

Checkpoint not complete

former_member84399
Participant
0 Kudos

Dear all,

After four Oracle 10 upgrades, the only remaining warning I have on db13 in all databases I upgraded is about "checkpoint not complete" . This only happens when there is a lot of activity in the database . It never happened before though, even when there was a lot of activity in the database.

I found note #79341 and #1068186 and I have set DISABLESELFTUNE_CHECKPOINTING

to FALSE, but this didn't particularly help, still get the warnings

My questions are:

The log buffer size is increased by the upgrade to 14M (instead of 1M that is used to be). I have 2 groups of redo log files, with 2 files each. Each file is 20M on size.

I was thinking of increasing them to 30M each. Do you think that this is a good thing to do?

Under /oracle/SID/saptrace/background, I see in the log writer trace the messages below. Do you know why I get these messages and are they relevant to the issue I have with the checkpoint?

      • SERVICE NAME:() 2007-08-09 09:56:38.059

      • SESSION ID:(316.1) 2007-08-09 09:56:38.059

Maximum redo generation record size = 197120 bytes

Maximum redo generation change vector size = 190204 bytes

tkcrrsarc: (WARN) Failed to find ARCH for message (message:0x10)

tkcrrpa: (WARN) Failed initial attempt to send ARCH message (message:0x10)

      • 2007-08-09 12:55:29.062

LGWR: Archivelog for thread 1 sequence 3662 will NOT be compressed

      • 2007-08-09 14:56:22.902

LGWR: Archivelog for thread 1 sequence 3663 will NOT be compressed

      • 2007-08-09 15:09:29.136

LGWR: Archivelog for thread 1 sequence 3664 will NOT be compressed

      • 2007-08-09 18:25:38.287

Many thanks

Andreas

Accepted Solutions (1)

Accepted Solutions (1)

Former Member
0 Kudos

Support team,

There are many different kind of checkpoints. The checkpoint that is issued with a logswitch is named "log switch checkpoint" in the alert log. If you issue manually an "alter system checkpoint" command, then you will see in the alert log that a "global checkpoint" is issued, which is executed immediately and turns the status of all ACTIVE redo logs to INACTIVE. These INACTIVE redo logs are no longer needed for recovery and can be dropped.

Now Oracle 10 does not trigger a full (global) checkpoint on logswitch (I think Oracle 9 used to do that after 2 consecutive log switches). Oracle 10 issues a global checkpoint when it tries to switch to an already ACTIVE redo log. The log switch checkpoint that is triggered by a log switch appears to have low priority and the database writer doesn't pick it up immediately like it does for the global checkpoint. I have seen checkpoints like this taking 3-4 minutes to complete while the data to be saved is only 50M so I can't believe that it is because of slow db writer.

Andreas

Answers (11)

Answers (11)

Former Member
0 Kudos

Hi Andreas,

I getting abit confuse. What I understand is that a checkpoint is issue whenever there is a log switch. My understanding of log switch is from log 11 to log 12 (assuming i had 4 log group 11 - 14 ). But base on your explanation, a checkpoint is issue when a complete switch log of group 11 to 12 to 13 to 14 then back to 11. Do highlight me if my understanding of the log switch theory is all the while wrong.

Regards

Lauran

Former Member
0 Kudos

Hello,

I didn't write that. I only referred to one checkpoint because this was only relevant to the example I was trying to make. I will rephrase:

A checkpoint will happen in each log switch. 11 to 12, 12 to 13, 13 to 14, 14 to 11. These are 4 different checkpoints. If you have more redo log groups, redo log group 11 will be again in user after 12, 13 and 14 have been filled. Therefore, the database writer has more time to complete the checkpoint that was started when the switch from 11 to 12 took place.

When the dbwriter finishes with the 11 to 12 checkpoint, it can work on the other pending checkpoints (12 to 13, 13 to 14 etc etc )

I hope this clarifies

Andreas

Former Member
0 Kudos

Hi Andreas,

Thanks for the infomation. 1 more question which will clear my doubt on this checkpoint issue. I understand that increasing the redo log size will help because it will reduce the dbwR writing frequency. Increasing the number of dbwR also aid in the process. How about adding a new redo log grp ? How does this contribute to the "checkpoint not complete" issue ?

Regards

Lauran

Former Member
0 Kudos

Hello Lauran,

Increasing the number of redo log groups helps as well. It is kind of equivalent to increasing the size of the redo logs. When you have more redo log groups, it takes more time for the log switches to "arrive" to the same redo log. Therefore, the db writer has more time to execute the checkpoint until the same redo log is used again.

For example, say you have 4 redo log groups called 11, 12, 13, 14. 11 is current, then a log switch happens. A checkpoint is issued when the log switch happens and the dbwriter starts working on writing the "dirty" blocks in the datafiles. In the meanwhile, the database is working and changes are recorded to redo log 12, a switch is taking place, now 13 is current, a bit later 14 is current and after this again 11. If you only had 2 redo log groups , 11 and 12 , the checkpoint on 11 will need to be completed while 12 is being written. If you have 4 like the example above, the db writer has more time ( the time that 12, 13, 14 ) take to be full.

Note 79341 shows how to increase the size of redo logs and the number of redolog groups

I hope this helps

Andreas

Former Member
0 Kudos

Hi All,

Just to side track abit. Can i said that when "checkpoint not complete" occured, it acutally mean that oracle is not able to read the data due to checkpoint is logging the process. "Checkpoint not complete" is actually not an error targeting the checkpoint issue. Or i can rephase the error as "read error due to checkpoint is in progress" Kindly correct me.

Regard

Lauran

Former Member
0 Kudos

Lauran,

It means that the redo log file can not be used because it's contents haven't been saved on the datafiles yet. So, it is rather "write error due to checkpoint in progress"

Andreas

former_member84399
Participant
0 Kudos

Thank you for your answers, I awarded some more points

I think I have my problem solved, 4 groups of 100M each has done the trick. I noticed a strange behaviour while resizing the redo logs, which might be the cause of the problem.

I understand that a log switch is always triggering a checkpoint (and the opposite is not true). When I was dropping redo log groups, I got two different kind of errors. One was the the redo log group to be dropped can not be dropped because it is current. This makes sense and I easily issued an "alter system switch logfile" statement.

SQL> ALTER DATABASE DROP LOGFILE GROUP 13;

ALTER DATABASE DROP LOGFILE GROUP 13

*

ERROR at line 1:

ORA-01623: log 13 is current log for instance BWT (thread 1) - cannot drop

ORA-00312: online log 13 thread 1: '/oracle/BWT/origlogA/log_g13m1.dbf'

ORA-00312: online log 13 thread 1: '/oracle/BWT/mirrlogA/log_g13m2.dbf'

SQL> alter system switch logfile;

System altered.

Then, when I tried to drop the same redo log group (that was not current any more, as I had just switched to the next one) I got the message

SQL> ALTER DATABASE DROP LOGFILE GROUP 13;

ALTER DATABASE DROP LOGFILE GROUP 13

*

ERROR at line 1:

ORA-01624: log 13 needed for crash recovery of instance BWT (thread 1)

ORA-00312: online log 13 thread 1: '/oracle/BWT/origlogA/log_g13m1.dbf'

ORA-00312: online log 13 thread 1: '/oracle/BWT/mirrlogA/log_g13m2.dbf'

So despite the switch, a checkpoint has not been issued yet. When I executed,

alter system checkpoint;

I could then drop log froup 13

I had similar issues with other log groups . What I do not understand is why I had to issue a manual checkpoint, I believed that the log switch should have issued a checkpoint itself.

Many thanks

Andreas

former_member204746
Active Contributor
0 Kudos
I believed that the log switch should have issued a checkpoint itself.

well, a checkpoint is issued, but the work can take some time to execute. This work is being done by the CKPT process. so, this is normal behavior and is expected.

Former Member
0 Kudos

<b>ORA-01624: log 13 needed for crash recovery of instance BWT (thread 1)</b>

This means that this group is still active and its logswitch checkpoint is not done yet.

Again a brief overview:

- every change generates redo (written to the current redo group) and changed blocks in the buffer cache (memory)

- everytime the redo log gets full a logswitch checkpoint is triggered, indicating DBWR to write all changed blocks to the data files, a new log group becomes current

- to loggroup stays active until the checkpoint completes (meaning all changed blocks covered in the redo are now written to disk. If the instance crashes before the checkpoint completed, this very log group would be needed for the recovery. Therefore you are not allowed to drop either a current, or an active log group.

So a checkpoint is mainly done by DBWR, the CKPT process only updates the file headers when a checkpoint completes.

Regards Michael

former_member84399
Participant
0 Kudos

Hello all, thank you for your replies so far, I awarded some points

I have had 9 systems upgraded so far. I am playing with different combinations of redo size, redo groups numbers and DISABLESELFTUNE_CHECKPOINTING. Here are my findings

Changing the DISABLESELFTUNE_CHECKPOINTING parameter hasn't made much difference anywhere. Whether set to true, false or reset (not sure what the default value is then as I can not see it in db03 or using show parameter from sqlplus), I have loads of checkpoint not completed messages in my busy systems

I had 4 redo log groups initially consisting of 1 redo log each of 20 M

Initially, I increased each redo log files size to 50M, the warnings were reduced (from 8-10 per day to 2-3). I then added 2 more groups of 1 redo log each (50M again). I now have none or 1 warning per day there

Vinod, if you can copy and paste me the extract from the database upgrade guide that is talking about the size of the redologs I will be grateful. I tried to look for it both in the SAP Oracle 10 upgrade guide and the Oracle guide from Metalink but I couldn't find it

Eric, or anybody else that understands the term "the database is fluctuating" from note 1068186, could you please explain it here?

Many thanks

Andreas

Former Member
0 Kudos

Hi again

One should not guess but i read the german version of the note as well, and i am pretty sure that 'fluctuating' means that the database is still aligning with its typical load patterns. As i already mentioned this most probably has to do with the other sort of checkpoints and not log switch checkpoints. These other (incremental) checkpoints can be tuned with various parameters (log_checkpoint_timeout / fast_start_mttr_target).

But in your case we have log switch checkpoints, which can only be controlled by the size of the redo logs. As your observations say you already reduced the not complete messages significantly. I am quite sure that they will go away completely when you further increase the size. 50mb is still tiny as on a fast system the logwriter can write 20mb or more per second under load. So if you have 4 redo groups of 50mb this would give 200mb divided by 20mb/s = 10s. So DBWR either has to write all changed blocks covered in the first group to the datafiles in 10s <b>or</b> a checkpoint not complete occurs.

In worst cases a single redo entry like <i>update bigtable set counter = counter + 1</i> could change millions of table blocks and generating millions of undo blocks which have to be written by the DBWR.

<b>Redo log switches should occur once per minute under load. If log_checkpoints_to_alert is set to true you can see log switches in the alert log.</b>

If you have logswitches under one minute when the checkpoint not complete occurs, then your log size is most probably still to small.

Of course if nobody complains of the now sporadic not complete messages, and you do not experience system hangs at crucial times, you should not worry to much about them anymore.

Best regards

Michael

Former Member
0 Kudos

BTW forgot, to see if an underscore parameter is set you can either do:

[code]SQL> show parameter disableselftune_checkpointing[/code]

This should work in 10g, otherwise do:

[code]SQL> select a.ksppinm "Parameter", b.ksppstvl "Session Value", c.ksppstvl "Instance Value"

from x$ksppi a, x$ksppcv b, x$ksppsv c

where a.indx = b.indx and a.indx = c.indx

and ksppinm ='_disable_selftune_checkpointing';[/code]

former_member204746
Active Contributor
0 Kudos

if your system had big loads in certain periods, I would advise using 6 or 8 regolog groups of 200MB each. If this is not enough, come back to this topic.

20MB was really not enough.

50 MB is OK for small SAP systems.

200MB is usually enough for most SAP systems.

former_member84399
Participant
0 Kudos

Dear all,

Thank you for your replies so far. In the meanwhile, I increased the size of my redologs from 20M to 50M. The checkpoint warnings still exist but they are significally reduced. I plan to put two more redo log groups (I only have two at the moment) and see what happens

Anyone knows why this behaviour in Oracle 10? The systems I am working on haven't had any increase in the user load and never had checkpoint warnings. I found notes 79341 and 1068186 but it is not clear to me what exactly happens for Oracle not to issue checkpoints as before. As I said, I have this issue directly after the upgrade to Oracle 10 and without any additional load to the system.

Many thanks

Andreas

Former Member
0 Kudos

Team,

Up to my knowledge, in 10 G Yes it is recmdd to go for Big redos in place of small in 9i.Oracle upgrade Doc says that.So I guess we can say in 10g checkponting is fast? NOt sure !

Depending on the number of datafiles in a database, a checkpoint can be a

highly resource intensive operation, And Tuning checkpoints involves four key initialization parameters.'

And this issue I heard many times after fresh upgrade.

Regards

Vinod

former_member204746
Active Contributor
0 Kudos

SAP note 1068186 explains why.

if you want to revert to Oarcfle 9i'w way, you can set the parameter "_disable_selftune_checkpointing" to the value "TRUE"

good luck.

Former Member
0 Kudos

<b>Anyone knows why this behaviour in Oracle 10? The systems I am working on haven't had any increase in the user load and never had checkpoint warnings. I found notes 79341 and 1068186 but it is not clear to me what exactly happens for Oracle not to issue checkpoints as before. As I said, I have this issue directly after the upgrade to Oracle 10 and without any additional load to the system.</b>

We had exactly the same behaviour like you have when going from 9i to 10g. I have made the following observation:

- 9i: dbwr immediately starts to work after the first log switch checkpoint

- 10g: dbwr is kind of lazy and is not working a lot before a SECOND logswitch is pending

You can easily see what is going on, with querying v$log:

[code]SQL> select * from v$log

GROUP# THREAD# SEQUENCE# BYTES MEMBERS ARC STATUS FIRST_CHANGE# FIRST_TIM

-


-


-


-


-


--- -


-


-


15 1 54069 1048576000 1 YES INACTIVE 1.8710E+12 27-AUG-07

16 1 54070 1048576000 1 YES ACTIVE 1.8710E+12 27-AUG-07

17 1 54071 1048576000 1 NO CURRENT 1.8710E+12 27-AUG-07[/code]

In this example there is one checkpoint - of group 16 - pending, therefore this redo is still active. When there is no inactive group, and the current is full, a checkpoint not complete occurs.

I am not very convinced that the disableselftune_checkpointing will help much in your case (but i could be wrong, so be free to try). dbwr must write all changed blocks covered in the group before the log group can be reused. If under load the log gets full, there is no way to prevent a log switch checkpoint (there are of course checkpoint events other than log switches).

Regards Michael

Former Member
0 Kudos

Dear Team,

A Checkpoint is a database event which synchronizes the modified data blocks in memory with the datafiles on disk.

Checkpoint not complete indicates that Oracle wants to reuse a redo log file, but

the current checkpoint position is still in that log. In this case, Oracle must

wait until the checkpoint position passes that log

this situation may be encountered if DBWR writes too slowly, or if a log switch happens before the log is completely full, or if log file sizes are too small.

I would even recommend sizing them a few hundred Mbytes, you can also consider adding an extra group because the checkpoint of the log about to be overwritten is not yet complete.

Vinod

Former Member
0 Kudos

Hi,

you should set DISABLESELFTUNE_CHECKPOINTING

to TRUE not to FALSE as stated in OSS note 1068186 to reduce the number of checkpoints.

You should also consider increasing the number of DB writers (parameter db_writer_processes)

Thanks

Former Member
0 Kudos

SAP Note 830576 has this for buffer size: LOG_BUFFER 1048576

I had the same problem with checkpoints and reduced the log_buffer size to 1048576 and the issue went away.

fidel_vales
Employee
Employee
0 Kudos

Hi,

Keep in mind that in Oracle 10g you cannot control the size of the redo log buffer as before with the mentioned parameter. Oracle will allocate the space based in granules. Setting the log buffer to 1048576, you are indicating Oracle 10g the minimun size to be allocated.

Former Member
0 Kudos

I did not know that! Thanks!

Former Member
0 Kudos

Hi Andreas

You can ignore the LGWR not compressed messages. If needed i can post the corresponding oracle metalink doc here, it is 284618.1. I don't hope you are on Oracle 10.1.0.X

Checkpoint not complete mostly happens, when the DBWR cannot keep up writing dirty blocks from the buffer cache to the database files. Every logswitch (redo log full, switch to next redo group) triggers a log checkpoint. Here is a general recommondation, please apply careful as this is depending on your system:

- if possible go to four redo log groups

- 20mb per redo log is tiny. if possible extend to 50mb.

I have seen systems with more than 20mb redo generated per second, they needed redo log sizes around 1gb. Redo log switches should occur once per minute under load. If log_checkpoints_to_alert is set to true you can see log switches in the alert log.

>> Beginning log switch checkpoint up to RBA [0x13d2.2.10], SCN: 44944154

With 4 groups and 50mb, there should be less pressure on the dbwr. Please do not hesitate to come back if you need further clarification.

Regards

Michael

Former Member
0 Kudos

Hi

OSS Note 79341 suggestions are the best.

Regards

Ganesh