on 08-10-2007 2:21 PM
Dear all,
After four Oracle 10 upgrades, the only remaining warning I have on db13 in all databases I upgraded is about "checkpoint not complete" . This only happens when there is a lot of activity in the database . It never happened before though, even when there was a lot of activity in the database.
I found note #79341 and #1068186 and I have set DISABLESELFTUNE_CHECKPOINTING
to FALSE, but this didn't particularly help, still get the warnings
My questions are:
The log buffer size is increased by the upgrade to 14M (instead of 1M that is used to be). I have 2 groups of redo log files, with 2 files each. Each file is 20M on size.
I was thinking of increasing them to 30M each. Do you think that this is a good thing to do?
Under /oracle/SID/saptrace/background, I see in the log writer trace the messages below. Do you know why I get these messages and are they relevant to the issue I have with the checkpoint?
SERVICE NAME:() 2007-08-09 09:56:38.059
SESSION ID:(316.1) 2007-08-09 09:56:38.059
Maximum redo generation record size = 197120 bytes
Maximum redo generation change vector size = 190204 bytes
tkcrrsarc: (WARN) Failed to find ARCH for message (message:0x10)
tkcrrpa: (WARN) Failed initial attempt to send ARCH message (message:0x10)
2007-08-09 12:55:29.062
LGWR: Archivelog for thread 1 sequence 3662 will NOT be compressed
2007-08-09 14:56:22.902
LGWR: Archivelog for thread 1 sequence 3663 will NOT be compressed
2007-08-09 15:09:29.136
LGWR: Archivelog for thread 1 sequence 3664 will NOT be compressed
2007-08-09 18:25:38.287
Many thanks
Andreas
Support team,
There are many different kind of checkpoints. The checkpoint that is issued with a logswitch is named "log switch checkpoint" in the alert log. If you issue manually an "alter system checkpoint" command, then you will see in the alert log that a "global checkpoint" is issued, which is executed immediately and turns the status of all ACTIVE redo logs to INACTIVE. These INACTIVE redo logs are no longer needed for recovery and can be dropped.
Now Oracle 10 does not trigger a full (global) checkpoint on logswitch (I think Oracle 9 used to do that after 2 consecutive log switches). Oracle 10 issues a global checkpoint when it tries to switch to an already ACTIVE redo log. The log switch checkpoint that is triggered by a log switch appears to have low priority and the database writer doesn't pick it up immediately like it does for the global checkpoint. I have seen checkpoints like this taking 3-4 minutes to complete while the data to be saved is only 50M so I can't believe that it is because of slow db writer.
Andreas
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi Andreas,
I getting abit confuse. What I understand is that a checkpoint is issue whenever there is a log switch. My understanding of log switch is from log 11 to log 12 (assuming i had 4 log group 11 - 14 ). But base on your explanation, a checkpoint is issue when a complete switch log of group 11 to 12 to 13 to 14 then back to 11. Do highlight me if my understanding of the log switch theory is all the while wrong.
Regards
Lauran
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hello,
I didn't write that. I only referred to one checkpoint because this was only relevant to the example I was trying to make. I will rephrase:
A checkpoint will happen in each log switch. 11 to 12, 12 to 13, 13 to 14, 14 to 11. These are 4 different checkpoints. If you have more redo log groups, redo log group 11 will be again in user after 12, 13 and 14 have been filled. Therefore, the database writer has more time to complete the checkpoint that was started when the switch from 11 to 12 took place.
When the dbwriter finishes with the 11 to 12 checkpoint, it can work on the other pending checkpoints (12 to 13, 13 to 14 etc etc )
I hope this clarifies
Andreas
Hi Andreas,
Thanks for the infomation. 1 more question which will clear my doubt on this checkpoint issue. I understand that increasing the redo log size will help because it will reduce the dbwR writing frequency. Increasing the number of dbwR also aid in the process. How about adding a new redo log grp ? How does this contribute to the "checkpoint not complete" issue ?
Regards
Lauran
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hello Lauran,
Increasing the number of redo log groups helps as well. It is kind of equivalent to increasing the size of the redo logs. When you have more redo log groups, it takes more time for the log switches to "arrive" to the same redo log. Therefore, the db writer has more time to execute the checkpoint until the same redo log is used again.
For example, say you have 4 redo log groups called 11, 12, 13, 14. 11 is current, then a log switch happens. A checkpoint is issued when the log switch happens and the dbwriter starts working on writing the "dirty" blocks in the datafiles. In the meanwhile, the database is working and changes are recorded to redo log 12, a switch is taking place, now 13 is current, a bit later 14 is current and after this again 11. If you only had 2 redo log groups , 11 and 12 , the checkpoint on 11 will need to be completed while 12 is being written. If you have 4 like the example above, the db writer has more time ( the time that 12, 13, 14 ) take to be full.
Note 79341 shows how to increase the size of redo logs and the number of redolog groups
I hope this helps
Andreas
Hi All,
Just to side track abit. Can i said that when "checkpoint not complete" occured, it acutally mean that oracle is not able to read the data due to checkpoint is logging the process. "Checkpoint not complete" is actually not an error targeting the checkpoint issue. Or i can rephase the error as "read error due to checkpoint is in progress" Kindly correct me.
Regard
Lauran
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Thank you for your answers, I awarded some more points
I think I have my problem solved, 4 groups of 100M each has done the trick. I noticed a strange behaviour while resizing the redo logs, which might be the cause of the problem.
I understand that a log switch is always triggering a checkpoint (and the opposite is not true). When I was dropping redo log groups, I got two different kind of errors. One was the the redo log group to be dropped can not be dropped because it is current. This makes sense and I easily issued an "alter system switch logfile" statement.
SQL> ALTER DATABASE DROP LOGFILE GROUP 13;
ALTER DATABASE DROP LOGFILE GROUP 13
*
ERROR at line 1:
ORA-01623: log 13 is current log for instance BWT (thread 1) - cannot drop
ORA-00312: online log 13 thread 1: '/oracle/BWT/origlogA/log_g13m1.dbf'
ORA-00312: online log 13 thread 1: '/oracle/BWT/mirrlogA/log_g13m2.dbf'
SQL> alter system switch logfile;
System altered.
Then, when I tried to drop the same redo log group (that was not current any more, as I had just switched to the next one) I got the message
SQL> ALTER DATABASE DROP LOGFILE GROUP 13;
ALTER DATABASE DROP LOGFILE GROUP 13
*
ERROR at line 1:
ORA-01624: log 13 needed for crash recovery of instance BWT (thread 1)
ORA-00312: online log 13 thread 1: '/oracle/BWT/origlogA/log_g13m1.dbf'
ORA-00312: online log 13 thread 1: '/oracle/BWT/mirrlogA/log_g13m2.dbf'
So despite the switch, a checkpoint has not been issued yet. When I executed,
alter system checkpoint;
I could then drop log froup 13
I had similar issues with other log groups . What I do not understand is why I had to issue a manual checkpoint, I believed that the log switch should have issued a checkpoint itself.
Many thanks
Andreas
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
<b>ORA-01624: log 13 needed for crash recovery of instance BWT (thread 1)</b>
This means that this group is still active and its logswitch checkpoint is not done yet.
Again a brief overview:
- every change generates redo (written to the current redo group) and changed blocks in the buffer cache (memory)
- everytime the redo log gets full a logswitch checkpoint is triggered, indicating DBWR to write all changed blocks to the data files, a new log group becomes current
- to loggroup stays active until the checkpoint completes (meaning all changed blocks covered in the redo are now written to disk. If the instance crashes before the checkpoint completed, this very log group would be needed for the recovery. Therefore you are not allowed to drop either a current, or an active log group.
So a checkpoint is mainly done by DBWR, the CKPT process only updates the file headers when a checkpoint completes.
Regards Michael
Hello all, thank you for your replies so far, I awarded some points
I have had 9 systems upgraded so far. I am playing with different combinations of redo size, redo groups numbers and DISABLESELFTUNE_CHECKPOINTING. Here are my findings
Changing the DISABLESELFTUNE_CHECKPOINTING parameter hasn't made much difference anywhere. Whether set to true, false or reset (not sure what the default value is then as I can not see it in db03 or using show parameter from sqlplus), I have loads of checkpoint not completed messages in my busy systems
I had 4 redo log groups initially consisting of 1 redo log each of 20 M
Initially, I increased each redo log files size to 50M, the warnings were reduced (from 8-10 per day to 2-3). I then added 2 more groups of 1 redo log each (50M again). I now have none or 1 warning per day there
Vinod, if you can copy and paste me the extract from the database upgrade guide that is talking about the size of the redologs I will be grateful. I tried to look for it both in the SAP Oracle 10 upgrade guide and the Oracle guide from Metalink but I couldn't find it
Eric, or anybody else that understands the term "the database is fluctuating" from note 1068186, could you please explain it here?
Many thanks
Andreas
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi again
One should not guess but i read the german version of the note as well, and i am pretty sure that 'fluctuating' means that the database is still aligning with its typical load patterns. As i already mentioned this most probably has to do with the other sort of checkpoints and not log switch checkpoints. These other (incremental) checkpoints can be tuned with various parameters (log_checkpoint_timeout / fast_start_mttr_target).
But in your case we have log switch checkpoints, which can only be controlled by the size of the redo logs. As your observations say you already reduced the not complete messages significantly. I am quite sure that they will go away completely when you further increase the size. 50mb is still tiny as on a fast system the logwriter can write 20mb or more per second under load. So if you have 4 redo groups of 50mb this would give 200mb divided by 20mb/s = 10s. So DBWR either has to write all changed blocks covered in the first group to the datafiles in 10s <b>or</b> a checkpoint not complete occurs.
In worst cases a single redo entry like <i>update bigtable set counter = counter + 1</i> could change millions of table blocks and generating millions of undo blocks which have to be written by the DBWR.
<b>Redo log switches should occur once per minute under load. If log_checkpoints_to_alert is set to true you can see log switches in the alert log.</b>
If you have logswitches under one minute when the checkpoint not complete occurs, then your log size is most probably still to small.
Of course if nobody complains of the now sporadic not complete messages, and you do not experience system hangs at crucial times, you should not worry to much about them anymore.
Best regards
Michael
BTW forgot, to see if an underscore parameter is set you can either do:
[code]SQL> show parameter disableselftune_checkpointing[/code]
This should work in 10g, otherwise do:
[code]SQL> select a.ksppinm "Parameter", b.ksppstvl "Session Value", c.ksppstvl "Instance Value"
from x$ksppi a, x$ksppcv b, x$ksppsv c
where a.indx = b.indx and a.indx = c.indx
and ksppinm ='_disable_selftune_checkpointing';[/code]
Dear all,
Thank you for your replies so far. In the meanwhile, I increased the size of my redologs from 20M to 50M. The checkpoint warnings still exist but they are significally reduced. I plan to put two more redo log groups (I only have two at the moment) and see what happens
Anyone knows why this behaviour in Oracle 10? The systems I am working on haven't had any increase in the user load and never had checkpoint warnings. I found notes 79341 and 1068186 but it is not clear to me what exactly happens for Oracle not to issue checkpoints as before. As I said, I have this issue directly after the upgrade to Oracle 10 and without any additional load to the system.
Many thanks
Andreas
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Team,
Up to my knowledge, in 10 G Yes it is recmdd to go for Big redos in place of small in 9i.Oracle upgrade Doc says that.So I guess we can say in 10g checkponting is fast? NOt sure !
Depending on the number of datafiles in a database, a checkpoint can be a
highly resource intensive operation, And Tuning checkpoints involves four key initialization parameters.'
And this issue I heard many times after fresh upgrade.
Regards
Vinod
<b>Anyone knows why this behaviour in Oracle 10? The systems I am working on haven't had any increase in the user load and never had checkpoint warnings. I found notes 79341 and 1068186 but it is not clear to me what exactly happens for Oracle not to issue checkpoints as before. As I said, I have this issue directly after the upgrade to Oracle 10 and without any additional load to the system.</b>
We had exactly the same behaviour like you have when going from 9i to 10g. I have made the following observation:
- 9i: dbwr immediately starts to work after the first log switch checkpoint
- 10g: dbwr is kind of lazy and is not working a lot before a SECOND logswitch is pending
You can easily see what is going on, with querying v$log:
[code]SQL> select * from v$log
GROUP# THREAD# SEQUENCE# BYTES MEMBERS ARC STATUS FIRST_CHANGE# FIRST_TIM
-
-
-
-
-
--- -
-
-
15 1 54069 1048576000 1 YES INACTIVE 1.8710E+12 27-AUG-07
16 1 54070 1048576000 1 YES ACTIVE 1.8710E+12 27-AUG-07
17 1 54071 1048576000 1 NO CURRENT 1.8710E+12 27-AUG-07[/code]
In this example there is one checkpoint - of group 16 - pending, therefore this redo is still active. When there is no inactive group, and the current is full, a checkpoint not complete occurs.
I am not very convinced that the disableselftune_checkpointing will help much in your case (but i could be wrong, so be free to try). dbwr must write all changed blocks covered in the group before the log group can be reused. If under load the log gets full, there is no way to prevent a log switch checkpoint (there are of course checkpoint events other than log switches).
Regards Michael
Dear Team,
A Checkpoint is a database event which synchronizes the modified data blocks in memory with the datafiles on disk.
Checkpoint not complete indicates that Oracle wants to reuse a redo log file, but
the current checkpoint position is still in that log. In this case, Oracle must
wait until the checkpoint position passes that log
this situation may be encountered if DBWR writes too slowly, or if a log switch happens before the log is completely full, or if log file sizes are too small.
I would even recommend sizing them a few hundred Mbytes, you can also consider adding an extra group because the checkpoint of the log about to be overwritten is not yet complete.
Vinod
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi,
you should set DISABLESELFTUNE_CHECKPOINTING
to TRUE not to FALSE as stated in OSS note 1068186 to reduce the number of checkpoints.
You should also consider increasing the number of DB writers (parameter db_writer_processes)
Thanks
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
SAP Note 830576 has this for buffer size: LOG_BUFFER 1048576
I had the same problem with checkpoints and reduced the log_buffer size to 1048576 and the issue went away.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi Andreas
You can ignore the LGWR not compressed messages. If needed i can post the corresponding oracle metalink doc here, it is 284618.1. I don't hope you are on Oracle 10.1.0.X
Checkpoint not complete mostly happens, when the DBWR cannot keep up writing dirty blocks from the buffer cache to the database files. Every logswitch (redo log full, switch to next redo group) triggers a log checkpoint. Here is a general recommondation, please apply careful as this is depending on your system:
- if possible go to four redo log groups
- 20mb per redo log is tiny. if possible extend to 50mb.
I have seen systems with more than 20mb redo generated per second, they needed redo log sizes around 1gb. Redo log switches should occur once per minute under load. If log_checkpoints_to_alert is set to true you can see log switches in the alert log.
>> Beginning log switch checkpoint up to RBA [0x13d2.2.10], SCN: 44944154
With 4 groups and 50mb, there should be less pressure on the dbwr. Please do not hesitate to come back if you need further clarification.
Regards
Michael
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi
OSS Note 79341 suggestions are the best.
Regards
Ganesh
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
82 | |
10 | |
10 | |
9 | |
6 | |
6 | |
5 | |
5 | |
4 | |
3 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.