From:Steve Adams
Date:18-Dec-2000 10:26
Subject:   System freezes due to log switch and/or checkpoints

Unless the log files are tiny, the amount of work needed for an instance checkpoint is a function of db_block_buffers, because that is what determines the number of dirty blocks that you can have in the cache. The speed at which that work can be done is a function of the physical write bandwidth. And the window of time available to complete a log switch checkpoint is dependent on the rate of redo generation and under Oracle8 the size of the log files times the number of log file groups, or under Oracle7 the just size of the log files (assuming that log_checkpoint_interval and log_checkpoint_timeout are set so as to prevent interval checkpoints). So under Oracle7 you had to increase the size of the log files to enlarge the window available for checkpoint completion, whereas under Oracle8 you can increase either the size of the log files or the number of groups.

The reason for this difference is that Under Oracle7, if a log switch checkpoint could not be completed prior to the next log switch, then it would be aborted and a new checkpoint would be started. This could happen repeatedly until all log files had been used and the instance would then be stuck until a checkpoint could be completed. Under Oracle8, modified blocks are linked to a checkpoint queue via their buffer headers, and checkpoint processing is merely a matter of writing buffers from the checkpoint queue in order. Therefore, it is no longer necessary to abort an incomplete checkpoint when a new checkpoint is required. That is why under Oracle8 you can increase either the log file size or the number of log groups to enlarge the checkpoint window, and you also don't have to eliminate interval checkpoints, as was the case under Oracle7.

I should also mention that this just a band-aid solution. Increasing the checkpoint window may mask the symptoms, but the real problem is the inadequate write bandwidth. The most robust solution to your problem would be to acquire more physical disks to increase the write bandwidth (even if you don't need the disk space).

Thanks for the response, Steve. I had increased the number of groups by 50% without any effect on the number of log file switch (checkpoint incomplete) waits for an otherwise identical test, so I was doubting my decision. One question. At your site, one of the documents indicates that it is better to enlarge the logs instead of increasing the number if you are trying to avoid checkpoints not completing. Would increasing the size generally be more effective?

Your problem is checkpointing, not log switching, so the hold_logs_open.sh script will not help you. Slow log switches result in log file switch completion waits, and secondary log buffer space waits. If you get log file switch (checkpoint incomplete) waits, that indicates that the datafile write bandwidth available to DBWn is not sufficient to checkpoint the number of dirty blocks able to be held in cache within the time that it takes the application and LGWR to generate and write enough redo to fill all the log file groups bar one. You can respond by spreading the datafiles over more physical disks (or disk load balancing if there are hot disk) to improve the write bandwidth, or you can reduce db_block_buffers to reduce the number of dirty buffers in cache, or you can increase the size and/or number of log file to give the checkpoints a longer window to complete.

In your February 3, 2000 Q & A, you indicated that you'd seen system freeze due to log switches instead of checkpoints. How did you determine which was the cause? On a system (8.0.5 on an HP K class box) I'm seeing numerous checkpoint not complete messages and log file switch (checkpoint incomplete) events as well as what appears to be a general system hang during high activity. Is there any kind of additional diagnosis I can do to confirm the cause prior to implementing the script?