Ixora News - December 2000

Welcome to the December 2000 issue of "Ixora News" available online at http://www.ixora.com.au/newsletter/2000_12.htm.
(To unsubscribe send a blank email to quit-news@ixora.com.au.)

In this issue ...

The best sort   - Using event 10032 to set sort_area_size
What's in a name?   - Why to avoid long file names and big directories
Large log buffers   - Not so bad since Oracle8
Copyright issues   - A few people have been naughty
Practical Oracle8i   - A new book from Jonathan Lewis

The best sort

Last month we discussed the instance wide setting of sort_area_size to minimize the number of disk sorts. However some disk sorts are so large that it is not feasible to eliminate them. Nevertheless, using the right sort_area_size and sort_multiblock_read_count can make a big difference to their performance.

The performance of a disk sort is largely a factor of the number of merge passes required, because that is what dictates the amount of temporary tablespace I/O. The number of comparisons performed is an almost trivial factor.

In a disk sort the input rows are sorted into a number of "sort runs". These sort runs are then merged. The number of sort runs that can be merged simultaneously is called the "merge width". If the initial number of sort runs is greater than the merge width, then multiple merge passes will be required. The merge width depends both on the sort_area_size and the sort_multiblock_read_count. Up to 90% of the sort_area_size may be used for read buffers during a merge, and the rest is used for write buffers. The size of each read buffer is sort_multiblock_read_count * db_block_size, and 2 read buffers are needed per sort run to enable asynchronous read ahead (but only if disk_asynch_io is TRUE). The same write buffer configuration is used during the initial sort phase as during the merge phase, therefore, the size of each sort run is approximately 90% of the sort_area_size.

This information can be used to calculate the sort_area_size that would be needed to avoid a secondary merge phase for a large sort. For example, if it is known that the largest sort in a certain batch process sorts 12G of data, and the sort_multiblock_read_count is 8 blocks (of 8K each), then the number of initial sort runs will be 12G / (0.9 * sort_area_size), and the maximum merge width will be round(0.9 * sort_area_size / 64K) / 2. Therefore, to ensure that the number of initial sort runs is no greater than the maximum merge width, the sort_area_size must be at least sqrt(12G * 64K * 2 / 0.81) or about 43.6M. There would be virtually no benefit in using a sort_area_size greater than 43.6M but less than 12G.

The only other consideration is the number of comparison operations performed, and thus the CPU time used, during the sort and merge phases. A larger sort_area_size requires more comparisons during the initial sort phase because sorting is an n log(n) complexity operation, but having fewer sort runs saves comparisons during the merge phase (which has linear complexity). These two factors largely cancel each other out.
This technique can be used to set the sort_area_size to the optimum value for batch processes that do large disk sorts. First you need to determine the largest disk sort that is performed by the batch process. This can be done by setting event 10032 at level 1 in the session and then examining the process trace file. The trace file will contain the following information for each disk sort performed.
---- Sort Parameters ------------------------------
sort_area_size                    1048576
sort_area_retained_size           1048576
sort_multiblock_read_count        8
max intermediate merge width      7
---- Sort Statistics ------------------------------
Initial runs                              13553
Intermediate runs                         2258
Number of merges                          2259
...
The size of each disk sort can be determined from the sort_area_size and the number of initial sort runs required. The optimal sort_area_size can then be calculated as explained above. If the batch process is then tuned to use this sort_area_size in its session, the "Number of merges" shown in the trace file will be 1 even for the largest disk sorts. In this case, giving the session its own sort_area_size of 44M resulted in a 71% reduction in elapsed time.

What's in a name?

Operating systems use a name cache to cache path name to inode translations. In many systems the name cache is called the DNLC, which stands either for the Directory Name Lookup Cache or the Dynamic Name Lookup Cache. When a user program attempts to open a file, the path name is looked up in the name cache via a hash algorithm. Depending on your operating system, this may be called a namei() call or a lookuppn() call. If a cache miss occurs, it must be resolved by consulting the directory in which the file is stored. If that directory is not in the name cache, then its parent directory must be consulted, and so on up to the root of the directory tree if necessary.

The process of consulting a directory to get the inode number for a particular file is called an iget(). Some inode gets can be resolved without disk access if the inode for the directory is in the inode cache and if the directory has sufficiently few and small entries for the data to be stored directly in the inode. Such directories are called immediate directories. If the directory is not an immediate directory, then its data blocks must be read from disk sequentially until the required file name and inode number have been found. Directory blocks are small and are not necessarily contiguous so directory access can be expensive. For this reason, it is better to have deep directory structures with a low branching factor, rather than a shallow directory structure with many files in each directory. In particular, beware of allowing large numbers of files to accumulate in the archive, audit and dump destination directories. If this does occur, the directory concerned should be entirely removed and then recreated. Ideally, the sum of the lengths of all the file names in a directory should never exceed 16K.

It is also best to ensure that frequently accessed files, particularly sub-directories are created first in each directory so that they can be found more quickly. Many file systems have an administrative utility to perform this optimization for you. Another saving is available if you use symbolic links. The path to a symbolic link can be stored directly in the inode if it is no more than 59 bytes in length.

On Unix, name cache statistics are available from the sar -a command, which reports the number of inode gets, path name lookups and directory block reads per second. Think twice before representing these figures as a hit rate, because multiple inode gets may be needed to resolve a single path name lookup. If your application uses a large number of files, you may be able to improve the name cache performance by enlarging the cache. The size of the name cache can normally be tuned via a kernel parameter, commonly ncache. In some cases, such as HP-UX, the size of the name cache cannot be tuned directly, but is derived from the size of the inode cache.

One more important thing to know about the name cache is that making it bigger may have no effect on performance if you use long directory and file names. Most systems only cache file and directory names of 14 bytes or less. However, this is changing. For example, Solaris now caches names up to 31 bytes in length, and HP-UX now caches names up to 38 bytes in length. Check the system header files on your system to see what the limit is on your operating system, and remember to allow for a terminating null byte. In particular, beware of using long names for the top level directories in your Oracle file systems. A single long directory name can cause name cache misses and thus directory access for every inode below it in the file system.

Large log buffers

We have recently learned that it is not so bad to have a large log buffer anymore. Under Oracle7, if you had a large log buffer of say 15M and infrequent commits, then it was possible for 5M of redo to be generated before LGWR would be posted to flush that redo to disk (because the default background write threshold is one third of the log buffer size). In such cases, the commit response time would be poor because LGWR would commonly have a large backlog of redo to write. Under Oracle8 a 1M upper bound has been set on the background write threshold. LGWR is now posted to perform a background write whenever its backlog exceeds 1M, regardless of the size of the log buffer. This effectively limits the possible negative impact of a large log buffer on commit response time.

Copyright issues

We have recently found copies of tips and scripts from the Ixora web site on 3 other web sites. In one case our copyright notice had even been removed and replaced with another! While links, quotations, translations and derived works are fine, plagiarism is not. Please notify Ixora if you become aware of any further breach of our copyright.

Practical Oracle8i: Building Efficient Databases

Jonathan Lewis's new book "Practical Oracle8i: Building Efficient Databases" is now shipping. If you already know Jonathan's web site and the quality of his other contributions to the Oracle community, you will be expecting great things from this book. Rest assured, you will not be disappointed. For more information, please see our review on the Ixora web site's page of Advanced Oracle Resources.


Copyright Ixora Pty Ltd Send Email Home