Mirror write cache
|Have you been able to measure the impact of MWC on performance?|
|I've done some tests on this. As you would expect it does not affect reads, and the impact on writes is dependent on both the nature and intensity of the workload. The impact on sequential writes is trivial. The impact on random writes can be quite big. Here's why. The mirror write cache has 32 slots each representing a 256K chunk of the logical volume. Any write to an area not already represented in that cache must be preceded by a sync write of the cache itself to the volume group reserved area at the top of the disk. Not only is that another I/O, but it involves two potentially lengthy seeks as well. However, you would have to be very unlucky to have this happen on every write.|
Half block I/O operations
|I am in the process of tuning Oracle 8.0.5, HP/UX 11.0 & EMC. The EMC engineers have pointed out that from the EMC side they are seeing 8K I/O requests coming from the server. The only application running on this server is Oracle configured with 16K db_block_size. The question is, can Oracle generate 8K I/O requests even though it is configured on 16K blocks?|
|The HP-UX filesystem cache layer insists on doing all I/O in 8K chunks. You need to use raw datafiles or Advanced JFS mounted for direct I/O to get it to do otherwise. Yes, it is brain-dead. But that is unfortunately an inescapable side effect of doing buffered I/O. It applies to all the other Unix flavours that I have worked on too. The only solutions are to use either "raw" or direct I/O.|
|I know that LGWR should read the log files when doing recovery, but it is reading all the time, at around of 20% of the write rate. Why?|
|This is because your log files are filesystem based, but Oracle's log block size does not correspond to the filesystem block size. When LGWR attempts to write a few log blocks, if the last filesystem block addressed is not in the Unix filesystem cache, and is not being fully overwritten, then LGWR must first read the filesystem block before it can overwrite part of it. This is one of many reasons why you should use raw - particularly for log files.|
Database block size for HP-UX 11
With a view to choosing a database block size for a large OLTP database ...|
1. What is the O/S block size on HP-UX 11, and
2. How many bytes does the O/S read in each read-cycle?
There is no single "O/S block size".
There are several.
If your datafiles are raw, or if you are using Advanced (Online) JFS with direct I/O enabled (convosync=direct), then the relevant block size is the device block size (DEV_BSIZE), which is defined in /usr/include/sys/param.h as 1K. This does not impose any constraint on your choice of a database block size.
If your datafiles are file system based, and you are not using direct I/O, then the relevant block sizes are the file system block size and the file system cache buffer size. The JFS file system block size can be 1K (default), 2K, 4K or 8K. The HFS file system block size can be 4K, 8K (default), 16K, 32K or 64K. The file system cache buffer size for JFS is 8K, and can be either 4K or 8K for HFS depending on the block size. JFS is preferable on performance grounds. This effectively constrains you to matching the file system block size and the database block size at 8K. However, if you are determined to use a smaller database block size, then you should use HFS and match the file system block size and the database block size at 4K.
The minimum physical I/O size (DEV_BSIZE) is 1K. The maximum physical I/O size (MAXPHYS) is 1M by default, but effectively only 256K if using LVM. Therefore, on raw, or under direct I/O, physical reads can be any multiple of 1K up to 1M. However, buffered reads are constrained to the file system cache buffer size. Nevertheless, multiblock reads should be as large as possible, to maximize the benefit of caching in the track buffers.
Database block size on NT
|The information I read on your web site indicates that a block size larger than the operating system block size benefits only raw devices in UNIX. Does this mean that in an NT environment for example where a file system is used, we should not set the block size higher than the file system block size?|
|Oracle on NT always explicitly does direct I/O. The datafiles are created and opened with FILE_FLAG_NO_BUFFERING and thus IRP_NOCACHE is always set. This means that there is no requirement on NT to match the database block size to the underlying operating system block sizes, as is the case on Unix. The down-side is that using direct I/O disables NT's fast I/O path, which means that the I/O request packets have to be hand-passed all the way down through the all the kernel layers and back up again, which even the Microsoft people regard as grossly inefficient. Because of this overhead, you should actually use a larger block size on NT than you might otherwise have done on Unix, so that you can get as much data for each I/O operation as possible.|
|In your tip, "Why Raw Datafiles", you refer to read/write locks on datafiles. Are these the same as thing as the INODE lock waits shown by glance under HP-UX? If so, I have been told that this is only an HP-UX issue and does not affect Oracle on other platforms such as Solaris or NT.|
|No, it is a generic issue. They are called inode locks under Unix, because that is where the lock is applied. Under NT, however, they are implemented as ERESOURCE locks on the File Control Block (FCB). The existence of inode locks under Solaris can easily be verified by checking the header files under /usr/include/sys.|
File system buffer size
|You mentioned that it is important to set the database block size to the same as file system buffer size. How do I determine what the file system buffer size is? We use HP-UX and the VxFS file system.|
|JFS uses 8K buffers. AdvJFS can be configured to use direct I/O and thus no buffers. HFS uses either 4K or 8K buffers depending on the file system block size. Unless you have a good reason to the contrary, you should always use an 8K database block size for file system based databases on HP-UX.|
|When you say file system buffer size do you mean the file system block size?|
|No, the default block size for JFS is 1K (although you can make it 8K). When I say the file system buffer size, I mean the size of the individual buffers in the HP-UX (in your case) buffer cache in which the file system blocks are buffered.|
|Does the size of the individual buffers in the buffer cache mean the page size?|
|No, on HP-UX the page size is always 4K and the page cluster size is 8K. The cache buffer size is logically independent, but for efficiency has been chosen by the kernel developers to match the page cluster size. Most memory management operations are done in page clusters rather than pages.|
Database block size on NT
|My apologies for asking a question regarding NT, but I am having difficulty in finding this information. Anyway, should the database block size equal NT's cluster size? Also should the multiblock read size be 64K?|
|There is no need to consider the "operating system block size" when choosing a database block size on NT. This is because Oracle uses direct I/O on NT, which is not affected by the NT block sizes. However, I/O on NT is not particularly efficient, so the bigger the block sizes the better. For reasoning, please see our tips on choosing a database block size and multiblock read size. The advice in those tips is applicable to NT as well as to Unix.|
DBWn Asynchronous I/O
|Can you suggest any further reading that might explain further the concept of Asynchronous I/O, and how it relates to DBWR processes?|
|Our web tip on Use raw log files does it for LGWR. It is basically the same for DBWn. The issue is of being able to queue multiple I/O requests to the operating system simultaneously. Think of it as the ability to issue "non-blocking" I/O requests. Synchronous I/O requests make the process "block" or wait until the hardware has finished the operation. Therefore only one I/O operation can be active per process at any point in time. Asynchronous I/O eliminates this artificial bottleneck.|
Maximum I/O size
|What is the maximum I/O size (MAXPHYS) nowadays on HP-UX and Solaris?|
|MAXPHYS has long been fixed at 64K on Solaris, and 256K on HP-UX. However, from Solaris 2.6 it is defined in /etc/system and defaults to 128K. On HP-UX 11 it now defaults to 1M. However, I think the LVM layer still constrains I/O operations to a single logical track group, which is 256K, so the higher MAXPHYS only applies if you are not using LVM, which is most unusual. Of course, these large physical I/O sizes are only possible if you are using raw or direct I/O. Perhaps more importantly, there is an internal Oracle kernel constant (SSTIOMAX) that limits I/O operations to 512K.|
Performance of raw devices
|We are setting up an OPS database on HP using EMC disk arrays. We have performed some internal benchmarks that suggested that raw device provide dramatically worse performance than JFS. The problem is that using raw devices is a requirement for OPS, but how can this be effectively implemented with EMC disk arrays on HP?|
|If your internal benchmark favors JFS, then there was something wrong with the benchmark. It is VERY difficult to get a level-playing field for such benchmarks. The attached paper may help. It reports testing by Veritas (the makers of JFS) showing that raw is better. Their conclusion is "With proper database tuning, VxFS can deliver performance close to what raw I/O can offer". For more information, please see our tips on Use raw log files, For more information, please see our tips on Use raw tempfiles and Avoid buffered I/O.|
|What benefit is there (if any) in using use_readv on HP-UX 10.20? The datafiles are filesystem based and reside on EMC storage.|
|I presume you are still on 7.x, because the parameter does not exist under 8. If so, try the following test and see for yourself. Set db_file_multiblock_read_count = 32; In SQL*Plus set timing on and perform a serial full table scan on a very large table - the largest you can find. Do it again and record the second elapsed time. Then bounce the instance, change the use_readv parameter setting and repeat the test. Once again, take the second measurement. The larger the table, the bigger the difference, but I have seen differences bigger than 10%. The one case in which you may see a degradation is when using Online JFS mounted for direct I/O (because each database block will be read separately and serially, as for buffered reads).|
1 February 2000
What is your opinion on whether to go for large database files or have multiple smaller files?
We have many tables with 25+ million rows.
All of these have N number of 512M extents, each extent residing in a 560M datafile.
The files themselves are carefully laid out in different logical volumes.
The logical volumes are 8G each, striped over of 2*4G or 4*2G EMC mirrored LUNs.
Thought is also given to the underlying physical disks within a logical volume and that two similar extents are not going on the same spindles.
I have set default parallelism according to the number of extents (bearing in mind that it doesn't go over the number of CPU's).
I would have thought that smaller files and parallel query should be more beneficial than 1 large datafile. Having said this, I am beginning to wonder whether this is the true, because the logical volumes are stripped anyway?
|In general, I favour more, smaller files. The objective is maintaining the all-important I/O separation for concurrently accessed segments. It also reduces the risk of inode lock contention on file systems. Whether your I/O separation is working as intended is, of course, something I cannot comment on from here. By the way, my recommendation is to make the degree of parallelism 1 less than the number of disks (extents), without regard for the number of CPU's, I/O channels, and so.|
Raw devices on NT
|Can I setup raw devices using IDE disks?|
|Yes, it has more to do with the operating system than the hardware. For example, you can use a raw partition from an IDE drive as a datafile under NT, but not under Linux. The syntax for NT is just to use the filespec '\\.\x:' where x is the drive letter for the partition.|
|We're doing some initial benchmarking of a new application. What we see is that a checkpoint stalls the database for a couple of seconds, and since this is a real-time system that cannot be tolerated. Where would you suggest we start digging? Redo logs are stored on a separate disk with direct-io enabled.|
|I have hit something like this before, but it was the log switch itself that was freezing things, rather than the checkpoint. The solution was to have a sleeping process hold open file descriptors on all the log files to speed up the open() system call. Have a look at our tip on Tuning Log Switches for more explanation and a script.|
Database block size
|What performance difference can be expected from rebuilding a database with a larger database block size? Some articles say that it is not noticeable, whereas other say that it is dramatic.|
|It depends. The key factors are the I/O cost of index range scans, and the extra rotational latency for partial block writes. If you make extensive use of index range scans, or have a DBWn bottleneck, then the difference is likely to be very noticeable. Otherwise, it may not be. For more information, please see our tip on this matter.|
File system performance problems
In your "Why Raw Datafiles" tip, you refer to three problems:
It differs for each. HP MeasureWare is able to measure the first on HP.
In general, however, the tell tale sign is that I/O response times as shown by Oracle are much longer than for other datafiles,
and yet the disk is well short of being 100% busy.
To measure the second, I have written C code that examines the kernel data structures directly, in particular the queue depths in the device driver and LVM layers. A more simple test is to write a shell script that just does
while : ; do date; sleep 1; done > $ORACLE_BASE/date.txt &and then copy a few large files, like some archive log files, into the same file system. If you have this problem, then you will see that the date shell script failed to run at all for several seconds. How long you pause for depends on how big the backlog is.
The third is relatively easy to spot. You just watch the scan rate using vmstat while doing a parallel full table scan. If paging goes sky high, and takes a while to recover, then file system buffer thrashing is the culprit.
File system reads
|Some stuff that I've seen says that file systems are better than raw for reads. What do you think? Have you got any statistics?|
|Certainly, if you get hits in the file system buffer cache, then the response time for Oracle's so called physical reads will look good. But wouldn't you rather cache the same data using the same physical memory, but in Oracle's own buffer cache, where it can be managed by Oracle's own LRU mechanisms, and accessed more directly? As for statistics, I suggest that you look at the presentation about this on Jonathan Lewis' web site.|
Maximum I/O sizes
|I am still trying to understand about the maximum I/O sizes that apply in various cases. What is the maximum size for an Oracle I/O operation for each of the major Unixes and NT, for raw and cooked file systems?|
|For all buffered I/O, the maximum (and only) size for all physical I/O operations is one buffer. That buffer size is generally 8K. For raw, or file system based direct I/O, the maximum physical I/O size is either MAXPHYS or possibly a lower Oracle limit. Under Oracle7, there was a limit of 64K for LGWR (which is what we were discussing before). Under Oracle8 it is 512K or MAXPHYS whichever is lower. MAXPHYS has changed from 256K to 1M between HP-UX 10.20 and HP-UX 11. On Solaris changed from 64K (fixed) to 128K (tunable) between Solaris 2.5 and Solaris 2.6. On AIX it used to be fixed at 64K, but I don't know whether this has changed. On NT it is 16 memory pages, which is 64K on 32-bit hardware, or 128K on 64-bit hardware (ie. Alpha).|
|If I have a system with write intensive files on a RAID5 array, which wait event is likely to show the impact on performance? Is it write complete waits, or is there another way to determine the impact?|
|If log files are on RAID5, then it is the average log file sync time that affects you. If datafiles are on RAID5, then it affects the performance of both direct writes, and DBWn writes. However, slow DBWn writes don't hurt, unless DBWn cannot keep up with its workload. If so, you'll see free buffer waits. The direct writes issue is visible in the average write time to the target tablespace (normally TEMP), as opposed to the average read time on the same tablespace.|
|Can you comment on using soft links for oracle data files (filesystem)? Oracle manuals advise using them, but I've heard different opinions. What is your opinion? Are there any drawbacks with links?|
|I commonly use links for raw datafiles. Doing so aids the adoption of meaningful naming conventions. However, there is no reason to use them for file system based datafiles. It is not really a big performance issue, because datafiles are normally only opened once per process.|
Database block size for OPS
|For an OPS database with very high transaction rates and index-based access patterns, could one induce a potential performance problem when going from a database block size of 8K to one of 16K? (The probability of accessing the same block concurrently from multiple instances doubles theoretically). Is perhaps the Oracle version a key factor?|
|The benefits of a larger block size are even greater in OPS than otherwise, because of the reduced messaging for lock acquisition. However, if you do have pinging, then the larger block size will accentuate it. But the mitigation for pinging is proper partitioning of the application, rather than using a small block size. Of course, because of cache fusion, release 8.1 copes much better with a poorly partitioned workload, whatever the block size.|
|I'm on Solaris 2.6, file system based. What must I do to get direct I/O? Does it happen automatically if the forcedirectio mount option is used? Does the db_file_direct_io_count parameter have to be set?|
|You are in danger of confusing Solaris direct I/O with Oracle direct I/O. The forcedirectio mount option avoids buffering in the file system buffer cache in the Solaris page pool. Oracle direct I/O (controlled by the db_file_direct_io_count parameter) avoids buffering in the database block buffers in the SGA. Oracle direct I/O is used automatically for parallel table scans and for most temp tablespace I/O. Solaris direct I/O is used for all file systems that have been mounted for direct I/O. These two forms of direct I/O are not related. They can be used in combination with each other, or separately.|
Read/write datafile locks
In your tip called "Why Raw Datafiles" you say ...
Read/write locks are needed because buffered reads cannot be guaranteed to be satisfied atomically. So operating systems use these locks to ensure that file system buffers against which a read is pending cannot be modified, lest the read return self-inconsistent data.Can you please explain this? I don't see how concurrent reading and writing could be a problem.
The problem is that the operating system cannot assume that the concurrent read and write operations are unrelated.
If they address the same buffers, and if the timing is "right" you could see something like this ...
|Copyright © Ixora Pty Ltd||