From:Steve Adams
Date:06-Jan-2001 10:50
Subject:   Parallel server availability

Here is some more information on the Shareplex option from Geof Liddon at CP Rail.

We use Shareplex here (CP Rail) to obtain a high availability solution and it is a very viable solution.

We have a 2 node AIX 4.3 S80 configuration running Oracle 7.3.4.3 (planned upgrades to 8.1.6.2 in the next couple of months) with a 150GB OLTP instance.

We have been able to achieve automatic reconnection by using the failover capability of SQL*NET (2 DESCRIPTION entries pointing to the 2 different instances using 1 generic service name) and only having the listener running for the active workload (we also have the inactive workload in restricted mode so that we can ensure only DBA's have access to it).

One big benefit of Shareplex is that since it works on a higher level, you can easily move between hardware platforms for upgrades (we did this when moving from HP-UX to AIX).

Be aware, Shareplex does not allow the zero outage solution that is wanted in this thread, we have an outage window of 15 minutes when we need upgrades, but in reality we obtain a failover in less than 4 minutes. Most of that time is spent in disabling/enabling triggers between the active and inactive workloads...so the outage time could be on the order of < 1 minute if the environment was setup properly.

We process about 30,000,000 million updates/inserts/deletes a day and everything is kept in sync very well by Shareplex except when we have some badly written purges that try to delete 750,000 records in 1 transaction...it gets a bit behind there, but for normal business activities, it keeps everything in sync very nicely. It also allows us to perform any index or table rebuilds without any impact at all to the business as we can do all this maintenance on the "inactive" workload instance.

Here is a follow-up on this thread from Paul Vallee.

I've never used it in production, however there is a shared-nothing solution, Quest's Shareplex, that solves this problem by working at a higher-level with a shared-nothing approach. You don't get niceties such as automated reconnection etc (but you can code that into your appserver, api, or app). However, you can do rolling upgrades (even hardware/os upgrades) that are impossible with shared-everything solutions such as OPS.

The engineering demos are very impressive, and I can point you towards a sales rep who is friendly and low-pressure if you like. It is expensive, but still in the same ballpark as OPS if you haven't licensed it yet.

Rolling upgrades between major releases are not possible, because that is when Oracle changes the file formats if necessary. For example, you cannot do a rolling upgrade from 8.0 to 8.1. The first instance to open a database records the full version number of the software that it is running in the controlfile. Every other instance that attempts to open the database checks that the first two components of the software version number match. If not, an ORA-407 error is signalled and the database is not opened.

Depending on who you speak to at Oracle, rolling upgrades between minor releases and for patch applications are technically possible but either not supported or strongly discouraged. The reason is that minor differences in locking behaviour between instances could lead to deadlocks and even corruptions. Even without those serious problems, applications may well fail during rolling upgrades as catalog objects are recreated and recompiled. In my opinion, it is just not worth the risk.

If you avoid fixed locking, use incremental checkpoint and the fast-start features then you should be able to get your recovery times right down to a handful of seconds, but never to zero. There will always be a small window of total application unavailability during the rebuilding of the instance lock database in memory.

I work for a project where we have very high availability requirements. We use a sun cluster with OPS (8.1.7). As far as I know OPS has to be shutdown in order to install a new version or even a patch set. Is this correct? Do you know any workarounds to keep the application runing during a software upgrade? Can you explain me the reason why Oracle is so strict and doesn't allow different versions within a cluster. It is hard to get any information about a missing feature from the Oracle documentation.

In our labs we have done some stress test (e.g. pull the power cord of one node or just panic one node in a cluster). Depending on the current load we realized an interruption of the application availability (no new connection could be established to any node within the cluster) from 10 sec to 5 min. Is there a way to bring this "overall application downtime" to 0 sec? Acctually I think when an instance recovery runs, it will certainly have a performance impact but it should not block the application at all. Could you think of any configuration error we might have done?