Applies to: Exchange Server 2007 SP3, Exchange Server 2007 SP2, Exchange Server 2007 SP1, Exchange Server 2007
Topic Last Modified: 2007-10-26

Single copy clusters (SCCs) offer redundancy for the services that provide access to data. Service redundancy enables rapid recovery, without data loss, in cases where the host node fails. Because an SCC passes the storage containing the databases to the new node as part of a failover service, service should be restored without data loss.

However, in an SCC, the storage subsystem is a single point of failure. A complete failure of the storage subsystem typically produces an outage of a day and an average data loss of 12 hours. This assumes that full backups are taken on a daily basis. In addition, the storage configuration for an SCC solution is typically more complex to install and operate than that required by cluster continuous replication (CCR), which is the other type of Exchange cluster solution. For more information about CCR, see Cluster Continuous Replication.

SCC recovery behavior can be separated into two types of outages:

The following table describes the expected recovery actions for a variety of failures. Some failures require the administrator to initiate the recovery while other failures are automatically handled by the Windows Cluster service.

Scheduled and unscheduled outages, although triggered differently, result in a passive node being activated and the databases mounted, assuming that the shared disks are successfully transitioned. If the shared disks fail to transition correctly, perhaps due to configuration error, the behavior is the same. The affected databases are not mounted.

Note:
Only one clustered mailbox server can be activated on a passive node at any specified time. If a node is already hosting an active clustered mailbox server, it cannot bring another server online.
Note:
Unlike previous versions of Exchange, Microsoft Exchange Server 2007 does not trigger SCC automatic unscheduled outages (failover) as a result of database failures.

Recovery actions for failures

Failure description Action Comments

Operating system stop error; operating system stops responding; complete power failure of a node; unrecoverable failure of the processor chip, motherboard, or backplane; or complete communication failure for a node

Automatic failover to passive node, if available. Databases are mounted as their storage comes online.

For a passive node to be available, it must be possible to establish a quorum after the failure. This means that the remaining node must be able to access the quorum.

Total storage failure on the active node

Storage failures reported to and through the monitoring system. The administrator can recover the storage or must use backups to recover.

Automatic failover to passive node, if available. Databases are mounted as their storage comes online.

For a passive node to be available, it must be possible to establish a quorum after the failure. This means that the remaining node must be able to access the quorum.

Total storage failure

Storage failures reported to and through the monitoring system. The administrator can recover the storage or must use backups to recover.

This failure is reported as a failure of the cluster (and all of its resources) because the quorum and databases are not accessible.

Data center failure

Automatic failover not supported without a third-party replication solution.

Replication must be synchronous if replicating from live data.

Operating system drive failure

No automatic recovery action. Not detected by Exchange unless the operating system fails. Detected based on apparent failures rather than root cause.

Operating system drive failure is reported by the operating system monitoring services and may cause the operating system to fail.

Operating system drive out of space

Automatic failover to passive node, if available. Databases are mounted as their storage comes online.

This failure is reported to and through the monitoring services. If automatic failover does not or cannot occur, the recovery action for this scenario is determined by the administrator.

Failure of the cluster's public network on the active node

Same recovery action as for the complete power failure scenario.

There is no detection of public network health beyond the hardware and software used to communicate between the active and passive nodes. Verification of actual client connectivity is not provided by Exchange 2007.

Complete failure of the cluster's public network

No automatic recovery action.

If the public network is lost, the IP Address resources will enter a failed state. After the public network issue is addressed, the resources can be brought back online.

Loss of cluster quorum

Clustered mailbox servers and cluster quorum are offline.

This scenario will result in no service if a quorum cannot be formed.

Information store failure

Automatic restart of the information store resource.

After repeated failures, the administrator can try to manually move the clustered mailbox server to a passive node in an attempt to bring it online.

Application (binary file) drive failure

No automatic recovery action.

Generally, this scenario will result in other failures that are reported to and through monitoring services and are actionable by the administrator. The recovery action for this scenario is determined by the administrator.

Application (binary files) drive out of space

No automatic recovery action.

Monitoring services report this condition. The recovery action for this scenario is determined by the administrator.

Complete loss of database or storage group, or database complete failure

Automatic attempt to remount the affected databases. If this attempt fails, the database will remain in a failed state, but no failover of the clustered mailbox server will occur.

The storage group or database either is dismounted due to software failure or corruption, or has failed because of hardware failures. For example, a storage group does a forced dismount of all databases when its log directory is not available. The administrator determines the corrective action. Recovery could be a scheduled outage to activate the passive node.

Partial failure of storage group or database, some data unavailable, or initial database mount failure

No automatic recovery action.

Partial failure means that some corruption has been reported, but the corruption did not force a dismount of the storage group or database. If a database does not mount at startup, no action is taken and monitoring services report the failure. The Mailbox server generates events when this is detected, which can be reported by the monitoring services. Monitoring will also detect and report dismounted databases.

Corrupted log detected for storage group

No automatic recovery action.

Monitoring services report this condition.

Database or transaction log drive out of space

No automatic recovery action. The databases in the storage group will be dismounted.

The lack of free drive space condition is reported through the monitoring system. The administrator determines the corrective action.