New High Availability and Site Resilience Functionality in Exchange 2010 SP1

Applies to: Exchange Server 2010 SP1

Topic Last Modified: 2012-04-25

Microsoft Exchange Server 2010 Service Pack 1 (SP1) includes new features, as well as enhancements to features introduced in the release to manufacturing (RTM) version of Exchange 2010. The new and improved features extend the scenarios in which you can achieve data and service availability for your Exchange 2010 environment.

The following new features for high availability and improvements to existing high availability features are available in Exchange 2010 SP1:

Continuous replication - block mode
Active mailbox database redistribution
Enhanced datacenter activation coordination mode support
New and enhanced management and monitoring scripts
Exchange Management Console user interface enhancements
Improvements in failover performance
Extensible Storage Engine recovery on hung I/O

These features are discussed in greater detail below.

Continuous Replication - Block Mode

In the RTM version of Exchange 2010 and in all versions of Exchange Server 2007, continuous replication operates by shipping copies of the log files generated by the active database copy to the passive database copies. Beginning with Exchange 2010 SP1, this form of continuous replication is known as continuous replication - file mode. Exchange 2010 SP1 also introduces a new form of continuous replication known as continuous replication - block mode. In block mode, as each update is written to the active database copy's active log buffer, it's also shipped to a log buffer on each of the passive mailbox copies. When the log buffer is full, each database copy builds, inspects, and creates the next log file in the generation sequence. If a failure affects the active copy, the passive copies will have been updated with most or all of the latest updates. The active copy doesn't wait for replication to complete to preclude replication issues from affecting the client experience.

Continuous replication - block mode is only active when continuous replication is up-to-date in file mode. The transition into and out of block mode is performed automatically by the log copier. Block mode dramatically reduces the latency between the time a change is made on the active copy and when the change is replicated to passive copies. In addition to replicating individual log file writes, block mode also changes the activation process for a passive copy. If a copy is in block mode when a failure occurs, the system uses whatever partial log content is available during the activation process. This eliminates the current log file on the active copy from being a single point of failure.

Active Mailbox Database Redistribution

Exchange 2010 SP1 includes a script called RedistributeActiveDatabases.ps1 that can be periodically run by administrators to balance the distribution of active database copies across a database availability group (DAG) based on administrator-configured activation preference. In addition, copy distribution awareness has been added to the Active Manager best copy selection process. Specifically, the first pass of best copy selection for lossless switchovers now sorts the possible targets by preference instead of least loss.

Enhanced Datacenter Activation Coordination Mode Support

Exchange 2010 RTM includes a configuration mode for DAG site resilience support called Datacenter Activation Coordination (DAC) mode. In DAC mode, Exchange cmdlets can be used to perform a data center switchover. In the RTM version, DAC mode is limited to DAGs with at least three members that have at least two or more members in the primary data center.

In Exchange 2010 SP1, DAC mode has been extended to support two-member DAGs that have each member in a separate data center. DAC mode support for two-member DAGs uses the witness server to provide additional arbitration. In addition, DAC mode has been extended to support DAGs that have all members deployed in a single Active Directory site, including single Active Directory sites that have been extended to multiple locations.

New and Enhanced Management and Monitoring Scripts

Exchange 2010 SP1 includes several new and enhanced scripts that greatly improve the management and monitoring experience:

CheckDatabaseRedundancy.ps1 (new) You can use this script to check the redundancy of replicated databases, and it will generate events if database resiliency is found to be in a compromised state (for example, there's only one healthy copy of a replicated database). The script is accompanied by a Microsoft System Center Operations Manager 2007 management pack change that can be used to monitor databases without redundancy, which is particularly useful in environments without RAID.
StartDagServerMaintenance.ps1 and StopDagServerMaintenance.ps1 (new) You can use StartDagServerMaintenance.ps1 to take a DAG member out of service for maintenance. It will move active databases off of the server and block databases from moving to that server. It will also make sure all critical DAG support functionality (for example, the Primary Active Manager PAM role) that might be on the server is moved to another server, and blocked from moving back to the server. Another script, StopDagServerMaintenance.ps1, is provided to complete the operation and remove the blocks.
CollectOverMetrics.ps1 (enhanced) You can use this script to collect switchover and failover data. This script has been enhanced in Exchange 2010 SP1 to include metrics for continuous replication - block mode, and more details from the replication and replay pipeline. In addition, it also features enhanced reporting.
CollectReplicationMetrics.ps1 (enhanced) This script is an active form of monitoring because it collects metrics related to continuous replication in real time while the script is running. The script supports parameters that enable you to customize the script's behavior and output.

Enhanced Exchange Management Console User Interface

Exchange 2010 SP1 includes Exchange Management Console (EMC) enhancements for managing DAGs. For example, the EMC now includes support for managing IP addresses and alternate witness server settings for DAGs. It's no longer necessary to use the Exchange Management Shell to configure these settings.

Improved Failover Performance

Exchange 2010 SP1 includes changes to improve failover and switchover performance and behavior. In the RTM version of Exchange 2010, when either a failover or a switchover occurs, the passive copy being activated immediately stops replaying log files that were copied to that passive copy. The active copy is then dismounted (if it's not already), and any remaining log files are copied to the passive copy being activated. Assuming that any missing data is within the automatic database mount dial setting, the passive copy is made the new active copy and the database is mounted in a dirty shutdown state. At this point, all log files that were copied to the previously passive (and now active) copy will be replayed to make the database consistent.

In Exchange 2010 SP1, when either a failover or a switchover occurs, the Microsoft Exchange Replication service on the passive copy being activated continues to replay log files that have been copied to the passive copy until the last log file generated by the active copy is copied to it. This enables a mount operation to be performed against a database that is in a nearly consistent state.

Other performance-enhancing changes involve time-outs and other algorithmic details to improve failover performance as well as I/O performance after failovers.

Extensible Storage Engine Recovery on Hung I/O

Exchange 2010 SP1 includes new recovery logic that makes use of the built-in Windows bugcheck behavior when certain conditions occur. Specifically, Extensible Storage Engine (ESE) has been updated to detect when I/O is hung and to take corrective action to automatically recover the server. ESE maintains an I/O monitoring thread that detects when an I/O has been outstanding for a specific period of time. By default, if an I/O for a database is outstanding for more than one minute, ESE logs an event. If a database has an I/O outstanding for greater than 4 minutes, ESE logs a specific failure event, if it’s possible to do so. ESE event 507, 508, 509, or 510 may or may not be logged, depending on the nature of the hung I/O. If the problem is such that the operating system volume is affected or the ability to write to the event log is affected, the events aren’t logged. If the events are logged, the Microsoft Exchange Replication service (MSExchangeRepl.exe) intentionally terminates the wininit.exe process to cause a bugcheck of Windows.

In some cases, the entire storage stack may be affected by the hang, making it impossible to write failure events to the crimson channel or any other area of the Windows Event Log. ESE also monitors the crimson channel by verifying that the event log can be written to. If writing to the event log fails for a long period of time, MSExchangeRepl intentionally causes a bugcheck of Windows by terminating wininit.exe. When the operating system I/O is hung, the system is obviously unable to write any ESE events to the event log.

Note:
Applications and Services logs are a new category of event logs in Windows Server 2008. These logs store events from a single application or component rather than events that might have system-wide impact. This new category of event logs is referred to as an application's crimson channel. For more information, see Monitoring High Availability and Site Resilience

This new bugcheck-based recovery feature in Exchange 2010 SP1 is designed to make recovery from hung I/O or a hung controller fast, rather than re-trying or waiting until the storage stack raises an error that causes failover. When the bugcheck occurs, the error code reads as follows:

CRITICAL_OBJECT_TERMINATION (f4)

A process or thread crucial to system operation has unexpectedly exited or been terminated.

Warning:
The presence of this bugcheck error code doesn’t necessarily mean that Exchange was the cause of the error. Any termination of wininit.exe, including one performed by an administrator using Task Manager or some other task management tool, will cause the same bugcheck error code.