Applies to: Exchange Server 2007 SP3, Exchange Server 2007 SP2, Exchange Server 2007 SP1, Exchange Server 2007
Topic Last Modified: 2008-03-21

Cluster continuous replication (CCR) is a high availability feature of Microsoft Exchange Server 2007 that combines the asynchronous log shipping and replay technology built into Exchange 2007 with the failover and management features provided by the Cluster service.

CCR is designed to provide high availability for Exchange 2007 Mailbox servers by providing a solution that:

CCR uses the database failure recovery functionality in Exchange 2007 to enable the continuous and asynchronous updating of a second copy of a database with the changes that have been made to the active copy of the database. During installation of the passive node in a CCR environment, each storage group and its database is copied from the active node to the passive node. This operation is called seeding, and it provides a baseline of the database for replication. After the initial seeding is performed, log copying and replay are performed continuously.

In a CCR environment, the replication capabilities are integrated with the Cluster service to deliver a high availability solution. In addition to providing data and service availability, CCR also provides for scheduled outages. When updates need to be installed or when maintenance needs to be performed, an administrator can move a clustered mailbox server (called an Exchange Virtual Server in previous versions of Exchange Server) manually to a passive node. After the move operation is complete, the administrator can then perform the needed maintenance.

The move operation is performed using the Move-ClusteredMailboxServer cmdlet in the Exchange Management Shell. Microsoft Exchange Server 2007 Service Pack 1 (SP1) also includes a new Manage Clustered Mailbox Server wizard in the Exchange Management Console that you can use to move clustered mailbox servers. The logic used by these tasks performs the necessary enforcement to make sure that no mailbox data is lost during the move. The tasks also perform checks before the move to make sure that replication on the passive node is ready to be quickly activated.

The key facts about CCR are as follows:

CCR Core Architecture

CCR combines the following elements:

  • Failover and virtualization features provided by Microsoft failover clusters

  • A majority-based failover cluster quorum model that uses a file share as a witness for cluster activity

  • Transaction log replication and replay features in Exchange 2007

  • Message queue feature of the Hub Transport server called the transport dumpster

Windows Failover Cluster

As shown in the following figure, in Exchange 2007 SP1, CCR uses two computers (referred to as nodes) joined in a single failover cluster running either Windows Server 2003 Service Pack 2 or Windows Server 2008. The nodes in the failover cluster host a single clustered mailbox server. A node that is currently running a clustered mailbox server is called the active node, and a node that is not running a clustered mailbox server, but is part of the cluster, and the target for continuous replication, is called the passive node. As a result of scheduled maintained and unscheduled outages, the designation of a node as active or passive will change several times throughout the lifetime of the failover cluster.


Cluster Continuous Replication Architecture

The failover cluster is built using the Cluster service and a specific type of cluster quorum model:

File Share Witness

Both of the preceding quorum models use a file share on a third computer as a witness. In these quorum models, a file share on a third computer is used to avoid an occurrence of network partition within the cluster, also known as split brain syndrome. Split brain syndrome occurs when all networks designated to carry internal cluster communications fail, and nodes cannot receive heartbeat signals from each other. Split brain syndrome is prevented by always requiring a majority of the two nodes and the file share to be available and interacting for the clustered mailbox server to be operational. When a majority of the computers are communicating, the cluster is said to have a quorum. The file share for the file share witness can be hosted on any computer running Windows Server. There is no requirement that the version of the Windows Server operating system hosting the file share match the operation system of the CCR environment. However, we recommend that you use a Hub Transport server in the Active Directory directory service site containing the clustered mailbox server to host the file share, because this allows a messaging administrator to maintain control over the file share.

Note:
The file share used by the file share witness cannot be hosted in a Distributed File System (DFS) environment.

The file share witness uses a file share on a computer outside the cluster to act as a witness to the activities of the two nodes that are the cluster. The witness is used by the two nodes to track which node is in control of the cluster. The note board is only required when the two nodes cannot communicate with each other. Consider a two-node clustered mailbox server made up of Node1 and Node2. In this case, Node1 is the node that can take control of the note board and is therefore able to take control of the cluster and bring up the clustered mailbox server. If Node2 is operational, but is unable to communicate to Node1, Node2 will try to take control of the note board and fail. The inability of Node2 to control the note board means that it should not bring up a clustered mailbox server.

When the two nodes are able to interact with each other, the note board is not necessary and could be offline. However, a subsequent failure of either node will prevent the clustered mailbox server from being online if the file share witness is not available.

The file share does not maintain any more state than previously described. Therefore, all cluster configuration information is exchanged between the two nodes themselves. This is important to understand, because it means that if Node1 is available and Node2 is unavailable, Node2 cannot become available until it communicates to Node1, even if it can communicate to the file share witness.

The file share witness support provides a periodic access check of the file share witness. If the access check fails, an event is generated. The event can be detected, collected, and reported by the monitoring system. This allows the operations staff to correct the issue, prior to the issue causing an outage due to a second concurrent failure.

The file share is accessed under the following conditions:

  • When a cluster node comes up and only one cluster node is available.

  • When a network connection problem prevents a previously reachable node from communicating with the cluster.

  • When a cluster node leaves the cluster.

  • Periodically for validation purposes. The frequency is configurable.

For these reasons, the load on the file share is light. As a result, a single server can provide file shares for multiple CCR environments. However, each CCR environment should have its own dedicated folder and share on this server.

File Share Witness Considerations

CCR is a two-node environment that uses either an MNS quorum with file share witness, or a Node and File Share Majority quorum instead of a third node (or more nodes) in the cluster that was required in traditional MNS clusters. A geographically dispersed CCR environment is a two datacenter deployment in which the active node is deployed in the primary datacenter, and the passive node is deployed in a secondary datacenter. Thus, in a geographically dispersed CCR environment, there are two options for placement of the file share: placing it in the primary datacenter, or placing it in a third datacenter.

The first option is to configure the file share on a Hub Transport server in the primary datacenter. A Hub Transport server is recommended because it allows a messaging administrator to manage and monitor outages of the file share. Our experience and customer feedback indicates that the most common types of network service interruptions occur in wide area network (WAN) topologies. Placing the file share in the primary datacenter is useful because it prevents service interruptions due to network failures between the two datacenters. Use of this configuration means automatic failover will not occur in the event of an outage of the primary datacenter. It does, however, ensure that majority in the failover cluster is not affected by a network failure between the primary and secondary datacenter.

The second option is to configure the file share on a managed server role within a third physical site. A managed server role is a server that is supported and maintained to a similar degree of other servers that are critical for the delivery of the messaging service. An example of a managed server role is a Hub Transport server in the primary datacenter. This third physical site could be a branch office or a third datacenter. A requirement of this configuration is that the third site must have a network infrastructure to the primary datacenter and secondary datacenter that has low latency and high reliability.

Transaction Log Replication and Replay

Transaction log replication and replay is used to copy the changed data and update the passive copy's database. Replication takes advantage of the change history produced by the Extensible Storage Engine (ESE). This change history is represented as a sequence of fixed-size 1 megabyte (MB) log files. The replication functionality copies the log files to the passive node as each log file is generated. The replication mechanism is asynchronous to the online database. When the logs arrive at the passive node, they are checked for corruption and then replayed into the copy of the database that is stored on the passive node. The replay process makes the changes described in the change log to the passive node's database, which makes the passive node's database match the production database with a slight time lag.

Because the data is replicated between the nodes, the clustered mailbox server can operate on either node in the cluster. This capability provides increased availability because scheduled outages and failures of one node do not cause an extended outage of the clustered mailbox server. In addition, service outages of the storage on one node will not impact availability of the other node and the clustered mailbox server. Assuming that the file share is still available and that it can communicate with the passive node, an outage of the active node causes the clustered mailbox server to move to a remaining node, and it continues to operate.

In a CCR environment, the transaction log file folder on the active node is shared using a standard Windows file share. The object globally unique identifier (GUID) for the storage group is used for the share name, and a dollar sign ($) is added to the end of the share. The Microsoft Exchange Replication service on the passive node connects to the share on the active node and copies, or pulls, the log files using the Server Message Block (SMB) protocol. The passive node then verifies the log file and replays it into the copy of the database on the passive node.

Note:
The SMB traffic for transaction log file replication is not encrypted. If needed, you can use Internet Protocol security (IPsec) to encrypt this traffic. Only transaction log file replication occurs using the SMB protocol. Reseeding a passive copy occurs using the ESE backup application programming interface (API), which is an unencrypted communication. If needed, IPsec can be used to encrypt this data.

Continuous Replication over Redundant Cluster Networks

In the release to manufacturing (RTM) version of Microsoft Exchange Server 2007, all transaction log file copying and seeding in a CCR environment occurs over the public network. This configuration has the following limits:

  • When the passive node is unavailable for several hours, a significant number of logs can build up that need to be transferred. The movement of those logs should be as rapid as possible when the passive node is available. By copying the logs over the public network, the movement of the logs contends with client traffic. This affects client traffic and slows the resynchronization.

  • When the public network fails, the failover is lossy, even though the log data is available.

  • Using an isolated network for log communication allows you to provide security for messaging data without using encryption and its associated performance penalty.

  • Log storms may occur under some circumstances. When they occur, the system experiences an unusually high replication burden. This could cause client starvation if the log data must be communicated over the same network used to communicate with clients.

Not all of these issues will occur with the same frequency. However, the first issue is effectively guaranteed to happen every few months because passive nodes are taken offline for regular maintenance activity.

Exchange 2007 SP1 minimizes the effects of the preceding problems by allowing the administrator to create one or more mixed networks in the cluster (a mixed network is a cluster network that supports both internal cluster heartbeat traffic and client traffic) for log shipping. Exchange 2007 SP1 also enables an administrator to specify a specific mixed network to be used for seeding.

Note:
Cluster networks used for log shipping and seeding must be configured as mixed networks. A mixed network is any cluster network that is configured for both cluster (heartbeat) and client access traffic. In addition, on the network adapter being configured with a continuous replication host name, the administrator must clear the Register this connection’s addresses in DNS check box in the Advanced TCP/IP properties dialog box and use either static DNS entries or Hosts file entries on each node to allow name resolution for the newly created host names by each node. The DNS server used by the network adapter can be located on the public or private network; however, regardless of its location, it must be accessible by both nodes so that host name resolution can occur. In addition, on Windows Server 2008, network adapters used for log shipping or seeding require NetBIOS to be enabled.

Support for log file copying over a mixed network is configured using a new cmdlet called Enable-ContinuousReplicationHostName. Similarly, turning off this feature is accomplished using the Disable-ContinuousReplicationHostName cmdlet.

After a clustered mailbox server exists in a CCR environment, an administrator can run Enable-ContinuousReplicationHostName on both nodes of the cluster and specify additional IP addresses and host names, which will then be created in dedicated cluster groups that are associated with each node. After this task has been performed, the Microsoft Exchange Replication service will begin using the newly created network for log copying shortly after successful configuration and upon confirming that the new network is operational. If multiple new networks are created, the Microsoft Exchange Replication service will randomly select one of them. If the specified network becomes unavailable, the Microsoft Exchange Replication service will automatically begin using other replication networks, or if none are available, it will begin using the public network for log shipping within five minutes. (Microsoft Exchange Replication service network discovery occurs every five minutes.) When the preferred replication network is again available, the Microsoft Exchange Replication service will automatically revert back to using it for log shipping.

For more information about these cmdlets, see Enable-ContinuousReplicationHostName and Disable-ContinuousReplicationHostName.

Support for seeding over a redundant cluster network is configured using the Update-StorageGroupCopy cmdlet, which has been updated in Exchange 2007 SP1 to include a new parameter called DataHostNames. This parameter is used to specify which cluster network should be used for seeding. For more information about the changes to the Update-StorageGroupCopy cmdlet in Exchange 2007 SP1, see Update-StorageGroupCopy.

After a cluster network has been created for continuous replication, you can use the Get-ClusteredMailboxServerStatus cmdlet to view updated information about cluster networks that have been enabled for continuous replication activity. The new output details include:

  • OperationalReplicationHostNames:{Host1,Host2,Host3}

  • FailedReplicationHostNames:{Host4}

  • InUseReplicationHostNames:{Host1,Host2}

For more information about the changes to the Get-ClusteredMailboxServerStatus cmdlet in Exchange 2007 SP1, see Get-ClusteredMailboxServerStatus.

Transport Dumpster

The bulk of the lost data that occurs during an automatic recovery is subsequently automatically recovered by a Hub Transport server feature called the transport dumpster. The transport dumpster for a specific database may be located on all Hub Transport servers in the Active Directory site containing the clustered mailbox server. As a message goes through Hub Transport servers on its way to a clustered mailbox server in a CCR environment, a copy is kept in the transport queue (mail.que) until the replication window has passed. The transport dumpster is a required component for CCR deployments. The transport dumpster takes advantage of the redundancy in the environment to reclaim some of the data affected by the failover. Specifically, Hub Transport servers maintain a queue of recently delivered mail. This queue is bound by the amount of time mail is kept and the total space used. When a failover is experienced that is not lossless, CCR on the clustered mailbox server automatically requests every Hub Transport server in the Active Directory site to resubmit mail from the transport dumpster queue. The information store automatically deletes the duplicates and again delivers mail that was lost.

The transport dumpster is enabled for CCR and, in Exchange 2007 SP1, also for local continuous replication (LCR). The transport dumpster is not enabled for SCR or single copy clusters (SCCs). For CCR, the necessary condition for an e-mail message to be retained in the transport dumpster is that it has at least one recipient whose mailbox is on a clustered mailbox server in a CCR environment or in SP1, on a mailbox database enabled for LCR.

The transport dumpster is designed to help protect against data loss by providing an administrator with the option to configure CCR such that the clustered mailbox server will automatically come online on another node, with a limited amount of data loss. When this happens, the system automatically delivers all the recent e-mail messages sent to users on this server, by taking advantage of the transport dumpster where these e-mail messages are still stored. This helps to prevent data loss in most situations. In a CCR environment, request for redelivery from the transport dumpster on all Hub Transport servers in the site is performed automatically. In Exchange 2007 RTM, the retry interval is hard-coded to seven days. In Exchange 2007 SP1, the retry interval is equal to the value set for MaxDumpsterTime. In an LCR environment, the request for redelivery from all Hub Transport servers in the site occurs as part of the Restore-StorageGroupCopy task.

Situations in which data loss is not mitigated by the transport dumpster include:

  • Drafts folder for any Microsoft Outlook clients in online mode.

  • Appointments, contact updates, property updates, tasks, and task updates.

  • Outgoing mail that is in transit from the client to the Hub Transport server. There is a period of time during which the e-mail message only exists on the sender's Mailbox server.

Deploying Cluster Continuous Replication

Deploying CCR is similar to deploying a stand-alone Exchange server, and it is similar to deploying an SCC. For more information about SCCs, see Single Copy Clusters. However, there are some significant differences to be aware of when deploying CCR. We recommend that you review the following topics before designing and deploying CCR in your environment:

After you are ready to deploy CCR, you can begin the process by performing the steps in each phase of installation described in one of the following topics:

Enhancements to CCR in Exchange 2007 SP1

Exchange 2007 SP1 includes several enhancements for CCR, including additional Exchange Management Console user interface elements, improved status and monitoring, and improved performance.

Exchange Management Console Enhancements

Several new user interface elements have been added in Exchange 2007 SP1 that enhance the management experience for high availability features, including CCR. These improvements include:

  • Transport dumpster user interface   A new Global Settings tab has been added to the Hub Transport node under the Organization Configuration work area. This tab includes a Transport Settings Properties page that can be used to configure the transport dumpster settings for the organization:

    • Maximum size per storage group (MB)   Specifies the maximum size of the transport dumpster queue for each storage group.

    • Maximum retention time (days)   Specifies how long an e-mail message should remain in the transport dumpster queue.

  • Continuous replication   Additional user interface controls have been added to the Exchange Management Console that enable an administrator to suspend, resume, update, and restore continuous replication. These controls are the equivalent of using the following Exchange Management Shell cmdlets:

    • Suspend-StorageGroupCopy

    • Resume-StorageGroupCopy

    • Update-StorageGroupCopy

    • Restore-StoreGroupCopy

    You can use these cmdlets and the corresponding Exchange Management Console tasks to manage continuous replication in both an LCR environment and in a CCR environment.

Status and Monitoring Enhancements

Exchange 2007 SP1 also introduces several changes that are designed to enhance the manageability of Exchange 2007. These changes improve upon the cluster reporting features in Exchange 2007 RTM, and include additional functionality designed for proactive monitoring of continuous replication environments. Specifically, the changes and enhancements correct known deficiencies with the Get-StorageGroupCopyStatus cmdlet, introduce a new cmdlet called Test-ReplicationHealth, and provide greater visibility into the loss window covered by the transport dumpster.

Improvements to the Get-StorageGroupCopyStatus Cmdlet

In Exchange 2007 RTM, there are several conditions where the status reported by Get-StorageGroupCopyStatus and the continuous replication performance counters is inaccurate or misleading:

  • A storage group that is not active (for example, not changing) can report its status as healthy when it might not be healthy. This situation occurs because the unhealthy condition is not detected until a log is replayed.

  • During replication initialization, the replication status is being re-evaluated and may not be accurate. When initialization completes, the status is updated.

  • The value of the LastLogGenerated field can be wrong when the database in the storage group is dismounted.

  • When there are one or more missing logs in the middle of a log stream, the passive copy continues to try to recover, causing the replication status to switch between failed and healthy states. When this happens, the replay and copy queues continue to grow.

  • Under rare conditions, a log can be successfully verified but still fail to replay. In this situation, the system will alternate between failed and healthy states during its attempts to recover. When this happens, the replay and copy queues continue to grow.

The Get-StorageGroupCopyStatus cmdlet has also been enhanced with the addition of new status information for CCR environments:

  • The Get-StorageGroupCopyStatus cmdlet reports a SummaryCopyStatus of ServiceDown when the Microsoft Exchange Replication service on the target computer is not network accessible.

  • The Get-StorageGroupCopyStatus cmdlet reports a SummaryCopyStatus of Initializing when the Microsoft Exchange Replication service on the target computer has not completed its initial startup checks. A new performance counter has also been created to represent this status as a Boolean.

  • The Get-StorageGroupCopyStatus cmdlet reports a SummaryCopyStatus of Synchronizing when it has not completed an incremental reseed.

The new states for the SummaryCopyStatus value are visible only when you use the Exchange 2007 SP1 version of the Exchange management tools. When you use the Exchange 2007 RTM version of the Exchange management tools, the status for any of the preceding states will be reported as Failed.

Test-ReplicationHealth Cmdlet

Exchange 2007 SP1 introduces a new cmdlet called Test-ReplicationHealth. This cmdlet is designed for proactive monitoring of continuous replication and the continuous replication pipeline. The Test-ReplicationHealth cmdlet checks all aspects of replication, cluster services, and storage group replication and replay status to provide a complete overview of the replication system. Specifically, when run on a node in the cluster, the Test-ReplicationHealth cmdlet performs the tests described in the following table.

Tests performed by the Test-ReplicationHealth cmdlet

Test Description

Cluster network status

Verifies that all cluster-managed networks found on the local node are running. This test is run only in a CCR environment.

Quorum group state

Verifies that the cluster group containing the quorum resource is healthy. This test is run only in a CCR environment.

File share quorum state

Verifies that the value of the FileSharePath used by the Majority Node Set quorum with file share witness is reachable. This test is run only in a CCR environment.

Clustered mailbox server group state

Verifies that the clustered mailbox server is healthy by confirming that all resources in the group are online. This test is run only in a CCR environment.

Node state

Verifies that neither of the nodes in the cluster is in a paused state. This test is run only in a CCR environment.

DNS registration status

Verifies that all cluster-managed network interfaces that have Require DNS registration to succeed set have passed DNS registration. This test is run only in a CCR environment.

Replication service status

Verifies that the Microsoft Exchange Replication service on the local computer is healthy.

Storage group copy suspended

Checks whether continuous replication has been suspended for any storage groups enabled for continuous replication.

Storage group copy failed

Checks whether any storage group copies are in a Failed state.

Storage group replication queue length

Checks whether any storage group has a replication copy queue length greater than best practice thresholds. Currently, these thresholds are:

  • Warning   Queue length is 3–5 logs.

  • Failure   Queue length is 6 or more logs.

Databases dismounted after failover

Checks whether any databases are dismounted or failed after a failover has occurred. This test only checks for databases that have failed as a result of a failover.

Performance Enhancements

Performance improvements have been made in Exchange 2007 SP1 that benefit high availability deployments. These improvements include:

  • I/O reductions on the disks containing passive copies of storage groups in continuous replication environments   In Exchange 2007 SP1, the design of the continuous replication architecture has been modified so that the database cache is now persisted on the passive node in between batches of log replay activity. The persistence of the database cache between batches of log replay activity enables the Microsoft Exchange Replication service to leverage the database caching features of the ESE, which in turn, reduces the amount of disk I/O that occurs on the passive copy's logical unit numbers (LUNs). By contrast, in Exchange 2007 RTM, a new database cache was created for each batch of log replay activity, which in some cases made the disk I/O activity on the passive node as much as two or three times the disk I/O on the active node.

  • Faster moving of clustered mailbox servers between nodes in a CCR environment   These improvements enable clustered mailbox servers to move between nodes in two minutes or less. This includes moves performed by an administrator (using the Move-ClusteredMailboxServer cmdlet), and failovers that are managed by the Cluster service. To accomplish fast moves in a CCR environment, the databases are taken offline without flushing the database cache. This behavior reduces the amount of downtime that occurs when the clustered mailbox server is moved to another node.

Using Standby Continuous Replication with CCR

SCR is a new feature introduced in Exchange 2007 SP1. SCR extends the existing continuous replication features and enables new data availability scenarios for Exchange 2007 Mailbox servers. SCR uses the same log shipping and replay technology used by LCR and CCR to provide added deployment options and configurations.

SCR enables you to use continuous replication to replicate Mailbox server data from a stand-alone Mailbox server (with or without LCR), or from a clustered mailbox server in an SCC or in a CCR environment.

The process for activating copies of Mailbox server data that are created and maintained by SCR is manual and is designed to be used when a significant failure occurs (and not for simple server outages that are recoverable by a restart or some other quick means). You can activate an SCR target using database portability, the server recovery option (Setup /m:RecoverServer), or if the Mailbox server is clustered, the clustered mailbox server recovery option (Setup /RecoverCMS). The option you choose will be based on your configuration and the type of failure that occurs.

For more information about SCR, see Standby Continuous Replication.

For More Information



Basic deployment of CCR