Applies to: Exchange Server 2007 SP3, Exchange Server
2007 SP2, Exchange Server 2007 SP1, Exchange Server 2007
Topic Last Modified: 2009-06-10
In addition to the tasks for day-to-day management and administration of an Exchange organization, there are tasks that are specific to managing a cluster continuous replication (CCR) environment. In general, the administrative tasks for CCR are grouped into two categories:
- Tasks related to a clustered mailbox server
- Tasks related to the storage groups and databases in a
clustered mailbox server
Tasks Related to a Clustered Mailbox Server
Administrative tasks associated with a clustered mailbox server in a CCR environment include the following:
- Managing disk volumes
- Viewing configuration settings
- Monitoring replication activity
- Viewing and collecting performance data
- Managing clustered mailbox servers, which includes bringing
them online, taking them offline, and moving clustered mailbox
servers between nodes
- Managing log file replication and replay
Managing Disk Volumes
While managing a CCR environment, it may be necessary to manage disk volumes that are associated with your CCR cluster. For example, the volume may need to be temporarily detached from the system for maintenance or other reasons. If this needs to happen on the active storage group or database, the databases in the storage group should be dismounted. If such an operation is to be performed on the passive copy of the storage group or database, all replication input/output (I/O) to the volume should be stopped by halting replication.
For more information about managing disk volumes, see How to Prepare for Disk Volume Management Activities for a CCR Copy.
Viewing Configuration Settings
After CCR has been enabled for a server, you can use the Exchange Management Console and the Exchange Management Shell to view the configuration settings for storage groups and databases on the server.
Configuration information includes the locations for the storage group and database files. In addition, you can review the clustered mailbox server-related configuration by using the Exchange Management Shell.
For detailed steps about how to view CCR failover control configuration information, see How to View Failover Control Configuration.
Monitoring Replication Activity
The passive copy of a database is only useful if it is kept up to date. Although CCR does not require any special monitoring, we recommend regularly monitoring each storage group to verify that it is properly replicating log files. The Microsoft Exchange Server 2007 Management Pack for Microsoft Operations Manager 2005 includes alerts for several critical problems related to CCR environments:
- The Microsoft Exchange Replication service is not running
on the passive node. The event that generates this alert does not
repeatedly appear after the service is stopped, so any alert
associated with it would be lost if it were cleared.
- The passive copy is in a Failed state.
- The passive copy is in a Healthy state, but it is significantly
behind in log copying or replay.
We also recommend adding a custom event rule to Microsoft Operations Manager that is triggered when a passive node is not detected to be running and is part of the cluster containing an active node. When this condition occurs, the Cluster service logs an event to the System event log. We recommend using the following criteria for the event rule, which will be part of the event logged by the Cluster service:
Event Source: ClusSvc
Event ID: 1135
For more information about creating event rules in Microsoft Operations Manager, see Monitoring Security Events with MOM.
You should investigate and resolve as quickly as possible any of the preceding alerts generated by the Exchange 2007 Management Pack or a custom event rule.
An alternative to using the Exchange 2007 Management Pack for Microsoft Operations Manager 2005 is to regularly run a script that executes the Get-StorageGroupCopyStatus cmdlet in the Exchange Management Shell. The Get-StorageGroupCopyStatus cmdlet gives queue lengths that incorporate the number of logs generated by the active node. For performance reasons, the queue length performance counters only report information that is known to the Microsoft Exchange Replication service. Under rare conditions, this can be inconsistent with the state of the active node. For more information about the Get-StorageGroupCopyStatus cmdlet, see "Viewing the Status of Storage Group Copies" later in this topic.
Viewing and Collecting Performance Data
You can determine the progress of replication by using performance counters. For more information about using performance counters for CCR, see How to View Performance Counters for Cluster Continuous Replication.
Managing Clustered Mailbox Servers
The three primary administrative tasks for managing a clustered mailbox server are bringing a clustered mailbox server online, taking it offline, and moving a clustered mailbox server between nodes in the cluster. It can also involve shutting down or restarting one of the nodes in the cluster as part of update management or other maintenance operations.
Starting and Stopping a Clustered Mailbox Server
The Failover Cluster Management tool (Windows Server 2008), Cluster Administrator (Windows Server 2003), and the Cluster.exe command-line tool (in both operating systems) have the ability to bring resources online and take resources offline. Taking a clustered mailbox server offline is called stopping and bringing a clustered mailbox server online is called starting. The recommended way to start a clustered mailbox server is to use the Start-ClusteredMailboxServer cmdlet. The recommended way to stop a clustered mailbox server is to use the Stop-ClusteredMailboxServer cmdlet. In Exchange 2007 Service Pack 1 (SP1), you can also use the Manage Clustered Mailbox Server wizard in the Exchange Management Console to start or stop a clustered mailbox server.
For detailed steps about how to bring a clustered mailbox server online, see How to Start a Clustered Mailbox Server in a CCR Environment. For detailed steps about how to take a clustered mailbox server offline, see How to Stop a Clustered Mailbox Server in a CCR Environment.
Moving a Clustered Mailbox Server Between Nodes
Manually moving a clustered mailbox server between nodes is called a handoff or scheduled outage. To move a clustered mailbox server, use the Move-ClusteredMailboxServer cmdlet. In Exchange 2007 SP1, you can also use the Manage Clustered Mailbox Server wizard in the Exchange Management Console to perform a handoff of a clustered mailbox server. Although the Failover Cluster Management tool (Windows Server 2008), Cluster Administrator (Windows Server 2003), and the Cluster.exe command-line tool (in both operating systems) can be used to move a clustered mailbox server between nodes, we recommend using one of the Exchange management tools to move a clustered mailbox server from the active node to the passive node because they allow you to specify a reason for the handoff. In addition:
- Using the cluster tools skips the check of the health or state
of the passive copy that is performed by the Exchange management
tools. Thus, their use can result in an extended outage while the
node performs the operations necessary to make the database
mountable.
- Using the cluster tools may also leave a database offline
indefinitely because replication is not healthy, and the cluster
tools, unlike the Exchange management tools are unable to determine
the state of replication before moving the resource group.
Note: Moving a clustered mailbox server between nodes will result in a brief interruption in service. In addition, any backups of any storage groups on the clustered mailbox server will be canceled.
If replication is not healthy or if these checks determine that the passive node is not in an acceptable state for a handoff, the Exchange Management Tools will not perform the handoff. If this happens and you still need to move the CMS to the passive node, you can use the cluster management tools to do that.
When moving a clustered mailbox server in a failover cluster in which there is network latency between the nodes, we recommend performing the move operation from the passive node.
For detailed steps about how to move a clustered mailbox server between nodes, see How to Move a Clustered Mailbox Server in a CCR Environment.
Performing Maintenance on the Cluster
Maintenance should always be performed on the passive node in the cluster. Updates, hotfixes, and other applications generally should not be installed on the active node (a node that currently owns a clustered mailbox server). For detailed steps about how to install Exchange update rollups in a CCR environment, see Applying Exchange 2007 Update Rollups to Clustered Mailbox Servers.
If maintenance needs to be performed on the active node, the clustered mailbox server should first be moved to a passive node using the Move-ClusteredMailboxServer cmdlet. After moving the clustered mailbox server, the previously active node becomes the passive node, and the previously passive node is now the active node. Maintenance can then be performed, and a handoff can be performed that moves the clustered mailbox server in the opposite direction.
CCR environments allow you to schedule a system outage of a specific node without an outage of the clustered mailbox server. In a CCR environment, only one node can be taken offline at a time. Taking more than one node offline will result in an interruption in service.
A scheduled outage is initiated via the Exchange Management Shell Move-ClusteredMailboxServer cmdlet. The topic How to Move a Clustered Mailbox Server in a CCR Environment provides a procedure to perform a scheduled outage.
Before shutting down or restarting any node in a CCR environment, we recommend that you verify which node is currently hosting the clustered mailbox server. This information can be obtained by using the Get-ClusteredMailboxServerStatus cmdlet.
Performing Maintenance on the Cluster
Maintenance should always be performed on the passive node in the cluster. Updates, hotfixes, and other applications generally should not be installed on the active node (the node that currently owns the clustered mailbox server). For detailed steps about how to install Exchange update rollups in a CCR environment, see Applying Exchange 2007 Update Rollups to Clustered Mailbox Servers.
If maintenance needs to be performed on the active node, the clustered mailbox server should first be moved to the passive node using the Move-ClusteredMailboxServer cmdlet. After moving the clustered mailbox server, the previously active node becomes the passive node, and the previously passive node is now the active node. Maintenance can then be performed, and a handoff can be performed that moves the clustered mailbox server in the opposite direction.
Shutting Down Nodes in the Cluster
If all of the nodes in the cluster need to be shut down, including the active node, you must first stop the clustered mailbox server. The Windows shutdown process is not Exchange-aware. Therefore, we recommend that you only shut down passive nodes. If an active node needs to be shut down or restarted, we recommend that you move the clustered mailbox server to another available node. For detailed steps that explain how to move a clustered mailbox server to another node, see How to Move a Clustered Mailbox Server in a CCR Environment.
If the clustered mailbox server cannot be moved to the passive node (perhaps because the passive node has already been shut down), it must be stopped prior to shutting down the active node.
If you need to restart or shut down the active node and you cannot move the clustered mailbox server to the passive node, we recommend that you use Group Policy to make sure that the clustered mailbox server is stopped before restarting or shutting down an active node. Windows Server provides a set of policy-driven computer shutdown scripts that you can manage by using the Group Policy snap-in. The Group Policy snap-in includes extensions that enable you to specify a script that runs when you shut down the computer.
For example, you can create a shutdown script that runs the Move-ClusteredMailboxServer cmdlet or the Stop-ClusteredMailboxServer cmdlet, with the appropriate parameters. We also recommend using a shutdown script because it minimizes the chance that the system will be shut down or restarted by an administrator who is not aware of the need to move or stop the clustered mailbox server before shutting down the active node.
Important: |
---|
These scripts run under the Local System account. Before these scripts can run successfully, you must grant the Local System account (the local node's computer account) permission to manage the clustered mailbox server. |
Managing Log File Replication and Replay
Managing replication in a CCR environment involves the following main activities:
- Handling failovers when replication is halted
- Halting and restarting replication to storage group copies
- Configuring one or more redundant networks for replication
Handling Failovers When Replication Is Halted
Halting replication stops all propagation of the changes from the active storage group to the copy for the period of the suspension. Should a failover happen during that time, the storage group copy will not have the latest changes. Depending on the volume of change that has occurred on the active node, the lack of the latest changes is likely to prevent the system from mounting the copy on the passive computer. Thus, you can either use the available version of the storage group on the passive node or wait until the original server recovers. It is important to minimize the time that the replication is halted to minimize this exposure.
If you do not mount the version of the data on the passive node when the original computer becomes available, the replication system will copy the missing logs and automatically mount the copy of the database on the new active node.
A failover that occurs after replication is resumed could occur when the passive copy is still missing logs or after it has all the logs, but before they have been replayed. If the logs are copied, but not replayed, a failover will trigger the replay of the outstanding logs into the database. Thus, this storage group will experience an extended recovery time as part of the failover, although other storage groups will not be affected. However, if enough logs are available to meet the configured automatic mount criteria, the system will eventually mount the database with the latest available data. There is one risk to this process: One of the logs to be replayed could be corrupted and not permit successful replay. In this case, the replay will result in an error and all further replay activity will be blocked. When this happens, the storage group copy will go into an error state referred to as Failed. In this error state, you may be able to recover using the version of the database up to that point. Otherwise, you will need to wait until the original server becomes available and the non-corrupted log is copied again.
Halting and Restarting Replication to Storage Group Copies
It may occasionally be necessary to control the activities of the passive copy. It may be necessary to halt and restart replication activity. Replication is controlled at the storage group level. Because a storage group can contain only one database, replication is localized to one database.
Replication occurs when both of the nodes in the cluster are operational, the Microsoft Exchange Replication service is running on the target node, and the storage group copy has copying enabled. If either the source or target location for CCR becomes unavailable, you must stop replication. In addition, some CCR administration tasks, such as seeding, performing integrity checks, or storage reconfiguration, require a storage group copy to have its replication halted. If you need to stop all access to the target's log files and log directory, you must halt replication.
Exchange 2007 requires that all replication activity be halted when the location of the storage group or database is being changed.
For more information about halting database copy updates, see How to Halt Replication for a Passive Copy in a CCR Environment. For more information about restarting database copy updates, see How to Restart Replication for a Passive Copy in a CCR Environment.
For more information about performing an integrity check on CCR transaction logs and database files, see How to Verify a Cluster Continuous Replication Copy.
Configuring One or More Redundant Networks for Replication
Exchange 2007 SP1 enables you to configure redundant cluster networks that can be used for log shipping and seeding in a CCR environment. The redundant network must be configured as a mixed cluster network. A mixed cluster network is any cluster network that has been configured for both cluster (heartbeat) and client access traffic.
When a mixed cluster network has been configured with continuous replication host names and IP addresses, Exchange 2007 will use that network for log shipping. In addition, the configured network is available for administrator-initiated seeding with the Update-StorageGroupCopy cmdlet. Multiple mixed networks can be specified, and if more than one network is available, Exchange 2007 will randomly select one of the networks. If the network currently in use becomes unavailable, Exchange 2007 will automatically select another available network.
Support for log shipping over a mixed network is configured using the Enable-ContinuousReplicationHostName cmdlet. Similarly, turning off this feature is accomplished using the Disable-ContinuousReplicationHostName cmdlet. After a clustered mailbox server exists in a CCR environment, an administrator can run Enable-ContinuousReplicationHostName on both nodes of the cluster and specify two IP addresses and host names. After doing this, the system randomly selects a mixed network for log copying after successful configuration and upon confirming that the mixed network is operational.
Seeding and reseeding in a CCR environment is performed using the Update-StorageGroupCopy cmdlet. In Exchange 2007 SP1, this cmdlet has been extended to include a new parameter called DataHostNames. This parameter is used to specify which network should be used for seeding or reseeding. The value is a multiple valued list of two names: either a fully qualified domain name (FQDN) or a host name. One of these names must identify the passive node.
For more information about configuring redundant networks for log shipping and seeding, see the following topics:
- CCR on Windows Server 2008:
- CCR on Windows Server 2003:
Tasks Related to Storage Groups and Databases in a Clustered Mailbox Server
Administrative tasks associated with the storage groups and databases in a clustered mailbox server in a CCR environment include the following:
- Moving the location of storage group files or a database
- Viewing the status of storage group copies
- Mounting and dismounting databases
- Verifying the integrity of a storage group copy
- Recovering from corruption in a production storage group or a
storage group copy
- Restoring CCR after experiencing a failure or some form of data
corruption
Except for the recovery storage group, which is a special type of storage group, all storage groups and databases in a CCR environment are automatically enabled for continuous replication. Although replication and replay can be suspended, disabling continuous replication for one or more storage groups in a CCR environment is not possible because this would allow an outage to prevent access to particular databases.
When you create a new storage group in a CCR environment, seeding of the copy of the database on the passive node should occur automatically. If for some reason seeding does not automatically occur, you must manually seed the database copy. For detailed steps about how to seed a database copy, see How to Seed a Cluster Continuous Replication Copy.
Moving the Location of Storage Group Files or a Database
It may be necessary to move the location of storage group files or the location of a database in a CCR environment. The time it takes to move the file locations depends on the size of the database being moved, the number of transaction log files being moved, and the performance characteristics of the storage. During any move, the database will be dismounted.
In a CCR environment, relocating a storage group requires that both copies be relocated in a consistent way because the location of files on both the active node and the passive node must be the same. Before a storage group or its database can be moved, you must dismount the database and suspend replication. For the active copy, you can accomplish this by using the Dismount-Database cmdlet in the Exchange Management Shell. For the Microsoft Exchange Replication service, use the Suspend-StorageGroupCopy cmdlet and the Resume-StorageGroupCopy cmdlet.
Note: |
---|
The Microsoft Exchange Replication service is constantly monitoring both the files in the copy location and the logs on the active node. Thus, if you manipulate active logs in any way, you must suspend activity of that storage group by using the Suspend-StorageGroupCopy cmdlet, which halts replication. |
For detailed steps about how to move the location of storage group files in a CCR environment, see How to Move a Storage Group in a CCR Environment. For detailed steps about how to move the location of a database in a CCR environment, see How to Move a Database in a CCR Environment.
Viewing the Status of Storage Group Copies
In the release to manufacturing (RTM) version of Microsoft Exchange 2007, you can only view CCR status information by using the Exchange Management Shell. In Exchange 2007 SP1, some of the status information listed in the following table can be viewed in the Exchange Management Console.
Exchange 2007 publishes a variety of status information for storage group copies. The following table describes the status information that is available. In the following table, the attributes are listed in the order in which they appear in the complete output of the Get-StorageGroupCopyStatus cmdlet. For detailed steps about viewing status information, see How to View the Status of a Storage Group in a CCR Environment.
Status information available for CCR-enabled storage groups
Attribute | Description |
---|---|
Identity |
Identity of the queried storage group. This attribute gives the <ServerName>\<StorageGroupName>. |
StorageGroupName |
Name of the queried storage group. This attribute gives the storage group name. |
SummaryCopyStatus |
Current overall status of the passive copy. Possible values are:
Exchange 2007 SP1 adds the following additional status values:
|
Failed |
Verification of the database or logs, which identified an inconsistency that prevents replication. Alternatively, there is a configuration or access problem with the active or passive copy. Possible values are True and False. |
FailedMessage |
Textual message that identifies the condition that caused replication to fail. It may not be the only replication problem area. |
Seeding |
Indicates that seeding is in progress. Possible values are True and False. |
Suspend |
Indicates that replication has been halted for the passive copy. This state prevents the database from advancing, and logs from being copied. Possible values are True and False. |
SuspendComment |
Optional comment area in which an administrator can provide a reason or note as to why replication activity was halted. |
CopySuspend |
Indicates that log copying has been halted for the passive copy. This prevents the log copy directory from changing. Possible values are True and False. |
CopySuspendComment |
Optional administrator comment providing a reason or note as to why log copy activity was halted. |
CopyQueueLength |
Number of transaction log files waiting to be copied to the passive copy log file folder. A copy is not considered complete until it has been checked for corruption. |
ReplayQueueLength |
Number of transaction log files waiting to be replayed into the passive copy. |
LatestAvailableLogTime |
Time stamp on the source storage group of the most recently detected new transaction log file. |
LastCopyNotificationedLogTime |
Time associated with the last new log generated by the active storage group and known to the copy. |
LastCopiedLogTime |
Time stamp on the source storage group of the last successful copy of a transaction log file. |
LastInspectedLogTime |
Time stamp on the target storage group of the last successful inspection of a transaction log file. |
LastReplayedLogTime |
Time stamp on the target storage group of the last successful replay of a transaction log file. |
LastLogGenerated |
Last log generation number that was known to be generated on the active copy of the storage group. |
LastLogCopied |
Last log generation number that was successfully copied to the passive copy log folder. |
LastLogNotified |
Last log generation number that was generated by the active storage group and known to the copy. |
LastLogInspected |
Last log generation number that was inspected for consistency and corruption. |
LastLogReplayed |
Last log generation number that was successfully replayed into the passive copy of the database. |
LatestFullBackupTime |
Time of the last full backup. |
LastestIncrementalBackupTime |
Time of the last incremental backup. |
SnapshotBackup |
Indicates whether the last full backup taken was a legacy streaming backup or a Volume Shadow Copy Service (VSS) backup snapshot. |
SnapshotLatestFullBackup |
Time of the last snapshot full backup. |
SnapshotLatestIncrementalBackup |
Time of the last snapshot incremental backup. |
SnapshotLatestDifferentialBackup |
Time of the last snapshot differential backup. |
SnapshotLatestCopyBackup |
Time of the last snapshot copy backup. |
OutstandingDumpsterRequests |
Outstanding requests and the time range (low-high) for the outstanding requests. |
DumpsterStatistics |
Transport dumpster statistics from all accessible Hub Transport servers. This value is displayed only when the DumpsterStatistics parameter is used with the Get-StorageGroupCopyStatus command. |
DumpsterStatisticsNotAvailable |
List of inaccessible Hub Transport servers. |
You can quickly assess the health of a storage group copy by looking at the values for SummaryCopyStatus, CopyQueueLength, ReplayQueueLength, and LastInspectedLogTime. These attributes show whether the storage group copy is functioning correctly and whether the storage group copy is relatively up to date in both copying and replaying logs. If the following conditions occur, you should determine the cause and correct the problem:
- Copy is not in a healthy state.
- Copy queue length is more than 5.
- Replay queue length is more than 20.
- Last inspected log time is not a recent time. Inactivity on the
storage group could cause this situation, but it could also
indicate the Microsoft Exchange Replication service is
stopped.
You can calculate the two queue numbers in units of time as follows:
- Copy queue in time = LatestAvailableLogTime –
LastCopiedLogTime
- Replay queue in time = LatestCopiedLogTime –
LastInspectedLogTime
The replay queue length and copy queue length values are available as performance counters. They are the CopyQueueLength and ReplayQueueLength performance counters under the Microsoft Exchange Replication service performance object.
There are some rare scenarios where the replication status can be misleading. The following is a list of those scenarios:
- A storage group that is not active (that is, not changing) can
report as healthy when it might not be healthy. This situation
could occur because the unhealthy condition could not be detected
until a log is replayed.
- During replication initialization, the replication status is
being evaluated and may not be accurate. When the initialization
completes, the status is updated.
- The value of the LastLogGenerated field can be wrong
when a database is dismounted. However, all logs with end-user
content are replicated if the storage group copy is
replicating.
- When there are one or more missing logs in the middle of a log
stream, the passive copy continues to try to recover. In doing so,
the replication status switches between failed and healthy states.
The replay and copy queues will continue to grow.
- Under some very rare conditions, a log can be successfully
verified but it can still fail to replay. In this situation, the
system will alternate between failed and healthy states as it
attempts to recover. The replay and copy queues will continue to
grow.
Note: In Exchange 2007 SP1, you can also use a new cmdlet called Test-ReplicationHealth to verify the health and status of storage groups enabled for continuous replication. For more information about the Test-ReplicationHealth cmdlet, see Test-ReplicationHealth.
Mounting and Dismounting Databases
It may occasionally be necessary to mount or dismount databases in a CCR environment. This could be required to perform a reconfiguration or to correct issues with the server or database. When the database is dismounted, it is frozen from further changes. Neither the database nor the log files are changed while the database is dismounted.
For more information about mounting databases in a CCR environment, see How to Mount a Database in a CCR Environment. For more information about dismounting databases in a CCR environment, see How to Dismount a Database in a CCR Environment.
Verifying the Integrity of a Storage Group Copy
When you use CCR, we recommend that you verify the integrity of the passive copy periodically by running a physical consistency check against the database and transaction log files. A physical consistency check examines the transaction logs and database files for corruption. You can perform the check by using Exchange Server Database Utilities (Eseutil.exe). For detailed steps about how to use Eseutil to check the transaction logs and database files for physical corruption, see How to Verify a Cluster Continuous Replication Copy.
Note: |
---|
Before you run a physical consistency check against a database, you must temporarily suspend replication activity against the storage group copy. You can suspend transaction log replay activity by using the Suspend-StorageGroupCopy cmdlet in the Exchange Management Shell. When the consistency check has completed, you can resume transaction log replay activity by using the Resume-StorageGroupCopy cmdlet. |
Recovering from Corruption in a CCR Environment
CCR enables you to recover from corruption or failures in a production storage group by initiating a scheduled outage. If the log files are not corrupt, no data loss should occur because of the recovery. However, if the log files are not available, the recovery can only bring the storage group back to a point in time that is consistent with the last set of changes that the copy received that are not corrupted. An additional constraint is that there cannot be any missing or corrupted change data earlier than that point in time.
For detailed steps that explain how to recover from corruption or failures in a CCR environment, see the following topics:
Restoring CCR After a Failure or Corruption Occurs
CCR provides functionality to automatically recover after a failure. However, there are still cases where manual recovery is required. Those cases are:
- Database file is corrupted on the passive
copy For detailed steps that explain how to
restore CCR after database corruption occurs, see How to Restore After
Database Corruption Occurs.
- Database or a log volume has failed on the passive
copy For detailed steps that explain how to
restore CCR after a volume failure occurs, see How to Restore After a
Volume Failure.
- Database has failed or is diverged CCR
detects and reports when database divergence has occurred as a
result of a failure. In general, this occurs when a database copy
is made available and the failed database copy has more changes
than the acceptable automatic mount criteria allows for. For
detailed steps that explain how to restore CCR after a database
failure or divergence occurs, see How to Restore CCR
Functionality After a Failure or Divergence.