Planning for High Availability and Site Resilience

Applies to: Exchange Server 2010 SP3, Exchange Server 2010 SP2

Topic Last Modified: 2013-01-09

Microsoft Exchange Server 2010 includes a new unified framework for mailbox resiliency that includes new features such as the database availability group (DAG) and mailbox database copies. Although deploying these new features is a quick and simple process, careful planning must be performed beforehand to ensure that any high availability and site resilient solution using these features meets your expectations and your business requirements.

During the planning phase, the system architects, administrators, and other key stakeholders should identify the requirements for the deployment; in particular the requirements for high availability and site resilience. There are general requirements that must be met for deploying these features, as well as hardware, software, and networking requirements that must also be met. For guidance on the storage requirements for DAGs, see Mailbox Server Storage Design.

Contents

General Requirements

Hardware Requirements

Storage Requirements

Software Requirements

Network Requirements

Witness Server Requirements

Planning for Site Resilience

Planning for Datacenter Switchovers

General Requirements

Before deploying a DAG and creating mailbox database copies, make sure that the following system-wide recommendations are met:

Domain Name System (DNS) must be running. Ideally, the DNS server should accept dynamic updates. If the DNS server doesn't accept dynamic updates, you must create a DNS host (A) record for each Exchange server. Otherwise, Exchange won't function properly.
Each Mailbox server in a DAG must be a member server in the same domain.
It isn't supported to add an Exchange 2010 Mailbox server that's also a directory server to a DAG.
The name that you assign to the DAG must be a valid, available, and unique computer name of 15 characters or less.

Hardware Requirements

Generally, there are no special hardware requirements that are specific to DAGs or mailbox database copies. The servers used must meet all of the requirements set forth in the topics for Exchange 2010 Prerequisites and Exchange 2010 System Requirements. For hardware planning information, see the following topics:

Storage Requirements

Generally, there are no special storage requirements that are specific to DAGs or mailbox database copies. DAGs don't require or use cluster-managed shared storage. Cluster-managed shared storage is supported for use in a DAG only when the DAG is configured to use a solution that leverages the Third Party Replication API built into Exchange 2010. For storage planning information, see Mailbox Server Storage Design.

Software Requirements

DAGs are available in both Exchange 2010 Standard Edition and Exchange 2010 Enterprise Edition. In addition, a DAG can contain a mix of servers running Exchange 2010 Standard Edition and Exchange 2010 Enterprise Edition.

Each member of the DAG must also be running the same operating system. Exchange 2010 is supported on both the Windows Server 2008 and Windows Server 2008 R2 operating systems. All members of a DAG must run either Windows Server 2008 or Windows Server 2008 R2. They can't contain a combination of both Windows Server 2008 and Windows Server 2008 R2.

In addition to meeting the prerequisites for installing Exchange 2010, there are operating system requirements that must be met. DAGs use Windows Failover Clustering technology, and as a result, they require the Enterprise version of Windows.

Network Requirements

There are specific networking requirements that must be met for each DAG and for each DAG member. DAG networks are similar to the public, mixed, and private networks used in previous versions of Exchange. However, unlike previous versions, using a single network in each DAG member is a supported configuration. In addition, the terminology has changed somewhat. Instead of public, private or mixed networks, each DAG has a single MAPI network, which is used by other servers (e.g., other Exchange 2010 servers, directory servers, etc.) to communicate with the DAG member, and zero or more Replication networks, which are networks that are dedicated to log shipping and seeding.

Although a single network is supported, we recommend that each DAG have at least two networks: a single MAPI network and a single Replication network. This provides redundancy for the network and the network path, and enables the system to distinguish between a server failure and a network failure. Using a single network adapter prevents the system from distinguishing between these two types of failures.

Note:
The product documentation in this content area is written with the assumption that each DAG member contains at least two network adapters, that each DAG is configured with a MAPI network and at least one Replication network, and that the system is able to distinguish between a network failure and a server failure.

Consider the following when designing the network infrastructure for your DAG:

Each member of the DAG must have at least one network adapter that is able to communicate with all other DAG members. If you are using a single network path, we recommend that you use gigabit Ethernet. When using a single network adapter in each DAG member, the DAG network does need to be enabled for replication and should be configured as a MAPI network. Because there are no other networks, the system will use the MAPI network as a Replication network, as well. In addition, when using a single network adapter in each DAG member, we recommend that you design the overall solution with the single network adapter and path in mind.
Using two network adapters in each DAG member provides you with one MAPI network and one Replication network, and the following recovery behaviors:
- In the event of a failure affecting the MAPI network, a server failover will occur (assuming there are healthy mailbox database copies that can be activated).
- In the event of a failure affecting the Replication network, if the MAPI network is unaffected by the failure, log shipping and seeding operations will revert to use the MAPI network, even if the MAPI network has it’s ReplicationEnabled property set to False. When the failed Replication network is restored to health and ready to resume log shipping and seeding operations, you must manually switch over to the Replication network. To change replication from the MAPI network to a restored Replication network, you can either suspend and resume continuous replication by using the Suspend-MailboxDatabaseCopy and Resume-MailboxDatabaseCopy cmdlets, or restart the Microsoft Exchange Replication service. We recommend using the suspend and resume operations to avoid the brief outage caused by restarting the Microsoft Exchange Replication service.
Each DAG member must have the same number of networks. For example, if you plan on using a single network adapter in one DAG member, then all members of the DAG must also use a single network adapter.
Each DAG must have no more than one MAPI network. The MAPI network must provide connectivity to other Exchange servers and other services, such as Active Directory and DNS.
Additional Replication networks can be added, as needed. You can also prevent an individual network adapter from being a single point of failure by using network adapter teaming or similar technology. However, even when using teaming, this does not prevent the network itself from being a single point of failure.
Each network in each DAG member server must be on its own network subnet. Each server in the DAG can be on a different subnet, but the MAPI and Replication networks must be routable and provide connectivity, such that:
- Each network in each DAG member server is on its own network subnet that's separate from the subnet used by each other network in the server.
- Each DAG member server's MAPI network can communicate with each other DAG member's MAPI network.
- Each DAG member server's Replication network can communicate with each other DAG member's Replication network.
- There is no direct routing that allows heartbeat traffic from the Replication network on one DAG member server to the MAPI network on another DAG member server, or vice versa, or between multiple Replication networks in the DAG.
Regardless of their geographic location relative to other DAG members, each member of the DAG must have round trip network latency no greater than 500 milliseconds (ms) between each other member. As the round trip latency between two mailbox servers hosting copies of a database increases, the potential for replication being not up-to-date also increases. Regardless of the latency of the solution, customers should validate that the network(s) between all DAG members is capable of satisfying the data protection and availability goals of the deployment. Configurations with higher latency values may require special tuning of DAG, replication and network parameters, such as increasing the number of databases or decreasing the number of mailboxes per database, to achieve the desired goals.
Round trip latency requirements may not be the most stringent network bandwidth and latency requirement for a multi-datacenter configuration. You must evaluate the total network load, which includes client access, Active Directory, transport, continuous replication, and other application traffic, to determine the necessary network requirements for your environment.
DAG networks support Internet Protocol Version 4 (IPv4) and IPv6. IPv6 is supported only when IPv4 is also used; a pure IPv6 environment isn't supported. Using IPv6 addresses and IP address ranges is supported only when both IPv6 and IPv4 are enabled on that computer, and the network supports both IP address versions. If Exchange 2010 is deployed in this configuration, all server roles can send data to and receive data from devices, servers, and clients that use IPv6 addresses.
Automatic Private IP Addressing (APIPA) is a feature of Microsoft Windows that automatically assigns IP addresses when no Dynamic Host Configuration Protocol (DHCP) server is available on the network. APIPA addresses (including manually assigned addresses from the APIPA address range) aren't supported for use by DAGs or by Exchange 2010.

DAG Name and IP Address Requirements

During creation, each DAG is given a unique name, and either assigned one or more static IP addresses, or configured to use DHCP. Regardless of whether you use static or dynamically-assigned addresses, any IP address assigned to the DAG must be on the MAPI network.

Each DAG requires a minimum of one IP address on the MAPI network. A DAG requires additional IP addresses when the MAPI network is extended across multiple subnets. The following figure illustrates a DAG where all nodes in the DAG have the MAPI network on the same subnet.

In this example, the MAPI network in each DAG member is on the 172.19.18.x subnet. As a result, the DAG requires a single IP address on that subnet.

The next figure illustrates a DAG that has a MAPI network which extends across two subnets: 172.19.18.x and 172.19.19.x.

In this example, the MAPI network in each DAG member is on a separate subnet. As a result, the DAG requires two IP addresses, one for each subnet on the MAPI network.

Each time the DAG's MAPI network is extended across an additional subnet, an additional IP address for that subnet must be configured for the DAG. Each IP address that's configured for the DAG is assigned to and used by the DAG's underlying failover cluster. The name of the DAG is also used as the name for the underlying failover cluster.

At any specific time, the cluster for the DAG will use only one of the assigned IP addresses. Windows Failover Clustering registers this IP address in DNS when the cluster IP address and Network Name resources are brought online. In addition to using an IP address and network name, a cluster name object (CNO) is created in Active Directory. The name, IP address and CNO for the cluster are used internally by the system to secure the DAG and for internal communication purposes. Administrators and end-users don't need to interface with or connect to the DAG name or IP address.

Note:
Although the cluster's IP address and network name are used internally by the system, there is no hard dependency in Exchange 2010 that these resources be available. Even if the underlying cluster's IP Address and Network Name resources are offline, internal communication still occurs within the DAG by using the DAG member's server names. However, we recommend that you periodically monitor the availability of these resources to ensure that they aren't offline for more than 30 days. If the underlying cluster is offline for more than 30 days, the cluster CNO account may be invalidated by the garbage collection mechanism in Active Directory.

Note:

Although the cluster's IP address and network name are used internally by the system, there is no hard dependency in Exchange 2010 that these resources be available. Even if the underlying cluster's IP Address and Network Name resources are offline, internal communication still occurs within the DAG by using the DAG member's server names. However, we recommend that you periodically monitor the availability of these resources to ensure that they aren't offline for more than 30 days. If the underlying cluster is offline for more than 30 days, the cluster CNO account may be invalidated by the garbage collection mechanism in Active Directory.

Network Adapter Configuration for DAGs

Each network adapter must be configured properly based on its intended use. A network adapter that's used for a MAPI network is configured differently from a network adapter that's used for a Replication network. In addition to configuring each network adapter correctly, you must also configure the network connection order in Windows so that the MAPI network is at the top of the connection order. For detailed steps about how to modify the network connection order, see Modify the Protocol Bindings Order.

MAPI Network Adapter Configuration

A network adapter intended for use by a MAPI network should be configured as described in the following table.

Networking Features	Setting
Client for Microsoft Networks	Enabled
QoS Packet Scheduler	Optionally enable
File and Printer Sharing for Microsoft Networks	Enable
Internet Protocol Version 6 (TCP/IP v6)	Optionally enable
Internet Protocol Version 4 (TCP/IP v4)	Enabled
Link-Layer Topology Discovery Mapper I/O Driver	Enabled
Link-Layer Topology Discovery Responder	Enabled

The TCP/IP v4 properties for a MAPI network adapter are configured as follows:

The IP address for a DAG member’s MAPI network can be manually assigned or configured to use DHCP. If DHCP is used, we recommend using persistent reservations for server's IP address.
The MAPI network typically uses a default gateway, although one isn't required.
At least one DNS server address must be configured. Using multiple DNS servers is recommended for redundancy.
The Register this connection's addresses in DNS checkbox should be checked.

Replication Network Adapter Configuration

A network adapter intended for use by a Replication network should be configured as described in the following table.

Networking Features	Setting
Client for Microsoft Networks	Disabled
QoS Packet Scheduler	Optionally enable
File and Printer Sharing for Microsoft Networks	Disabled
Internet Protocol Version 6 (TCP/IP v6)	Optionally enable
Internet Protocol Version 4 (TCP/IP v4)	Enabled
Link-Layer Topology Discovery Mapper I/O Driver	Enabled
Link-Layer Topology Discovery Responder	Enabled

The TCP/IP v4 properties for a Replication network adapter are configured as follows:

The IP address for a DAG member’s Replication network can be manually assigned or configured to use DHCP. If DHCP is used, we recommend using persistent reservations for server's IP address.
Replication networks typically do not have default gateways, and if the MAPI network has a default gateway, then no other networks should have default gateways. Routing of network traffic on a Replication network can be configured by using persistent, static routes to the corresponding network on other DAG members using gateway addresses that have the ability to route between the Replication networks. All other traffic not matching this route will be handled by the default gateway that's configured on the adapter for the MAPI network.
DNS server addresses should not be configured.
The Register this connection's addresses in DNS checkbox should not be checked.

Return to top

Witness Server Requirements

A witness server is a server outside of a DAG that's used to achieve and maintain quorum when the DAG has an even number of members. DAGs with an odd number of members do not use a witness server. All DAGs with an even number of members will use a witness server. The witness server can be any computer running Windows Server. There is no requirement that the version of the Windows Server operating system of the witness server match the operating system used by the DAG members.

Quorum is maintained at the cluster level, underneath the DAG. A DAG has quorum when the majority of its members are online and can communicate with the other online members of the DAG. This notion of quorum is one aspect of the concept of quorum in Windows failover clustering. A related and necessary aspect to quorum in failover clusters is the quorum resource. The quorum resource is a resource inside a failover cluster that provides a means for arbitration leading to cluster state and membership decisions. The quorum resource also provides persistent storage for storing configuration information. A companion to the quorum resource is the quorum log, which is a configuration database for the cluster. The quorum log contains information such as which servers are members of the cluster, what resources are installed in the cluster, and the state of those resources (for example, online or offline).

It is critical that each DAG member have a consistent view of how the DAG's underlying cluster is configured. The quorum acts as the definitive repository for all configuration information relating to the cluster. The quorum is also used as a tie-breaker to avoid “split-brain” syndrome. Split brain syndrome is a condition that occurs when DAG members cannot communicate with each other but are up and running. Split brain syndrome is prevented by always requiring a majority of the DAG members (and in the case of DAGs with an even number of member, the DAG witness server) to be available and interacting for the DAG to be operational.

Planning for Site Resilience

Every day, more and more businesses recognize that access to a reliable and available messaging system is fundamental to their success. For many organizations, the messaging system is part of the business continuity plans, and their messaging service deployment is designed with site resilience in mind. Fundamentally, many site resilient solutions involve the deployment of hardware in a second datacenter.

Ultimately, the overall design of a DAG, including the number of DAG members and the number of mailbox database copies, will depend on each organization's recovery service level agreements (SLAs) that cover various failure scenarios. During the planning stage, the solution's architects and administrators identify the requirements for the deployment, including in particular the requirements for site resilience. They identify the location(s) to be used and the required recovery SLA targets. The SLA will identify two specific elements that should be the basis for the design of a solution that provides high availability and site resilience: the Recovery Time Objective (RTO) and the Recovery Point Objective (RPO). Both of these values are measured in minutes. The RTO is how long it takes to restore service. The RPO refers to how current the data is after the recovery operation has completed. An SLA may also be defined for restoring the primary datacenter to full service after its problems are corrected.

The solution's architects and administrators will also identify which set of users require site resilience protection, and determine if the multi-site solution will be active/passive or active/active configuration. In an active/passive configuration, no users are normally hosted in the standby datacenter. In an active/active configuration, users are hosted in both locations, and some percentage of the total number of databases within the solution has a preferred active location in a second datacenter. When service for the users of one datacenter fails, those users are activated in the other datacenter.

Constructing the appropriate SLAs often requires answering the following basic questions:

What level of service is required after the primary datacenter fails?
Do users need their data or just messaging services?
How rapidly is data required?
How many users must be supported?
How will users access their data?
What is the standby datacenter activation service level agreement (SLA)?
How is service moved back to the primary datacenter?
Are the resources dedicated to the site resilience solution?

By answering these questions, you begin to shape a site resilient design for your messaging solution. A core requirement of recovery from site failure is to create a solution that gets the necessary data to the backup datacenter that hosts the backup messaging service.

Namespace Planning

Exchange 2010 changes the way in which you plan your namespace design when deploying a site resilient configuration. Proper namespace planning is essential in order for datacenter switchovers to be successful. From a namespace perspective, each datacenter used in a site resilience configuration is considered to be active. As a result, each datacenter will require its own unique namespace for the various Exchange 2010 services in that site, including namespaces for Outlook Web App, Outlook Anywhere, Exchange ActiveSync, Exchange Web Services, RPC Client Access, Post Office Protocol version 3 (POP3), Internet Message Access Protocol version 4 (IMAP4), and Simple Mail Transfer Protocol (SMTP). In addition, one of the datacenters also hosts the namespace for Autodiscover. This design also enables you to perform a single database switchover from the primary datacenter to a second datacenter to validate the configuration of the second data as part of validation of and practice for a datacenter switchover.

As a best practice, we recommend that you use split DNS for the Exchange hostnames that are used by clients. Split DNS refers to a DNS server configuration in which internal DNS servers return an internal IP address for a hostname and external (Internet-facing) DNS servers return a public IP address for the same hostname. Because using split DNS uses the same hostnames internally and externally, this strategy enables you to minimize the number of hostnames you'll need.

The following figure illustrates namespace planning for a site resilient configuration.

Namespaces for site resilient DAG deployment

As shown above, each datacenter uses a separate and unique namespace and each contains DNS servers in a split DNS configuration for those namespaces. The Redmond datacenter, which is considered the primary datacenter, is configured with a namespace of protocol.contoso.com. The Portland datacenter is configured with a namespace of protocol.standby.contoso.com. Namespaces can include designations of standby, as in the example figure, they can be based on region (e.g., protocol.portland.contoso.com), or they can be based on other naming conventions that suit your organization's needs. The key requirement is that, regardless of the naming convention you use, each datacenter should have its own unique namespace.

FailbackURL Configuration

Some Web browsers, including Microsoft Internet Explorer, maintain a DNS name cache during each browser session that is separate from the DNS cache provided by the operating system. During failback to the primary datacenter after a datacenter switchover has occurred, the Web browser's use of this separate cache can result in logon loops for Outlook Web App users wherein users are redirected to the same URL in a repeating loop.

During the failback process, the IP address for the Outlook Web App namespace is changed in DNS from an endpoint in the standby datacenter back to its original endpoint in the primary datacenter. After the TTL for the DNS record has expired and even after the operating system's DNS cache is cleared, Web browsers that maintain their own separate name cache may continue to connect to the endpoint in the standby datacenter, even though the namespace is hosted in the primary datacenter.

Typically, closing the Web browser is sufficient to clear its separate name cache and prevent the logon loops. However, to mitigate this issue for all Web browsers and Outlook Web App users, you can configure the FailbackURL property of your Outlook Web App virtual directory. The FailbackUrl parameter specifies the host name that Outlook Web App uses to connect to the Client Access server after failback to a primary site. This namespace requires a separate DNS entry pointing to the original Client Access server's IP address. The value of the FailbackUrl parameter must be different from the value of the ExternalUrl parameter for the Outlook Web App virtual directory. When an Outlook Web App user provides their credentials, the Client Access server will detect if the redirection URL is the same URL the user is visiting. If the URLs are the same, the Client Access server will check to see if the FailbackUrl parameter is configured:

If the FailbackUrl parameter is configured, it will redirect the user to that URL where they should be able to access Outlook Web App.
If the FailbackUrl parameter is not configured, the user will receive an error message that indicates that a server configuration change is temporarily preventing access to their mailbox. The message instructs the user to close all browser windows (thereby clearing the browser's name cache) and try again in a few minutes.

Certificate Planning

There are no unique or special design considerations for certificates when deploying a DAG in a single datacenter. However, when extending a DAG across multiple datacenters in a site resilient configuration, there are some specific considerations with respect to certificates. Generally, your certificate design will depend on the clients in use, as well as the certificate requirements by other applications that use certificates. But there are some specific recommendations and best practices you should follow with respect to the type and number of certificates.

As a best practice, you should minimize the number of certificates you use for your Client Access servers, reverse proxy servers, and transport servers (Edge and Hub). We recommend using a single certificate for all of these service endpoints in each datacenter. This approach minimizes the number of certificates that are needed, which reduces both cost and complexity for the solution.

For Outlook Anywhere clients, we recommend that you use a single Subject Alternative Name (SAN) certificate for each datacenter, and include multiple host names in the certificate. To ensure Outlook Anywhere connectivity after a database, server or datacenter switchover, you must use the same Certificate Principal Name on each certificate, and configure the Outlook Provider Configuration object Active Directory with the same Principal Name in Microsoft-Standard Form (msstd). For example, if you use a Certificate Principal Name of mail.contoso.com, you would configure the attribute as follows:

	Copy Code
Set-OutlookProvider EXPR -CertPrincipalName "msstd:mail.contoso.com"

Some applications that integrate with Exchange have specific certificates requirements that may require using additional certificates. Exchange 2010 can co-exist with Office Communications Server (OCS). OCS requires certificates with 1024-bit or greater certificates that use the OCS server name for the Certificate Principal Name. Because using an OCS server name for the Certificate Principal Name would prevent Outlook Anywhere from working properly, you would need to use an additional and separate certificate for the OCS environment.

For more information about using SAN certificates for Exchange 2010 client access, see Configure SSL Certificates to Use Multiple Client Access Server Host Names.

Network Planning

In addition to the specific networking requirements that must be met for each DAG, as well as for each server that's a member of a DAG, there are some requirements and recommendations that are specific to site resilience configurations. As with all DAGs, whether the DAG members are deployed in a single site or in multiple sites, the round-trip return network latency between DAG members DAG must be no greater than 500 milliseconds (ms). In addition, there are specific configuration settings that are recommended for DAGs that are extended across multiple sites:

MAPI networks should be isolated from Replication networks Windows network policies, Windows firewall policies or router access control lists (ACLs) should be used to block traffic between the MAPI network and the Replication network(s). This configuration is necessary to prevent network heartbeat cross-talk.
Client-facing DNS records should have a Time to Live (TTL) of 5 minutes The amount of downtime that clients experience is dependent not just on how quickly a switchover can occur, but also on how quickly DNS replication occurs and how quickly the clients query for updated DNS information. DNS records for all Exchange client services, including Outlook Web App, Exchange ActiveSync, Exchange Web services, Outlook Anywhere, SMTP, POP3, IMAP4, and RPC Client Access in both the internal and external DNS servers should be set with a TTL of 5 minutes.
Use static routes to configure connectivity across Replication networks To provide network connectivity between each of the Replication network adapters, use persistent static routes. This is a quick and one-time configuration that is performed on each DAG member when using static IP addresses. If you are using DHCP to obtain IP addresses for your Replication networks, you can also use it to assign static routes for the Replication, thereby simplifying the configuration process.

General Site Resilience Planning

In addition to the requirements listed above for high availability, there are other recommendations for deploying Exchange 2010 in a site resilient configuration (e.g., extending a DAG across multiple datacenters). What you do during the planning phase will directly affect the success of your site resilience solution. For example, poor namespace design can cause difficulties with certificates, and an incorrect certificate configuration can prevent users from accessing services.

In order to minimize the time it takes to activate a second datacenter, and allow the second datacenter to host the service endpoints of a failed datacenter, the appropriate planning must be completed. For example:

The Service Level Agreement (SLA) goals for the site resilience solution must be well understood and documented.
The servers in the second datacenter must have sufficient capacity to host the combined user population of both datacenters.
The second datacenter must have all services enabled that are provided in primary datacenter (unless the service isn't included as part of the site resilience SLA). This includes Active Directory, networking infrastructure (DNS, TCP/IP, etc.), telephony services (if Unified Messaging is in use), and site infrastructure (power, cooling, etc.).
In order for some services to be able to service users from the failed datacenter, they must have the proper server certificates configured. Some services do not allow instancing (for example, POP3 and IMAP4) and only allow the use of a single certificate. In these cases, either the certificate must be a subject alternative name (SAN) certificate that includes multiple names, or the multiple names must be similar enough so that a wildcard certificate can be used (assuming the security policies of the organization allows the use of wildcard certificates).
The necessary services must be defined in the second datacenter. For example, if first datacenter has three different SMTP URLs on different transport servers, then the appropriate configuration must be defined in the second datacenter to enable at least one (if not all three) transport server(s) to host the workload.
The necessary network configuration must be in place to support the datacenter switchover. This might mean making sure that load balancing configurations are in place, that global DNS is configured, and that the Internet connection is enabled with the appropriate routing configured.
The strategy for the enabling the DNS changes necessary for a datacenter switchover must be understood. The specific DNS changes, including their Time to Live (TTL) settings, must be defined and documented to support the SLA(s) in effect.
A strategy for testing the solution must also be established and factored into the SLA. Periodic validation of the deployment is the only way to guarantee that the quality and viability of the deployment does not degrade over time. After the deployment is validated, we recommend that the part of the configuration that directly affects the success of the solution be explicitly documented. In addition, we recommend that you enhance your change management processes around those segments of the deployment.

Return to top

Planning for Datacenter Switchovers

The proper planning and preparation involves not only the deployment of the second datacenter resources, such as live Client Access and Hub Transport servers, but also pre-configuration of those resources to minimize the changes required as part of a datacenter switchover operation.

Note:
Client Access and Hub Transport services are required in the second datacenter even when automatic activation of the mailbox databases in the second datacenter is blocked. These services are necessary in order to perform database switchovers, as well as to perform testing and validation of the services and data in the second datacenter.

To better understand the how a datacenter switchover process works, it's helpful to understand the basic operation of an Exchange 2010 datacenter switchover.

As illustrated in the following figure, a site resilient deployment consists of a DAG that has members in both datacenters.

Database Availability Group Across Two Sites

When a DAG is extended across multiple datacenters, it should be designed so that either the majority of the DAG members are located in the primary datacenter or, when each datacenter has the same number of members, the primary datacenter hosts the witness server. This design guarantees that service will provided in the primary datacenter even if network connectivity between the two datacenters fails. It also means that when the primary datacenter fails, however, quorum will be lost for the members in the second datacenter.

Partial datacenter failures are also possible and will happen. The presumption is that if enough functionality is lost in the primary datacenter to preclude effective service and management then a datacenter switchover should be performed to activate the second datacenter. The activation process involves the administrator configuring the surviving servers of partially operational state to cease service. Activation can then proceed in the second datacenter. This is done to preclude both sets of services to try and operate at the same time.

As a result of the loss of the quorum, the DAG members in the second datacenter cannot automatically come online. Thus, activating the mailbox servers in the second datacenter also requires a step where the DAG member servers are forced to create quorum, at which point the servers in the failed datacenter are internally (but only temporarily) removed from the DAG. This provides a partial-service solution that's stable and able to experience some level of additional failures and still continue to function.

Note:
One prerequisite of being able to experience additional failures is that the DAG has at least four members and the four members are spread between two Active Directory sites (e.g., at least two members in each datacenter).

This is the basic process used to re-establish Mailbox role functionality in the second datacenter. The activation of the other roles in the second datacenter does not involve explicit actions on the impacted servers in the second datacenter. Instead, servers in the second datacenter become the service endpoints for those services normally hosted by the primary datacenter. For example, a user normally hosted in the primary datacenter might use https://mail.contoso.com/owa to connect to Outlook Web App. After the datacenter failure, these service endpoints are moved to endpoints in the second datacenter as part of the switchover operation. During the switchover operation, the service endpoints for the primary datacenter are re-targeted at alternate IP addresses for the same services in the second datacenter. This minimizes the amount of changes that must be made to configuration information stored in Active Directory during the switchover process. Generally, there are two ways to complete this step:

Update DNS records; or
Reconfigure DNS and load balancer(s) to enable and disable alternate IP addresses, thus moving services between datacenters.

A strategy for testing the solution must be established. It must be factored into the SLA. Periodic validation of the deployment is the only way to guarantee the deployment does not degrade over time.

Careful completion of these planning steps will directly impact the success of a datacenter switchover. For example, poor namespace design can cause difficulties with certificates, and an incorrect certificate configuration can preclude users from being able to access services.

After the deployment is validated, we recommend that all parts of the configuration that directly affect the success of a datacenter switchover be explicitly documented. In addition, it might be prudent to enhance the change management processes around those segments of the deployment.

For more information about datacenter switchovers, including activating a secondary datacenter, and re-activating a failed (primary) datacenter, see Datacenter Switchovers.

Return to top