Topic Last Modified: 2011-03-02

This topic describes the results of Microsoft’s testing of the failover solution proposed in this section.

Central Site Link Latency

We used a network latency simulator to introduce latency on the simulated WAN link between North and South. The recommended topology supports a maximum latency of 20 ms between the geographical sites. Improvements in the architecture of Lync Server 2010 enable the allowed latency to be higher than the maximum of 15 ms allowed in the Microsoft Office Communications Server 2007 R2 metropolitan site resiliency topology.

15 ms. We started by introducing a 15 ms round-trip latency into both the network path between two sites and the data path used for data replication between the two sites. The topology continued to operate without problem under these conditions and under load.
20 ms. We then began to increase latency. At 20 ms round-trip latency for both network and data traffic, the topology continued to operate without problem. 20 ms is the maximum supported round-trip latency for this topology in Lync Server 2010.

Important:

Microsoft will not support solutions whose network and data latency exceeds 20 ms.
30 ms. At 30 ms round-trip latency, we started to see degradation in performance. In particular, message queues for archiving and monitoring databases started to grow. As a result of these increased latencies, user experience also deteriorated. Sign-in time and conference creation time both increased, and the A/V experience degraded significantly. For these reasons, Microsoft does not support a solution where round-trip latency has exceeded 20 ms.

Important:
Microsoft will not support solutions whose network and data latency exceeds 20 ms.

Failover

As previously mentioned, all Windows Server 2008 R2 clusters in the topology used a Node and File Share Majority quorum. As a result, in order to simulate site failover, we had to isolate all servers and clusters by losing connectivity to both the South site and the witness site. We used a “dirty” shutdown of all servers at the North site.

Results and observations following failure of the North site are as follows:

The passive SQL Server cluster node became active within minutes. The exact amount of time can vary and depends on the details of the environment. Internal users connected to the North site were signed out and then automatically signed back in. During the failover, presence was not updated, and new actions, such as new IM sessions or conferences, failed with appropriate errors. No more errors occurred after the failover was complete.
As long as there is a valid network path between peers, ongoing peer-to-peer calls continued without interruption.
UC-PSTN calls were disconnected if the gateway supporting the call became unavailable. In that case, users could manually re-establish the call.
Lync 2010 users connected to North site were disconnected and automatically reconnected to the South site within minutes. Users could then continue as before.
In order to reconnect, Group Chat client users had to sign out and sign back in. The Group Chat Channel service and Lookup service in the South site, which were normally stopped or disabled at the site, had to be started manually.
Conferences hosted in the North site automatically failed over to the South site. All users were prompted to rejoin the conference after failover completed. Clients could rejoin the meeting. Meeting recording continued during the failover. Archiving stopped until the hot standby Archiving Server was brought online.
Manageability continued to work while the North site was down. For example, users could be moved from the Survivable Branch Appliance to the Front End pool.
After the North site went offline, SQL Server clusters and file share clusters in the South site came online in a few minutes.
Site failover duration as observed in our testing was only a few minutes.

Failback

For the purposes of our testing, we defined failback as restoring all functionality to the North site such that users can reconnect to servers at that site. After the North site was restored, all cluster resources were moved back to their nodes at the North site.

We recommend that you perform your failback in a controlled manner, preferably during off hours, as some user disruption can happen during the failback procedures. Results and observations following failback of the North site are as follows:

Before cluster resources can be moved back to their nodes at the North site, storage had to be fully resynchronized. If storage has not been resynchronized, clusters will fail to come online. The resynchronization of the storage happened automatically.
To ensure minimal user impact, the clusters were set not to automatically fail back. Our recommendation is to postpone failback until the next maintenance window after ensuring storage has fully resynchronized.
The Front End Servers will come online when they are able to connect to the Active Directory Domain Services. If the Back End Database is not yet available when the Front End Servers come online, users will have limited functionality.

After the Front End Servers in the North site are online, new connections will be routed to them. Users who are online, and who usually connect through Front End Servers in the North site, will be signed out and then signed back in on their usual North site server.

If you want to prevent the Front End Servers at the North site from automatically coming back online—for example, if you want better control over the whole process or if latency between the two sites has not been restored to acceptable levels—we recommend shutting down the Front End Servers.
Site failback duration as observed in our testing was under one minute.



Test Results