Topic Last Modified: 2011-03-02
This topic describes the results of Microsoft’s testing of the failover solution proposed in this section.
Central Site Link Latency
We used a network latency simulator to introduce latency on the simulated WAN link between North and South. The recommended topology supports a maximum latency of 20 ms between the geographical sites. Improvements in the architecture of Lync Server 2010 enable the allowed latency to be higher than the maximum of 15 ms allowed in the Microsoft Office Communications Server 2007 R2 metropolitan site resiliency topology.
- 15 ms. We started by introducing a 15
ms round-trip latency into both the network path between two sites
and the data path used for data replication between the two sites.
The topology continued to operate without problem under these
conditions and under load.
- 20 ms. We then began to increase
latency. At 20 ms round-trip latency for both network and data
traffic, the topology continued to operate without problem.
20 ms is the maximum supported round-trip latency for this
topology in Lync Server 2010.
Important: Microsoft will not support solutions whose network and data latency exceeds 20 ms. - 30 ms. At 30 ms round-trip latency, we
started to see degradation in performance. In particular, message
queues for archiving and monitoring databases started to grow. As a
result of these increased latencies, user experience also
deteriorated. Sign-in time and conference creation time both
increased, and the A/V experience degraded significantly. For these
reasons, Microsoft does not support a solution where round-trip
latency has exceeded 20 ms.
Failover
As previously mentioned, all Windows Server 2008 R2 clusters in the topology used a Node and File Share Majority quorum. As a result, in order to simulate site failover, we had to isolate all servers and clusters by losing connectivity to both the South site and the witness site. We used a “dirty” shutdown of all servers at the North site.
Results and observations following failure of the North site are as follows:
- The passive SQL Server cluster node became active within
minutes. The exact amount of time can vary and depends on the
details of the environment. Internal users connected to the North
site were signed out and then automatically signed back in. During
the failover, presence was not updated, and new actions, such as
new IM sessions or conferences, failed with appropriate errors. No
more errors occurred after the failover was complete.
- As long as there is a valid network path between peers, ongoing
peer-to-peer calls continued without interruption.
- UC-PSTN calls were disconnected if the gateway supporting the
call became unavailable. In that case, users could manually
re-establish the call.
- Lync 2010 users connected to North site were disconnected and
automatically reconnected to the South site within minutes. Users
could then continue as before.
- In order to reconnect, Group Chat client users had to sign out
and sign back in. The Group Chat Channel service and Lookup service
in the South site, which were normally stopped or disabled at the
site, had to be started manually.
- Conferences hosted in the North site automatically failed over
to the South site. All users were prompted to rejoin the conference
after failover completed. Clients could rejoin the meeting. Meeting
recording continued during the failover. Archiving stopped until
the hot standby Archiving Server was brought online.
- Manageability continued to work while the North site was down.
For example, users could be moved from the Survivable Branch
Appliance to the Front End pool.
- After the North site went offline, SQL Server clusters and file
share clusters in the South site came online in a few minutes.
- Site failover duration as observed in our testing was only a
few minutes.
Failback
For the purposes of our testing, we defined failback as restoring all functionality to the North site such that users can reconnect to servers at that site. After the North site was restored, all cluster resources were moved back to their nodes at the North site.
We recommend that you perform your failback in a controlled manner, preferably during off hours, as some user disruption can happen during the failback procedures. Results and observations following failback of the North site are as follows:
- Before cluster resources can be moved back to their nodes at
the North site, storage had to be fully resynchronized. If storage
has not been resynchronized, clusters will fail to come online. The
resynchronization of the storage happened automatically.
- To ensure minimal user impact, the clusters were set not to
automatically fail back. Our recommendation is to postpone failback
until the next maintenance window after ensuring storage has fully
resynchronized.
- The Front End Servers will come online when they are able to
connect to the Active Directory Domain Services. If the Back End
Database is not yet available when the Front End Servers come
online, users will have limited functionality.
After the Front End Servers in the North site are online, new connections will be routed to them. Users who are online, and who usually connect through Front End Servers in the North site, will be signed out and then signed back in on their usual North site server.
If you want to prevent the Front End Servers at the North site from automatically coming back online—for example, if you want better control over the whole process or if latency between the two sites has not been restored to acceptable levels—we recommend shutting down the Front End Servers.
- Site failback duration as observed in our testing was under one
minute.