Topic Last Modified: 2014-02-10
This article is a companion to the Key Health Indicators: The Foundation for Maintaining Healthy Lync Servers poster, which you can download from the Download Center.
You can use this poster to learn about Key Health Indicators (KHIs), performance counters with thresholds aimed at revealing user experience issues. Gathering KHI data is usually the first step to implementing the Call Quality Methodology (CQM), which is focused on ensuring a quality audio experience for Lync users.
If you have questions about how to use CQM, you can submit your questions to cqmfeedback@microsoft.com.
The poster explains the following areas:
What are Key Health Indicators?
Key Health Indicators are performance counters with thresholds aimed at revealing user experience issues. Gathering KHI data is usually the first step to implementing the Call Quality Methodology (CQM), which is focused on ensuring a quality audio experience for Lync users.
KHIs are used in addition to standard Lync Monitoring Solutions (e.g. System Center Operations Manager, Synthetic Transactions, Monitoring Server) and not instead of those solutions.
Collect the KHI performance counters and populate the KHI spreadsheet accompanying the Networking Guide to produce a scorecard that will help you determine the server health of a Lync deployment. Once populated, it guides you in repairing the environment and gives additional insight to other stakeholders. Evaluate KHIs on a monthly basis and incorporate them into any deployment’s ongoing operational processes.
Download the Lync Server Networking Guide to see the full list of KHIs and to get the related spreadsheets.
To Collect KHI Data
-
Run the KHI script included with the Lync Server Networking Guide on each Lync Server. This will create a Data Collector inside of Performance Monitor and name it KHI. By default, data will be polled every 15 seconds.
-
Before the start of your company's business day, go to each Lync Server and start the KHI Data Collector.
-
At the end of that day, stop the KHI Data Collector and copy the data to a central location.
-
After using Performance Monitor to fill in the KHI spreadsheet included with the Lync Server Networking Guide download, compare the results to the recommended targets.
Remediation Flow for all Server Roles
For each server in your Lync implementation, begin by verifying that the server’s component health and system performance is at or above the desired level. Only after that should you look at the indicators relating to the server’s role in the overall Lync implementation.
Begin by collecting KHI Performance Data for all servers. For each of the system roles (details discussed later in this document) determine whether the basic system components meet the recommended targets. If they do not, remediate the system performance then re-collect KHI data and ensure system health before looking at the metrics specific to the server’s role in the Lync implementation. Component health for all roles is defined as:
-
CPU Utilization < 80%
-
Avg. Disk Write < 10 ms
-
Avg. Disk Read < 10 ms
-
Available memory >20% System Total MB
-
Network Queue Length < 2
-
Discarded Packets (in / out) = 0
Glossary
The following terms and acronyms are used in this poster:
AS MCU = Application Sharing Multi-point Control Unit
AV MCU = Audio/Video MCU
IM MCU = Instant Messaging MCU
UCWA = Unified Communications Web API
AV Edge = Traversal of audio/video via edge
AV Auth = Audio/Video Authentication
SIP Stack = Contains Lync’s core SIP implementation
Data Proxy = Used for edge conferencing
LySS = Lync Storage Service
Front-end Servers
The following recommended KHI targets are specific to front-end servers in addition to basic component health:
Functional area | Target Metrics |
---|---|
AS/AV/IM MCU |
MCU Health State <2 |
Web Components |
Distribution List expansion AD timeouts <0 ABWQ failures = 0 LIS failures = 0 Authentication Errors < 1/sec ASP.NET v4 Requests Rejected = 0 |
SIP Stack |
Avg. Incoming Message Processing < 1 sec Incoming Responses Dropped < 1/sec Incoming Requests Dropped < 1/sec Queue Latency < 100 ms Sproc Latency < 100 ms Throttled Requests = 0 Authentication Errors < 1/sec Incoming Messages Timed Out < 2 Avg. Incoming Message Hold < 1 sec Flow Controlled Connections < 2 Avg. Out Queue Delay < 2 sec |
LySS |
% of space used by Storage Service DB < 80 # of replica replication failures = 0 # of data loss events = 0 |
SQL |
Page life expectancy > 300 Sec. Batch requests / sec < 2500 |
Backend SQL Servers
The following recommended KHI targets are specific to SQL servers in addition to basic component health:
Functional area | Target Metrics |
---|---|
SQL |
Page life expectancy > 300 Sec. Batch requests / sec < 2500 |
Mediation Servers
The following recommended KHI targets are specific to mediation servers in addition to basic component health:
Functional area | Target Metrics |
---|---|
Mediation Server Service |
Load Call Failure Index = 0 Failed Calls due to Proxy <10 Failed Calls due to Gateway <10 Calls (in or out) rejected = 0 Media Candidates missing = 0 Media Connectivity Check Failures = 0 |
Edge Servers
The following recommended KHI targets are specific to edge servers in addition to basic component health:
Functional area | Target Metrics |
---|---|
AV Auth |
Bad Requests < 20/sec |
AV Edge |
Auth. Failures <20/sec Allocation Failures <20/sec Packets Dropped <300/sec |
Data Proxy |
Throttled Server connections < 3 System is Throttling <1 |
SIP Stack |
Connections over limit dropped < 1 Sends timed out <10 Flow Controlled Connections <100 Incoming requests dropped < 1/sec Avg. Message Processing < 3 sec |