Historically, establishing information technology (IT) performance metrics has been a complex and difficult task sometimes requiring manual collection of data, software customization, detailed analysis, and lengthy reporting. Not only was it difficult to distill from the raw data meaningful information, but the data reported was not always current. As a result, the ability to make well informed decisions based on the current environment was challenging. Even more challenging was the ability to make “predictive” decisions based on legitimate trends. With increasing standardization, more sophisticated enterprise applications/tools, and the wide-spread adoption of IT Service Management best practices, the formulation of IT metrics has evolved significantly. For example, just a few years ago the talk of “five nines” (99.99999%) availability was standard. Later the conversation focused on the cost of unavailability – from the customer perspective. Today leading automated tools provide standard metrics and reports “out of the box”. However, despite the availability of tools and standard metrics, most IT organizations still struggle with metrics. In general, people tend to perform in accordance with what is measured. Metrics are an effective means to drive behaviors. Therefore, it is critical that meaningful metrics are established, published, and well understood by everyone in the organization.
The most effective approach to establishing meaning metrics is to start with the “End” in mind. The “End” is typically an end user. Traditionally, end users are most concerned with the availability, feature/functionality, and support of a specific “IT service” verses the various technologies that are combined to create a service (i.e. applications, networks, databases, etc). Therefore, the starting point for meaningful metrics is to determine how the end user or customer will evaluate the service provider’s performance on each of these services. Armed with this information it is then much easier to establish a limited number of high level, yet measurable metrics that be quantified and reported. It is then the responsibility of the IT organization to “reverse engineer” the overall customer focused metrics to determine the cascading sub tier metrics that directly contribute to the overall customer metrics used to evaluate the effectiveness of a specific IT service. These sub tier metrics represent both technology metrics that indicate the effectiveness of a specific technology as well as more service related metrics that represent the effectiveness of the combined technologies required to deliver a specific service. When combining the customer’s evaluation (perception) of the quality of service across all IT services, the effectiveness of the IT organization overall can be determined and improvement plans established. This approach to implementing metrics ensures that the appropriate organizational behaviors are aligned with what is important to the end user or customer.
Therefore, there are a number of factors to consider when establishing meaningful metrics that will guide the appropriate organizational behaviors to ensure organizational success. These factors can best be considered by answering a number of key questions.
For example:
- What is the mission?
- In what ways does the success of the mission depend on IT?
- Specific what IT services does the mission depend on most?
- How does the end user evaluate the quality of the IT service(s)?
- Can the degree of service quality be measured?
- What key metrics can be established to measure service specific quality?
- What technology metrics if measured will indicate the degree to which the various technologies are contributing to the quality of the IT service vs. the independent measure of the quality of the individual technologies that represent the service?
- What are the metrics necessary to measure the effectiveness of the infrastructure processes required to effectively plan, design, develop, test, transition, implement, operate and continually improve the services?
Provided in Appendix A are common IT Service Management metrics for core ITIL based processes. Once again, the appropriate metrics/KPIs/critical success factors selected are unique to the organization depending upon the relative maturity of the organization and what the end user considers to be most important to meet service quality expectations. The appropriate technology based metrics such as latency, bandwidth, throughput, packet loss etc. within a network environment should contribute directly to overall service quality and is somewhat dependent upon the types of technologies that exist within the infrastructure and the availability of supporting tools suites.
In summary, the objective of metrics is to drive individual and organizational behavior by providing insight into an event that has occurred to determine if a desired outcome has or has not been achieved. Thus choosing meaningful metrics are critical. Metrics for the sake of reporting are not meaningful. Meaningful metrics provide actionable information that facilitates accurate and timely decision making. As an organization matures, so should their metrics. Please see Appendix B for diagrams depicting the linkage between goals and metrics.
Appendix A – Common IT Service Management Metrics for Service Level Management, Critical Success Factors (CSFs) and Key performance Indicators (KPIs)
| SLM | Manage quantity and quality of IT service needed: |
| SLM | percentage reduction in SLA targets missed |
| SLM | percentage reduction in SLA targets threatened |
| SLM | percentage increase in Customer perception of SLA achievements via CSS responses |
| SLM | percentage reduction in SLA breaches caused because of third party support contracts (Underpinning Contracts) |
| SLM | percentage reduction in SLA breaches caused because of internal Operational Level Agreements (OLA’s). |
| SLM | Deliver service as previously agreed at affordable costs: |
| SLM | total number and percentage increase in fully documented SLAs in place |
| SLM | percentage increase of SLAs agreed against operational services being run |
| SLM | percentage increase in completeness of Service Catalogue versus operational services |
| SLM | percentage improvement in the Service Delivery costs |
| SLM | percentage reduction in the cost of monitoring and reporting of SLAs |
| SLM | percentage increase in the speed and accuracy of developing SLAs. |
| SLM | Manage business interface: |
| SLM | increased percentage of Services covered by SLAs |
| SLM | documented and agreed SLM processes and procedures are in place |
| SLM | reduction in the time to respond to and implement SLA requests |
| SLM | increased percentage of SLA reviews completed on time |
| SLM | reduction in the percentage of outstanding SLAs for annual renegotiation |
| SLM | reduction in the percentage of SLAs requiring Changes (for example targets not attainable; Changes in usage levels) |
| SLM | percentage increase in the number of OLA’s and third Party contracts in place |
| SLM | documentary evidence that issues raised at service and SLA reviews are being followed up and resolved (e.g. via the CSIP)? |
| SLM | reduction in the number and severity of SLA breaches |
| SLM | effective review and follow-up of all SLA, OLA and underpinning contract breaches. |
| Capacity | Accurate business forecasts: |
| Capacity | production of workload forecasts on time |
| Capacity | percentage accuracy of forecasts of business trends |
| Capacity | timely incorporation of business plans into Capacity Plan |
| Capacity | reduction in the number of variances from the business plans and Capacity Plans. |
| Capacity | Knowledge of current and future technologies: |
| Capacity | increased ability to monitor performance and throughput of all services and components |
| Capacity | timely justification and implementation of new technology in line with business requirements (time, cost and functionality) |
| Capacity | reduction in the use of old technology causing breached SLAs due to problems with support or performance. |
| Capacity | Ability to demonstrate cost-effectiveness: |
| Capacity | a reduction in panic buying |
| Capacity | reduction in the over-capacity of IT |
| Capacity | accurate forecasts of planned expenditure |
| Capacity | reduction in the business disruption caused by a lack of adequate IT capacity |
| Capacity | relative reduction in the cost of production of the Capacity Plan. |
| Capacity | Ability to plan and implement the appropriate IT capacity to match business needs: |
| Capacity | percentage reduction in the number of Incidents due to poor performance |
| Capacity | percentage reduction in lost business due to inadequate capacity |
| Capacity | all new services are implemented which match Service Level Requirements (SLRs) |
| Capacity | increased percentage of recommendations made by Capacity Management are acted upon |
| Capacity | reduction in the number of SLA breaches due to either poor service performance or poor component performance. |
| Availability | Manage availability and reliability of IT service: |
| Availability | percentage reduction in the unavailability of services and components |
| Availability | percentage increase in the reliability of services and components |
| Availability | effective review and follow-up of all SLA, OLA and underpinning contract breaches |
| Availability | percentage improvement in overall end-to-end availability of services |
| Availability | percentage reduction in the number and impact of service breaks |
| Availability | improvement in the MTBF (mean time between failures) |
| Availability | improvement in the MTBSI (mean time between system incidents) |
| Availability | reduction in the MTTR (mean time to repair). |
| Availability | Satisfy business needs for access to IT services: |
| Availability | percentage reduction in the unavailability of services |
| Availability | percentage reduction of the cost of business overtime due to unavailable IT |
| Availability | percentage reduction in critical time failures, e.g. specific business peak and priority availability needs are planned for |
| Availability | percentage improvement in business and Users satisfied with service (by CSS results). |
| Availability | Availability of IT infrastructure achieved at optimum costs: |
| Availability | percentage reduction in the cost of unavailability |
| Availability | percentage improvement in the Service Delivery costs |
| Availability | timely completion of regular risk analysis and system review |
| Availability | timely completion of regular cost-benefit analysis established for infrastructure Component Failure Impact Analysis (CFIA) |
| Availability | percentage reduction in failures of third party performance on MTTR/MTBF against contract targets |
| Availability | educed time taken to complete (or update) a risk analysis |
| Availability | reduced time taken to review system resilience |
| Availability | reduced time taken to complete an Availability Plan |
| Availability | timely production of management reports |
| Availability | percentage reduction in the incidence of operational reviews uncovering security and reliability exposures in application designs. |
| IT Security | Management Controls (NIST 800-26) |
| IT Security | Percentage of systems that had formal risk assessment performed and documented. (every 6 months) |
| IT Security | Percentage of systems that have had risk levels reviewed by management. (6 months) |
| IT Security | Percentage of total systems for which security controls have been tested and evaluated in the past year. (12 months) |
| IT Security | The average time elapsed between vulnerability or weakness discovery and implementation of corrective action. (3 months) |
| IT Security | Percentage of systems that have the costs of their security controls integrated into the life cycle of the system. (12 months) |
| IT Security | Percentage of system changes recertified if security controls are added or modified after the system was developed. (12 months) |
| IT Security | Percentage of total systems that have been authorized for processing following certification and accreditation. (3 months) |
| IT Security | Percentage of systems that are operating under an Interim Authority to Operate (IATO). (3 months) |
| IT Security | Percentage of systems with approved system Security plans. (12 months) |
| IT Security | Percentage of current security plans. (6 months) |
| IT Security | Operational Controls (NIST 800-26) |
| IT Security | Percentage of systems compliant with the separation of duties requirement. (12 months) |
| IT Security | Percentage of users with special access to systems who have undergone background evaluations. (6 months) |
| IT Security | Percentage of information systems libraries that log the deposits and withdrawals of tapes. (6 months) |
| IT Security | Percentage of data transmission facilities in the organization that have restricted access to authorized individuals. (6 months) |
| IT Security | Percentage of laptops with encryption capability for sensitive files. (3 months) |
| IT Security | Percentage of security-related user issues resolved immediately following the initial call. (6 months) |
| IT Security | Percentage of used media sanitized before reuse or disposal. (12 months) |
| IT Security | Percentage of critical data files and operations with an established backup frequency. (12 months) |
| IT Security | Percentage of systems that have a contingency plan. (12 months) |
| IT Security | Percentage of systems for which contingency plans have been tested in the past year. (12 months) |
| IT Security | Percent of systems that impose restrictions on system maintenance personnel. (12 months) |
| IT Security | Percentage of software changes documented and approved through change request forms. (3 months) |
| IT Security | Percentage of systems with the latest approved patches installed. (1 month) |
| IT Security | Percentage of systems with automatic virus definition updates and automatic virus scanning. (6 months) |
| IT Security | Percentage of systems that perform password policy verification. (6 months) |
| IT Security | Percentage of in-house applications with documentation on file. (12 months) |
| IT Security | Percentage of systems with documented risk assessment reports. (12 months) |
| IT Security | Technical Controls (NIST 800-26) |
| IT Security | Percentage of employees with significant security responsibilities who have received specialized training. (12 months) |
| IT Security | Percentage of agency components with incident handling and response capability. (6 months) |
| IT Security | Number of incidents reported to FedCIRC, NIPC and local law enforcement. (3 months) |
| IT Security | Percentage of systems without active vendor-supplied passwords. (6 months) |
| IT Security | Percentage of unique user IDs. (3 months) |
| IT Security | Percentage of users with access to security software that are not security administrators. (3 months) |
| IT Security | Percentage of systems running restricted protocols. (6 months) |
| IT Security | Percentage of websites with a posted privacy policy. (3 months) |
| IT Security | Percentage of systems on which audit trails provide a trace of user actions. (12 months) |
| Incident | Quickly resolve Incidents: (applies to both Incident and Request Fulfillment) |
| Incident | percentage reduction in average time to respond to a call for assistance from first-line operatives |
| Incident | percentage increase in the Incidents resolved by first line operatives |
| Incident | percentage increase in the Incidents resolved by first line operatives on first response |
| Incident | percentage reduction of Incidents incorrectly assigned |
| Incident | percentage reduction of Incidents incorrectly categorized |
| Incident | reduced mean elapsed time for resolution or circumvention of Incidents, broken down by impact code |
| Incident | increased percentage of Incidents resolved within agreed (in SLAs) response times by impact code. |
| Incident | Maintain IT service quality: (applies to both Incident and Request Fulfillment) |
| Incident | reduction in the service unavailability caused by Incidents |
| Incident | increased percentage of Incidents resolved within target times by priority |
| Incident | increased percentage of Incidents resolved within target times by category |
| Incident | percentage reduction in the average time for second line support to respond |
| Incident | reduction of the Incident backlog |
| Incident | percentage increase in the Incidents fixed before Users notice |
| Incident | percentage reduction in the Incidents reopened |
| Incident | percentage reduction in the overall average time to resolve Incidents |
| Incident | reduction in the number of Incidents requiring more than one second line support team. |
| Incident | Improve business and IT productivity: (applies to both Incident and Request Fulfillment) |
| Incident | percentage reduction in average cost of handling incidents |
| Incident | improve percentage of business incidents dealt with first line operatives |
| Incident | percentage reduction number of times first line operatives bypassed |
| Incident | percentage improvement in average number of incidents handled by each first line operatives |
| Incident | no delays in the production of management reports |
| Incident | improved scores on CSS responses. |
| Incident | User satisfaction: (applies to both Incident and Request Fulfillment) |
| Incident | percentage improvement in CSS responses on the Incident Management service |
| Incident | percentage reduction in length of queue time waiting for Service Desk response |
| Incident | percentage reduction in the number of lost Service Desk calls |
| Incident | percentage reduction of the number of revised business instructions issued. |
| Service Desk | Receiving Calls (Performance): (applies to both Incident and Request Fulfillment) |
| Service Desk | % First Call Resolution |
| Service Desk | % First Call Resolutions without Passwords |
| Service Desk | # Dropped Calls |
| Service Desk | Average Call Hold Time |
| Service Desk | Workload Volumes (Performance) |
| Service Desk | otal # Calls / day / month / year |
| Service Desk | Average # Calls per Day |
| Service Desk | Average # Calls per Month |
| Service Desk | Average # Calls Assigned / day. |
| Service Desk | Average # Calls Assigned / month. |
| Service Desk | Operational Level (Quality) (applies to both Incident and Request Fulfillment) |
| Service Desk | Average duration by Call Type |
| Service Desk | Average duration by Call Group |
| Service Desk | Counts of Call Type / Group / Customer |
| Configuration | Control of IT assets: |
| Configuration | percentage reduction in number of Configuration Item (CI) attribute errors found in Configuration Management Database (CMDB) |
| Configuration | percentage increase in the number of CI’s successfully audited |
| Configuration | percentage improvements in the speed and accuracy of audit. |
| Configuration | Support the delivery of quality IT services: |
| Configuration | percentage reduction in service errors attributable to wrong CI information |
| Configuration | improved speed of component repair and recovery |
| Configuration | improved Customer satisfaction with services and terminal equipment. |
| Configuration | Economic service provision: |
| Configuration | reduction in the number of ‘missing or duplicated’ CI’s |
| Configuration | greater percentage of maintenance costs and license fees within budget |
| Configuration | percentage reduction in S/W costs due to better control |
| Configuration | percentage reduction in H/W costs due to better control of spares inventory and supplies |
| Configuration | percentage improvement in average cost of maintaining CI’s in CMDB. |
| Configuration | Support, integration and interfacing to all other ITSM processes: |
| Configuration | reduced percentage of Change failures as a result of inaccurate configuration data |
| Configuration | improved Incident resolution time due to the availability of complete and accurate configuration data |
| Configuration | more accurate results from Risk Analysis audits due to available and accurate asset information. |
| Problem | Improve service quality: |
| Problem | percentage reduction in repeat Incidents/Problems |
| Problem | percentage reduction in the Incidents and Problems affecting service to Customers |
| Problem | percentage reduction in the known Incidents and Problems encountered |
| Problem | no delays in production of management reports |
| Problem | improved CSS responses on business disruption caused by Incidents and Problems. |
| Problem | Minimize impact of Problems: |
| Problem | percentage reduction in average time to resolve Problems |
| Problem | percentage reduction of the time to implement fixes to Known Errors Problem |
| Problem | percentage reduction of the time to diagnose Problems |
| Problem | percentage reduction of the average number of undiagnosed Problems |
| Problem | percentage reduction of the average backlog of ‘open’ Problems and errors. |
| Problem | Reduction cost of Problems to Users: |
| Problem | percentage reduction of the impact of Problems on User |
| Problem | reduction in the business disruption caused by Incidents and Problems |
| Problem | percentage reduction in the number of Problems escalated (missed target) |
| Problem | percentage reduction in the IT Problem Management budget |
| Problem | increased percentage of proactive Changes raised by Problem Management, particularly from Major Incident and Problem reviews. |
| Change | Repeatable process: |
| Change | percentage fewer rejected RFCs |
| Change | percentage reduction in unauthorized Changes detected |
| Change | percentage of Change requests (business driven need) implemented on time |
| Change | percentage reduction in average time to make Changes |
| Change | percentage reduction in the Change backlog |
| Change | percentage fewer Changes ‘backed out’ because of testing failures |
| Change | percentage reduction in Changes required by previous Change failures |
| Change | increase in the percentage of reports produced on schedule. |
| Change | Quick and accurate Changes: |
| Change | percentage reduction in the number of urgent Changes |
| Change | percentage reduction of urgent Changes causing Incidents |
| Change | reduction in the percentage of Changes implemented without being tested |
| Change | percentage reduction of urgent Changes requiring back-out |
| Change | reduced percentage of urgent or high priority Changes submitted without business case to justify decision. |
| Change | Protect service: |
| Change | reduction in both the scheduled and unscheduled service unavailability caused by Changes |
| Change | percentage reduction in Changes backed out |
| Change | percentage reduction of unsuccessful Changes |
| Change | percentage reduction in Changes causing Incidents |
| Change | percentage reduction in Changes impacting core service time and SLA service hours |
| Change | percentage increase in Changes activated outside core service time and SLA service hours |
| Change | reduction in the percentage of Changes not referred to a Change Advisory Board (CAB) or Change Advisory Board Emergency Committee (CAB/EC) |
| Change | improvement in Customer Satisfaction Survey (CSS) feedback on Change |
| Change | percentage reduction in failed Changes that do not have recorded back-out plan |
| Change | percentage reduction in time to implement a Change freeze. |
| Change | Show efficiency and effectiveness results: |
| Change | percentage efficiency improvement based on number of RFCs processed |
| Change | percentage increase in the accuracy of Change estimates |
| Change | percentage reduction in the average cost of handling a Change |
| Change | percentage reduction in Change overtime due to better planning |
| Change | reduction in the ‘cost’ of failed Changes |
| Change | increased percentage of Changes implemented on time |
| Change | increased percentage of Changes implemented to budget |
| Change | reduction in the percentage of failed Changes |
| Change | reduction in the percentage of backed out Changes. |
| Release | Better quality software and hardware: |
| Release | percentage reduction in the use of software and hardware Releases that have not passed the required quality checks |
| Release | percentage reduction in installed software not taken from DSL |
| Release | percentage reduction in non-standard hardware |
| Release | all bought-in software complies with legal restrictions |
| Release | percentage reduction of unauthorized reversion to previous Releases |
| Release | percentage reduction in the use of unauthorized software and hardware. |
| Release | Repeatable process for rolling out software and hardware Releases: |
| Release | all new Releases planned and controlled by Release Management |
| Release | all installed software taken from the DSL |
| Release | all appropriate hardware stored in the DHS |
| Release | percentage reduction in the number of failed distributions of Releases to remote sites |
| Release | reduction in the percentage of urgent Releases |
| Release | increase in the percentage of ‘normal Release units’ as opposed to ad hoc Releases. |
| Release | Implementation of Releases swiftly (business driven needs) and accurately: |
| Release | percentage reduction in build failures |
| Release | percentage implementation of releases at all sites, including remote ones, on time |
| Release | percentage reduction in the number of urgent Releases |
| Release | percentage reduction in the Releases causing Incidents |
| Release | reduction in the percentage of Releases implemented without being tested |
| Release | reduced percentage of urgent or high priority Releases requested without the appropriate business case/justification. |
| Release | Cost-effective releases |
| Release | increased percentage of Releases built and implemented on schedule |
| Release | percentage Releases built and implemented within budget |
| Release | reduction in the service unavailability caused by Releases |
| Release | percentage reduction in Releases backed out |
| Release | percentage reduction of failed Releases |
| Release | percentage reduction in the average cost of handling a Release |
| Release | percentage reduction in Release overtime due to better planning |
| Release | reduction in the ‘cost’ of failed Releases |
| Release | no evidence of payment of license fees or wasted maintenance effort, for software that is not in use |
| Release | no evidence of wasteful duplication in Release building (e.g. multiple builds of remote sites, when copies of a single build would suffice) |
| Release | percentage improvement of the planned composition of Releases matching the actual composition (which demonstrates good Release planning) |
| Release | percentage improvement in the resources required by Release Management |
| Release | percentage increase in the accuracy of Release estimates. |
Appendix B – Linking Goals to Metrics
Management Control Metrics – Examples
DS1 – Define and Manage Service Levels
DS2 – Manage Performance and Capacity
About G2SF
G2SF specializes in IT compliance, governance, and service management consulting, training, network security/engineering/operations/management, and enterprise application support in accordance with the Information Technology Infrastructure Library (ITIL©), ISO/IEC20000 standards, and other federally mandated requirements. G2SF is committed to institutionalizing various technical standards and service management best practices to increase operational efficiencies, reduce operational costs, and improve end user satisfaction within the world’s largest IT organizations. In doing so, the company has established a successful track record as an objective change agent by collaborating with clients to facilitate technology, organizational, and cultural transformations within large, complex, global, classified IT environments.


