IT Service Management Metrics

 

Historically, establishing information technology (IT) performance metrics has been a complex and difficult task sometimes requiring manual collection of data, software customization, detailed analysis, and lengthy reporting. Not only was it difficult to distill from the raw data meaningful information, but the data reported was not always current. As a result, the ability to make well informed decisions based on the current environment was challenging. Even more challenging was the ability to make “predictive” decisions based on legitimate trends.  With increasing standardization, more sophisticated enterprise applications/tools, and the wide-spread adoption of IT Service Management best practices, the formulation of IT metrics has evolved significantly.  For example, just a few years ago the talk of “five nines” (99.99999%) availability was standard.  Later the conversation focused on the cost of unavailability – from the customer perspective.   Today leading automated tools provide standard metrics and reports “out of the box”.  However, despite the availability of tools and standard metrics, most IT organizations still struggle with metrics. In general, people tend to perform in accordance with what is measured. Metrics are an effective means to drive behaviors. Therefore, it is critical that meaningful metrics are established, published, and well understood by everyone in the organization.

The most effective approach to establishing meaning metrics is to start with the “End” in mind. The “End” is typically an end user. Traditionally, end users are most concerned with the availability, feature/functionality, and support of a specific “IT service” verses the various technologies that are combined to create a service (i.e. applications, networks, databases, etc). Therefore, the starting point for meaningful metrics is to determine how the end user or customer will evaluate the service provider’s performance on each of these services. Armed with this information it is then much easier to establish a limited number of high level, yet measurable metrics that be quantified and reported.  It is then the responsibility of the IT organization to “reverse engineer” the overall customer focused metrics to determine the cascading sub tier metrics that directly contribute to the overall customer metrics used to evaluate the effectiveness of a specific IT service. These sub tier metrics represent both technology metrics that indicate the effectiveness of a specific technology as well as more service related metrics that represent the effectiveness of the combined technologies required to deliver a specific service.  When combining the customer’s evaluation (perception) of the quality of service across all IT services, the effectiveness of the IT organization overall can be determined and improvement plans established. This approach to implementing metrics ensures that the appropriate organizational behaviors are aligned with what is important to the end user or customer.

Therefore, there are a number of factors to consider when establishing meaningful metrics that will guide the appropriate organizational behaviors to ensure organizational success. These factors can best be considered by answering a number of key questions.
For example:

  • What is the mission?
  • In what ways does the success of the mission depend on IT?
  • Specific what IT services does the mission depend on most?
  • How does the end user evaluate the quality of the IT service(s)?
  • Can the degree of service quality be measured?
  • What key metrics can be established to measure service specific quality?
  • What technology metrics if measured will indicate the degree to which the various technologies are contributing to the quality of the IT service vs. the independent measure of the quality of the individual technologies that represent the service?
  • What are the metrics necessary to measure the effectiveness of the infrastructure processes required to effectively plan, design, develop, test, transition, implement, operate and continually improve the services?

Provided in Appendix A are common IT Service Management metrics for core ITIL based processes. Once again, the appropriate metrics/KPIs/critical success factors selected are unique to the organization depending upon the relative maturity of the organization and what the end user considers to be most important to meet service quality expectations. The appropriate technology based metrics such as latency, bandwidth, throughput, packet loss etc. within a network environment should contribute directly to overall service quality and is somewhat dependent upon the types of technologies that exist within the infrastructure and the availability of supporting tools suites.

In summary, the objective of metrics is to drive individual and organizational behavior by providing insight into an event that has occurred to determine if a desired outcome has or has not been achieved.  Thus choosing meaningful metrics are critical.  Metrics for the sake of reporting are not meaningful.  Meaningful metrics provide actionable information that facilitates accurate and timely decision making.  As an organization matures, so should their metrics. Please see Appendix B for diagrams depicting the linkage between goals and metrics.

Appendix A – Common IT Service Management Metrics for Service Level Management, Critical Success Factors (CSFs) and Key performance Indicators (KPIs)

SLM
Manage quantity and quality of IT service needed:
SLMpercentage reduction in SLA targets missed
SLMpercentage reduction in SLA targets threatened
SLMpercentage increase in Customer perception of SLA achievements via CSS responses
SLMpercentage reduction in SLA breaches caused because of third party support contracts (Underpinning Contracts)
SLMpercentage reduction in SLA breaches caused because of internal Operational Level Agreements (OLA’s).
SLM
Deliver service as previously agreed at affordable costs:
SLMtotal number and percentage increase in fully documented SLAs in place
SLMpercentage increase of SLAs agreed against operational services being run
SLMpercentage increase in completeness of Service Catalogue versus operational services
SLMpercentage improvement in the Service Delivery costs
SLMpercentage reduction in the cost of monitoring and reporting of SLAs
SLMpercentage increase in the speed and accuracy of developing SLAs.
SLM
Manage business interface:
SLMincreased percentage of Services covered by SLAs
SLMdocumented and agreed SLM processes and procedures are in place
SLMreduction in the time to respond to and implement SLA requests
SLMincreased percentage of SLA reviews completed on time
SLMreduction in the percentage of outstanding SLAs for annual renegotiation
SLMreduction in the percentage of SLAs requiring Changes (for example targets not attainable; Changes in usage levels)
SLMpercentage increase in the number of OLA’s and third Party contracts in place
SLMdocumentary evidence that issues raised at service and SLA reviews are being followed up and resolved (e.g. via the CSIP)?
SLMreduction in the number and severity of SLA breaches
SLMeffective review and follow-up of all SLA, OLA and underpinning contract breaches.
Capacity
Accurate business forecasts:
Capacity production of workload forecasts on time
Capacity percentage accuracy of forecasts of business trends
Capacity timely incorporation of business plans into Capacity Plan
Capacity reduction in the number of variances from the business plans and Capacity Plans.
Capacity
Knowledge of current and future technologies:
Capacity increased ability to monitor performance and throughput of all services and components
Capacity timely justification and implementation of new technology in line with business requirements (time, cost and functionality)
Capacity reduction in the use of old technology causing breached SLAs due to problems with support or performance.
Capacity
Ability to demonstrate cost-effectiveness:
Capacity a reduction in panic buying
Capacity reduction in the over-capacity of IT
Capacity accurate forecasts of planned expenditure
Capacity reduction in the business disruption caused by a lack of adequate IT capacity
Capacity relative reduction in the cost of production of the Capacity Plan.
Capacity
Ability to plan and implement the appropriate IT capacity to match business needs:
Capacity percentage reduction in the number of Incidents due to poor performance
Capacity percentage reduction in lost business due to inadequate capacity
Capacity all new services are implemented which match Service Level Requirements (SLRs)
Capacity increased percentage of recommendations made by Capacity Management are acted upon
Capacity reduction in the number of SLA breaches due to either poor service performance or poor component performance.
Availability
Manage availability and reliability of IT service:
Availabilitypercentage reduction in the unavailability of services and components
Availabilitypercentage increase in the reliability of services and components
Availabilityeffective review and follow-up of all SLA, OLA and underpinning contract breaches
Availabilitypercentage improvement in overall end-to-end availability of services
Availabilitypercentage reduction in the number and impact of service breaks
Availabilityimprovement in the MTBF (mean time between failures)
Availabilityimprovement in the MTBSI (mean time between system incidents)
Availabilityreduction in the MTTR (mean time to repair).
Availability
Satisfy business needs for access to IT services:
Availabilitypercentage reduction in the unavailability of services
Availabilitypercentage reduction of the cost of business overtime due to unavailable IT
Availabilitypercentage reduction in critical time failures, e.g. specific business peak and priority availability needs are planned for
Availabilitypercentage improvement in business and Users satisfied with service (by CSS results).
Availability
Availability of IT infrastructure achieved at optimum costs:
Availabilitypercentage reduction in the cost of unavailability
Availabilitypercentage improvement in the Service Delivery costs
Availabilitytimely completion of regular risk analysis and system review
Availabilitytimely completion of regular cost-benefit analysis established for infrastructure Component Failure Impact Analysis (CFIA)
Availabilitypercentage reduction in failures of third party performance on MTTR/MTBF against contract targets
Availabilityeduced time taken to complete (or update) a risk analysis
Availabilityreduced time taken to review system resilience
Availabilityreduced time taken to complete an Availability Plan
Availabilitytimely production of management reports
Availabilitypercentage reduction in the incidence of operational reviews uncovering security and reliability exposures in application designs.
IT Security
Management Controls (NIST 800-26)
IT SecurityPercentage of systems that had formal risk assessment performed and documented. (every 6 months)
IT SecurityPercentage of systems that have had risk levels reviewed by management. (6 months)
IT SecurityPercentage of total systems for which security controls have been tested and evaluated in the past year. (12 months)
IT SecurityThe average time elapsed between vulnerability or weakness discovery and implementation of corrective action. (3 months)
IT SecurityPercentage of systems that have the costs of their security controls integrated into the life cycle of the system. (12 months)
IT SecurityPercentage of system changes recertified if security controls are added or modified after the system was developed. (12 months)
IT SecurityPercentage of total systems that have been authorized for processing following certification and accreditation. (3 months)
IT SecurityPercentage of systems that are operating under an Interim Authority to Operate (IATO). (3 months)
IT SecurityPercentage of systems with approved system Security plans. (12 months)
IT SecurityPercentage of current security plans. (6 months)
IT Security
Operational Controls (NIST 800-26)
IT SecurityPercentage of systems compliant with the separation of duties requirement. (12 months)
IT SecurityPercentage of users with special access to systems who have undergone background evaluations. (6 months)
IT SecurityPercentage of information systems libraries that log the deposits and withdrawals of tapes. (6 months)
IT SecurityPercentage of data transmission facilities in the organization that have restricted access to authorized individuals. (6 months)
IT SecurityPercentage of laptops with encryption capability for sensitive files. (3 months)
IT SecurityPercentage of security-related user issues resolved immediately following the initial call. (6 months)
IT SecurityPercentage of used media sanitized before reuse or disposal. (12 months)
IT SecurityPercentage of critical data files and operations with an established backup frequency. (12 months)
IT SecurityPercentage of systems that have a contingency plan. (12 months)
IT SecurityPercentage of systems for which contingency plans have been tested in the past year. (12 months)
IT SecurityPercent of systems that impose restrictions on system maintenance personnel. (12 months)
IT SecurityPercentage of software changes documented and approved through change request forms. (3 months)
IT SecurityPercentage of systems with the latest approved patches installed. (1 month)
IT SecurityPercentage of systems with automatic virus definition updates and automatic virus scanning. (6 months)
IT SecurityPercentage of systems that perform password policy verification. (6 months)
IT SecurityPercentage of in-house applications with documentation on file. (12 months)
IT SecurityPercentage of systems with documented risk assessment reports. (12 months)
IT Security
Technical Controls (NIST 800-26)
IT SecurityPercentage of employees with significant security responsibilities who have received specialized training. (12 months)
IT SecurityPercentage of agency components with incident handling and response capability. (6 months)
IT SecurityNumber of incidents reported to FedCIRC, NIPC and local law enforcement. (3 months)
IT SecurityPercentage of systems without active vendor-supplied passwords. (6 months)
IT SecurityPercentage of unique user IDs. (3 months)
IT SecurityPercentage of users with access to security software that are not security administrators. (3 months)
IT SecurityPercentage of systems running restricted protocols. (6 months)
IT SecurityPercentage of websites with a posted privacy policy. (3 months)
IT SecurityPercentage of systems on which audit trails provide a trace of user actions. (12 months)
Incident
Quickly resolve Incidents: (applies to both Incident and Request Fulfillment)
Incident percentage reduction in average time to respond to a call for assistance from first-line operatives
Incident percentage increase in the Incidents resolved by first line operatives
Incident percentage increase in the Incidents resolved by first line operatives on first response
Incident percentage reduction of Incidents incorrectly assigned
Incident percentage reduction of Incidents incorrectly categorized
Incident reduced mean elapsed time for resolution or circumvention of Incidents, broken down by impact code
Incident increased percentage of Incidents resolved within agreed (in SLAs) response times by impact code.
Incident
Maintain IT service quality: (applies to both Incident and Request Fulfillment)
Incident reduction in the service unavailability caused by Incidents
Incident increased percentage of Incidents resolved within target times by priority
Incident increased percentage of Incidents resolved within target times by category
Incident percentage reduction in the average time for second line support to respond
Incident reduction of the Incident backlog
Incident percentage increase in the Incidents fixed before Users notice
Incident percentage reduction in the Incidents reopened
Incident percentage reduction in the overall average time to resolve Incidents
Incident reduction in the number of Incidents requiring more than one second line support team.
Incident
Improve business and IT productivity: (applies to both Incident and Request Fulfillment)
Incident percentage reduction in average cost of handling incidents
Incident improve percentage of business incidents dealt with first line operatives
Incident percentage reduction number of times first line operatives bypassed
Incident percentage improvement in average number of incidents handled by each first line operatives
Incident no delays in the production of management reports
Incident improved scores on CSS responses.
Incident
User satisfaction: (applies to both Incident and Request Fulfillment)
Incident percentage improvement in CSS responses on the Incident Management service
Incident percentage reduction in length of queue time waiting for Service Desk response
Incident percentage reduction in the number of lost Service Desk calls
Incident percentage reduction of the number of revised business instructions issued.
Service Desk
Receiving Calls (Performance): (applies to both Incident and Request Fulfillment)
Service Desk% First Call Resolution
Service Desk% First Call Resolutions without Passwords
Service Desk# Dropped Calls
Service DeskAverage Call Hold Time
Service Desk
Workload Volumes (Performance)
Service Deskotal # Calls / day / month / year
Service DeskAverage # Calls per Day
Service DeskAverage # Calls per Month
Service DeskAverage # Calls Assigned / day.
Service DeskAverage # Calls Assigned / month.
Service Desk
Operational Level (Quality) (applies to both Incident and Request Fulfillment)
Service DeskAverage duration by Call Type
Service DeskAverage duration by Call Group
Service DeskCounts of Call Type / Group / Customer
Configuration
Control of IT assets:
Configurationpercentage reduction in number of Configuration Item (CI) attribute errors found in Configuration Management Database (CMDB)
Configurationpercentage increase in the number of CI’s successfully audited
Configurationpercentage improvements in the speed and accuracy of audit.
Configuration
Support the delivery of quality IT services:
Configurationpercentage reduction in service errors attributable to wrong CI information
Configurationimproved speed of component repair and recovery
Configurationimproved Customer satisfaction with services and terminal equipment.
Configuration
Economic service provision:
Configurationreduction in the number of ‘missing or duplicated’ CI’s
Configurationgreater percentage of maintenance costs and license fees within budget
Configurationpercentage reduction in S/W costs due to better control
Configurationpercentage reduction in H/W costs due to better control of spares inventory and supplies
Configurationpercentage improvement in average cost of maintaining CI’s in CMDB.
Configuration
Support, integration and interfacing to all other ITSM processes:
Configurationreduced percentage of Change failures as a result of inaccurate configuration data
Configurationimproved Incident resolution time due to the availability of complete and accurate configuration data
Configurationmore accurate results from Risk Analysis audits due to available and accurate asset information.
Problem
Improve service quality:
Problempercentage reduction in repeat Incidents/Problems
Problempercentage reduction in the Incidents and Problems affecting service to Customers
Problempercentage reduction in the known Incidents and Problems encountered
Problemno delays in production of management reports
Problemimproved CSS responses on business disruption caused by Incidents and Problems.
Problem
Minimize impact of Problems:
Problempercentage reduction in average time to resolve Problems
Problempercentage reduction of the time to implement fixes to Known Errors Problem
Problempercentage reduction of the time to diagnose Problems
Problempercentage reduction of the average number of undiagnosed Problems
Problempercentage reduction of the average backlog of ‘open’ Problems and errors.
Problem
Reduction cost of Problems to Users:
Problempercentage reduction of the impact of Problems on User
Problemreduction in the business disruption caused by Incidents and Problems
Problempercentage reduction in the number of Problems escalated (missed target)
Problempercentage reduction in the IT Problem Management budget
Problemincreased percentage of proactive Changes raised by Problem Management, particularly from Major Incident and Problem reviews.
Change
Repeatable process:
Changepercentage fewer rejected RFCs
Changepercentage reduction in unauthorized Changes detected
Changepercentage of Change requests (business driven need) implemented on time
Changepercentage reduction in average time to make Changes
Changepercentage reduction in the Change backlog
Changepercentage fewer Changes ‘backed out’ because of testing failures
Changepercentage reduction in Changes required by previous Change failures
Changeincrease in the percentage of reports produced on schedule.
Change
Quick and accurate Changes:
Changepercentage reduction in the number of urgent Changes
Changepercentage reduction of urgent Changes causing Incidents
Changereduction in the percentage of Changes implemented without being tested
Changepercentage reduction of urgent Changes requiring back-out
Changereduced percentage of urgent or high priority Changes submitted without business case to justify decision.
Change
Protect service:
Changereduction in both the scheduled and unscheduled service unavailability caused by Changes
Changepercentage reduction in Changes backed out
Changepercentage reduction of unsuccessful Changes
Changepercentage reduction in Changes causing Incidents
Changepercentage reduction in Changes impacting core service time and SLA service hours
Changepercentage increase in Changes activated outside core service time and SLA service hours
Changereduction in the percentage of Changes not referred to a Change Advisory Board (CAB) or Change Advisory Board Emergency Committee (CAB/EC)
Changeimprovement in Customer Satisfaction Survey (CSS) feedback on Change
Changepercentage reduction in failed Changes that do not have recorded back-out plan
Changepercentage reduction in time to implement a Change freeze.
Change
Show efficiency and effectiveness results:
Changepercentage efficiency improvement based on number of RFCs processed
Changepercentage increase in the accuracy of Change estimates
Changepercentage reduction in the average cost of handling a Change
Changepercentage reduction in Change overtime due to better planning
Changereduction in the ‘cost’ of failed Changes
Changeincreased percentage of Changes implemented on time
Changeincreased percentage of Changes implemented to budget
Changereduction in the percentage of failed Changes
Changereduction in the percentage of backed out Changes.
Release
Better quality software and hardware:
Releasepercentage reduction in the use of software and hardware Releases that have not passed the required quality checks
Releasepercentage reduction in installed software not taken from DSL
Releasepercentage reduction in non-standard hardware
Releaseall bought-in software complies with legal restrictions
Releasepercentage reduction of unauthorized reversion to previous Releases
Releasepercentage reduction in the use of unauthorized software and hardware.
Release
Repeatable process for rolling out software and hardware Releases:
Releaseall new Releases planned and controlled by Release Management
Releaseall installed software taken from the DSL
Releaseall appropriate hardware stored in the DHS
Releasepercentage reduction in the number of failed distributions of Releases to remote sites
Releasereduction in the percentage of urgent Releases
Releaseincrease in the percentage of ‘normal Release units’ as opposed to ad hoc Releases.
Release
Implementation of Releases swiftly (business driven needs) and accurately:
Releasepercentage reduction in build failures
Releasepercentage implementation of releases at all sites, including remote ones, on time
Releasepercentage reduction in the number of urgent Releases
Releasepercentage reduction in the Releases causing Incidents
Releasereduction in the percentage of Releases implemented without being tested
Releasereduced percentage of urgent or high priority Releases requested without the appropriate business case/justification.
Release
Cost-effective releases
Releaseincreased percentage of Releases built and implemented on schedule
Releasepercentage Releases built and implemented within budget
Releasereduction in the service unavailability caused by Releases
Releasepercentage reduction in Releases backed out
Releasepercentage reduction of failed Releases
Releasepercentage reduction in the average cost of handling a Release
Releasepercentage reduction in Release overtime due to better planning
Releasereduction in the ‘cost’ of failed Releases
Releaseno evidence of payment of license fees or wasted maintenance effort, for software that is not in use
Releaseno evidence of wasteful duplication in Release building (e.g. multiple builds of remote sites, when copies of a single build would suffice)
Releasepercentage improvement of the planned composition of Releases matching the actual composition (which demonstrates good Release planning)
Releasepercentage improvement in the resources required by Release Management
Releasepercentage increase in the accuracy of Release estimates.

Appendix B – Linking Goals to Metrics

 

Management Control Metrics – Examples

DS1 – Define and Manage Service Levels

 DS2 – Manage Performance and Capacity

 

About G2SF

G2SF specializes in IT compliance, governance, and service management consulting, training, network security/engineering/operations/management, and enterprise application support in accordance with the Information Technology Infrastructure Library (ITIL©), ISO/IEC20000 standards, and other federally mandated requirements. G2SF is committed to institutionalizing various technical standards and service management best practices to increase operational efficiencies, reduce operational costs, and improve end user satisfaction within the world’s largest IT organizations. In doing so, the company has established a successful track record as an objective change agent by collaborating with clients to facilitate technology, organizational, and cultural transformations within large, complex, global, classified IT environments.