The Service and the Incident

Previous articles have introduced services. This article will introduce the basic concept of Incident Management, an essential part of IT service in terms of both operation and quality of service.

24 Feb 2021 Alexandr Kolovratník Article

IT services surround us non-stop to such an extent that we do not even realize their presence until they stop working and their unavailability affects us. Imagine that you might lose the internet (incident) and the misfortune of not being able to open this article.

Therefore, an incident is an indication of the service state when its functionality is negatively affected. This means not only complete unavailability, but also unplanned quality reduction (e.g., slowing down the Internet, inability to send e-mail to a specific address). Incident management's goal is to ensure its recovery to the state in that it should be delivered to the user as quickly as possible.

„An incident is an unplanned interruption of IT service or a reduction in the quality of IT service.“

Incident management practices help set up incident resolution processes. Thanks to this, we will be able to solve them reliably; we will know what steps need to be taken and stop relying on the heroes, who will be able to rest with the knowledge that everything is properly guarded. The main goal is a uniform setting through all services, which will allow easier orientation in the complexity of services and their components.

Common Sense Incident Management Light Version

The process can be divided into several parts, which are naturally consecutive (see the picture: "Incident Management Process Flow").

Incident Detection

This is the first step in the incident resolution process. Leaving aside the users who can report the incident, we use monitoring tools to detect error conditions. They are often solitary probes without any other links. In that case, it is challenging to evaluate what the probe represents and what action has happened or should occur. Orchestration tools and configuration databases can help in this situation. They can help detect the incident/cause and start the process itself in a timely and accurate manner.

Incident Registration

The detected incident must be recorded; it should contain the maximum amount of input for incident detection. For example, the automatic creation of an incident from a monitoring tool in the request system.

Incident Classification

There belong a summary and determination of steps to eliminate the incident, e.g., priority, owners group, services group, and others. This point helps target the solution team quickly and efficiently; moreover, it's a step closer to automation. Both towards the user (incident reporting forms automatically assigning a classification) and towards the ServiceDesk or the administrator (automatically triggered events/repairs).

Incident Diagnostics

Sometimes the causes of an incident are not obvious, such as a power outage or incidents that have already been documented and recorded. Thanks to the previous classification step, it is possible to easily and automatically determine who is to be notified and what role they play in the solution - e.g., in the form of a Teams channel for easier and faster communication/diagnostics.

Incident Resolution

It is not just the cause elimination; it is essential to take care of the user, ensure that the repair is correct, and close the incident with him. In some cases, the incident is easily removed; in other cases, it requires changes, e.g., in the configuration. The output should always be an extension of the documentation, e.g., a description of how the repair or workaround was performed, if the repair is more time-consuming and cannot be deployed immediately.

Incident Conclusion

The incident does not end with just eliminating the cause and verifying the user. Part of the closure is the creation of a report on the incident. The report aims to provide an overview of how the service is delivered to the user over time (service owner/management) and allows continuous improvement of the service itself because potential weaknesses or extensions of service documentation may stand out.

Incident Record

The record should contain all the steps that were taken during the incident. This point is important not only from the point of view of more efficient and more apparent incident management but also from the possible capture of high-priority incidents.

A more detailed presentation of Incident Management would be on a much longer reading, mainly because Incident Management, according to ITIL, deals in detail with the description of roles/responsibilities and the process itself. We will present this and discuss it with stakeholders in a different way. In the text, we deliberately tried to avoid describing the roles responsible for individual areas/activities. Anyone who is working in IT at least will undoubtedly find themselves in some part or even in the entire life cycle of the incident that has been described. Incident management will help us name this, set boundaries, and free ourselves from some duplications of these activities, as we have dedicated teams performing some of these roles.

Incident management is a set of practices with an intuitive and logical process, which has a clear goal to deal with incidents in a timely and appropriate manner and learn from them. Simultaneously, it helps to be more transparent through all services, reduces the burden on individuals, and allows us to act as one team to our users.

Back