How to leverage logs to find similar incidents for a given event.

Artificial Intelligence for IT Operations (AIOps) are software systems that combine big data and artificial intelligence (AI) or machine learning (ML) to mine a voluminous amount of information coming from disparate data sources for identifying events (e.g., logs, metrics, alerts, incidents, anomalies). AIOps then correlates and groups them by inferring patterns for fault localization and uses this information to find similar historical incidents for action recommendation.

In Catchpoint’s SRE Report 2020, 80% of SREs work on post-mortem analysis of incidents due to lack of provided information and 16% of toil comes from investigating false positives/negatives. Incident management includes finding similar incidents for a given event. [1]

This is a challenging problem because the vocabulary of alerts and incidents can be different; also, alert descriptions are machine generated, whereas incident descriptions are human generated. Moreover, it may be the case that two or more events may have the same description, however, the underlying root causes are different. This article addresses the above challenge by leveraging logs for finding similar incidents.

Terminology

This section defines the terms related to incident management that we will be using throughout this article:

  • An event indicates that something of note has happened and is associated with one or more applications, services or other managed resources. For instance, a container is moved to a new node, column is added to a DB table, a new version of an application is deployed or memory or CPU is exhausted.
  • An alert is a record (type) of an event indicating a (fault) condition in the managed environment. It requires (or will require in the future) human or automatic attention and actions toward remediation. For instance, disk drive failure or a network link down could be alerts.
  • An incident represents a reduction in the quality of a business application or service. It is driven by one or more alerts. Incidents require prompt attention. For instance, an unresponsive application or inaccessible storage array could be serious outages.
  • Logs are a fundamental source of data generated from every level of components in an application. In each log line, details about the event — such as a resource that was accessed, who accessed it and the time — are included.

Finding similar events

Two events may or may not have a similar description but if the underlying logs are similar, then they are most likely related to each other — this is the key hypothesis of using logs for finding similar events.

Each application consists of several microservices, and some of these services are related to other services, forming a graph. If one service fails, then any other service which is upstream or downstream of the failed service could throw error log lines. It is important to identify error log lines corresponding to each failed microservice and collate them together to form a log signature for a particular event. We obtain log lines corresponding to each event from the time window of +- 5 minutes from outage start time (i.e., 10 minutes of log data). Each log line from the set of log lines is input to a pretrained error classifier; the output of the classifier is a 0 (error) or 1 (non-erroneous). The error classifier allows us to separate log lines pertaining to a healthy state of the system and the corresponding microservice from the non-erroneous log lines.

In order to use error log lines for event similarity, each log line is processed and templatized, and then they are collated to form a log-signature for each event. The objective of templatization is to normalize log lines to a common id, called as template-id. As a result, for a given event, there is a set of templates-ids and corresponding application-ids. We propose a log-signature representation for each event from its template-ids and corresponding application-ids, and use that for event similarity.

The example below shows a log signature for an event. There are three log template ids: template_id_atemplate_id_b and template_id_c. Two log template ids (template_id_a and template_id_b) belong to application_id_a, and one log template id (template_id_c) belongs to application_id_b. This representation is called as log signature of an event:

{
  "templates": [{
     "application_id": "application_id_a",
     "template": "template_id_a"
  }, {
     "application_id": "application_id_a",
     "template": "template_id_b"
  }, {
     "application_id": "application_id_b",
     "template": "template_id_c"
  }]
}

Once we have a log signature for each event, the similarity is calculated between two events by computing the overlap between their application ids. For each application id that overlaps, it computes the overlap between their respective templates ids to calculate a score called as log template similarity score.

Hypothesis testing

In this section, we want to verify the hypothesis that the two similar events may or may not have a lexically matching incident descriptions, but that their logs should have high overlap and that they are discriminative. Figure 1 shows four events where SREs communicated to us that they were similar to each other:

Figure 1: A set of four events that SREs described as similar to each other. Values in blue and green are text based and log-template based similarity scores, respectively.

We computed the similarity between them using the two methods, text-based similarity and log-template-based similarity. To compute the event-description-based similarity between two events, we obtain the distributed representation using universal sentence encoder for each event in the pair and then compute cosine similarity between them.

In the previous section, we outlined our method for calculating log-template-based similarity between two events. These results show that whenever text descriptions have high overlapping terms, the text-based similarity method have high scores for them. However, when there are few overlapping terms, the text-based similarity has a lower score. For example, the similarity between incident descriptions “database processing delayed for some users” and “Customers unable to view DB dashboard” have a low similarity score of 0.055. As per the ground truth communicated by the SRE, these two events are actually related to each other.

When we use log-template-based similarity to compute similarity between events, we observe that it captures the relatedness between events very well. This is because the similarity is computed based on the symptoms reflected in the logs captured through log signatures. For example, for the pair mentioned above, the log-template-based similarity score is 0.783, which indicates that their log signatures do have a high overlap, thus indicating high relatedness between them.

Summary

Using text description of events to compute similarity between them is not reliable and may result in inaccuracies. This article presents an approach that leverage logs for computing similarity between events and shows superior performance of the proposed method over the traditional text-based similarity method.

Additional resources

Reference

[1] Chen, Y., Yang, X., Dong, H., He, X., Zhang, H., Lin, Q., Chen, J., Zhao, P., Kang,Y., Gao, F., et al.: Identifying linked incidents in large-scale online service systems. In: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. pp. 304– 314 (2020)

Was this article helpful?
YesNo

More from Cloud

IBM Cloud Virtual Servers and Intel launch new custom cloud sandbox

4 min read - A new sandbox that use IBM Cloud Virtual Servers for VPC invites customers into a nonproduction environment to test the performance of 2nd Gen and 4th Gen Intel® Xeon® processors across various applications. Addressing performance concerns in a test environment Performance testing is crucial to understanding the efficiency of complex applications inside your cloud hosting environment. Yes, even in managed enterprise environments like IBM Cloud®. Although we can deliver the latest hardware and software across global data centers designed for…

10 industries that use distributed computing

6 min read - Distributed computing is a process that uses numerous computing resources in different operating locations to mimic the processes of a single computer. Distributed computing assembles different computers, servers and computer networks to accomplish computing tasks of widely varying sizes and purposes. Distributed computing even works in the cloud. And while it’s true that distributed cloud computing and cloud computing are essentially the same in theory, in practice, they differ in their global reach, with distributed cloud computing able to extend…

How a US bank modernized its mainframe applications with IBM Consulting and Microsoft Azure

9 min read - As organizations strive to stay ahead of the curve in today's fast-paced digital landscape, mainframe application modernization has emerged as a critical component of any digital transformation strategy. In this blog, we'll discuss the example of a US bank which embarked on a journey to modernize its mainframe applications. This strategic project has helped it to transform into a more modern, flexible and agile business. In looking at the ways in which it approached the problem, you’ll gain insights into…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters