Root cause analysis

Edit online

Instana manages incidents and accelerates the identification of probable root cause. Instana automatically detects incidents, issues, and changes to help you detect, understand, and investigate quality of service issues of your applications.

DevOps practitioners face significant problems in today’s world of dynamic applications that are composed of hundreds or possibly thousands of components. When things break they need to be able to detect and understand the problem as soon as possible, even before users start to feel the service impact. After the DevOps restore the service as quickly as possible, they need to fix the exact root cause and make sure that the problem does not occur again. The DevOps can take hours or days to identify the root cause of an issue, and often, the reason remains unidentified.

Incidents

Edit online

Incidents yield the highest severity level. They are created when edge services that are accessed by users are impacted or an imminent risk of impact exist. Using Dynamic Graph all relevant events are correlated for each incident to provide context and root cause analysis hypotheses.

A service is suddenly responding slower than usual, we call this incident a sudden increase in average latency. The incident is automatically marked in yellow as a warning. The color is presented till this incident is active. After it is resolved, the color changes to gray and is still available for the drill-down menu. See the following example of an incident.

The incident detail view is organized into three parts:

The header contains basic information about the key facts of the incident.
- Start time;
- End time (current if it is still ongoing);
- The number of the still active events;
- The number of changes involved;
- The number of affected entities.
You can see the incident start date, the end date (if closed), how many events are still active, how many changes belong to this incident, and the number of affected entities:

Figure 2. Incident KPIs
The second section provides a visual representation of the incident development over time. The chart shows the complete time frame, from start to end and all events, which are sorted by start time. The view is limited to seven events when collapsed. Press the expand button to see the full view if your incidents contain more than seven events at a time. Clicking either of the bars open the detail-view for that issue:

Figure 3. Incident population
The third section contains the details for the graph view in the second section. A list of all events, which are sorted by start time, allows the user to see all available information for each event. Click an event to expand the details and see all the available information for the event:

Figure 4. Expanded incident event

The details help in understanding the event, followed by multiple charts with the corresponding metric plotted for visualization. If an event is still active, the chart continues rendering new incoming metric values. Two flags are available. One flag is to emphasize that an event affects a service and the other flag that an event triggered the incident. If available, the flags are placed over each event in the list.

When you focus on an event, the detail section provides the same information that is described in the Incidents event list on point 3.

Automatic probable root cause (public preview)

Edit online

To alleviate the Mean Time to Remediation (MTTR) for DevOps practitioners, Instana automates probable root cause, an algorithm that dynamically analyzes the trace statistics and topology by using Causal AI. This algorithm identifies the probable root cause entity of a failure, which enables DevOps practitioners to quickly determine the probable source of an application's failure.

You can access the Probable Root Cause section on any incidents that are created from a Smart Alert on the following entity types:

Application perspectives
Services
Endpoints
Service Level Objectives on application perspectives

The Probable Root Cause section has the following primary sections:

The probable root cause entity and related infrastructure or application information
The evidence that helps the DevOps practitioners to view the reason why a specific entity is identified as a probable root cause.
The events that are associated with the probable root cause entity and the probability level that indicates the likelihood of failure.
The relevant trace error messages and logs to the probable root cause uncover additional details of the problem at first glance.

The Causal AI algorithm identifies an entity (or multiple entities) that is likely to be the source of the problem. The entity can be any physical or logical entity that is monitored by Instana and is displayed as the Probable root cause entity. The displayed entity links to the entity page, which describes the state of the entity at the time of the incident. The associated events are all recent events that occurred on the Probable Root Cause entity. With detailed associated events, the DevOps practitioners can quickly identify issues, incidents, or change events that caused the problem. The link to the Unbounded Analytics page is prefiltered to the identified Probable Root Cause entity, helping the DevOps practitioners quickly identify traces and errors that are related to that entity.

In the Probable Root Cause section, Instana displays up to three entities that most likely failed. To see other entities that are possible candidates for Probable Root Cause, click the Most likely cause and Potential cause tabs. These entities are sorted by the likelihood of failure, so the most likely root cause is the first one shown. The probability levels can be labeled as low, medium, or high, where high indicates the high probability of failure. A tooltip is displayed when you hover your cursor on the Probability level, which further explains the meaning for each label.

Identified trace errors and logs

Edit online

Trace error messages: error calls that have error messages. The error messages include specific errors that are identified while your software is running and are listed in the order of erroneous calls count. The error messages include stack traces and error codes.
Trace logs: more comprehensive record of the events of the system's call flow. The trace logs are ordered by the count and include log levels such as ERROR and WARN.

For advanced LLM-based investigation, see Intelligent incident investigation.

Issues

Edit online

An Issue is an event that gets created if an application, service, or any part of it gets degraded. Instana comes with several hundreds of curated health signatures that detect various problems that range from degradation of service quality, to complex infrastructure issues, to disk saturation. Issues are automatically resolved when the metrics, events, or metadata return to the expected values.

In addition to built-in issues, you can define custom events to detect problems that are specific to your system.

To see all detected issues (both built-in and custom issues) by Instana, go to the Events view, and select the Issues tab. You can use Dynamic Focus to filter issues.

Each Instana issue contains the following information:

Severity: This information can be CRITICAL or WARNING. CRITICAL means that a direct or indirect risk of data loss or service and are not available. WARNING means any other performance issue that might impact the user experience or lead to a problem in the long term.
Start, end time, and duration of the issue.
Affected entities: One or more entities are affected by the problem.
Details: Extra description that provides more context and measures to resolve the problem.
Metrics: Metric charts that show metric values that are relevant to the problem around the time the problem happened.
Where applicable, go to Unbounded Analytics to investigate traces, calls, or page loads that are affected by the issue.

In this example, the CPU steals time on one Linux machine is suspicious and therefore marked as an issue. An issue by itself does not trigger an alert, but Instana does note that it happened. If the service to where this system is connected behaves badly, this issue is part of the incident. This methodology is one of the major benefits of Instana because you do not need to manually correlate events and performance problems. Just because something is using too much CPU for a while does not mean that a problem exists. This information becomes relevant only when it impacts a service.

For more information about managing built-in and custom issues, see Manage Built-in Events

Since Instana knows all dependencies between monitored services, it triggers Incidents for all quality of service issues when incidents are impacting the user. It also triggers Incidents for critical infrastructure issues, such as disk saturation and Elasticsearch cluster split-brain situations because these issues are likely to cause data loss.

Note: Applications, services, or endpoints that receive infrequent traffic (for example, one call every 15 minutes) are not considered to have a sufficient basis for our issue detection. The severity of an issue can change during its lifetime. It represents the highest severity that was ever reached by this particular issue.

Draft comment:

Impacted Users for application issues (private preview)

This feature is under private preview. You can contact the technical Instana support to get included in this program.

By using this feature, you can see the impacted users of a specific event, and get valuable insights into how events are affecting your users by quickly identifying and addressing issues that impact user experience.

Availability

To use this feature, ensure that the following conditions are met:

Both your front-end (website or mobile app) and back-end servers are monitored by Instana.
The correlation between front-end and back-end monitoring functions as expected. For more information, see [Backend correlation](../website_monitoring/backend_correlation.md).
The Impacted Users feature is currently supported only for application issues.

What is an impacted user?

An impacted user is a user whose experience is negatively affected by an application issue that triggers an event. For example, an impacted user might be someone whose journey or visit to your website or mobile app is disrupted due to a back-end server error issue as follows:

The user encounters a critical error page and cannot continue using the site or app.
The user experiences significant delays or timeouts, leading to a disrupted experience.
The user's actions (such as form submissions or transactions) fail to complete due to server-side issues.

Event data correlation and impact analysis

When an event is triggered, the system correlates data from your front-end and back-end monitoring to identify which end users are impacted. Then, you can detailed information about the affected users and understand the scope and impact of the issue.

Changes

Edit online

A Change is an event that represents changes, such as a server start/stop, deployment, and configuration change on a system. Further, separated into:

Changes - Changed configuration of components, for example versions, environment variable values, and other components
Offline/Online - Tracking the presence of components under management

Change events are important information that is used together with the Dynamic Graph to automatically detect relation of changes in configuration to incidents.

Events view

Edit online

To see all events that are detected by Instana, go to the Events dashboard and select the Incidents, Issues, Changes or All tabs to see the corresponding event types.

Filtering Capabilities for all Events

Edit online

Dynamic Focus Query

Edit online

Searching through events that are discovered by Instana relies on the Dynamic Focus feature. By selecting one or more multiple bars in the Events bar chart, the Events table lists only the events that are included in the selected bars. By selecting the bars in the Events bar chart, you can do a detailed inspection of events without changing the current time interval. You can also use the search box to find specific items by the data in the “Title” or “On” columns (the service where the incident occurred) in the Overview table. In this example, the search query is event.text:"Error rate". The result is a list of all events that contain the phrase "Error rate" in the title:

Filtering Table

Edit online

The Events view provides powerful filtering capabilities through dedicated UI filters. The event list can be filtered using the three key filter options:

Transient Events: if events are transient, non-transient or both.
Event Type: if events are Built-in or Custom.
Smart Alerts: if events are triggered by Smart Alerts from Application, Website, Synthetics, Infrastructure, Mobile, Log or SLO.

These filters can be used individually or in combination to quickly find relevant events and focus your troubleshooting efforts on what matters most.