Root cause analysis

Instana manages incidents and accelerates the identification of root cause. Instana automatically detects changes, issues, and incidents to help you detect, understand, and investigate quality of service issues of your applications.

DevOps practitioners face significant problems in today’s world of dynamic applications that are composed of hundreds or possibly thousands of components. When things break they need to be able to detect and understand the problem as soon as possible, even before users start to feel the service impact. After the DevOps restore the service as quickly as possible, they need to fix the exact root cause and make sure that the problem does not occur again. The DevOps can take hours or days to identify the root cause of an issue, and often, the reason is left unidentified.

Changes

A Change is an event that represents changes, such as a server start/stop, deployment, and configuration change on a system. Further, separated into:

  • Changes - Changed configuration of components, for example versions, environment variable values, and other components
  • Offline/Online - Tracking the presence of components under management

Change events are important information that is used together with the Dynamic Graph to automatically detect relation of changes in configuration to incidents.

Issues

An Issue is an event that gets created if an application, service, or any part of it gets degraded. Instana comes with several hundreds of out-of-the-box curated health signatures that detects various problems that ranging from degrade of service quality, to complex infrastructure issues, to disk saturation. Issues are automatically resolved as soon as the metrics, events, or metadata returns to the expected values.

In addition to built-in issues, you can define custom events to detect problems that are specific to your system.

To see all issues (both built-in and custom issues) that are detected by Instana, go to the Events view, and click the Issues tab. You can use Dynamic Focus to filter issues.

Each Instana issue contains the following information:

  • Severity: This information can be CRITICAL or WARNING. CRITICAL means that a direct or indirect risk of data loss or service and are not available. WARNING means any other performance issue that might impact the user experience or lead to a problem in the long term
  • Start, end time, and duration of the issue
  • Affected entities: one or more entities that are affected by the problem
  • Details: additional description that provides more context and measures to resolve the problem
  • Metrics: metric charts that show metric values that are relevant to the problem around the time the problem happened
  • Where applicable, you can go to Unbounded Analytics to investigate traces, calls, or page loads that are affected by the issue

In this example, the CPU steals time on one Linux machine is suspicious and therefore marked as an issue. An issue by itself does not trigger an alert, Instana notes that it happened. If the service to where this system is connected behave badly, this issue is part of the incident. This methodology is one of the major benefits of Instana because you don't need to manually correlate events and performance problems. Just because something is using too much CPU for a while doesn’t mean that a problem exists. Only when a service is impacted this be relevant information.

Checkout Manage Built-in Events for more information on managing built-in and custom issues.

Since Instana knows all dependencies between monitored services, it triggers Incidents for all quality of service issues when incidents are impacting the user. Also, some critical infrastructure issues, such as disk saturation and Elasticsearch cluster split brain situations, triggers incidents because the result is most likely data loss.

Applications, services, or endpoints that receive infrequent traffic (for example, one call every 15 minutes) are not considered to have a sufficient basis for our issue detection. The severity of an issue can change during its lifetime. It represents the highest severity that was ever reached by this particular issue.

Incidents

Incidents yield the highest severity level. They are created when edge services that are accessed by users are impacted or an imminent risk of impact exist. Using Dynamic Graph all relevant events are correlated for each incident to provide context and root cause analysis hypotheses.

A service is suddenly responding slower than usual, we call this a sudden increase in average latency. The incident is automatically marked in yellow as a warning. The color is presented till this incident is active. After it is resolved, the color changes to gray and is still available for the drill-down menu. See the following example of an incident.

The incident detail view is organized into three parts:

  1. The header contains basic information about the key facts of the incident.

    • Start time;
    • End time (current if it is still ongoing);
    • The number of the still active events;
    • The number of changes involved;
    • The number of affected entities.

    You can see the incident start date, the end date (if available), how many events are still active, how many changes belong to this incident, and the number of affected entities:

  2. The second section provides a visual representation of the incident development over time. The chart shows the complete timeframe, from start to end and all events, which are sorted by start time. The view is limited to seven events when collapsed. Press the expand button to see the full view if your incidents contain more than seven events at a time. Clicking either of the bars open the detail-view for that issue:

  3. The third section contains the details for the graph view in section 2. A list of all events, which are sorted by start time, allows the user to see all available information for each event. To see all the available information for each event, click it to expand it:

The details help in understanding the event, followed by multiple charts with the corresponding metric plotted for visualization. If an event is still active, the chart continues rendering new incoming metric values. Two flags are available. One flag is to emphasize that an event affects a service and the other flag that an event triggered the incident. If available, the flags are placed on top of each event in the list.

When you focus on an event, the detail section provides the same information that is described in the Incidents event list on point 3.

Automatic probable root cause (public preview)

To alleviate the Mean Time to Remediation (MTTR) for DevOps practitioners, Instana automates probable root cause, an algorithm that dynamically analyzes the trace statistics and topology by using Causal AI. This algorithm identifies the probable root cause entity of a failure, which enables DevOps practitioners to quickly determine the probable source of an application's failure.

You can access the Probable Root Cause section on any incidents that are created from a Smart Alert on the following entity types:

  • Application perspectives
  • Services
  • Endpoints
  • Service Level Objectives on application perspectives

The Probable Root Cause section has the following primary sections:

  1. The Probable Root Cause entity and related infrastructure or application information
  2. The evidence that helps the DevOps practitioners to view the reason why a specific entity is identified as a Probable root cause.
  3. The events that are associated with the Probable Root Cause entity, as well as the probability level indicating the likelihood of failure

The Causal AI algorithm identifies an entity (or multiple entities) that is likely to be the source of the problem. The entity can be any physical or logical entity monitored by Instana and is displayed as the Probable root cause entity. The displayed entity links to the entity page, which describes the state of the entity at the time of the incident. The associated events are all recent events that occurred on the Probable Root Cause entity. With detailed associated events, the DevOps practitioners can quickly identify issues, incidents, or change events that caused the problem. A link to the Unbounded Analytics page is prefiltered to the identified Probable Root Cause entity, which helps the DevOps practitioners to quickly and easily identify traces and errors that are related to the Probable Root Cause entity.

In the Probable Root Cause section panel, Instana displays up to three entities that most likely failed. To see other entities that are possible candidates for Probable Root Cause, click the Most likely cause and Potential cause tabs. These entities are sorted by the likelihood of failure, so the most likely root cause is the first one shown. The probability levels are shown in the upper right, under the title Probability level, and can be labeled as low, medium, or high, where high indicates the high probability of failure. A tooltip is displayed when you hover your cursor on the Probability level, which further explains the meaning for each label.

Events view

To see all events that are detected by Instana go to the "Events" view and choose between "Incidents", "Issues", "Changes" or "All" tabs to see corresponding event types. Searching through events that are discovered by Instana relies on the Dynamic Focus feature. By selecting one or more multiple bars in the Events bar chart at the top, the event's table lists only the events that are included in the selected bars. By selecting the bars in the events bar chart, you can do a detailed inspection of events without changing the current time interval.

In addition, you can use the search box to find specific items by the data shown in the columns “Title” or “On” (the name of the service on which the incident occurred) in the Overview table. In this example, the search query is event.text:"Error rate". The result is a list of all events that contain the phrase "Error rate" in the title: