Service level objectives (public preview)

Welcome to Service Level Objectives (public preview). The navigation sidebar contains a new item that gives SLOs a dedicated space to provide both a better experience and more meaningful data to help you monitor objectives for your applications and websites.

You can use an overview dashboard to view all of your SLOs and their status, and a detailed page to view the latency or availability, error budget, and traffic across time windows.

  • Do you want to monitor the availability of your applications?
  • Do you want to monitor the latency of your websites?
  • Do you want to customize which calls are good or bad and track them?

Service Level Objectives are not supported in Docker currently.

Permissions

During public preview, no permission is required to access the new SLO dashboard to view and create SLOs. However, you are limited by the permissions in place for any entities (applications or websites) associated with your SLO. If you have permission to view an application or website, you can create an SLO for it.

Dashboard Overview

In the Instana UI, hover over the navigation sidebar and select Service Levels. The Service Levels page shows an overview of all of your SLOs including their status in alphanumeric order. You can sort the table by name, filter tables by Tag or Entity Type, or search by SLO name.

The table displays the following information:

Name: The name of the SLO. The color of the border on the left signifies the current status of the Service Level Indicator.

Entity: The name of the application or website that is associated with this SLO.

If the associated entity (application or website) is deleted or is no longer visible to you, this column displays “Unknown Entity”. The SLO definition remains, but the SLO data for deleted entities is no longer captured.

Blueprint: The SLI and type that is being measured.

  • Latency: measures how fast the service responded above a set threshold. For example, the measure of how fast your checkout-service responds faster than 100 ms.
  • Availability: measures how often a service responds appropriately. For example, how often your HTTP calls are erroneous.

Custom: Measures the count of good or bad events by using user-defined filters.

Error Budget Remaining: indicates how much of your error budget is available. Represented in minutes, calls, or beacons depending on the selected entity type and SLI type. Also, details on the time window configuration of the SLO are displayed.

Status/Target: displays the current status of the SLO represented in percentage value, along with the wanted target. If the status meets the target, it is colored green until your error budget is burned, otherwise it is colored red.

Tags: displays the tags that are associated with the SLO, which can be used to filter the view for related SLOs.

Actions: contains edit, copy, and delete buttons.

Pagination is available at the bottom of the table to view more SLOs.

SLO Details

Select an SLO from the data table on the dashboard to see its details. This page shows a summary of the SLO status, error budget, and traffic. In addition, there are detailed charts showing the SLI, error budget, and traffic over time. Time can be toggled to view the data by either the configured SLO time window or the time that has been selected in the upper right corner of the page.

This page display the following information:

SLO Name

Analyze calls and Analyze bad Calls: These buttons take you to the Analyze page to better understand the calls being made to the services selected within the SLO.

Summary tab: shows how this SLO is performing over time.

Entity: The name of the application or website that is associated with this SLO.

If the associated entity (application or website) has been deleted or is no longer visible to you, this column displays “Unknown Entity”. The SLO definition remains, but the SLO data for deleted entities is no longer captured.

Tags: displays the tags that are associated with the SLO, which can be used to filter the view for related SLOs.

Time content switcher: this toggles between the time window from the SLO configuration and the time window that is selected.

Configured SLO time window: the time window that is being measured for the SLO and determines the error budget. This was set when the SLO was created.

Matching SLO time window: this shows the configured time windows that the selected time spans.

Status/Target: displays the current status of the SLO represented in percentage value, along with the wanted target.

Error Budget Remaining: indicates how much of your error budget is available. Represented in minutes, calls, or beacons depending on the selected entity type and SLI type. Also, details on the time window configuration of the SLO are displayed.

Traffic: the amount of calls your application or website services are receiving.

The following charts show the trend over the time frames selected.

  • Configuration tab: shows how this SLO was configured during its creation. There are options to edit, copy, or delete the SLO.

SLO summary and configuration view

Creating an SLO

To display the Service Levels main page, select Service Levels option in the navigation sidebar Service Levels in the navigation sidebar. The Service Levels page opens up with a list of created Service Level Objectives in an alphanumeric order. You can choose to sort the table by using the up arrow in the Name column. When you sort the name column, the table is ordered in reverse order. In addition to sorting, you can filter the table by using the Tag and Entity Type filters. You can also search for an SLO by name in the list by using the search box. To create an SLO, click + Add at the bottom right of the page. This opens a modal window Create Service Level Objective (SLO). You can create an SLO for either an application or a website by using the following steps:

Creating an SLO for applications

Select Application

Select the application that you want to monitor and measure performance.

  1. Use the content picker to choose an application.
  2. Select an application from the list, or search for your application.

Set Scope

Select the scope within the application to measure.

  1. Select the boundary, either Inbound Calls or All Calls. a. Inbound calls: include calls initiated from outside the application and where the destination service is part of the selected application perspective. b. All calls: include both inbound calls from outside the application, and calls that occur within the application perspective.

  2. Select Hidden Calls, optionally you can include internal or synthetic calls. By default, both are excluded. a. Include Internal Calls: a particular type of call that represents work that is done inside a service. It can be created from intermediate spans that are sent through custom tracing. b. Include Synthetic Calls: are calls with a synthetic endpoint as the destination, such as calls to health-check endpoints.

  3. Select a specific Service in your application or leave the default of All Services selected to include the entire Application Perspective.

  4. Select an Endpoint or, similarly to service selection, you can choose to leave the default of All Endpoints to apply to the entire service.

Set Indicator

Select the metric or SLI to measure for the application and service. A service Level Indicator (SLI) can be measured based on either Latency or Availability. These can both be measured based on aggregated number of times per time (time based) or the total number of good calls versus bad calls (event based). This results in a decrease in the minutes or calls in your error budget the longer it is unresolved.

You can select a Blueprint to define the type of Indicator. There are three types of blueprints in Instana, Latency, Availability, and Custom.

  • Latency: measures how fast the service responds above a set threshold. For example, the measure of how fast your checkout service responds faster than 100 ms.
    • Time based: if the threshold is set to 100 ms then it means if the response took more than 100 ms in a minute then it is a bad call. You can choose how you want to aggregate the results.
    • Event based: if the threshold is set to 100 ms then it means if the response took more than 100 ms in a minute then it is a bad call. These calls are counted to show good versus bad calls.

-Availability: measures how often a service responds appropriately. For example, how often your HTTP calls are erroneous. - Time based: You can choose how good versus bad calls are aggregated and can set the acceptable percentage of bad calls. For example, if you set an error rate to 50% and if the availability of a particular service took more than 50% in a minute, then it is a bad call. - Event based: if a service is available then it is a good call. Otherwise, it is a bad call. These calls are counted to show good versus bad calls.

Custom: this allows you to define a good or bad call by using custom filters.

Set the objective

Select the target for the SLO. This is the target percentage that your application should be meeting when it is performing correctly. Note: the maximum precision that is currently supported is “49s” (99.99%).

Select Time Window Type:

-Fixed: A time window with a defined start and duration. As an example, you can configure a fixed one-month window that starts on 2020-01-01. The time window will be automatically reset to the next month (2020-02-01) when the month is completed.

-Rolling: A time window with a fixed window size, where the end is defined by the global time picker’s end date and time selection. As an example, the rolling time window enables the ability to always see the last week.

Set Details

  1. Give a name to identify an SLO.
  2. Set an optional tag to categorize or sort things or apply a label for a custom use case.

Preview

Verify your widget in the preview. If no preview is displayed, click Highlight missing configuration to see what is missing. The preview data is based on the previous seven days.

To create the SLO configuration, click Save. The created SLO is reflected in the interface page.

Creating an SLO for websites

Select Entity

Select the entity you’d like to monitor and measure its performance.

  1. Use the content picker to choose website.
  2. Select a website from the list or search for your website.

Select Scope

Select the scope within the website to measure.

  1. Select the beacon to monitor: HTTP requests, Website page load, or Website custom events.
  2. Use the custom events filter if needed.

Select Indicator

Select the metric or SLI to measure for the application and service. A service Level Indicator (SLI) can be measured based on either Latency or Availability. These can both be measured based on aggregated number of times per time (time based) or the total number of good calls versus bad calls (event based). This results in a decrease in the minutes or calls in your error budget the longer it’s unresolved.

You can select a Blueprint to define the type of Indicator. There are three types of blueprints in Instana, Latency, Availability, and Custom.

  • Latency: measures how fast the service responds above a set threshold. For example, the measure of how fast your checkout service responds faster than 100 ms.
    • Time based: if the threshold is set to 100 ms then it means if the response took more than 100 ms in a minute then it is a bad call. You can choose how you want to aggregate the results.
    • Event based: if the threshold is set to 100 ms then it means if the response took more than 100 ms in a minute then it is a bad call. These calls are counted to show good versus bad calls.

-Availability: measures how often a service responds appropriately. For example, how often your HTTP calls are erroneous. - Time based: the good vs bad calls can be aggregated. And the acceptable percentage of bad calls can be set. For example, if you set an error rate to 50% and if the availability of a particular service took more than 50% in a minute, it is a bad call. - Event based: if a service is available then it is a good call. Otherwise, it is a bad call. These calls are counted to show good versus bad calls.

Custom: allows you to define a good or bad call by using custom filters.

Set the objective

Select the target for the SLO. This is the target percentage that your application should be meeting when it is performing correctly.

The maximum precision that is currently supported is “49s” (99.99%).

Select Time Window Type:

-Fixed time interval: a time window with a defined start and duration. As an example, you can configure a fixed one-month window that starts on 2020-01-01. The time window will be automatically reset to the next month (2020-02-01) when the month is completed.

-Rolling time window: a time window with a fixed window size, where the end is defined by the global time picker’s end date and time selection. As an example, the rolling time window enables the ability to always see the last week.

Name and tags

  1. Give a name to identify an SLO.
  2. Set an optional tag to group/categorize or sort things or apply a label for a custom use case.

Preview

Verify your widget in the preview. Note: If no preview is displayed, click Highlight missing configuration to immediately see what is missing. Note: The preview data is based on the prior seven days.

To create the SLO configuration, click Save. The created SLO is reflected in the interface page.

Example

Consider the following example where the SLO requirement is to make sure 90% of the calls of the “Robot-shop” application have a latency of better than 100 ms over a fixed period of 1 week. The configuration of the SLO might look like:

-Entity: Robot-shop application (all services, endpoints, calls) - Indicator: - Blueprint: Latency - Type: time-based (aggregate all calls over 1 minute) as 1 data point - Aggregation: mean - Threshold: 100 ms

-Objective: - Target: 90% - Time window type: Fixed - Time window length: 1 week

The error budget for this SLO would be calculated as:

  • Minutes in time period x (1 - SLO Target Percentage)
  • Minutes in time period: 24 × 60 × 7 minutes in 1 week = 10080 minutes
  • SLO Target Percentage = 90% (.90)
  • 10080 x (1 - 0.9) = 1008 minutes’