About log anomaly detection - golden signals

Detect log anomalies with golden signals.

Log anomaly detection - golden signals, introduced in Version 4.2.0, is an algorithm that improves signal-to-noise management of anomaly detection over previous log anomaly detection algorithms. It uses autoclassification of log templates to pre-filter logs of interest and provides administrative control to further tune the anomalies that are raised as alerts. It also improves anomaly alert explainability by providing anomaly periods, baselines, and forecasts in a chart and access to raw logs.

How do golden signals help with log anomaly detection?

Working with metric anomaly detection, Log anomaly detection - golden signals discovers important log message patterns and detects when they change significantly. The algorithm takes in raw log messages and converts the textual information to log patterns with both known and unknown variables and static content. These log patterns are called templates. Domain knowledge is used to assign a golden signal type to each template.

After this templatization, log data is filtered based on the golden signal type and converted to metric data. This converted metric data is passed through the metric anomaly detection algorithm to train models and infer anomalies.

IT operations teams can use the alerts that are generated to identify the log patterns that deviated from normal behavior and the time period in which the abnormalities were observed. Teams can use this information for root cause analysis to help resolve incidents.

Note: If the Log anomaly detection - golden signals algorithm is enabled, the Log anomaly detection - natural language or Log anomaly detection - statistical baseline algorithms cannot be active. Likewise, if Log anomaly detection - natural language or Log anomaly detection - statistical baseline is enabled, Log anomaly detection - golden signals cannot be active.

What is a golden signal?

Golden signals help IT operations teams to classify and organize events when they diagnose incidents.

Seven types of golden signals exist within IBM Cloud Pak® for AIOps.
Golden signal types can fall into the following classes:
- Causes include latency, error, and availability.
- Effects include exception, traffic, saturation, and information.
- The information type is not assigned to any class and shows up with none.

Golden signal types

Latency: These events include a duration and are not related to resources.
- Examples include transaction load, transfer, or any other metrics where the unit of measurement is time.
- Example events include disk read/write times and I/O wait times.
Error: These events are associated with a timeout, drop, failure, reset, retry, retransmission of request, packets, or a notification.
- An error might also be understood as a limiting condition of latency. Examples include HTTP code 5xx, requests failed, crash affected user rate or count, count of errors, or any user actions with errors, error count, or rate associated with reported or requests, and event failure.
Availability: These events are indicative of the availability or readiness of the resource.
- Examples include HTTP code 4xx, host unhealthy, resource availability, readiness, and health status.
Exception: These events are composed of stack trace or any indication of other errors in the log data.
- Examples include DSRA0010E : SQL State = 08S01, Error Code = 18.456: (FailedLoginException) Login error: com.ibm.security.krb5.KrbException, and status code: 6 message: Client not found in Kerberos database.
Traffic: These events represent the volume or rate of transactions that flow through the system.
- Examples include HTTP code 2xx, traffic in/out, bytes received/sent/outstanding, disk/bytes read/write operations, throughput, service idle time, and HTTP 2xx/4xx/5xx success counts.
- Events that represent idle time might also be classified as traffic because they might indicate that a service is experiencing idle time, indicating low traffic.
Saturation: These events can be measured in terms of count or percentage and can be exhausted and throttled. These events are also resource-oriented, such as CPU, GPU, memory, swap, kernel thread, cache, containers, link, connection, process, queue, disk, and session.
- At the infrastructure level, such resources might be as granular as CPU, GPU, memory, and disk space or as broad as a Kubernetes cluster, pod, and VM.
- At the middleware level, such resources might include connection pool and thread pool.
- At the application level, such resources might be user sessions.
Information: Events that do not fall into any of the previous golden signal types are assigned as information logs, for example, <<UNKNOWN TYPE>> on Success type = 0 data=true.

Enabling or disabling the log anomaly golden signals pipeline

To enable the pipeline, complete the following steps:

Before you enable the log anomaly golden signals pipeline, disable the Log anomaly detection - natural language and the Log anomaly detection - statistical baseline algorithms from the AI Model Management UI.

Figure. AI Model Management UI page
Enable the metric anomaly detection training schedule from the AI Model Management UI.
Click Set up training and complete the steps in Starting the training setup to enable the pipeline.
Verify that the Log anomaly detection - golden signals policy is enabled so the log anomaly golden signals pipeline can run as expected. The policy is enabled by default. After the pipeline runs, it produces an alert.

To disable the pipeline, complete the following steps:

Click the Log anomaly detection - golden signals tile within the AI Model Management page and delete the training definition.

Figure. Log anomaly detection - golden signals UI page
Enable the Log anomaly detection - natural language or the Log anomaly detection - statistical baseline algorithms from the AI Model Management UI. Now the pipeline is configured not to use golden signals, and all incoming log lines are instead sent to the log anomaly detection pipeline.

How to set up a training definition for log anomaly detection - golden signals

Prerequisites

On the IBM Cloud Pak for AIOps home page, click the navigation icon to go to the main navigation menu.
In the main navigation menu, click Operate > AI Model Management to open AI Model Management.
Disable the Log anomaly detection - natural language and Log anomaly detection - statistical baseline algorithms.
Create a training definition for Metric anomaly detection with a training schedule.
Make sure that all the log data connections that are created in the Data and tool connections section are in Live data for initial training mode. Whether the log data connection is set up before or after you enable log anomaly golden signals, it must be in the Live data for initial AI training mode.

Starting the training setup

Within the Log anomaly detection - golden signals tile, click Set up training.

Note: If this AI algorithm is already set up for training, Set up training is not available. For more information about changing the algorithm training setup, see Editing the training setup for an algorithm.
Click Next to open the Getting started panel. The panel explains this AI algorithm and how it helps in your production environment. The panel also provides a list of data and tool connections that are available to generate a model. At least one connection must exist in this list for the model to collect data and start training.
If no connections are listed or if you expected to see more connections, click Data and tool connections to modify your connections.
Add at least one connection by using the Data and tool connections area of IBM Cloud Pak for AIOps. Then, alerts are generated when the following conditions are true:
- At least 100,000 log messages are processed for template quality.
- The logs span at least 3.5 days, and metric anomaly detection training is run on that data.
Note: The log connection must be enabled only in Live data for initial AI training mode as shown in the following LogDNA example:

Figure. Live data for initial AI training
Click Next.
In the Training setup panel, click Done. The training begins.

Handling large log message loads for template training

Starting with IBM Cloud Pak® for AIOps version 4.2.1, log message sampling is enabled by default. The training initially runs every 10 minutes until at least 100,000 log messages are processed. Then, the training runs once every hour. Regardless of the log message load, 450,000 logs are sampled by default for every training run.

If the log message load contains more than 10 million logs each hour, tune some parameters to override the defaults with one of the following options.

Option 1

Increase the pod resources and the MINIMUM_SAMPLE_SIZE_PER_TRAINING_RUN and MAXIMUM_SAMPLE_SIZE_PER_TRAINING_RUN values.

After you log in to the terminal with the oc login command, edit the deployment file by entering the following command:
```
oc edit deployment aimanager-aio-log-anomaly-golden-signals
```

Modify the values in the container.resources file to the following values:

containers:
  - resources:
      limits:
        cpu: '3'
        memory: 6Gi
      requests:
        cpu: '2'
        memory: 4Gi

In the env section, add or modify the following values:

- name: MINIMUM_SAMPLE_SIZE_PER_TRAINING_RUN
  value: '800000'
- name: MAXIMUM_SAMPLE_SIZE_PER_TRAINING_RUN
  value: '900000'

Option 2

Alternatively, run the training more frequently even after the initial 100,000 log messages are processed by modifying the JOB_FREQUENCY_TIME value.

After you log in to the terminal with the oc login command, edit the deployment file by entering the following command:
```
oc edit deployment aimanager-aio-log-anomaly-golden-signals
```
In the env section, add or modify the following value to change the number of minutes:
```
- name: JOB_FREQUENCY_TIME
  value: '10'
```

Viewing templates

Now the training setup is complete. The template training continuously runs in the background.

After the data connection collects the threshold number of log messages, the algorithm triggers template training on the data that is collected. The algorithm starts to process the log messages to categorize them into the required templates.

When template training gets completed, you can see a summary count of all the templates that got generated, as shown in the following example:

Any template with a golden signal type of Information gets categorized as a disabled template, and all other templates with a golden signal type other than Information get categorized as enabled.

You can view more details about these templates by clicking the Templates tab.

The templates are displayed in a tabular format with the following details:

The Template name section shows the log pattern that matches the input log messages. Click a Template name to view template Details. From Details, edit the Template name or select a different Golden signal type. Click Save to save your changes, or click Cancel to close the template without saving the changes.
The ID numbers indicate the order in which the templates were created.
The Alerts last 24 hours and Alerts last 2 weeks sections display the alert count if an alert was generated for that particular template in the last 24 hours or 2 weeks. If any log message patterns or frequencies change unexpectedly, the template triggers an alert to be generated.
The Log messages last 24 hours and Log messages last 2 weeks sections show the count of any log messages that were collected in the last 24 hours or 2 weeks that match a specific template pattern.
The Golden signal type section shows the golden signal type that is attached to a template. Open the menu for the template that you want to modify. Click Set golden signal to select a golden signal type, or click Set state to set the state.
The State section shows enabled or disabled based on the golden signal type for the specific template. Enabled templates are used to track log message patterns over extended periods of time. This data is then sent to metric anomaly detection. Alerts are generated when log message patterns or their frequencies change unexpectedly.

If the counts against template patterns from the log anomaly detection - golden signals algorithm are not updated in the training UI table, enable historic alert storing in Elasticsearch to access alert counts. For more information, see Counts against template patterns are not updated in the training UI.

Viewing unmatched templates

Log messages that don't have any common pattern to them are grouped based on their golden signal type and matched to a template pattern with an unmatched_<golden_signal_type> format as shown in the following example:

After enough unmatched log messages reach the threshold count, an incremental training run is triggered. The incremental training might generate new log template patterns to help match the log lines that are similar to current unmatched logs and add the logs to a template.

Click a Template name of an unmatched template to view template Details.

Viewing log messages

View up to 10,000 log messages that match a template pattern.

Click a Template name. Then, click the Log messages tab.
- Received shows timestamps for when the messages were received.
- Log message shows the log message content that matches the template pattern.
Click Export to CSV file to generate a CSV file if you want to further analyze the data.

Viewing metric anomalies in the Alert Viewer

For more information about viewing Log anomaly detection - golden signals metric anomalies in the Alert Viewer, see Viewing metric anomaly details.