Workflow of natural language log anomaly detection

Find out how natural language log anomaly detection processes your log data to generate log alerts.

The workflow of natural language log anomaly detection is made up of two parts:

Model generation
Application of the model to the live log data stream

Model generation

The following figure shows a simplified workflow illustrating how the natural language log anomaly detection algorithm generates an AI model from log data. This part of the workflow is known as training.

Algorithm is configured and trained

The algorithm configuration includes specification of log data sources, the period of data to train on, whether to run training on a schedule or on demand, and whether to deploy the training automatically or manually. Training is started and the model training process begins, as described in steps 2 to 6.
Training jobs are identified

Assume that you have a microservice that is made up of multiple components such as orders, catalog, front-end, and more, each of these components with its own set of logs. As it is likely that each of the components has different logging behavior, the log anomaly detection algorithm creates a separate model for each component. Each of these models is trained by a separate training job.

The logging data for the different components is differentiated by using the instance-id parameter that is specified during log integration mapping. For an example of log integration mapping for a Mezmo log system, see Specifying field mapping.

The following steps 3 to 6 explain the process for training a unique model for each component. The steps are repeated for each component identified.

Note: For more information about limits and throughput rates, see Supported resource number and throughput rates.
Data is imported

Based on the specification of log data sources in the algorithm configuration, data is imported from log data sources for a given component.
Data is divided into 10-second slots

The imported data is divided into 10-second time slots. For example, if two weeks' worth of data was impoirted, then that data is divided into x time slots based on the following calculation.

2 weeks = 14 days = 336 hours = 20,160 minutes = 120,960 time slots

Log patterns are identified and counted

Within each time slot, the system identifies and counts log patterns. Log patterns are identified in the following way:

Each log message is separated into variant and invariant parts. For example, consider the following log messages:

2021-03-12T03:05:31.608355778+00:00 stdout F
{
  “name”:“@instana collector”,
  “__in”:5,
  “hostname”:“cart-846b9595c9qcqfk”,
  “pid”:35261,
  “module”: “announceCycle/agentHostLookup”,
  “level”:2,
  “msg”:“Agent cannot be contacted through system ABC nor through default gateway 123. Scheduling reattempt 0
}

2021-03-12T03:05:33.608355778 +00:00 stdout F
{
  “name”:“@instana collector”,
   “__in”:5,
   “hostname”:“cart-846b9595c9qcqfk”,
   “pid”:89561,
   “module”: “announceCycle/agentHostLookup”,
   “level”:2,
   “msg”:“Agent cannot be contacted through system DEF nor through default gateway 456. Scheduling reattempt 1
 }

2021-03-12T03:05:35.321654987+00:00 stdout F
{
 “name”:“@instana collector”,
 “__in”:5,
 “hostname”:“cart-846b9595c9qcqfk”,
 “pid”:75421,
 “module”: “announceCycle/agentHostLookup”,
 “level”:3,
 “msg”:“Agent cannot be contacted through system GHK nor through default gateway 789. Scheduling reattempt 2
}

These log messages can be expressed by using the following log pattern, where the variant parts of the log messages are expressed by using wildcards, such as <*> and <NUM>.

<*> stdout F
{
  “name”:“@instana collector”,
  “__in”:<NUM>,
  “hostname”:“cart-846b9595c9qcqfk”,
  “pid”:<NUM>,
  “module”: “announceCycle/agentHostLookup”,
  “level”:<NUM>,
  “msg”:“Agent cannot be contacted through <*> nor through default gateway <*>. Scheduling reattempt <NUM>
}

Log patterns are counted in each 10-second slot and means and standard deviations are stored for each log pattern.

Model is created

On completion of training, a model is created that includes log pattern statistics.

Note: Each time that the algorithm is run, steps 2 to 5 are repeated from scratch and a new version of the model is generated.

Application of the model to the live log data stream

The following figure shows a simplified workflow illustrating how the natural language log anomaly detection algorithm applies the AI model to the live log data stream. This part of the workflow is known as inference, and it involves applying the findings from the model.

Model is deployed

Depending on how the algorithm was configured a new version of the model is deployed automatically on completion of training, or manually.
Review live stream for anomalies

Now that you have a deployed model, the system starts monitoring the live log stream for anomalies, which are statistically significant deviations from the baseline. It evaluates each component’s logs every 10 seconds for anomalies.
Log alerts are generated

Log anomalies are generated at the end of each 10-second interval for each log line in which there is a statistical difference between the log pattern counts and the reference values and log patterns – or both – are identified in the live log stream that does not exist in the model.

Based on this analysis, the log anomaly detection algorithms identify one or more of the following log anomaly types:
1. Pattern is expected but is not detected
2. Pattern is detected but is not expected
3. Pattern is detected but is not expected, and includes an error
4. Pattern is detected with lower frequency than expected
5. Pattern is detected with higher frequency than expected
The algorithm creates a log alert for each log anomaly. The severity of the log alert is calculated based on the following considerations:
- All alerts that are generated by this algorithm are given an intial severity of 4 (Minor)
- If the pattern is detected but not expected (log anomaly type 2) or it is an error pattern, then the severity is increased to 5 (Major).
- If the pattern is detected with a lower frequency than expected (log anomaly type 4), then the severity is increased to 6 (Critical).