Troubleshooting AI model management

Learn how to isolate and resolve problems with AI model training.

Larger data sets are taking a long time to train
Log anomaly detection training is taking longer than expected
Network issues with Elasticsearch
Network issues with Spark
Training encounters insufficient resources
Training reports failure but process completes
Log anomaly detection model deployment fails with error undefined
Ingested raw logs are not available
Statistical baseline reference can require manual reset
How to stop a precheck job
Evaluation of historical metric data gives inconsistent results
Model training considers less historical data in a data quality check
Deployed model status in a metric anomaly or temporal grouping training algorithm is changed to pending
Model version deployment reports failure despite being successfully deployed
The Alert suppression XinY policies tile does not appear in AI model management
Related events training can fail when when you are processing a high alert count
Ticketing integration missing from list of integrations
Metric anomaly detection training fails after an extended time period

Larger data sets are taking a long time to train

If your data sets are large, it can take a long time for AI training to complete. For example, when you run log anomaly training a large log data set might take a full day to train.

By default, two training pods are assigned to handle training jobs. This number of pods is sufficient to complete training within acceptable time periods if your data sets are not too large. However, larger data sets can take a considerable amount of time to train based on two training pods that are running in parallel.

Solution: To speed up training, you must increase the number of training pods that are dedicated to training. The default value is two pods, regardless of the type of training definition that you are running.

A limit applies to the number of training pods that can be dedicated to training. Before you increase the number of training pods that are dedicated to training, it is important for you to be aware of these limits.

Each additional training pod requires approximately the following extra resources:

2.5 CPU cores
2 - 16 GBytes of memory, depending on the AI type that is trained.

Complete the following steps:

Log in to your cluster.
Go to the target IBM Cloud Pak for AIOps namespace.
Find the training resource quota by running the following command:
```
oc get resourcequota | grep training
```
Edit the resource quota by running oc edit resourcequota <name of resource quota found in step 3>. Also, edit the spec.hard.pods field with the number of pods you want to dedicate to training simultaneously. Ensure that you have the needed resources as described in the Before you begin section.
Save the edit by typing :wq or :x. After a few minutes, the changes are automatically applied to your cluster, and you can start training.

Log anomaly detection training is taking longer than expected

Log anomaly detection training is taking longer than expected. Currently, log training times are on the order of 860 log records per second. This number is significantly slower than the 25,000 log records per second that is expected in enterprise environments. Log training time is proportional to the volume of log data passed through the training system.

This issue can occur as the training rate is constrained by the current system architecture.

Solution: This issue has no simple resolution. Depending on the data you are training, the following actions might have some impact on training times:

Increase the number of training pods that are dedicated to training, as described in the Solution section of Larger data sets are taking a long time to train.
Decrease the sampling size used for the training.

Network issues with Elasticsearch

The IBM Cloud Pak for AIOps IBM Cloud Pak for AIOps console is generating lots of network errors.

This issue occurs when Elasticsearch is in an unhealthy state. It occurs most likely due to issues with Elasticsearch connectivity.

Solution: Try to get more information on potential Elasticsearch issues by examining the AI Platform pod logs. Check to see whether any error logs associated with Elasticsearch connectivity exists.

Log in to your cluster by running the oc login command. You must be an administrator user within the project (namespace) where IBM Cloud Pak for AIOps is installed.
```
oc login -u <username> -p <password>
```
For more information, see Logging in to the OpenShift CLI.
Display the logs and look for error logs associated with Elasticsearch connectivity.
```
oc logs $(oc get pods | grep ai-platform-api-server| awk '{print $1}')
```

Network issues with Spark

The AI platform has problems either launching training jobs, or reporting ongoing training status for the temporal grouping AI algorithm. As a result, you encounter errors on the Manage AI algorithms page.

This issue can occur as training jobs for the temporal grouping AI algorithm are orchestrated by the internal component Spark. When there are issues with launching training jobs or training status is not being displayed, it indicates that Spark is in an unhealthy state.

Solution: Try to get more information on potential Spark issues by examining the AI Platform pod logs.

Proceed as follows:

Log in to your cluster by running the oc login command. You must be an administrator user within the project (namespace) where IBM Cloud Pak for AIOps is installed.
```
oc login -u <username> -p <password>
```
For more information, see Logging in to the OpenShift CLI.

Display the logs and look for error logs associated with Spark.

oc logs $(oc get pods | grep ai-platform-api-server| awk '{print $1}')

Training encounters insufficient resources

Training an AI algorithm returns an insufficient resources exception. The reason for this exception is that internal components within the training process were unable to locate a node with sufficient CPU or memory requirements to satisfy the training run.

Solution: The solution is to reduce the number of CPU and memory requests. Usually memory is the bottleneck so if you are unsure of what to do, a good option is to initially reduce memory by half.

Find the custom resource of the algorithm for which you want to alter the default resource requirements by running the following command:
```
oc get algorithms
```

On running the command, custom resources containing all the algorithm specifications are displayed as follows:

oc get algorithms

NAME                AGE
change-risk         46h
log-anomaly         46h
similar-incidents   46h

Edit the specific algorithm by running the following command:
```
oc edit algorithm <name>
```
Decode the spec.manifestBase64 field. Once decoded, you see a resources field that is defined as follows:
```
resources:
    requests:
        cpu: 2
        memory: 4Gi
    limits:
        cpu: 2
        memory: 4Gi
```
Note: You can edit requests and limits (make sure that limits=requests for training algorithms), re-encode to base64, and then replace the manifestBase64 field in the CR with this newly encoded manifest.
Save the changes, and proceed with your training.

Training reports failure but process completes

An AI algorithm training run has completed, but the IBM Cloud Pak for AIOps console shows that the training has failed. No concrete model was generated during the training run.

The reason for this failure might be insufficient data or a misconfiguration in the range of the data set. Failure does not mean that the training failed; it just means that no value was obtained from the training run and consequently no model was produced.

Solution: Examine the data sources for the training, examine the training configuration, make any necessary changes, and launch the training again.

Log anomaly detection model deployment fails with error undefined

When you are training the log anomaly detection algorithm with historical data and you attempt to deploy the AI model, you might encounter an issue where the model deployment fails with an Error: undefined error in the UI.

Solution: As a workaround when this issue occurs, restart all cp4waiops-eventprocessor pods, which can cause the model deployment to complete successfully. To restart the pods, run the following command, which deletes the current instance of the pods and creates and starts a new instance of the pods.

oc delete po cp4waiops-eventprocessor-eve-29ee-ep-jobmanager-0 cp4waiops-eventprocessor-eve-29ee-ep-jobmanager-1 cp4waiops-eventprocessor-eve-29ee-ep-taskmanager-0 cp4waiops-eventprocessor-eve-29ee-ep-taskmanager-1

Note: If you installed IBM Cloud Pak for AIOps with a larger production deployment sizing, you need to specify any additional cp4waiops-eventprocessor pods that exist when you run the preceding command.

Ingested raw logs are not available immediately

Ingested raw logs are not available immediately for "Log anomaly detection - natural language" model training after the import is complete. Raw logs are not available for review in the "Preview log" section of the ChatOps Alert Viewer.

This issue can occur as additional time is required to store normalized logs in Elasticsearch after they are ingested. The amount of time can be reduced by allocating more Flink tasks to the Flink job that stores the normalized logs.

Solution: To resolve this issue, complete the following steps:

The environment variables that control the number of Flink tasks that are used to move logs to Elasticsearch are located in the Kubernetes deployment for aimanager-aio-controller. Edit this deployment: oc edit deployment aimanager-aio-controller
Find the environment variables by searching for "LOGS_JOB2_BASE_PARALLELISM". Change the value of LOGS_JOB2_BASE_PARALLELISM and LOGS_JOB2_NUM_PARTITIONS from their default values of 2 to something higher. Consider doubling the value to see whether it has the wanted effect.
Save the yaml and wait for the pods to restart. You can verify that the pods have restarted by monitoring the output of: oc get pods | grep aimanager-aio-controller
Stop the data flow for all logs integrations. It is important that there are no active logs integrations, as this is how the controller knows that it is OK to stop the Flink job, which copies the logs to Elasticsearch.
Restart the data flow for all logs integrations.

The Flink job restarts by using a higher number of tasks. It copies the logs to Elasticsearch more quickly.

Be careful not to allocate a higher parallelism than you have Flink task slots available. The number of tasks that you need can be calculated by adding the following values:

sum of base_parallelism from all defined integrations, as defined in the Integration UI
value of LOGS_JOB3_BASE_PARALLELISM multiplied by the number of logs type defined integrations (for example, Mezmo, Falcon Logscale, Splunk, and so on)
value of LOGS_JOB2_BASE_PARALLELISM
value of LOGS_JOB4_BASE_PARALLELISM

If you need additional task manager slots, you can add a task manager by Increasing data streaming capacity.

Statistical baseline reference can require manual reset

There is a chance that infrequently occurring anomalies are learned as normal behavior and they might not be detected anymore. For instance, if interval for the infrequently occurring anomalies do not exceed the predefined threshold of 0.15, the interval is not tagged as biased and the reference values are updated and the anomaly is learned as normal behavior. If this occurs, you need to reset the learning by resetting the reference of the statistical baseline with the following steps:

Open a terminal and log in to your cluster:

oc login <server> -u <cluster username> -p <cluster pass>

Open another terminal and run port-forwarding for Elasticsearch

while true; do oc port-forward $(oc get po | grep iaf-system-elasticsearch-es-aiops-0 | awk '{print $1}') 9200; done

Get the Elasticsearch credentials:

export username=$(oc get secret iaf-system-elasticsearch-es-default-user -o jsonpath="{.data.username}"|base64 -d);
export password=$(oc get secret iaf-system-elasticsearch-es-default-user -o jsonpath="{.data.password}"|base64 -d);

Reset the statistical baseline reference:

curl -u $username:$password -XDELETE https://localhost:9200/1000-1000-reference_oob --insecure | sort

How to stop a precheck job

The workaround is to stop the precheck job manually using Openshift CLI commands, delete the training definition, recreate the training definition, and relaunch the training configuration with the correct details.

There are two methods to do this. The first is recommended as it limits mistakes and there is less risk of making training off sync with data in the Elasticsearch database, but the models are deleted if you use this method, whereas with the second method that is listed they will not, so it's preferable to use the second method if you want to conserve the models in question.

Method 1: Stop the precheck job and delete the training definition to start afresh

Retrieve the precheck job in the IBM Cloud Pak for AIOps namespace, or a different namespace that it is installed under:
```
oc get jobs | grep precheck
```
Delete the precheck job in order to stop the precheck run:
```
oc delete job <job-name>
```
Delete the training definition using the delete button on the console.
Re-create the training definition with the correct configurations required for the training algorithm, for example date range or scheduling.
Run the training as usual.

Method 2: Stop precheck job and clean up Elasticsearch data

Retrieve the precheck job in the IBM Cloud Pak for AIOps namespace, or a different namespace that it is installed under:
```
oc get jobs | grep precheck
```
Delete the precheck job in order to stop the precheck run:
```
oc delete job <job-name>
```
Retrieve the AI Platform pod name in the IBM Cloud Pak for AIOps namespace or different namespace that it is installed under:
```
oc get pod | grep api-server
```

Enter inside the pod:

oc rsh <pod-name> # Example: oc rsh aimanager-aio-ai-platform-api-server-55d5b59bd4-jk7nz

Verify the latest precheck run and retrieve the '_id' required for cleanup:

curl -k -u <username>:<password> -X POST $ES_URL/precheckrun/_search?pretty=true" -H 'Content-Type: application/json' -d '{"query": {"match_all": {}},"size": 1,"sort": [{"timestamp": {"order": "desc"}}]}'

Example:

"_id":
"log_anomaly_detection_configurationv3"

Use the '_id' and version specified in the ID to delete the precheck respective run entries in elasticsearch:

curl -k -u <username>:<password> -X POST $ES_URL/precheckrun/_delete_by_query" -H 'Content-Type: application/json' -d '{"query": {"term": {"_id": <ID retrieved from previous step>}}}'

Example:

curl -k -u <username>:<password> -X POST $ES_URL/precheckrun/_delete_by_query" -H 'Content-Type: application/json' -d '{"query": {"term": {"_id": "log_anomaly_detection_configurationv3"}}}'

curl -k -u <username>:<password> -X POST $ES_URL/precheckstatus/_delete_by_query" -H 'Content-Type: application/json' -d '{"query": {"term": {"precheckRunVersion": <version retrieved from previous step>}}}'

Example:

curl -k -u <username>:<password> -X POST $ES_URL/precheckstatus/_delete_by_query" -H 'Content-Type: application/json' -d '{"query": {"term": {"precheckRunVersion": "v3"}}}'

Update the training definition document to decrement the precheck run version:

curl -k -u <username>:<password> -X POST $ES_URL/trainingdefinition/_update/log_anomaly_detection_configuration" -H 'Content-Type: application/json' -d '{"doc": {"precheckRunVersion": <decremented version>}}'

Example:

curl -k -u <username>:<password> -X POST $ES_URL/trainingdefinition/_update/log_anomaly_detection_configuration" -H 'Content-Type: application/json' -d '{"doc": {"precheckRunVersion": "v2"}}'

Make your changes to the training configuration, for example to the date range or scheduling.
Run the training as usual.

Evaluation of historical metric data gives inconsistent results

Sending historical metric data for training – such as in a product demonstration – you see inconsistent outcomes in the IBM Cloud Pak for AIOps console metric anomaly results.

At inference (metric evaluation) time, it takes a short amount of time to load analytics models into memory. This delay might lead to some metric data being left out of an evaluation.

Solution: Send in one datapoint, for each individual metric, to allow the model to be loaded. Then, send remaining data after a short interval, 30 - 60 seconds later.

Model training considers less historical data in a data quality check

Less data is included in a data quality check if you select Training first instead of running a precheck and training separately.

If you select Training to run both a data quality check and training sequentially, you might notice this problem. This situation occurs only when some integrations are moved into live data after the training completes, and the training dates are changed to check historical data from another integration source.

Solution: Run the precheck and training separately. Start the data quality precheck, and after the precheck is complete, run the training after.

Deployed model status in a metric anomaly or temporal grouping training algorithm is changed to pending

A temporal grouping or a metric anomaly detection algorithm is in a pending state although the model was previously trained and deployed with success.

Solution: To resolve this issue, complete the following steps:

Run the oc -n <namespace> rsh <aimanager-aio-ai-platform-api-server-*> command.
Delete all versions of the metric_anomaly_detection_configuration or temporal_grouping_configuration training algorithms in the trainingrun index. Search the index by running the following command:
```
./elastic.sh -X GET -E trainingrun/_search\?pretty=true&size=1000
```
Identify all the documents that are related to your metric_anomaly_detection_configuration or temporal_grouping_configuration training algorithms. The _id field of the documents is metric_anomaly_detection_configurationv<version> or temporal_grouping_configurationv<version>, where <version> is the version of the training run.

Delete the training run documents:

./elastic.sh -X DELETE -E trainingrun/_doc/metric_anomaly_detection_configurationv1
./elastic.sh -X DELETE -E trainingrun/_doc/metric_anomaly_detection_configurationv2
[..]

./elastic.sh -X DELETE -E trainingrun/_doc/temporal_grouping_configurationv1
./elastic.sh -X DELETE -E trainingrun/_doc/temporal_grouping_configurationv2
[..]

Delete the training definition to clean the UI:

./elastic.sh -X DELETE -E trainingdefinition/_doc/metric_anomaly_detection_configuration

./elastic.sh -X DELETE -E trainingdefinition/_doc/temporal_grouping_configuration

Now, you can train a new model.

Model version deployment reports failure despite being successfully deployed

When you deploy LAD model version manually, you might encounter an error message on your console that reports deployment failure. The error message reads that the failure occurs due to the version being already deployed.

This error arises due to the successful deployment of the model version in the background despite getting timed out.

Solution: Restart the data flow for all the integrations. To do so, first disable and then re-enable the data flow for all integrations.

The Alert suppression XinY policies tile does not appear in AI Model Management

In the IBM Cloud Pak for AIOps console, click AI model management and find the Alert suppression XinY policies tile. If this tile does not appear, an administrator or developer user needs to restart the aimanager-aio-ai-platform-api-server pod.

To restart the target pod, delete the pod so that a new pod automatically starts up:

Connect to your cluster with the oc login command.

Restart the aimanager-aio-ai-platform-api-server pod.

oc delete pod/"$(oc get pod|grep aimanager-aio-ai-platform-api-server|awk '{print$1; exit}')"

Successfully running this command produces output similar to the following example:

pod "aimanager-aio-ai-platform-api-server-55c5d975c4-6qpwk" deleted

Related events training can fail when when you are processing a high alert count

When you are running related events training on a data set that includes a high number of alerts (approximately 11.5M or more alerts), the training can fail. When this failure occurs the Spark log includes thrown exceptions.

The spark log exceptions can resemble the following output:

24/02/23 10:16:01 ERROR TaskSetManager: Task 0 in stage 3215.0 failed 4 times; aborting job
24/02/23 10:16:01 INFO TaskSchedulerImpl: Removed TaskSet 3215.0, whose tasks have all completed, from pool 
24/02/23 10:16:01 INFO DAGScheduler: Job 326 failed: $anonfun$withThreadLocalCaptured$1 at <unknown>:0, took 0.996638 s
24/02/23 10:16:01 INFO TaskSchedulerImpl: Cancelling stage 3215
24/02/23 10:16:01 INFO TaskSchedulerImpl: Killing all running tasks in stage 3215: Stage cancelled: Job aborted due to stage failure: Task 0 in stage 3215.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3215.0 (TID 2975) (10.254.75.83 executor 0): org.apache.spark.SparkException: Checkpoint block rdd_680_0 not found! Either the executor
that originally checkpointed this partition is no longer alive, or the original RDD is
unpersisted. If this problem persists, you may consider using `rdd.checkpoint()`
instead, which is slower than local checkpointing but more fault-tolerant.
	at org.apache.spark.errors.SparkCoreErrors$.checkpointRDDBlockIdNotFoundError(SparkCoreErrors.scala:102)
	at org.apache.spark.rdd.LocalCheckpointRDD.compute(LocalCheckpointRDD.scala:61)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:362)
	at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:378)
	at org.apache.spark.storage.BlockManager.$anonfun$getOrElseUpdate$1(BlockManager.scala:1372)
	at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1601)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1528)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1592)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1389)
	at org.apache.spark.storage.BlockManager.getOrElseUpdateRDDBlock(BlockManager.scala:1343)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:326)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)

Solution: As a workaround if this failure occurs, increase the spark worker memory limit and run the training again.

The spark worker memory limit can be modified by using the following command:

oc edit sts aiops-ir-analytics-spark-worker

The field that needs modification is the memory field as shown in the following example output:

resources:
  limits:
    memory: 8000Mi

Ticketing integration missing from list of integrations

If a ticketing integration is missing from the list of integrations in a training module such as Similar tickets or Change risk, you should update that integration. To do this, edit the integration and either update the existing description or add a new one, then save your changes. This action will trigger a resynchronization of the connector schema for that integration. After a few minutes, the updated integration should appear in the trainer Integrations table.

Metric anomaly detection training fails after an extended time period

Previously deployed metric anomaly detection training models stop running successfully.

An error similar to the following example appears in the Cassandra logs:

ERROR [ReadStage-2] 2024-10-22 11:53:22,943 StorageProxy.java:2175 - Scanned over 100001 tombstones during query 'SELECT value FROM tararam.dt_metric_value WHERE token(t_uid, mr_id) >= token('cfd95b7e-3bc7-4006-a4a8-a73a79c71255', 28f522d0-5c6a-3d8a-b867-8a449f3e5fa9) AND token(t_uid, mr_id) <= 5156857202419367684 LIMIT 1000 ALLOW FILTERING' (last scanned row token was 4404220741919580179 and partion key was ((cfd95b7e-3bc7-4006-a4a8-a73a79c71255, bab5050a-5780-3e92-b8b4-8baa00cbfe42), 2024-09-27T18:25:00.000Z)); query aborted

Solution: Clear the already expired metric data and then restore the expiration interval to its original value.

Connect to the Cassandra database:

Login the CQL for cassandra:
       export CASSANDRA_PASSWORD=$(oc get secret aiops-topology-cassandra-auth-secret --template={{.data.password}} | base64 --decode);echo $CASSANDRA_PASSWORD
       oc exec -ti aiops-topology-cassandra-0 -- /opt/ibm/cassandra/bin/cqlsh -u admin -p <PASSWORD FROM PREVOUS> --ssl

Set the gc_grace_seconds period on the metric data table to 10 minutes:

cqlsh> ALTER TABLE tararam.dt_metric_value WITH gc_grace_seconds=600;

Wait at least 10 minutes for the data to expire.

Then, set the table back to its original value:

cqlsh> ALTER TABLE tararam.dt_metric_value WITH gc_grace_seconds=864000;