IBM Support

Training produces non-deterministic Policy results

Troubleshooting


Problem

The related events training algorithm is not deterministic - i.e. two runs with exactly the same alerts can produce different numbers of groups (and policies created)

Symptom

The number of Temporal Grouping and Temporal Patterns policies will not be consistent, for consistent event data in Cassandra.

Diagnosing The Problem

Run training again and again, i.e. on the training pod
curl -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' --header 'X-TenantID: cfd95b7e-3bc7-4006-a4a8-a73a79c71255' -d '{  "properties": {  "patterns.enabled": "true" , "ea.policies.deploy": false ,  "patterns.deploy": "false" , "patterns.outputRaw": true, "runner.spark.writeOutToDir": "/opt/spark/work/MissingPatterns1" }}' 'http://172.30.90.33:8080/1.0/training/train/related-events'
{"_executionTime":209,"response":"driver-20230508112211-0158"}

where 172.30.90.33 is the IP of the training service (oc get service | grep train)

See different policy numbers - change cluster_release_name below
export ADMIN_PASSWORD=$(oc get secret cluster_release_name-systemauth-secret -o jsonpath --template '{.data.password}' | base64 --decode)
export POLICY_URL=$(echo "https://$(oc get route | grep policies | awk '{print $2}')$(oc get route | grep policies | awk '{print $3}')"/system)/v1/cfd95b7e-3bc7-4006-a4a8-a73a79c71255/policies/system

env | grep ADMIN_PASSWORD
env | grep POLICY_URL

curl -X GET --header 'Content-Type: application/json' --header 'Accept: application/json' --user system:${ADMIN_PASSWORD} -vk ${POLICY_URL} > ALL_POLICIES.run2
curl -X GET --header 'Content-Type: application/json' --header 'Accept: application/json' --user system:${ADMIN_PASSWORD} -vk ${POLICY_URL} | grep related-events > ALL_RELATED_POLICIES.run2
curl -X GET --header 'Content-Type: application/json' --header 'Accept: application/json' --user system:${ADMIN_PASSWORD} -vk ${POLICY_URL} | grep pattern > ALL_PATTERN_POLICIES.run2
 

Resolving The Problem

The problem is resolved in NOI 1.6.9
For large data sets, the number of Spark Worker pods should be set to 6. This can be done by scaling the spark-slave deployment and setting the replicas=6.

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB45","label":"Automation"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSTPTP","label":"Netcool Operations Insight"},"ARM Category":[{"code":"a8m0z0000001jZTAAY","label":"NOI Netcool Operations Insights-\u003ECNEA Cloud Native Event Analytics"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Document Information

Modified date:
04 September 2023

UID

ibm17028501