Training produces non-deterministic Policy results

Troubleshooting

Problem

The related events training algorithm is not deterministic - i.e. two runs with exactly the same alerts can produce different numbers of groups (and policies created)

Symptom

The number of Temporal Grouping and Temporal Patterns policies will not be consistent, for consistent event data in Cassandra.

Diagnosing The Problem

Run training again and again, i.e. on the training pod

curl -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' --header 'X-TenantID: cfd95b7e-3bc7-4006-a4a8-a73a79c71255' -d '{  "properties": {  "patterns.enabled": "true" , "ea.policies.deploy": false ,  "patterns.deploy": "false" , "patterns.outputRaw": true, "runner.spark.writeOutToDir": "/opt/spark/work/MissingPatterns1" }}' 'http://172.30.90.33:8080/1.0/training/train/related-events'
{"_executionTime":209,"response":"driver-20230508112211-0158"}

where 172.30.90.33 is the IP of the training service (oc get service | grep train)

See different policy numbers - change cluster_release_name below

export ADMIN_PASSWORD=$(oc get secret cluster_release_name-systemauth-secret -o jsonpath --template '{.data.password}' | base64 --decode)
export POLICY_URL=$(echo "https://$(oc get route | grep policies | awk '{print $2}')$(oc get route | grep policies | awk '{print $3}')"/system)/v1/cfd95b7e-3bc7-4006-a4a8-a73a79c71255/policies/system

env | grep ADMIN_PASSWORD
env | grep POLICY_URL

curl -X GET --header 'Content-Type: application/json' --header 'Accept: application/json' --user system:${ADMIN_PASSWORD} -vk ${POLICY_URL} > ALL_POLICIES.run2
curl -X GET --header 'Content-Type: application/json' --header 'Accept: application/json' --user system:${ADMIN_PASSWORD} -vk ${POLICY_URL} | grep related-events > ALL_RELATED_POLICIES.run2
curl -X GET --header 'Content-Type: application/json' --header 'Accept: application/json' --user system:${ADMIN_PASSWORD} -vk ${POLICY_URL} | grep pattern > ALL_PATTERN_POLICIES.run2

Resolving The Problem

The problem is resolved in NOI 1.6.9

For large data sets, the number of Spark Worker pods should be set to 6. This can be done by scaling the spark-slave deployment and setting the replicas=6.

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB45","label":"Automation"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSTPTP","label":"Netcool Operations Insight"},"ARM Category":[{"code":"a8m0z0000001jZTAAY","label":"NOI Netcool Operations Insights-\u003ECNEA Cloud Native Event Analytics"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Tips

Training produces non-deterministic Policy results

Troubleshooting

Problem

Symptom

Diagnosing The Problem

Resolving The Problem

Document Location

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?