Troubleshooting
Problem
When you run extraction training with Automation Document Processing, the training never completes. The deep learning log shows:
WE ARE FACING HIGH TRAFFIC IN GPU CLUSTER. RETRYING IN 15 MINS
Cause
The newest Nvidia drivers use more GPU memory than previous versions. The change in available GPU memory causes Automation Document Processing to incorrectly determine that the GPU is not available and starts looping, waiting for the GPU to become available again.
Environment
OpenShift Container Platform clusters running with Nvidia GPU
Diagnosing The Problem
From within the deep-learning pod, run this command.
nvidia-smi
If the version shows as 515.65.01 or higher, apply the workaround in this technote.
Resolving The Problem
- In the Red Hat OpenShift console, navigate to Operators, Installed Operators, and select Nvidia GPU Operator.
- In the ClusterPolicy tab, select the gpu cluster policy.
- Add or edit the yaml file to specify the image and save the yaml.
spec: driver: image: nvcr.io/nvidia/driver@sha256:dd8155a49340aaea95fd85639576e76196d0170fc023dc7782b35f7c313275b5
Document Location
Worldwide
[{"Type":"MASTER","Line of Business":{"code":"LOB45","label":"Automation"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSBYVB","label":"IBM Cloud Pak for Business Automation"},"ARM Category":[{"code":"a8m3p000000hAKPAA2","label":"Operate-\u003EADP Install\\Upgrade\\Setup"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]
Was this topic helpful?
Document Information
Modified date:
21 November 2022
UID
ibm16840769