Extraction training fails with High Traffic in GPU Cluster error when running Automation Document Processing

Troubleshooting

Problem

When you run extraction training with Automation Document Processing, the training never completes. The deep learning log shows:

WE ARE FACING HIGH TRAFFIC IN GPU CLUSTER. RETRYING IN 15 MINS

Cause

The newest Nvidia drivers use more GPU memory than previous versions. The change in available GPU memory causes Automation Document Processing to incorrectly determine that the GPU is not available and starts looping, waiting for the GPU to become available again.

Environment

OpenShift Container Platform clusters running with Nvidia GPU

Diagnosing The Problem

From within the deep-learning pod, run this command.

nvidia-smi

If the version shows as 515.65.01 or higher, apply the workaround in this technote.

Resolving The Problem

In the Red Hat OpenShift console, navigate to Operators, Installed Operators, and select Nvidia GPU Operator.
In the ClusterPolicy tab, select the gpu cluster policy.

Add or edit the yaml file to specify the image and save the yaml.

spec:
  driver:
    image: nvcr.io/nvidia/driver@sha256:dd8155a49340aaea95fd85639576e76196d0170fc023dc7782b35f7c313275b5

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB45","label":"Automation"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSBYVB","label":"IBM Cloud Pak for Business Automation"},"ARM Category":[{"code":"a8m3p000000hAKPAA2","label":"Operate-\u003EADP Install\\Upgrade\\Setup"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Tips

Extraction training fails with High Traffic in GPU Cluster error when running Automation Document Processing

Troubleshooting

Problem

Cause

Environment

Diagnosing The Problem

Resolving The Problem

Document Location

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?