IBM Support

Extraction training fails with High Traffic in GPU Cluster error when running Automation Document Processing

Troubleshooting


Problem

When you run extraction training with Automation Document Processing, the training never completes.  The deep learning log shows:
WE ARE FACING HIGH TRAFFIC IN GPU CLUSTER. RETRYING IN 15 MINS

Cause

The newest Nvidia drivers use more GPU memory than previous versions.  The change in available GPU memory causes Automation Document Processing to incorrectly determine that the GPU is not available and starts looping, waiting for the GPU to become available again.

Environment

OpenShift Container Platform clusters running with Nvidia GPU

Diagnosing The Problem

From within the deep-learning pod, run this command.
nvidia-smi
If the version shows as 515.65.01 or higher, apply the workaround in this technote.

Resolving The Problem

  1. In the Red Hat OpenShift console, navigate to Operators, Installed Operators, and select Nvidia GPU Operator.
  2. In the ClusterPolicy tab, select the gpu cluster policy.
  3. Add or edit the yaml file to specify the image and save the yaml.
    spec:
      driver:
        image: nvcr.io/nvidia/driver@sha256:dd8155a49340aaea95fd85639576e76196d0170fc023dc7782b35f7c313275b5

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB45","label":"Automation"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSBYVB","label":"IBM Cloud Pak for Business Automation"},"ARM Category":[{"code":"a8m3p000000hAKPAA2","label":"Operate-\u003EADP Install\\Upgrade\\Setup"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Document Information

Modified date:
21 November 2022

UID

ibm16840769