Troubleshooting common issues
Following are some problems you might encounter when using PowerAI Vision, along with steps to fix them.
- The PowerAI Vision GUI does not work
- Unexpected / old pages displayed when accessing the user interface
- Uploading a large file fails
- Uploading a large number of files fails
- PowerAI Vision does not play video
- Out of space error from load_images.sh
- I forgot my user name or password
- PowerAI Vision cannot train a model
- Training or deployment hangs
- Model training and inference fails
- Object detection model training fails using images with non-standard aspect ratios
- Auto labeling of a data set returns "Auto Label Error"
- PowerAI Vision does not start
- PowerAI Vision fails to start - Kubernetes connection issue
- PowerAI Vision startup hangs - helm issue
- Helm status errors when starting PowerAI Vision
- Some PowerAI Vision functions don't work
- Command line tool fails - missing options
The PowerAI Vision GUI does not work
- Problem
- You cannot label objects, view training charts, or create categories.
- Solution
- Verify that you are using a supported web browser. The following web browsers are supported:
- Google Chrome Version 60, or later
- Firefox Quantum 59.0, or later
Unexpected / old pages displayed when accessing the user interface
- Problem
- After updating, reinstalling, or restarting PowerAI Vision, the browser presents pages that are from the previous version or are stale.
- Solution
- This problem is typically caused by the browser using a cached version of the page. To solve
the problem, try one of these methods:
- Use a Firefox Private Window to access the user interface.
- Use a Chrome Incognito Window to access the user interface.
- Bypass the browser cache:
- In most Windows and Linux browsers: Hold down Ctrl and press F5.
- In Chrome and Firefox for Mac: Hold down ⌘ Cmd and ⇧ Shift and press R.
Uploading a large file fails
When uploading files into a data set, there is a 2GB size limit per upload session. This limit applies to a single .zip file or a set of files. When you upload a large file that is under 2 GB, you might see the upload start (showing a progress bar) but then you get an error message in the user interface. This error happens due to a Nginx timeout, where the file upload is taking longer than the defined 5 minute Nginx timeout.
Despite the notification error, the large file has been uploaded. Refreshing the page will show the uploaded files in the data set.
Uploading a large number of files fails
- Problem
- This problem only occurs on a Microsoft Windows system using the Chrome browser. When using the Import Files button in the PowerAI Vision user interface to add images or videos to a data set, you select a large amount of files to upload but nothing happens after submitting the file picker.
- Solution
- This is a known bug with the Chrome browser on Windows where the file names selected are too long, causing the file picker to fail silently. Try the following solutions:
- Create a zip of the files and upload that instead.
- Use Firefox Quantum 59.0 or later to uploads the files.
- Drag-and-drop the files onto the area in the user interface labeled "Drop files here".
PowerAI Vision does not play video
- Problem
- You cannot upload a video, or after the video is uploaded the video does not play.
- Solution
- Verify that your video is a supported type:
- Ogg Vorbis (.ogg)
- VP8 or VP9 (.webm)
- H.264 encoded videos with MP4 format (.mp4)
Out of space error from load_images.sh
- Problem
- When installing the product, the load_images.sh script is used to load the
PowerAI Vision Docker images. Even though the script
terminates with "INFO: All images loaded successfully.", the output should be
checked to ensure there were not any problems.
For example, the /var/lib/docker file system can run out of space, resulting in a message indicating that an image was not fully loaded. The following output shows that the Docker image powerai-vision-dnn was not able to be fully loaded because of insufficient file system space:
Loaded image: powerai-vision-dnn:1.1.1.0 5f38fd05125c: Loading layer [==================================================>] 826.8 MB/826.8 MB a95ac7216ffb: Loading layer [==================================================>] 20.3 MB/20.3 MB Error processing tar file(exit status 1): write /usr/lib/libavcodec.so.57.107.100: no space left on device INFO: All images loaded successfully.
This situation can also be noted in the output from /opt/powerai-vision/bin/kubectl get pods. This command is described in Checking the application and environment, which shows images that could not be loaded with a status of ErrImagePull or ImagePullBackOff.
- Solution
- The file system space for /var/lib/docker needs to be increased, even if the file system is not completely full. There might still be space in the file system where /var/lib/docker is located, but insufficient space for the PowerAI Vision Docker images. There are operating system mechanisms to do this, including moving or mounting /var/lib/docker to a file system partition with more space.
I forgot my user name or password
- Problem
- You forgot your user name or password and cannot log in to the PowerAI Vision GUI.
- Solution
- PowerAI Vision uses an internally managed users account database. To change your user name or password, see Logging in to PowerAI Vision.
PowerAI Vision cannot train a model
- Problem
- The model training process might fail if your system does not have enough GPU resources.
- Solution
-
- If you are training a data set for image classification, verify that at least two image categories are defined, and that each category has a minimum of five images.
- If you are training a data set for object detection, verify that at least one object label is used. You must also verify that each object is labeled in a minimum of five images.
- Ensure
that enough GPUs are available. PowerAI Vision assigns one GPU to each active training job or deployed
deep learning API. For example, if a system has four GPUs and you have two deployed web APIs, there
are two GPUs available for active training jobs. If a training job appears to be hanging, it
might be waiting for another training job to complete, or there might not be a GPU available to run
it.
To determine how many GPUs are available on the system, run the sudo /opt/powerai-vision/bin/kubectl.sh describe nodes script and review the nVidiaGPU Limits column.
The following is an example of the output from sudo /opt/powerai-vision/bin/kubectl.sh describe nodes that shows two GPUs currently in use:
Name: 127.0.0.1 Roles: <none> Labels: beta.kubernetes.io/arch=ppc64le beta.kubernetes.io/os=linux gpu/nvidia=TeslaV100-SXM2-16GB kubernetes.io/hostname=127.0.0.1 Annotations: node.alpha.kubernetes.io/ttl=0 volumes.kubernetes.io/controller-managed-attach-detach=true... Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) CPU Requests CPU Limits Memory Requests Memory Limits NvidiaGPU Limits -------------------------------------------------------------------------- 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2 (50%) Events: <none>
If all the systems GPUs are in use, you can either delete a deployed web API (making the API unavailable for inference) or you can stop a training model that is running.- To delete a deployed model, click Deployed Models. Next, select the model that you want to delete and click Delete. The trained model is not deleted from PowerAI Vision. You can redeploy the model later when more GPUs are available.
- To stop a training model that is running, click Models. Next, select the model that has a status of Training in Progress and click Stop Training.
Training or deployment hangs
- Problem
- You submit a job for training or deployment, but it never completes. When doing training or
deployments, sometimes some pods that are running previous jobs get out of sync with the
vision-service MongoDB and they hang forever instead of getting terminated within a minute or so. In
turn, they hold GPUs so no new training or deployment jobs can complete. They will be in the
Scheduled state forever.
To verify that this is the problem, run kubectl get pods and review the output. The last column shows the age of the pod. If it is older than a few minutes, use the information in "Solution" to solve the problem.
Example:
kubectl get pods powerai-vision-infer-ic-06767722-47df-4ec1-bd58-91299255f6hxxzk 1/1 Running 0 22m powerai-vision-infer-ic-35884119-87b6-4d1e-a263-8fb645f0addqd2z 1/1 Running 0 22m powerai-vision-infer-ic-7e03c8f3-908a-4b52-b5d1-6d2befec69ggqw5 1/1 Running 0 5h powerai-vision-infer-od-c1c16515-5955-4ec2-8f23-bd21d394128b6k4 1/1 Running 0 3h
- Solution
- Follow these steps to manually delete the deployments that are hanging.
- Determine the running deployments and look for those that have been running longer than a few
minutes:
kubectl get deployments
- Delete the deployments that were identified as hanging in the previous step.
kubectl delete deployment deployment_id
- You can now try the training or deploy again, assuming there are available GPUs.
Note: When a deployment is manually deleted, vision-service might try to recreate it when it is restarted. The only way to force Kubernetes to permanently delete it is to remove the failing model from PowerAI Vision. - Determine the running deployments and look for those that have been running longer than a few
minutes:
Model training and inference fails
- Problem
- The NVIDIA GPU device is not accessible by the PowerAI Vision Docker containers. To confirm this, run
kubectl logs -f _powerai-vision-portal-ID_ and then check
pod_powerai-vision-portal-ID_powerai-vision-portal.log for
an error indicating error == cudaSuccess (30 vs.
0):
F0731 20:34:05.334903 35 common.cpp:159] Check failed: error == cudaSuccess (30 vs. 0) unknown error *** Check failure stack trace: *** /opt/py-faster-rcnn/FRCNN/bin/train_frcnn.sh: line 24: 35 Aborted (core dumped) _train_frcnn.sh
- Solution
- Use sudo to alter SELINUX permissions for all of the NVIDIA devices so they are
accessible via the PowerAI Vision Docker containers.
sudo chcon -t container_file_t /dev/nvidia*
Object detection model training fails using images with non-standard aspect ratios
- Problem
- The training of an object detection model fails with the error message "An error occurred training model-name. You must retrain model-name.", where model-name is the name of the model being trained. However, repeated attempts to train the model fail with the same error.
- Solution
- Examine the data set for images that were cropped to a non-standard aspect ratio and are much
longer on one edge, for example, 10 times longer on the horizontal edge than the vertical edge.
These images will cause the object training to fail and they should be cropped to or adjusted to
allow the model to train. The images must follow these guidelines:
- They must be at least 130 pixels on the shortest edge.
- If the longer edge is greater than 1000 pixels, the image is scaled down when the model is trained. When the longest side is scaled down to 1000 pixels, the shorter edge must still be at least 130 pixels.
Auto labeling of a data set returns "Auto Label Error"
- Problem
- Auto labeling cannot be performed on a data set that does not have unlabeled images, unless some of the images were previously labeled by the auto label function.
- Solution
- Ensure that the Objects section of the data set side bar shows there are objects that are "Unlabeled". If there are none, that is, if "Unlabeled (0)" is displayed in the side bar, add new images that are unlabeled or remove labels from some images, then run auto label again.
PowerAI Vision does not start
- Problem
- When you enter the URL for PowerAI Vision from a supported web browser, nothing is displayed. You see a 404 error or Connection Refused message.
- Solution
-
Complete the following steps to solve this problem:
- Verify that IP version 4 (IPv4) port forwarding is enabled by running the /sbin/sysctl
net.ipv4.conf.all.forwarding command and verifying that the value for
net.ipv4.conf.all.forwarding is set to 1.
If IPv4 port forwarding is not enabled, run the /sbin/sysctl -w net.ipv4.conf.all.forwarding=1 command. For more information about port forwarding with Docker, see UCP requires IPv4 IP Forwarding in the Docker success center.
- If IPv4 port forwarding is enabled and the docker0 interface is a member of the trusted
zone, check the Helm chart status by running this
script:
sudo /opt/powerai-vision/bin/helm.sh status vision
In the script output, verify that the PowerAI Vision components are available by locating the Deployment section and identifying that the AVAILABLE column has a value of 1 for each component. The following is an example of the output from the helm.sh status vision script that shows all components are available:
RESOURCES: ==> v1beta1/Deployment NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE powerai-vision-mongodb 1 1 1 1 4d powerai-vision-portal 1 1 1 1 4d powerai-vision-postgres 1 1 1 1 4d powerai-vision-taskanaly 1 1 1 1 4d powerai-vision-ui 1 1 1 1 4d powerai-vision-video-nginx 1 1 1 1 4d powerai-vision-video-portal 1 1 1 1 4d powerai-vision-video-rabmq 1 1 1 1 4d powerai-vision-video-redis 1 1 1 1 4d powerai-vision-video-test-nginx 1 1 1 1 4d powerai-vision-video-test-portal 1 1 1 1 4d powerai-vision-video-test-rabmq 1 1 1 1 4d powerai-vision-video-test-redis 1 1 1 1 4d
If you recently started PowerAI Vision and some components are not available, wait a few minutes for these components to become available. If any components remain unavailable, gather the logs and contact IBM® Support, as described in this topic: Gather PowerAI Vision logs and contact support.
- If the docker0 interface is a member of a trusted zone and all PowerAI Vision components are available, verify that the firewall is
configured to allow communication through port 443 (used to connect to PowerAI Vision) by running this
command:
sudo firewall-cmd --permanent --zone=public --add-port=443/tcp
- Verify that IP version 4 (IPv4) port forwarding is enabled by running the /sbin/sysctl
net.ipv4.conf.all.forwarding command and verifying that the value for
net.ipv4.conf.all.forwarding is set to 1.
PowerAI Vision fails to start - Kubernetes connection issue
- Problem
- If the host system does not have a default route defined in the networking configuration, the
Kubernetes cluster will fail to start with connection issues. For example:
$ sudo /opt/powerai-vision/bin/powerai_vision_start.sh INFO: Setting up GPU... [...] Checking kubernetes cluster status... The connection to the server 127.0.0.1:8080 was refused - did you specify the right host or port? INFO: Probing cluster status #1: The connection to the server 127.0.0.1:8080 was refused - did you specify the right host or port? INFO: Probing cluster status #2: The connection to the server 127.0.0.1:8080 was refused - did you specify the right host or port? INFO: Probing cluster status #3: The connection to the server 127.0.0.1:8080 was refused - did you specify the right host or port? INFO: Probing cluster status #4: The connection to the server 127.0.0.1:8080 was refused - did you specify the right host or port? INFO: Probing cluster status #5: The connection to the server 127.0.0.1:8080 was refused - did you specify the right host or port? INFO: Probing cluster status #6: The connection to the server 127.0.0.1:8080 was refused - did you specify the right host or port? INFO: Probing cluster status #7: The connection to the server 127.0.0.1:8080 was refused - did you specify the right host or port? INFO: Probing cluster status #8: The connection to the server 127.0.0.1:8080 was refused - did you specify the right host or port? INFO: Probing cluster status #9: The connection to the server 127.0.0.1:8080 was refused - did you specify the right host or port? INFO: Probing cluster status #10: The connection to the server 127.0.0.1:8080 was refused - did you specify the right host or port? INFO: Probing cluster status #11: ERROR: Retry timeout. Error in starting kubernetes cluster, please check /opt/powerai-vision/log/kubernetes for logs.
- Solution
- Define a default route in the networking configuration. For instructions to do this on Red Hat Enterprise Linux (RHEL), refer to 2.2.4 Static Routes and the Default Gateway in the Red Hat Customer Portal.
PowerAI Vision startup hangs - helm issue
- Problem
- PowerAI Vision startup hangs with the message "Unable to
start helm within 30 seconds - trying again." For
example:
root> sudo /opt/powerai-vision/bin/powerai_vision_start.sh Checking ports usage... Checking ports completed, no confict port usage detected. [ INFO ] Setting up the GPU... Init cuda devices... Devices init completed! Persistence mode is already Enabled for GPU 00000004:04:00.0. Persistence mode is already Enabled for GPU 00000004:05:00.0. Persistence mode is already Enabled for GPU 00000035:03:00.0. Persistence mode is already Enabled for GPU 00000035:04:00.0. All done. [ INFO ] Starting kubernetes... Checking kubernetes cluster status... Probing cluster status #1: NotReady Probing cluster status #2: NotReady Probing cluster status #3: NotReady Probing cluster status #4: Ready Booting up ingress controller... Initializing helm... [ WARN ] Unable to start helm within 30 seconds - trying again. If this continues, contact support. [ WARN ] Unable to start helm within 30 seconds - trying again. If this continues, contact support. [ WARN ] Unable to start helm within 30 seconds - trying again. If this continues, contact support. [ WARN ] Unable to start helm within 30 seconds - trying again. If this continues, contact support.
- Solution
- To solve this problem, you must follow these steps exactly as written:
- Cancel PowerAI Vision startup by pressing ctrl+c.
- Stop PowerAI Vision by running this command:
sudo /opt/powerai-vision/bin/powerai_vision_stop.sh
- Modify the Red Hat Enterprise Linux (RHEL) settings as follows:
sudo nmcli connection modify docker0 connection.zone trusted sudo systemctl stop NetworkManager.service sudo firewall-cmd --permanent --zone=trusted --change-interface=docker0 sudo systemctl start NetworkManager.service sudo nmcli connection modify docker0 connection.zone trusted sudo systemctl restart docker.service
- Start PowerAI Vision again:
sudo /opt/powerai-vision/bin/powerai_vision_start.sh
If the above commands do not fix the startup issue, check for a cgroup leak that can impact Docker. A Kubernetes/Docker issue can cause this situation, and after fixing the firewall issue the start up can still fail if there was cgroup leakage.
One symptom of this situation is that the df command is slow to respond. To check for excessive cgroup mounts, run the mount command:
If the cgroup count is in thousands, reboot the system to clear up the cgroups.$ mount | grep cgroup | wc -l
Helm status errors when starting PowerAI Vision
- Problem
- There is an issue in some RHEL releases that causes the startup of PowerAI Vision to fail after restarting the host system. When this is
the problem, the system tries to initialize Helm at 30 second intervals but never succeeds.
Therefore, the startup never succeeds. You can verify this status by running the Helm status vision
command:
Result:# /opt/powerai-vision/bin/helm status vision
Error: getting deployed release "vision": Get https://10.10.0.1:443/api/v1/namespaces/kube-system/configmaps[...]: dial tcp 10.10.0.1:443: getsockopt: no route to host
- Solution
- To solve this problem, you must follow these steps exactly as written:
- Cancel PowerAI Vision startup by pressing ctrl+c.
- Stop PowerAI Vision by running this command:
sudo /opt/powerai-vision/bin/powerai_vision_stop.sh
- Modify the Red Hat Enterprise Linux (RHEL) settings as follows:
sudo nmcli connection modify docker0 connection.zone trusted sudo systemctl stop NetworkManager.service sudo firewall-cmd --permanent --zone=trusted --change-interface=docker0 sudo systemctl start NetworkManager.service sudo nmcli connection modify docker0 connection.zone trusted sudo systemctl restart docker.service
- Start PowerAI Vision again:
sudo /opt/powerai-vision/bin/powerai_vision_start.sh
If the above commands do not fix the startup issue, check for a cgroup leak that can impact Docker. A Kubernetes/Docker issue can cause this situation, and after fixing the firewall issue the start up can still fail if there was cgroup leakage.
One symptom of this situation is that the df command is slow to respond. To check for excessive cgroup mounts, run the mount command:
If the cgroup count is in thousands, reboot the system to clear up the cgroups.$ mount | grep cgroup | wc -l
Some PowerAI Vision functions don't work
- Problem
- PowerAI Vision seems to start correctly, but some
functions, like automatic labeling or automatic frame capture, do not function.To verify that this is the problem, run /opt/powerai-vision/bin/kubectl.sh get pods and verify that one or more pods are in state CrashLoopBackOff. For example:
kubectl get pods NAME READY STATUS RESTARTS AGE ... powerai-vision-video-rabmq-5d5d786f9f-7jfk9 0/1 CrashLoopBackOff 2 54s
- Solution
- PowerAI Vision Vision requires IPv6. Enable IPv6 on the system.
Command line tool fails - missing options
- Problem
- You receive errors using one of the PowerAI Vision command line tools, indicating that parameters are missing or options are not recognized.
- Solution
- Validate all of the hyphen (-) characters used to specify command line options. When using international keyboards, a similar but different character might have been used for command line options in a shell. For example, in Unicode character encoding there are multiple similar characters: hyphen-minus (Ascii hyphen), hyphen, em-dash, etc. Only the "hyphen-minus" character is valid to indicate a command line option.