Known issues for watsonx Orchestrate

The following known issues and limitations apply to watsonx Orchestrate.

watsonx Orchestrate has the following known issues:

After adding users to watsonx Orchestrate, Personal Skills are not displayed for the users
watsonx Orchestrate pod component limits are not defined for resource scaling
watsonx Assistant Dialog pod experiences continuous increased memory utilization and OOMKill
CR remains in the In Progress status after rebooting the cluster
A mongodb-wo-mongo-ops-manager-db pod error can halt the watsonx Orchestrate installation
Adding skills before synchronization results in an error
An error is displayed when connecting to a skill
Cannot upload files over 6 MB by using Box skills
Unable to add input to the connection fields of Salesforce
OAuth 2.0 web authentication is not supported when adding new skills
RabbitMQ cluster is set to In progress or Fail and wo-rabbitmq-orchestrate-backup-label has not completed after 1 minute

After adding users to watsonx Orchestrate, `Personal Skills` are not displayed for the users

Applies to: 4.8.5

Problem: Unable to create more than 30 instances of watsonx Assistant to support Personal Skills in watsonx Orchestrate.

Cause

watsonx Assistant has a maximum limit of creating only 30 instances per watsonx Orchestrate instance.

Solution

To solve the problem, apply the following patch:

INSTANCE_NAME=$(oc -n <cpd-namespace> get wa --output jsonpath='{.items[0].metadata.name}')
oc patch wa ${INSTANCE_NAME} --type='merge' -p='{"configOverrides":{"store":{"extra_vars": {"store": {"MAX_NEW_IA_ASSISTANTS":"1000","MAX_NEW_IA_SKILLS":"300000","ASSISTANT_MAX_PAGE_LIMIT":"1000"}}}}}'

watsonx Orchestrate pod component limits are not defined for resource scaling

Applies to: 4.8.4 and 4.8.5

Problem: Requests and limits are not defined for optimizing pod resources.

Cause

Limited support for resource scaling in watsonx Orchestrate.

Solution

Disable the reconcile loop for the pod components to a certain capacity.

To resolve this issue, you can use one of the following methods:

Apply the following label to the component that you want to scale:
```
wo.watsonx.ibm.com/hands-off=true
```
This disables the reconcile loop for the component. You can modify the CR per your requirements. A limitation with this method is that not all components expose their resources in the CR, so you cannot change the default resources for them.
After the watsonx Orchestrate deployment is complete, scale down the watsonx Orchestrate component operator. To scale down, navigate to the component operator deployment and manually scale to 0, or use the following command:
```
oc scale -n ${OPERATOR_NS} deploy/ibm-wxo-component-operator-controller-manager --replicas 0
```
This action disables the reconcile loop for all components, and you can modify the resources per your requirements. This method allows you to scale deployments directly and modify components that do not expose resources in their CRs.

watsonx Assistant Dialog pod experiences continuous increased memory utilization and `OOMKill`

Applies to: 4.8.5

Problem: During testing wa-dialog pod experienced increased memory utilization and one of the pods was OOMKilled.

Cause

This problem is caused due to less memory limit for watsonx Assistant.

Solution

To solve the problem, apply the following patch:

Note: Apply the following patch on top of the watsonx Assistant with t-shirt size as medium and HPA enabled.

INSTANCE_NAME=$(oc -n <cpd-namespace> get wa --output jsonpath='{.items[0].metadata.name}')
oc -n <cpd-namespace> patch wa ${INSTANCE_NAME} --type=merge -p="{\"configOverrides\":{\"dialog\":{\"resources\":{\"limits\":{\"memory\":\"1.5Gi\"}}}}}"

CR remains in the `In Progress` status after rebooting the cluster

Applies to: 4.8.4

Fixed in: 4.8.5

Problem

After a cluster reboot, the CR gets stuck in the In Progress status.

 watsonx_orchestrate  WatsonxOrchestrate    wo                              zen          InProgress  1.0.0      2024-03-21T09:30:22Z  1.0.0                    N/A

Cause

This problem occurs when Kafka waits for reconciling its topics.

Solution

To resolve this problem, delete the pending Kafka pod. For example, the kafkaibm-kafka-2 pod below in a cluster.

  conditions:
  - lastTransitionTime: "2024-03-22T14:05:09.741114596Z"
    message: Pod wo-watson-orchestrate-kafkaibm-kafka-2 cannot be updated right now.
    reason: UnforceableProblem
    status: "True"
    type: NotReady

A `mongodb-wo-mongo-ops-manager-db` pod error can halt the watsonx Orchestrate installation

Applies to: 4.8.4

Fixed in: 4.8.5

Error

This issue does not show a clear error in the pod or in the opsmanager CR. To confirm that the mongodb-wo-mongo-ops-manager-db pod error has occurred in the cluster, run the following command:

oc exec -it -c mongodb-agent  mongodb-wo-mongo-ops-manager-db-0 -- /opt/scripts/readinessprobe

If the error has occurred in the cluster, the following is displayed:

panic: Get "[https://172.30.0.1:443/api/v1/namespaces/cpd-instance-1/pods/mongodb-wo-mongo-ops-manager-db-0](https://172.30.0.1/api/v1/namespaces/cpd-instance-1/pods/mongodb-wo-mongo-ops-manager-db-0)": dial tcp 172.30.0.1:443: i/o timeout

Cause

This error occurs in certain clusters when the MongoDB readinessprobe calls to the Kubernetes API are not supported by the network traffic.

Solution

To resolve this error, run the following commands:

# Update opsmanager labels to allow access
oc patch opsmanager mongodb-wo-mongo-ops-manager --type merge --patch '{"spec":{"applicationDatabase":{"podSpec":{"podTemplate":{"metadata":{"labels":{"wo.watsonx.ibm.com/external-access":"true"}}}}}}}'
# Delete existing sts to force re-rollout:
oc delete sts mongodb-wo-mongo-ops-manager-db

Adding skills before synchronization results in an error

Applies to: 4.8.4

Error

When you add skills before synchronization, you might see the following error:

Failed to complete this operation due to the following errors. Failed to upskill skill because
{'status': 'failed', 'status_code':500, 'message':'"next_action"', 'detailed_message': None, 'other_details': None, 'is_available': True,
'skill_set_orch':[]}. Skill has not bootstrapped.

Cause: Synchronization of applications and skills takes up to five minutes to complete after the service instance creation.
Solution: Wait for up to five minutes to complete the synchronization of applications and skills.

An error is displayed when connecting to a skill

Applies to: 4.8.4

Error

When you try to connect to a skill, you might see the following error:

You couldn't connect. An error occurred while processing the request. Error: unable to get local issuer certificate.

Cause: IBM® watsonx Orchestrate does not support insecure communication with servers that use self-signed certificates or any certificates that are not signed by a trusted Certificate Authority (CA).
Solution: The server administrator must use a certificate that is signed by a trusted CA.

Cannot upload files over 6 MB by using Box skills

Applies to: 4.8.4

Problem: This issue is caused due to the file size.

Cause: This issue occurs when you try to upload files over 6 MB by using Box skills.
Solution: You need to upload the file manually or compress the file to a size smaller than 6 MB.

Unable to add input to the connection fields of Salesforce

Applies to: 4.8.4

Problem: You are unable to add input to the connection fields of Salesforce because the focus does not stay on the input fields.

Cause: This issue is caused due to a known user interface issue.
Solution: Click the border of an input field to keep the focus on the field.

OAuth 2.0 web authentication is not supported when adding new skills

Applies to: 4.8.4

Problem: OAuth 2.0 web authentication is not supported when adding new skills

Cause: This issue is caused due to browser redirection.
Solution: Try to use a different authentication method.

RabbitMQ cluster is set to `In progress` or `Fail` and `wo-rabbitmq-orchestrate-backup-label` has not completed after 1 minute

Applies to: 4.8.4

Problem

watsonx Orchestrate CR is stuck at 32 pods deployed, see the following message:

#  oc get WatsonxOrchestrate wo -n bvt
NAME   DEPLOYED   VERIFIED   TOTAL
wo     32         31         47

See if the following log is in the job:

# oc logs job/wo-rabbitmq-orchestrate-backup-label
Unable to connect to the server: dial tcp 172.30.0.1:443: i/o timeout
Retries=18
waiting for all PVCs to be ready
E0328 14:58:32.163323     208 memcache.go:265] couldn't get current server API group list: Get "https://172.30.0.1:443/api?timeout=32s": dial tcp 172.30.0.1:443: i/o timeout

Cause

The installation from production environment fails because of RabbitMQ instability and displays the error: “WO-Mongo is setting cpdmongodbservice-cr to fail".

Solution

To solve the problem, apply the following patch:

# Add rabbit label
oc patch rabbitmqcluster wo-rabbitmq --type 'merge' --patch '{"metadata":{"annotations":{"wo.watsonx.ibm.com/hands-off":"true"}},"spec":{"global":{"podLabels":{"wo.watsonx.ibm.com/external-access":"true"}}}}'
# Restart rabbitmq job
oc delete job wo-rabbitmq-orchestrate-backup-label

Known issues for watsonx Orchestrate

After adding users to watsonx Orchestrate, Personal Skills are not displayed for the users

watsonx Orchestrate pod component limits are not defined for resource scaling

watsonx Assistant Dialog pod experiences continuous increased memory utilization and OOMKill

CR remains in the In Progress status after rebooting the cluster

A mongodb-wo-mongo-ops-manager-db pod error can halt the watsonx Orchestrate installation

Adding skills before synchronization results in an error

An error is displayed when connecting to a skill

Cannot upload files over 6 MB by using Box skills

Unable to add input to the connection fields of Salesforce

OAuth 2.0 web authentication is not supported when adding new skills

RabbitMQ cluster is set to In progress or Fail and wo-rabbitmq-orchestrate-backup-label has not completed after 1 minute

After adding users to watsonx Orchestrate, `Personal Skills` are not displayed for the users

watsonx Assistant Dialog pod experiences continuous increased memory utilization and `OOMKill`

CR remains in the `In Progress` status after rebooting the cluster

A `mongodb-wo-mongo-ops-manager-db` pod error can halt the watsonx Orchestrate installation

RabbitMQ cluster is set to `In progress` or `Fail` and `wo-rabbitmq-orchestrate-backup-label` has not completed after 1 minute