Known issues and limitations for Watson Discovery
The following known issues and limitations apply to the Watson Discovery service.
- RabbitMQ pod continues to run when Watson Discovery is shutdown
- Upgrade to 4.8.7 does not complete
- Failed to restore elasticsearch data in Watson Discovery
- Custom resources are not accessible from the Teach domain concepts section after upgrading
- The Elasticsearch statefulsets do not scale up
- Elasticsearch pods are not ready
- Secrets are no longer automatically generated when the integrated OpenShift image registry is disabled
- Watson Discovery installation or upgrade does not complete because certain pods fail
- Unable to add documents during upgrade of Watson Discovery
- Watson Gateway pods in a crash loop after upgrading Watson Discovery
- The etcd operator script fails while upgrading Watson Discovery
- Watson Discovery orchestrator pods not starting because ResourceQuota is applied to the namespace
- Dictionary and Part of Speech facets are not shown in Content Mining projects
- Upgrade fails due to existing Elasticsearch 6.x indices
- During shutdown the DATASTOREQUIESCE field does not update
- UpgradeError is shown after resizing PVC
- Disruption of service after upgrading, restarting, or scaling by updating scaleConfig
- MinIO gets stuck in a loop after several installation attempts
RabbitMQ pod continues to run when Watson Discovery is shutdown
Applies to: 4.8.0 to 4.8.5
Fixed in: 4.8.6
- Error
-
If you run the following command to shut down the Watson Discovery service, the RabbitMQ pod continues to run.
The following is an example of this error:cpd-cli manage shutdown \ --components=watson_discovery \ --cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} \ --include_dependency=true
$ oc get pod -l 'icpdsupport/addOnId in (discovery)' | grep -v Completed NAME READY STATUS RESTARTS AGE wd-rabbitmq-discovery-0 1/1 Running 0 35h
- Cause
-
This error occurs due to an issue with scaling down RabbitMQ.
- Solution
- Run the following command to manually change the replicas of RabbitMQ CR to 0.
oc -n {PROJECT_CPD_INST_OPERANDS} patch rabbitmqclusters.rabbitmq.opencontent.ibm.com wd-rabbitmq --type='json' -p='[{"op": "replace", "path": "/spec/replicas", "value":0}]'
Upgrade to 4.8.7 does not complete
Applies to: 4.8.7
- Error
-
Upgrade to 4.8.7 does not complete. The
oc get pods -l run=elastic
command shows highest-ordinal pods in crash loop back-off state:
The OpenSearch cluster allocation explain API indicates the following.% oc get pods -l run=elastic NAME READY STATUS RESTARTS AGE wd-ibm-elasticsearch-create-snapshot-repo-job-k9pvt 0/1 Completed 0 52m wd-ibm-elasticsearch-create-snapshot-repo-job-khr66 0/1 Completed 4 140m wd-ibm-elasticsearch-es-server-client-0 2/2 Running 0 87m wd-ibm-elasticsearch-es-server-client-1 1/2 CrashLoopBackOff 13 (78s ago) 52m wd-ibm-elasticsearch-es-server-data-0 2/2 Running 0 87m wd-ibm-elasticsearch-es-server-data-1 1/2 CrashLoopBackOff 13 (98s ago) 52m wd-ibm-elasticsearch-es-server-master-0 1/2 CrashLoopBackOff 13 (109s ago) 52m
Following is an example of this error:cannot allocate replica shard to a node with version [2.14.0] since this is older than the primary version [2.16.0]
% oc rsh -n zen -c elasticsearch wd-ibm-elasticsearch-es-server-data-0 bash -c 'curl -ksS -u ${ELASTIC_USER}:${ELASTIC_PASSWORD} "${ELASTIC_ENDPOINT}/_cluster/allocation/explain?pretty=true"' { "index": ".ltrstore", "shard": 0, "primary": false, "current_state": "unassigned", "unassigned_info": { "reason": "REPLICA_ADDED", "at": "2024-09-12T02:38:17.732Z", "last_allocation_status": "no_attempt" }, "can_allocate": "no", "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes", "node_allocation_decisions": [ { "node_id": "_h0I9tQWQ-qk45yRd81Hmg", "node_name": "wd-ibm-elasticsearch-es-server-data-0", "transport_address": "127.0.0.1:9800", "node_attributes": { "shard_indexing_pressure_enabled": "true" }, "node_decision": "no", "deciders": [ { "decider": "node_version", "decision": "NO", "explanation": "cannot allocate replica shard to a node with version [2.14.0] since this is older than the primary version [2.16.0]" } ] }, { "node_id": "jrmd6qIKRV62_WQmh2aaHg", "node_name": "wd-ibm-elasticsearch-es-server-data-1", "transport_address": "127.0.0.1:9801", "node_attributes": { "shard_indexing_pressure_enabled": "true" }, "node_decision": "no", "deciders": [ { "decider": "same_shard", "decision": "NO", "explanation": "a copy of this shard is already allocated to this node [[.ltrstore][0], node[jrmd6qIKRV62_WQmh2aaHg], [P], s[STARTED], a[id=MT-6m6T2THaAPJD3qU8S1g]]" } ] } ] }
- Cause
-
During the rolling upgrade from 2.14 to 2.16 the OpenSearch cluster sometimes routes primary index shard to the 2.16 nodes prematurely. When these indices' replica shards do not have any other 2.16 nodes to route, the cluster gets stuck in a
yellow
health state, which prevents the lower-ordinal pods from updating, stalling the upgrade. - Solution
-
- Stop the Watson Discovery
operator:
oc scale deploy wd-discovery-operator --replicas=0 --namespace=${PROJECT_CPD_INST_OPERATORS}
- Relax the cluster health check so the lower-ordinal pods will update even though the cluster is
not nominally
healthy:
oc patch elasticsearchcluster/wd -n ${PROJECT_CPD_INST_OPERANDS} --type=merge --patch='{"spec":{"clusterHealthCheckParams":"wait_for_status=yellow&timeout=1s"}}'
- Watch pod status and await successful restart of 0th-ordinal
pods.
oc get pods -l run=elastic -n ${PROJECT_CPD_INST_OPERANDS}
- Start the Watson Discovery operator that will
return the cluster health check to its normal
behavior:
oc scale deploy wd-discovery-operator --replicas=1 --namespace=${PROJECT_CPD_INST_OPERATORS}
- Stop the Watson Discovery
operator:
Failed to restore elasticsearch data in Watson Discovery
Applies to: 4.8.x and later
- Error
-
After running OADP restore, sometimes Elasticsearch data is not restored in Watson Discovery. If this problem occurs, you can see an error message in the
CPD-CLI*.log
log file undercpd-cli-workspace/logs
directory, for example:"[cloudpak:cloudpak_snapshot_2024-09-01-15-07-58/COvZbNZfTgGYBZ7OfSfOfA] cannot restore index [.ltrstore] because an open index with same name already exists in the cluster. Either close or delete the existing index or restore the index under a different name by providing a rename pattern and replacement name"
- Cause
-
This error is caused when an index
.ltrstore
is created bydeployment/wd-discovery-training-crud
before restoring the back up data. - Solution
-
- Go to the
PROJECT_CPD_INST_OPERANDS
namespace:oc project ${PROJECT_CPD_INST_OPERANDS}
- Get an
Elasticsearch
pod name:pod=$(oc get pod -l icpdsupport/addOnId=discovery,app=elastic,role=master,tenant=wd --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}')
- Note the number of replicas of
deployment/wd-discovery-training-crud
:oc get deployment wd-discovery-training-crud -o jsonpath='{.spec.replicas}'
- Scale down
deployment/wd-discovery-training-crud
:oc patch wd wd --type merge --patch '{"spec": {"wire": {"trainingCrud": {"trainingCrudReplicas": 0}}}}'
- Delete
.ltstore
index:oc exec $pod -c elasticsearch -- bash -c 'curl -XDELETE -s -k -u ${ELASTIC_USER}:${ELASTIC_PASSWORD} "${ELASTIC_ENDPOINT}/.ltrstore"'
- Get the snapshot name that includes the data of Watson Discovery:
The command output indicates the latest snapshot name, for example:oc exec $pod -c elasticsearch -- bash -c 'curl -XGET -s -k -u ${ELASTIC_USER}:${ELASTIC_PASSWORD} "${ELASTIC_ENDPOINT}/_cat/snapshots/cloudpak?h=id&s=end_epoch"'
cloudpak_snapshot_2024-09-01-15-07-58
- Restore using the snapshot (replace
<snapshot-name>
with your snapshot name):oc exec $pod -c elasticsearch -- bash -c 'curl -XPOST -s -k -u ${ELASTIC_USER}:${ELASTIC_PASSWORD} "${ELASTIC_ENDPOINT}/_snapshot/cloudpak/<snapshot-name>/_restore"'
- Scale
deployment/wd-discovery-training-crud
up to its original state:oc patch wd wd --type merge --patch '{"spec": {"wire": {"trainingCrud": {"trainingCrudReplicas": <number-of-original-replicas>}}}}'
- Go to the
Custom resources are not accessible from the Teach domain concepts section after upgrading
Applies to: Upgrading from 4.7.1 and 4.7.2 to any later version
- Error
-
In rare cases, a resource clean-up job might invalidate resources in certain projects when upgrading Watson Discovery. Invalidated resources lead to issues such as dictionaries and entity extractors not being accessible from the Teach domain concepts section of the Improvement tools panel on the Improve and customize page.
- Cause
-
An issue with the resource clean-up job in 4.7.1 and 4.7.2 invalidates the project resources, resulting in this issue.
- Solution
- Scale down the
wd-cnm-api pod
before upgrading Watson Discovery from 4.7.1 and 4.7.2.
After completing the upgrade process, either scale up the pod to its default value or scale the pod to a specific number of replicas instead of the default value.oc -n ${namespace} patch wd wd --type=merge --patch '{"spec": {"cnm": {"apiServer": {"replicas": 0}}}}'
To scale up the pod to its default value, run the following command:oc -n ${namespace} patch wd wd --type=merge --patch '{"spec": {"cnm": {"apiServer": {"replicas": 1}}}}' oc -n ${namespace} patch wd wd --type=json --patch '[{"op":"remove","path":"/spec/cnm"}]'
To scale the pod to a specific number of replicas, run the following command:oc -n ${namespace} patch wd wd --type=merge --patch '{"spec": {"cnm": {"apiServer": {"replicas": 1}}}}' oc -n ${namespace} patch wd wd --type=merge --patch "{\"spec\": {\"cnm\": {\"apiServer\": {\"replicas\": ${num_of_replicas}}}}}"
The Elasticsearch
statefulsets
do not scale up
Applies to: 4.8.6 and later
- Error
-
The Elasticsearch
statefulsets
do not scale up when you change the scale config or scale up the Elasticsearch replicas. The following is an example of this error:oc get pod | grep wd-ibm-elasticsearch wd-ibm-elasticsearch-es-server-client-0 2/2 Running 0 5h21m wd-ibm-elasticsearch-es-server-client-1 1/2 Running 0 8m13s wd-ibm-elasticsearch-es-server-data-0 2/2 Running 0 5h31m wd-ibm-elasticsearch-es-server-data-1 1/2 Running 0 8m15s wd-ibm-elasticsearch-es-server-master-0 1/2 Running 0 7m43s
- Cause
-
This error occurs when the auto scaling configuration of the index attempts to create replicas for new pods, which results in a bad cluster state. This state prevents the configuration update of the existing pods.
- Solution
- To resolve the issue, follow the steps in Scaling the Elasticsearch cluster.
Elasticsearch pods are not ready
Applies to: 4.8.6 and later
- Error
- When upgrading Watson
Discovery to 4.8.6, Elasticsearch pods are not ready and shows the following
status:
# oc -n ${PROJECT_CPD_INST_OPERANDS} get pods ... wd-ibm-elasticsearch-es-server-client-1 1/2 Running 0 68m wd-ibm-elasticsearch-es-server-data-1 1/2 Running 0 68m ...
- Cause
-
Instability of Elasticsearch pods in Watson Discovery.
- Solution
-
Delete the current Elasticsearch custom resource. The Watson Discovery operator recreates the custom resource.
# oc delete elasticsearchclusters.elasticsearch.opencontent.ibm.com wd
Secrets are no longer automatically generated when the integrated OpenShift image registry is disabled
Applies to: 4.8.5 or earlier
Fixed in: 4.8.6
- Error
- An error occurs when pulling images while installing Watson
Discovery.
Could not find imagePullSecret attached to ServiceAccount/wd-discovery-admin. Required value, spec.template.spec.containers[0].volumeMounts[3].name: Not found: "image-pull-secret"]","reason":"Invalid","details":{"name":"wd-discovery-ranker-master","group":"apps","kind":"Deployment","causes":[{"reason":"FieldValueRequired","message":"Required value","field":"spec.template.spec.volumes[2].secret.secretName"},{"reason":"FieldValueNotFound","message":"Not found: "image-pull-secret"","field":"spec.template.spec.containers[0].volumeMounts[3].name"}]}
- Cause
-
If you disable the
ImageRegistry
cluster capability or if you disable the integrated OpenShift® image registry in the cluster image registry operator’s configuration, a service account token secret and an image pull secret are no longer generated for each service account.
- Solution
-
You can either update the
config.image/cluster resource
as described in the OCP documentation or contact IBM® Support for assistance.
Watson Discovery installation or upgrade does not complete because certain pods fail
Applies to: From 4.8.2 to 4.8.5
Fixed in: 4.8.6
- Error
-
The Watson Discovery installation or upgrade process does not complete because of certain pods failing.
NAME READY STATUS RESTARTS AGE wd-discovery-entity-suggestion-74dbf8764f-f4xbw 0/1 Running 33 (5m34s ago) 153m wd-discovery-wd-indexer-59c7d968d9-rrt4b 0/1 Running 7 (2m40s ago) 150m wd-discovery-hdp-worker-1 1/2 CrashLoopBackOff 32 (97s ago) 150m wd-discovery-hdp-worker-0 1/2 CrashLoopBackOff 32 (77s ago) 150m wd-discovery-converter-94788d69c-76qlk 0/1 Running 24 (5m39s ago) 149m wd-discovery-orchestrator-576bfbd4b7-r5xt4 0/1 CrashLoopBackOff 25 (3m4s ago)
- Cause
-
Certain pods can get stuck during start up.
- Solution
-
-
To determine whether pods do not start because of this issue, check the logs of one of the failing pods using the following command:
oc logs <name_of_pod>
-
Verify if the logs end with the following message.
The IBMJCEPlusFIPS provider is configured for FIPS 140-2. Please note that the 140-2 configuration may be removed in the future.
If you find this message at the end of the logs, contact IBM Support for assistance.
-
Unable to add documents during upgrade of Watson Discovery
Applies to: 4.8.3 or earlier
Fixed in: 4.8.4
- Error
-
While upgrading Watson Discovery from version 4.8.3 or earlier versions, Watson Discovery is unable to ingest documents because certain APIs return the 500 error. In addition, the
wd-discovery-crawler
pods fall into aCrashLoopBackOff
state until the upgrade is completed.
- Cause
-
This error occurs because certain APIs related to document ingestion are unable to communicate with Postgres during an upgrade.
- Solution
- Ingest documents after completion of the upgrade.
Watson Gateway pods in a crash loop after upgrading Watson Discovery
Applies to: 4.8.4
Fixed in: 4.8.5
- Error
-
After upgrading to Watson Discovery 4.8.4, you might observe that the Gateway pod is in a crash loop. Watson Discovery might also not report the updated version as expected.
- Cause
-
This error occurs as a result of an Out of Memory (OOM) issue.
- Solution
- Attempt to increase the memory
resources.
oc get csv | grep gateway oc edit csv oc patch csv/ibm-watson-gateway-operator.v1.0.26 --type json -p '[{ "op": "replace", "path":"metadata/spec/install/spec/deployments/0/spec/template/spec/containers/0/resources/limits/memory","value":"2Gi" }]' oc patch csv/ibm-watson-gateway-operator.v1.0.26 --type json -p '[{ "op": "replace", "path":"metadata/spec/install/spec/deployments/0/spec/template/spec/containers/0/resources/requests/memory","value":"2Gi" }]'
You can edit the CSV file name according to your environment.
The etcd operator script fails while upgrading Watson Discovery
Applies to: 4.8.4 and 4.8.5
- Error
-
During Watson Discovery upgrade to version 4.8.4 or 4.8.5, the
Ready
status showsFalse
andReadyReason
showsIn Progress
for a long time.
You can verify etcd in# oc get wd -n zen NAME VERSION READY READYREASON UPDATING UPDATINGREASON DEPLOYED VERIFIED QUIESCE DATASTOREQUIESCE AGE wd 4.8.4 False InProgress True VerifyWait 11/23 10/23 NOT_QUIESCED NOT_QUIESCED 2d17h
unverifiedComponents
of Watson Discovery CR.oc get wd -n<ns> -o yaml unverifiedComponents: etcd
Also, an error message similar to one of the following is displayed inibm-etcd-operator
pod logs or theibm-etcd-operator
logs:"msg": "An unhandled exception occurred while templating '{{ q('etcd_member', cluster_host= etcd_cluster_name + '-client.' + etcd_namespace + '.svc', cluster_port=etcd_client_port, ca_cert=tls_directory + '/etcd-ca.crt', cert_cert=tls_directory + '/etcd-client.crt', cert_key=tls_directory + '/etcd-client.key') }}'. Error was a <class 'ansible.errors.AnsibleError'>, original message: An unhandled exception occurred while running the lookup plugin 'etcd_member'. Error was a <class 'ansible.errors.AnsibleError'>, original message: Unable to fetch members. Error: 'Client' object has no attribute 'server_version_sem'. Unable to fetch members. Error: 'Client' object has no attribute 'server_version_sem'"
Symptom: TASK [etcdcluster : Enable authentication when secure client] ****************** [1;30mtask path: /opt/ansible/roles/etcdcluster/tasks/reconcile_pods.yaml:246�[0m /usr/local/lib/python3.8/site-packages/etcd3/baseclient.py:97: Etcd3Warning: cannot detect etcd server version 1. maybe is a network problem, please check your network connection 2. maybe your etcd server version is too low, required: 3.2.2+ warnings.warn(Etcd3Warning("cannot detect etcd server version\n" [0;31mfatal: [localhost]: FAILED! => {[0m [0;31m "msg": "An unhandled exception occurred while running the lookup plugin 'etcd_auth'. Error was a <class 'ansible.errors.AnsibleError'>, original message: Enabling authentication failed. Error: 'Client' object has no attribute 'server_version_sem'"[0m [0;31m}[0m
- Cause
-
A script in the etcd operator that sets authentication might fail. When it fails, the etcd operator does not deploy with
authentication:enabled
in theetcdcluster
CR. This failure stops other components in the service from being upgraded and verified. - Solution
- Attempt to re-execute the etcd operator by
restarting the
etcdcluster
CR.- Get the name of the service
etcdcluster
.oc get etcdcluster | grep etcd <or name of the etcd cluster in the deployment>
- Delete the CR to allow the etcd operator
to re-execute
tasks.
oc delete etcdcluster <cluster>
- Wait until the
etcdcluster
andetcd
pods are re-created. - Check the status of
Ready
,Deployed
, andVerified
to make sure that the upgrade is successful.# oc get wd NAME VERSION READY READYREASON UPDATING UPDATINGREASON DEPLOYED VERIFIED QUIESCE DATASTOREQUIESCE AGE wd 4.8.4 True Stable False Stable 23/23 23/23 NOT_QUIESCED NOT_QUIESCED 3d6h
- Get the name of the service
Watson
Discovery orchestrator pods not starting because ResourceQuota
is
applied to the namespace
Applies to: 4.8.2 and 4.8.3
Fixed in: 4.8.4
- Error
-
The
wd-discovery-orchestrator-setup
job fails to run because of an error similar to the following:Error creating: pods "wd-discovery-orchestrator-setup-m5r5s" is forbidden: failed quota: cpd-quota: must specify limits.cpu for: verify-resources; limits.memory for: verify-resources; requests.cpu for: verify-resources; requests.memory for: verify-resources'
- Cause
-
The
wd-discovery-orchestrator-setup
job does not run when aResourceQuota
is applied to the namespace where Watson Discovery is installed without setting theLimitRange
in theverify-resources
container for the following:limits.cpu
,limits.memory
,requests.cpu
, orrequests.memory
. - Solution
- Fix the error by setting a
LimitRange
for limits and requests.To set the
LimitRange
, complete the following steps:- Create a new YAML file by copying the following text. Save the YAML file in a location from
which you can access it in the next
step.
apiVersion: oppy.ibm.com/v1 kind: TemporaryPatch metadata: name: wd-orchestrator-setup-resource-patch spec: apiVersion: discovery.watson.ibm.com/v1 kind: WatsonDiscoveryOrchestrator name: wd patchType: patchStrategicMerge patch: orchestrator: job: spec: template: spec: containers: - name: verify-resources resources: limits: cpu: "1" ephemeral-storage: 1Gi memory: 512Mi requests: cpu: "0.2" ephemeral-storage: 1Mi memory: 256Mi
- Run the following command in the namespace where Watson
Discovery is
installed.
oc apply -f <yaml-file> -n "${PROJECT_CPD_INST_OPERANDS}"
- Wait until the following message appears in the Watson
Discovery pod
logs.
"msg": "Starting reconciliation of TemporaryPatch/wd-orchestrator-setup-resource-patch"
- Delete the
wd-discovery-orchestrator-setup
job.
The operator restarts the job with theoc delete job/wd-discovery-orchestrator-setup
LimitRange
for the limits and requests.
- Create a new YAML file by copying the following text. Save the YAML file in a location from
which you can access it in the next
step.
Dictionary and Part of Speech facets are not shown in Content Mining projects
Applies to: 4.8.0 and 4.8.2
Fixed in: 4.8.3
- Error
-
In Content Mining projects, when you apply a dictionary annotator and one or more of the following enrichments to a collection, the dictionary and Part of Speech facets are not shown or appear empty.
- Entities v2
- Keywords
- Sentiment of Document
- Entity extractor
- Document classifier
- Cause
-
Dictionary and Part of Speech facets were unexpectedly removed from collections in Content Mining projects, resulting in this error.
- Solution
- Fix the error by applying a temporary patch.
To apply the patch, complete the following steps:
- Run the following
command:
cat << EOF | oc apply -f - apiVersion: oppy.ibm.com/v1 kind: TemporaryPatch metadata: name: drop-annotations-patch spec: apiVersion: discovery.watson.ibm.com/v1 kind: WatsonDiscoveryEnrichment name: wd patchType: patchStrategicMerge patch: enrichment-service: deployment: spec: template: spec: containers: - name: annotator-manager env: - name: DROP_POS_ANNOTATIONS value: "false" EOF
- Wait for a few minutes until the
wd-discovery-enrichment-service
pods restart. - Run
Rebuild index
for the collection.
In case you want to remove the temporary patch, run the following command:oc delete temporarypatch drop-annotations-patch
- Run the following
command:
Upgrade fails due to existing Elasticsearch 6.x indices
Applies to: 4.8.0 and later
- Error
- If the existing Elasticsearch cluster has indices created with Elasticsearch 6.x, then upgrading
Watson
Discovery to version 4.8.0 or later
fails.
> oc get wd wd NAME VERSION READY READYREASON UPDATING UPDATINGREASON DEPLOYED VERIFIED QUIESCE DATASTOREQUIESCE AGE wd 4.8.0 False InProgress True VerifyWait 2/24 1/24 NOT_QUIESCED NOT_QUIESCED 63m
- Cause
- Watson Discovery checks for existence of deprecated version of indices in the Elasticsearch cluster when upgrading to version 4.8.0 or later.
- Solution
- To determine whether existing Elasticsearch 6.x indices are the cause of the upgrade failure,
verify the log of the
wd-discovery-es-detect-index
pod using the following command:
During shutdown the DATASTOREQUIESCE field does not update
Applies to: 4.7.0 and later
- Error
-
After successfully executing the cpd-cli manage shutdown command, the
DATASTOREQUIESCE
state in the Watson Discovery resource is stuck inQUIESCING
:
# oc get WatsonDiscovery wd -n "${PROJECT_CPD_INST_OPERANDS}"
NAME VERSION READY READYREASON UPDATING UPDATINGREASON DEPLOYED VERIFIED QUIESCE DATASTOREQUIESCE AGE
wd 4.7.3 True Stable False Stable 24/24 24/24 QUIESCED QUIESCING 16h
- Cause
-
Due to the way quiescing Postgres works, the Postgres pods are still running in background. This results in the metadata not updating in the Watson Discovery resource.
- Solution
- There is no fix for this. However, the state being stuck in
QUIESCING
does not affect the Watson Discovery operator.
UpgradeError is shown after resizing PVC
- Error
- After you edit the custom resource to change the size of a persistent volume claim for a data store, an error is shown.
- Cause
- You cannot change the persistent volume claim size of a component by updating the custom resource. Instead, you must change the size of the PVC on the persistent volume claim node after it is created.
- Solution
- To prevent the error, undo the changes that were made to the YAML file. For more information about the steps to follow to change the persistent volume claim size successfully, see Scaling an existing persistent volume claim size.
Disruption of service after upgrading, restarting, or scaling by
updating scaleConfig
- Error
- After upgrading, restarting, or scaling Watson
Discovery by updating
the
scaleConfig
parameter, the Elasticsearch component might become non-functional, resulting in disruption of service and data loss. - Cause
- The Elasticsearch component uses a quorum of pods to ensure availability when it completes search operations. However, each pod in the quorum must recognize the same pod as the leader of the quorum. The system can run into issues when more than one leader pod is identified.
- Solution
- To determine if confusion about the quorum leader pod is the cause of the issue, complete the
following steps:
- Log in to the cluster, and then set the namespace to the project where the Discovery resources are installed.
- Check each of the Elasticsearch pod with the role of
master
to see which pod it identifies as the quorum leader.
Each pod must list the same pod as the leader.oc get pod -l icpdsupport/addOnId=discovery,app=elastic,role=master,tenant=wd \ -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' | while read i; do echo $i; oc exec $i \ -c elasticsearch -- bash -c 'curl -ksS "localhost:19200/_cat/master?v"'; echo; done
For example, in the following result, two different leaders are identified. Pods1
and2
identify pod2
as the leader. However, pod0
identifies itself as the leader.wd-ibm-elasticsearch-es-server-master-0 id host ip node 7q0kyXJkSJirUMTDPIuOHA 127.0.0.1 127.0.0.1 wd-ibm-elasticsearch-es-server-master-0 wd-ibm-elasticsearch-es-server-master-1 id host ip node L0mqDts7Rh6HiB0aQ4LLtg 127.0.0.1 127.0.0.1 wd-ibm-elasticsearch-es-server-master-2 wd-ibm-elasticsearch-es-server-master-2 id host ip node L0mqDts7Rh6HiB0aQ4LLtg 127.0.0.1 127.0.0.1 wd-ibm-elasticsearch-es-server-master-2
If you find that more than one pod is identified as the leader, contact IBM Support.
MinIO gets stuck in a loop after several installation attempts
- Error
- The message,
Cannot find volume "export" to mount into container "ibm-minio"
, is displayed during an upgrade of Watson Discovery from Version 4.6 or previous versions. When you check the status of the MinIO pods by using the following command:
Then, check the MinIO operator logs by using the following commands:oc get pods -l release=wd-minio -o wide
oc get pods -A | grep ibm-minio-operator
You see an error that is similar to either of the following messages in the logs:oc logs -n <namespace> ibm-minio-operator-XXXXX
ibm-minio/templates/minio-create-bucket-job.yaml failed: jobs.batch "wd-minio-discovery-create-bucket" already exists) and failed rollback: failed to replace object"
ibm-minio/templates/minio-create-bucket-job.yaml failed: jobs.batch "wd-minio-discovery-create-pvc" already exists) and failed rollback: failed to replace object"
- Cause
- A job that creates a storage bucket or PVC for MinIO and then is deleted after it completes, is not being deleted properly.
- Solution
- Complete the following steps to check whether an incomplete
create-bucket
job orcreate-pvc
job for MinIO exists. If so, delete the incomplete jobs so that the jobs can be recreated and can then run successfully.- Check for the MinIO jobs by using the following
commands:
oc get jobs | grep 'wd-minio-discovery-create-bucket'
oc get jobs | grep 'wd-minio-discovery-create-pvc'
- If an existing
create-bucket
job is listed in the response, delete the job by using the following command:oc delete job $(oc get jobs -oname | grep 'wd-minio-discovery-create-bucket')
- If an existing
create-pvc
job is listed in the response, delete the job by using the following command:oc delete job $(oc get jobs -oname | grep 'wd-minio-discovery-create-pvc')
- Verify that all of the MinIO pods start successfully by using the following
command:
oc get pods -l release=wd-minio -o wide
- Check for the MinIO jobs by using the following
commands:
Limitations
- The service supports single-zone deployments; it does not support multi-zone deployments.
- You cannot upgrade the Watson
Discovery service by using the
service-instance upgrade
command from the Cloud Pak for Data command-line interface. - You cannot use the Cloud Pak for Data OpenShift APIs for Data Protection (OADP) backup and restore utility to do an offline backup and restore the Watson Discovery service. Online backup and restore with OADP is available.