IBM Support

New Install of OVA not coming up healthy due to ETCD performance.

Troubleshooting


Problem

Upon installing OVA with the ISO and powering up the VM user find that the install says done but finds pods missing or in crashloopback or pending status
One of the following job pods and the job pods would likely show as 0/1 CrashLookBackOff:
  • <managementcluster_CR_name>-up-apim-schema-0-to-<schema_number>*
  • <managementcluster_CR_name>-up-lur-schema-0-to-<schema_number>*
This can cause any of this:
  • could be during apim/lur DB creation
  • could be during schema jobs for apim or lur
  • could be during data-populate jobs for apim or lur
  • all of that above could succeed and we could see that postgres just won’t stay up and keeps restarting

Symptom

Install will not complete successfully. apic health-check will fail.
Pod logs might show
-sql [261262] getConnection
-bhendi:error [261262] lib/db::execute failed, Code='55006', Message='database "apim" is being accessed by other users': database "apim" is being accessed by other users, stack: error: database "apim" is being accessed by other users
    at Parser.parseErrorMessage (/app/node_modules/pg-protocol/src/parser.ts:369:69)
    at Parser.handlePacket (/app/node_modules/pg-protocol/src/parser.ts:188:21)
    at Parser.parse (/app/node_modules/pg-protocol/src/parser.ts:103:30)
  • there could be many messages about "apply request took too long" where it is taking seconds to complete when the expected duration is 100ms
-caller":"etcdserver/server.go:1159","msg":"failed to revoke lease","lease-id":"4bd38ef0015a6704","error":"etcdserver: request timed out"}
-"caller":"etcdserver/util.go:170","msg":"apply request took too long","took":"4.998589206s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/leases/kube-system/kube-scheduler\" ","response":"","error":"context deadline exceeded"}

Cause

Slow ETCD server can cause this issue, it is not related to APIC issue.

Resolving The Problem

To Start take a backup of schema job
k get job | grep schema
k get job <name of schema job> -o yaml > nameofschemajob.yaml
How to monitor to see that the resolution steps fixed the problem:
  • Scale down apiconnect operator as well as lur/ldap/apim deployments to 0
k scale deploy ibm-apiconnect --replicas=0
  • Exec to the leader pod and remove the locks
kubectl -n <namespace> exec -it `kubectl -n <namespace> get po -l role=master -o name` -- bash
psql -U postgres


SELECT * FROM pg_stat_activity WHERE datname='apim';

SELECT pg_terminate_backend(pg_stat_activity.pid)
FROM pg_stat_activity
WHERE pg_stat_activity.datname = 'apim'
AND pid <> pg_backend_pid();


\q
exit

Scale back up the ibm-apiconnect operator deployment, and this will result in the job running again. 

k scale deploy ibm-apiconnect --replicas=1

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB67","label":"IT Automation \u0026 App Modernization"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSMNED","label":"IBM API Connect"},"ARM Category":[{"code":"a8m50000000CeBXAA0","label":"API Connect-\u003EManagement and Monitoring (MM)-\u003EInstallation"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Document Information

Modified date:
02 July 2024

UID

ibm17148626