Troubleshooting Apache Flink jobs

If you do not see events flowing to HDFS or to Elasticsearch and Flink job logs report errors, explore possible diagnoses and solutions.

All task slots seem to be busy. However, a job does not have enough task slots assigned.

Problem

The job manager log reports errors such as the following one.

org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
Could not allocate all requires slots within timeout of 300000 ms. 
Slots required: 8, slots allocated: 0

The Flink web interface is accessible and in the Overview page, you see 0 (zero) available task slots.

Cause

If this issue happens after initial configuration, it means that you did not configure enough task slots for running all the jobs. Verify whether the number of task slots displayed in the Flink web interface is equal to, or greater than, the number of running jobs. If it is not, update your IBM® Business Automation Insights configuration with the correct number of task manager replicas and task slots.
If the issue happens after you have updated your IBM Business Automation Insights configuration, the problem might indicate that Apache Flink did not correctly update the metadata about task slot assignment after a failing task manager recovered.

Solution

Restart each task manager one by one, in any order, by running these delete commands.
```
kubectl delete pod <my-release>-bai-flink-taskmanager-0
kubectl delete pod <my-release>-bai-flink-taskmanager-1
...
```
These commands redistribute the jobs to the different task slots. The task manager StatefulSet immediately redeploys new instances.
After the restart, verify from the Flink web interface that all jobs are running and have task slots assigned.

After an update of the job submitter, the processing job is in Cancelled state and issues an error message.

Problem

The following error message is displayed.

Get latest completed checkpoint for <job-id> job
REST endpoint did not return latest completed checkpoint. Getting it from Persistent Volume...
Error: There is no checkpoint to recover from.

Diagnosis

This error can happen when the version of a job is updated, for example to try to fix a failure, and this failure is preventing the creation of new checkpoints and savepoints.

Solution

Restart the job from the latest successful checkpoint or savepoint.

New in 18.0.2

You can find the latest successful checkpoint in the <bai-pv>/checkpoints/<job-name>/<job-id> directory.
- <bai-pv> is the directory where the IBM Business Automation Insights persistent volume (PV) was created. Set this variable to /mnt/pv, which is the folder where the PV is mapped within the job submitters.
- <job-name> is the name of the failing job, for example bai/bpmn.
- <job-id> is the job identifier that is indicated by the error message. Pick the most recent checkpoint, that is, the higher <checkpoint-id> value and verify that the folder is not empty.
If all <checkpoint-id> folders are empty, and only in this case, use the latest savepoint of the corresponding processing job, which you can find in the <bai-pv>/savepoints/<job-name> directory.
Update the Recovery Path property of the failing job by following the procedure in Updating your Business Automation Insights deployment.

For 18.0.0 and For 18.0.1 You can find the latest successful checkpoint in /checkpoints/<job-id>, where <job-id> is the job identifier indicated by the error message. Otherwise, use the latest savepoint of the corresponding processing job.

Jobs are not running and the Flink web interface is not accessible after a system restart.

Cause

When you try to access the Flink web interface, you see the following message.

{"errors":["Service temporarily unavailable due to an ongoing leader election. 
Please refresh."]}

The job manager logs the following message.

INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool
- Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{######}]
...
INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool   
- Pending slot request [SlotRequestId{######}] timed out.

The job manager also reports errors such as the following one.

org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
Could not allocate all requires slots within timeout of 300000 ms. 
Slots required: 6, slots allocated: 0

Cause

These messages indicate that the job manager was not correctly updated after the system restart. As a consequence, the job manager does not have access to the task managers to assign job execution.

Solution

Restart the job manager to update it with the correct data, by running the delete command.

kubectl delete pod <my-release>-bai-flink-jobmanager-<id>

A new job manager instance is deployed. After the redeployment, all jobs should be running again and the Flink web interface should be accessible.

New in 18.0.1 Pods of `<release-name>`-bai-bpmn or `<release-name>`-bai-icm jobs are stuck in `Init:0/1` status

Problem

The pods of <release-name>-bai-bpmn and <release-name>-bai-icm jobs first require that <release-name>-bai-setup job completes successfully. The <release-name>-bai-setup job attempts up to 3 retries on failure. Past these 3 retries, it does not trigger any new pod creation. As a side-effect, this can cause pods of the <release-name>-bai-bpmn and <release-name>-bai-icm jobs to remain stuck in Init:0/1 status.

When you run the get pods command, you might observe the following results.

kubectl get pods -n <my-namespace>

Table 1. Pod retrieval results
NAME	READY	RESTARTS	STATUS	AGE
... `<release-name>-bpmn-aaaaa` ... `<release-name>-bai-icm-bbbbb` ...	... 0/1 ... 0/1 ...	... Init:0/1 ... Init:0/1 ...	... 0 ... 0 ...	... 2h ... 2h ...

The logs of the <release-name>-bai-bpmn and <release-name>-bai-icm pods show the following information.

kubectl logs <release-name>-bpmn-aaaaa -n <my-namespace> -c wait-bai-flink-es
...
Checking if mappings of process-summaries* indexes are up-to-date by querying https://<release-name>-ibm-dba-ek-client:9201/elasticsearch-ibm-dba-ek-<release-name>/_template/process-summaries ... (iteration 67)
...
Checking if mappings of process-summaries* indexes are up-to-date by querying https://<release-name>-ibm-dba-ek-client:9201/elasticsearch-ibm-dba-ek-<release-name>/_template/process-summaries ... (iteration 68)
...

Cause

This situation can happen if you set up Elasticsearch incorrectly when you installed the release. First check that Elasticsearch is properly up and running. After you have ensured that Elasticsearch is up and running, you can apply the following solution.

Solution

Delete all the pods that were previously created by the <release-name>-bai-setup job.

kubectl delete pod <release-name>-bai-setup-aaaaa
kubectl delete pod <release-name>-bai-setup-bbbbb
kubectl delete pod <release-name>-bai-setup-ccccc
kubectl delete pod <release-name>-bai-setup-ddddd

Execute the following command to recreate the <release-name>-bai-setup job.

Note: This command requires the jq command-line JSON processor.

kubectl get job <release-name>-bai-setup -o json | jq 'del(.spec.selector)' | jq 'del(.spec.template.metadata.labels)' | kubectl replace --force -f -

New in 18.0.2 You are trying to remove an operator without first creating savepoints

Problem

The job submitter pods are in Error state and you find errors in the logs, such as the following one.

Caused by: java.lang.IllegalStateException: Failed to rollback to checkpoint/savepoint file:/mnt/pv/savepoints/dba/bai-<job-name>/savepoint-<savepoint-id> Cannot map checkpoint/savepoint state for operator xxxxxxxxxxxxxx to the new program, because the operator is not available in the new program.

Cause

This error occurs if you are trying to update your release and remove an operator, for example HDFS, but you did not first create the necessary savepoints and no values were passed to the recoveryPath parameter of the jobs.

Solution

The error message contains a path to a savepoint that is created on the fly to try to update the job. You can use that savepoint to restart the jobs from them by updating the IBM Business Automation Insights release and passing the correct values for each job in its recoveryPath parameter. For more information about the parameters that need to be updated in the release, see Advanced updates.

Troubleshooting Apache Flink jobs

All task slots seem to be busy. However, a job does not have enough task slots assigned.

After an update of the job submitter, the processing job is in Cancelled state and issues an error message.

Jobs are not running and the Flink web interface is not accessible after a system restart.

New in 18.0.1 Pods of <release-name>-bai-bpmn or <release-name>-bai-icm jobs are stuck in Init:0/1 status

New in 18.0.2 You are trying to remove an operator without first creating savepoints

New in 18.0.1 Pods of `<release-name>`-bai-bpmn or `<release-name>`-bai-icm jobs are stuck in `Init:0/1` status