Troubleshooting Apache Flink jobs
If you do not see events flowing to HDFS or to Elasticsearch and Flink job logs report errors, explore possible diagnoses and solutions.
All task slots seem to be busy. However, a job does not have enough task slots assigned.
- Problem
- The job manager log reports errors such as the following
one.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 8, slots allocated: 0
The Flink web interface is accessible and in the Overview page, you see 0 (zero) available task slots.
- Cause
- If this issue happens after initial configuration, it means that you did not configure enough task slots for running all the jobs. Verify whether the number of task slots displayed in the Flink web interface is equal to, or greater than, the number of running jobs. If it is not, update your IBM® Business Automation Insights configuration with the correct number of task manager replicas and task slots.
- If the issue happens after you have updated your IBM Business Automation Insights configuration, the problem might indicate that Apache Flink did not correctly update the metadata about task slot assignment after a failing task manager recovered.
- Solution
- Restart each task manager one by one, in any order, by running these delete
commands.
kubectl delete pod <my-release>-bai-flink-taskmanager-0 kubectl delete pod <my-release>-bai-flink-taskmanager-1 ...
These commands redistribute the jobs to the different task slots. The task manager StatefulSet immediately redeploys new instances.
- After the restart, verify from the Flink web interface that all jobs are running and have task slots assigned.
- Restart each task manager one by one, in any order, by running these delete
commands.
After an update of the job submitter, the processing job is in Cancelled state and issues an error message.
- Problem
- The following error message is
displayed.
Get latest completed checkpoint for <job-id> job REST endpoint did not return latest completed checkpoint. Getting it from Persistent Volume... Error: There is no checkpoint to recover from.
- Diagnosis
- This error can happen when the version of a job is updated, for example to try to fix a failure, and this failure is preventing the creation of new checkpoints and savepoints.
- Solution
- Restart the job from the latest successful checkpoint or savepoint.New in 18.0.2
- You can find the latest successful checkpoint in the
<bai-pv>/checkpoints/<job-name>/<job-id>
directory.
- <bai-pv> is the directory where the IBM Business Automation Insights persistent volume (PV) was created. Set this variable to /mnt/pv, which is the folder where the PV is mapped within the job submitters.
- <job-name> is the name of the failing job, for example bai/bpmn.
- <job-id> is the job identifier that is indicated by the error message. Pick the most recent checkpoint, that is, the higher <checkpoint-id> value and verify that the folder is not empty.
- If all <checkpoint-id> folders are empty, and only in this case, use the latest savepoint of the corresponding processing job, which you can find in the <bai-pv>/savepoints/<job-name> directory.
- Update the Recovery Path property of the failing job by following the procedure in Updating your Business Automation Insights deployment.
For 18.0.0 and For 18.0.1 You can find the latest successful checkpoint in /checkpoints/<job-id>, where <job-id> is the job identifier indicated by the error message. Otherwise, use the latest savepoint of the corresponding processing job.
- You can find the latest successful checkpoint in the
<bai-pv>/checkpoints/<job-name>/<job-id>
directory.
Jobs are not running and the Flink web interface is not accessible after a system restart.
- Cause
- When you try to access the Flink web interface, you see the following
message.
{"errors":["Service temporarily unavailable due to an ongoing leader election. Please refresh."]}
The job manager logs the following message.INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{######}] ... INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Pending slot request [SlotRequestId{######}] timed out.
The job manager also reports errors such as the following one.org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 6, slots allocated: 0
- Cause
- These messages indicate that the job manager was not correctly updated after the system restart. As a consequence, the job manager does not have access to the task managers to assign job execution.
- Solution
- Restart the job manager to update it with the correct data, by running the
delete
command.
kubectl delete pod <my-release>-bai-flink-jobmanager-<id>
A new job manager instance is deployed. After the redeployment, all jobs should be running again and the Flink web interface should be accessible.
New in 18.0.1 Pods of <release-name>-bai-bpmn or <release-name>-bai-icm jobs are stuck in Init:0/1 status
- Problem
- The pods of <release-name>-bai-bpmn and
<release-name>-bai-icm jobs first require that
<release-name>-bai-setup job completes successfully. The
<release-name>-bai-setup job attempts up to 3 retries on
failure. Past these 3 retries, it does not trigger any new pod creation. As a side-effect, this can
cause pods of the <release-name>-bai-bpmn and
<release-name>-bai-icm jobs to remain stuck in
Init:0/1 status.When you run the get pods command, you might observe the following results.
kubectl get pods -n <my-namespace>
Table 1. Pod retrieval results NAME READY RESTARTS STATUS AGE ... <release-name>-bpmn-aaaaa
...
<release-name>-bai-icm-bbbbb
...
... 0/1
...
0/1
...
... Init:0/1
...
Init:0/1
...
... 0
...
0
...
... 2h
...
2h
...
The logs of the <release-name>-bai-bpmn and <release-name>-bai-icm pods show the following information.kubectl logs <release-name>-bpmn-aaaaa -n <my-namespace> -c wait-bai-flink-es ... Checking if mappings of process-summaries* indexes are up-to-date by querying https://<release-name>-ibm-dba-ek-client:9201/elasticsearch-ibm-dba-ek-<release-name>/_template/process-summaries ... (iteration 67) ... Checking if mappings of process-summaries* indexes are up-to-date by querying https://<release-name>-ibm-dba-ek-client:9201/elasticsearch-ibm-dba-ek-<release-name>/_template/process-summaries ... (iteration 68) ...
- Cause
- This situation can happen if you set up Elasticsearch incorrectly when you installed the release. First check that Elasticsearch is properly up and running. After you have ensured that Elasticsearch is up and running, you can apply the following solution.
- Solution
- Delete all the pods that were previously created by the
<release-name>-bai-setup
job.
kubectl delete pod <release-name>-bai-setup-aaaaa kubectl delete pod <release-name>-bai-setup-bbbbb kubectl delete pod <release-name>-bai-setup-ccccc kubectl delete pod <release-name>-bai-setup-ddddd
- Execute the following command to recreate the
<release-name>-bai-setup job.Note: This command requires the jq command-line JSON processor.
kubectl get job <release-name>-bai-setup -o json | jq 'del(.spec.selector)' | jq 'del(.spec.template.metadata.labels)' | kubectl replace --force -f -
- Delete all the pods that were previously created by the
<release-name>-bai-setup
job.
New in 18.0.2 You are trying to remove an operator without first creating savepoints
- Problem
- The job submitter pods are in Error state and you find errors in the logs, such as the following
one.
Caused by: java.lang.IllegalStateException: Failed to rollback to checkpoint/savepoint file:/mnt/pv/savepoints/dba/bai-<job-name>/savepoint-<savepoint-id> Cannot map checkpoint/savepoint state for operator xxxxxxxxxxxxxx to the new program, because the operator is not available in the new program.
- Cause
- This error occurs if you are trying to update your release and remove an operator, for example HDFS, but you did not first create the necessary savepoints and no values were passed to the recoveryPath parameter of the jobs.
- Solution
- The error message contains a path to a savepoint that is created on the fly to try to update the job. You can use that savepoint to restart the jobs from them by updating the IBM Business Automation Insights release and passing the correct values for each job in its recoveryPath parameter. For more information about the parameters that need to be updated in the release, see Advanced updates.