Restarting from a checkpoint or savepoint from Business Automation Insights 18.0.2

New in 18.0.2 Starting from savepoints is mandatory when you upgrade Business Automation Insights and the new version is based on a new Apache Flink version.

About this task

For upgrades that include a new version of Flink, a savepoint is required for each processing job. You cannot use checkpoints in this case. You create savepoints as part of the upgrade procedure in Upgrading Business Automation Insights releases.

At any time, you can create savepoints by running the exec command as shown below. If a job failure is preventing you from creating a savepoint, that is, if the create-savepoints.sh script returns an error, use the latest successful checkpoint.

Tip: Always prefer to use savepoints over checkpoints because savepoints are always retained until you explicitly delete them. Use checkpoints when savepoint creation fails. However, because only the 3 latest successful checkpoints are retained, and to prevent them from being deleted while a new checkpoint is created, remember to first cancel the job. On cancellation, the 3 latest checkpoints are retained.

You can then restart the processing by using the Recovery Path property of each job submitter in the Helm release properties, using either a savepoint or a checkpoint.

Procedure

Retrieve the name of the job manager pod.

JOBMANAGER=`kubectl get pods --selector=release=<my-release> --namespace <my-namespace> | grep bai-flink-jobmanager | awk '{print $1}'`

Create savepoints for all the running processing jobs by using the script provided in the job manager pod.
```
kubectl exec -it $JOBMANAGER --namespace <my-namespace> -- scripts/create-savepoints.sh
```
The script returns the path of the created savepoints.
```
Savepoint completed. Path: file:/mnt/pv/savepoints/dba/bai-<job-name>/savepoint-<id>
```
Savepoints are created for all the jobs while they continue running.
Optional: If you need to stop the processing right after the creation of the savepoints (for example, this is required for an upgrade), execute the script with the -s lag.
```
kubectl exec -it $JOBMANAGER --namespace <my-namespace> -- scripts/create-savepoints.sh -s
```
This command stops the jobs and returns the path to the created savepoints.
```
Savepoint stored in file:/mnt/pv/savepoints/dba/bai-<job-name>/savepoint-<id>.
```
Optional: If the create-savepoints.sh script returns an error while savepoints are created, and only in this case, use the latest successful checkpoint.
The create-savepoints.sh script returns the names and identifiers of the jobs that failed to create savepoints.
```
Not able to create savepoint for job 'dba/bai-<job-name>' with ID: <job-id>
```
1. Cancel the jobs to prevent the creation of new checkpoints.
```
kubectl exec -it $JOBMANAGER --namespace <my-namespace> -- flink cancel <job-id>
```
2. Retrieve the latest successful checkpoint.
```
kubectl exec -it $JOBMANAGER --namespace <my-namespace> -- curl -sk https://localhost:8081/jobs/<job-id>/checkpoints | jq '.latest.completed.external_path'
```
Note: The error that prevents the creation of savepoints might reoccur when you restart from the latest successful checkpoint. Before you restart the job from the checkpoint, make sure to look into the job logs to identify the problem and take the necessary steps to fix it. For information about monitoring jobs, see Troubleshooting.
To update the Recovery Path property, follow the procedure in Updating your Business Automation Insights deployment.
By default, you can restart a job from a same checkpoint or savepoint only once. This is a safety mechanism in case you forget to remove the value of the Recovery Path property. If you try to restart more than once, the job submitter falls into error state and returns a message such as Error: The savepoint <path/to/savepoint> was already used. The Job won't be run from there.

The job resumes processing from where it was when the specified checkpoint or savepoint was created.
Optional: If you really need to restart a job from the same checkpoint or savepoint more than once, first delete the /recovery/<job-name>/<savepoint-id> savepoint on the persistent volume (PV).

Results

Jobs are restored from the savepoint or checkpoint using the allowNonRestoredState flag that is described in the Restore a savepoint section of the Flink documentation. To enable you to remove operators, such as HDFS storage or Kafka egress, that flag is removed. Therefore, be careful when you restore from a savepoint or checkpoint and look up the logs. When a state is ignored, the job manager log contains a message such as the following one. Similar messages appear only when you disable operators, and only once for each operator.

2019-01-08 18:11:04,737 INFO org.apache.flink.runtime.checkpoint.Checkpoints - Skipping savepoint state for operator <operator-id>.

For more information, see Advanced updates.