Restarting from a checkpoint or savepoint

You can restart event processing from a specific checkpoint or savepoint, typically after an upgrade or an unrecoverable error.

Before you begin

Make sure that the jq command-line JSON processor is installed. The jq tool is available from this page: https://stedolan.github.io/jq/.

About this task

Restarting from a checkpoint or savepoint gives you the opportunity to fix an unrecoverable error and start over from a valid savepoint or checkpoint.

Savepoints for upgrading
You create savepoints as part of the upgrade procedure. See Upgrading Business Automation Insights.
Important:
  • Starting from savepoints is mandatory when you upgrade Business Automation Insights and the new version is based on a new Apache Flink version.
  • If you restart from a checkpoint, do not change the parallelism of the Flink job. You can change the parallelism of the Flink job only in the context of restarting from a savepoint.
Savepoints for recovery at any time
You can also create savepoints at any time by running the HTTP POST request command as described next. If a job failure is preventing you from creating a savepoint, that is, if the savepoints endpoint returns an error, use the latest successful checkpoint.

Procedure

  1. Retrieve the management service URL.
    MANAGEMENT_SERVICE_ROUTE=`oc get routes -l app=ibm-business-automation-insights --no-headers=true --namespace <my-namespace> | grep bai-management-service | awk '{print $2}'`
  2. Retrieve the management service username and password.
    export MANAGEMENT_SERVICE_USERNAME=$(oc extract secret/<bai_custom_resource_name>-bai-secret-internal --keys=management-username --to=- 2>/dev/null)
    export MANAGEMENT_SERVICE_PASSWORD=$(oc extract secret/<bai_custom_resource_name>-bai-secret-internal --keys=management-password --to=- 2>/dev/null)
  3. Create savepoints for all the running processing jobs by sending an HTTP POST request to trigger the endpoint for savepoints.
    curl -X POST -k -u ${MANAGEMENT_SERVICE_USERNAME}:${MANAGEMENT_SERVICE_PASSWORD} "https://${MANAGEMENT_SERVICE_ROUTE}/api/v1/processing/jobs/savepoints" | jq
  4. If you need to terminate job processing right after the creation of the savepoints, send the HTTP POST request by using the curl tool and set the cancelJob query parameter to true.
    Typically, this action is required for an upgrade.
    curl -X POST -k -u ${MANAGEMENT_SERVICE_USERNAME}:${MANAGEMENT_SERVICE_PASSWORD} "https://${MANAGEMENT_SERVICE_ROUTE}/api/v1/processing/jobs/savepoints?cancelJob=true"

    The savepoints are stored in file /mnt/pv/savepoints/dba/bai-<job-name>/savepoint-<id>.

  5. Optional: If the savepoints HTTP POST request returns an error while savepoints are created, and only in this case, use the latest successful checkpoint.

    The HTTP POST request for savepoints returns the names and identifiers of the jobs that failed to create savepoints. 

    1. Cancel the jobs to prevent the creation of new checkpoints.
      curl -X POST -k -u ${MANAGEMENT_SERVICE_USERNAME}:${MANAGEMENT_SERVICE_PASSWORD}  "https://${MANAGEMENT_SERVICE_ROUTE}/api/v1/processing/jobs/<job-id>/cancel" 
    2. Retrieve the latest successful checkpoint.
      curl -X GET -k -u ${MANAGEMENT_SERVICE_USERNAME}:${MANAGEMENT_SERVICE_PASSWORD} 'https://${MANAGEMENT_SERVICE_ROUTE}/api/v1/processing/jobs/<job-id>/checkpoint' | jq ".completed_location"
  6. To update the <job_name>.recovery_path parameter, follow the procedure in Updating your Business Automation Insights custom resource.

    By default, you can restart a job from a same checkpoint or savepoint only once. This is a safety mechanism in case you forget to remove the value of the <job_name>.recovery_path parameter. If you try to restart more than once, the job submitter falls into error state and returns a message such as Error: The savepoint /mnt/pv/savepoints/dba/bai-<job-name>/savepoint-<id> was already used. The Job won't be run from there. The job resumes processing from where it was when the specified checkpoint or savepoint was created.

  7. Optional: If you really need to restart a job from the same checkpoint or savepoint more than once, first delete the /mnt/pv/<job-name>/<savepoint-id> directory on the persistent volume (PV).

Example

Triggering savepoints endpoints
This command
curl -X POST -k -u admin:passw0rd "https://<your_bai_management_url_route>/api/v1/processing/jobs/savepoints"
returns
[{"jid":"2228d8c5e592ee93b8e3b1e33d762a6b","name":"dba/bai-bpmn","request-id":"b169f74f50a085c7b0fc600efc136b28","state":"IN_PROGRESS"}]
Triggering checkpoint endpoints
This command
curl -X GET -k -u admin:passw0rd "https://<your_bai_management_url_route>/api/v1/processing/jobs/2228d8c5e592ee93b8e3b1e33d762a6b/checkpoint"
returns
{"jid":"2228d8c5e592ee93b8e3b1e33d762a6b","completed_trigger_timestamp":1612517197656,"completed_location":"/mnt/pv/checkpoints/dba/bai-bpmn/2228d8c5e592ee93b8e3b1e33d762a6b/chk-76","savepoint_trigger_timestamp":1612517096739,"savepoint_location":"/mnt/pv/savepoints/savepoint-2228d8-cb0d3350a473","restored_location":null}