Restarting from a checkpoint or savepoint

You can restart event processing from a specific checkpoint or savepoint, typically after an upgrade or an unrecoverable error.

Before you begin

Make sure that the jq command-line JSON processor is installed. The jq tool is available from this page: https://stedolan.github.io/jq/.

About this task

Restarting from a checkpoint or savepoint gives you the opportunity to fix an unrecoverable error and start over from a valid savepoint or checkpoint.

Savepoints for upgrading
You create savepoints as part of the upgrade procedure. See Upgrading Business Automation Insights.
Important:
  • Starting from savepoints is mandatory when you upgrade Business Automation Insights and the new version is based on a new Apache Flink version.
  • If you restart from a checkpoint, do not change the parallelism of the Flink job. You can change the parallelism of the Flink job only in the context of restarting from a savepoint.
Savepoints for recovery at any time
You can also create savepoints at any time by running the HTTP POST request command as described next. If a job failure is preventing you from creating a savepoint, that is, if the savepoints endpoint returns an error, use the latest successful checkpoint.

Procedure

  1. Retrieve the management service URL.
    INSIGHTS_ENGINE_CR=icp4adeploy
    export NAMESPACE=<namespace>
    export MANAGEMENT_URL=$(oc get insightsengine ${INSIGHTS_ENGINE_CR} -n ${NAMESPACE} -o jsonpath='{.status.components.management.endpoints[?(@.scope=="External")].uri}')
  2. Retrieve the management service username and password.
    export MANAGEMENT_AUTH_SECRET=$(oc get insightsengine ${INSIGHTS_ENGINE_CR} -n ${NAMESPACE} -o jsonpath='{.status.components.management.endpoints[?(@.scope=="External")].authentication.secret.secretName}')
    export MANAGEMENT_USERNAME=$(oc get secret ${MANAGEMENT_AUTH_SECRET} -n ${NAMESPACE} -o jsonpath='{.data.username}' | base64 -d)
    export MANAGEMENT_PASSWORD=$(oc get secret ${MANAGEMENT_AUTH_SECRET} -n ${NAMESPACE} -o jsonpath='{.data.password}' | base64 -d)
  3. Create savepoints for all the running processing jobs by sending an HTTP POST request to trigger the endpoint for savepoints.
    curl -X POST -k -u ${MANAGEMENT_USERNAME}:${MANAGEMENT_PASSWORD} "${MANAGEMENT_URL}/api/v1/processing/jobs/savepoints" | jq .
  4. If you need to stop job processing right after the creation of the savepoints, send the HTTP POST request by using the curl tool and set the cancelJob query parameter to true.
    Typically, this action is required for an upgrade.
    curl -X POST -k -u ${MANAGEMENT_USERNAME}:${MANAGEMENT_PASSWORD} "${MANAGEMENT_URL}/api/v1/processing/jobs/savepoints?cancelJob=true"

    The savepoints are stored in file /mnt/pv/savepoints/dba/bai-<job-name>/savepoint-<id>.

  5. Optional: If the savepoints HTTP POST request returns an error while savepoints are created, and only in this case, use the latest successful checkpoint.

    The HTTP POST request for savepoints returns the names and identifiers of the jobs that failed to create savepoints. 

    1. Cancel the jobs to prevent the creation of new checkpoints.
      curl -X POST -k -u ${MANAGEMENT_USERNAME}:${MANAGEMENT_PASSWORD}  "${MANAGEMENT_URL}/api/v1/processing/jobs/<job-id>/cancel" 
    2. Retrieve the latest successful checkpoint.
      curl -X GET -k -u ${MANAGEMENT_USERNAME}:${MANAGEMENT_PASSWORD} '${MANAGEMENT_URL}/api/v1/processing/jobs/<job-id>/checkpoint' | jq ".completed_location"
  6. To restart a job from a savepoint, follow the procedure in Updating your Business Automation Insights custom resource specifying the bai_configuration.<job_name>.recovery_path parameter with the full path where the savepoint is stored.
    By default, you can restart a job from a same checkpoint or savepoint only once. This is a safety mechanism in case you forget to remove the value of the <job_name>.recovery_path parameter. If you try to restart more than once, the job submitter returns a error message such as:
    Error: The savepoint /mnt/pv/savepoints/dba/bai-<job-name>/savepoint-<id> was already used. The Job won't be run from there.
    The job resumes processing from the moment the specified checkpoint or savepoint was created.
    Important:
    Do not prefix the savepoint path with file://.
    Do not specify the savepoint path for the ODM Flink job.
  7. Optional: If you really need to restart a job from the same checkpoint or savepoint more than once, do as follows. first delete the /mnt/pv/savepoints/dba/<job-name>/<savepoint-id> directory on the persistent volume (PV).
    oc exec -it <xxx>-ep-taskmanager-<n> bash
    rm -rf /mnt/pv/savepoints/dba/<job-name>/<savepoint-id>

Examples

Triggering savepoints endpoints
This command
$ curl -X POST -k -u ${MANAGEMENT_USERNAME}:${MANAGEMENT_PASSWORD} ${MANAGEMENT_URL}/api/v1/processing/jobs/savepoints
returns
[{"name":"dba/bai-bpmn","request-id":"49bfd59e16a1b521c4ff63bf152d4fd0","jid":"554ad87ad1b8eff85f910e791ff6952d","state":"COMPLETED","location":"file:/mnt/pv/savepoints/dba/bai-bpmn/savepoint-554ad8-2f214ebb11e2"}]
Triggering checkpoint endpoints
This command
$ curl -X GET -k -u ${MANAGEMENT_USERNAME}:${MANAGEMENT_PASSWORD}  ${MANAGEMENT_URL}/api/v1/processing/jobs/554ad87ad1b8eff85f910e791ff6952d/checkpoint
returns
{"jid":"554ad87ad1b8eff85f910e791ff6952d","completed_trigger_timestamp":1638718929999,"completed_location":"/mnt/pv/checkpoints/dba/bai-bpmn/554ad87ad1b8eff85f910e791ff6952d/chk-7571","savepoint_trigger_timestamp":1638718895315,"savepoint_location":"/mnt/pv/savepoints/dba/bai-bpmn/savepoint-554ad8-5072f25f0d3b","restored_location":"/mnt/pv/savepoints/dba/bai-bpmn/savepoint-6156e2-34ef68a03c2d"}