Checkpoints and savepoints

To ensure recovery from possible job failures, IBM® Business Automation Insights uses Apache Flink checkpoints and savepoints, and high availability mode.

Checkpoints

Business Automation Insights processing jobs are stateful. For the state to be fault tolerant, checkpoints are checked at regular intervals. Checkpoints are stored in the persistent volume (PV) in the /checkpoints folder.

New in 18.0.2 The /checkpoints folder contains one subfolder for each processing job.

You can modify the checkpointing frequency in the Apache Flink Job Checkpointing Interval option in IBM Business Automation Configuration Container when you install Business Automation Insights or the value of the flink.jobCheckpointingInterval parameter on the Helm command line when update your deployed Business Automation Insights configuration. This property sets the number of milliseconds between two consecutive checkpoints. The default value is 5000 ms.

Savepoints

Savepoints are created by the job submitters each time a new version of the processing job is deployed so that the job stops safely and the new job version redeploys from the new savepoint created. Savepoints are stored in the persistent volume in the /savepoints folder.

New in 18.0.2 The /savepoints folder contains one subfolder for each processing job.

RocksDB State backend

Business Automation Insights processing jobs use the RocksDB state backend to store the checkpoints and savepoints in the persistent volume. To configure RocksDB, use Flink predefined configuration options for this state backend. You can then update those options by modifying the Apache Flink RocksDB Options Set (flink.rocksdbOptionsSet) property in the Helm release. The DEFAULT option is used by default.

For more information about these options, see the page for predefined options in the Flink documentation.

Note: Modifying RocksDB options might increase or decrease memory consumption by task managers.

Failure and recovery

If any of the cluster components crashes after the new instance is brought back by Kubernetes, Flink recovers the jobs automatically. The recovered job is restarted from the latest successful checkpoint, and the processing resumes from where it was before the failure.

If the failure occurs while a processing job is in progress, for example because of a wrong job parameter value, such as unauthorized access to the Elasticsearch or HDFS servers, the job restarts from the latest checkpoint. If the issue is not corrected, the job fails again and restarts from the latest checkpoint, and so on. In such cases, the issue can be resolved on the external service. The job will eventually run correctly again and event processing will resume from where it left off before the failure. If the issue can be solved by updating the job configuration, update the new values as described in Apache Flink parameters. In this case, a savepoint is created before the job is updated; then the job is canceled and the new version is submitted to the cluster through the savepoint that was created to enable the processing to resume from where it was before the failure.