Preparing your cluster for fault tolerance
To ensure recovery from possible job failures, IBM® Business Automation Insights uses Apache Flink checkpoints and savepoints, and high availability mode. You can restart a job from a previous checkpoint or savepoint. For some possible failure cases, you can also take preventive steps.
High availability configuration
For high availability and fault tolerance to be effective, it is critical that you set the appropriate number of replicas for Apache Zookeeper, for embedded Elasticsearch nodes, and for the administrative service. Refer to Replica configuration for high availability and fault tolerance.
Failure of Apache Flink job manager and task managers
When they are deployed, the job manager and task managers are configured to restart their failed pod, and jobs are configured for checkpointing. The job manager and task managers restart automatically and processing resumes from where it left off.
Job failure
If a recoverable error occurs, such as temporary network outage that prevents connection to Kafka or Elasticsearch, Flink jobs automatically restart. When a job cannot restart, see Restarting from a checkpoint or savepoint.