IBM Cloud Pak for AIOps backup and restore

Learn how to back up and restore IBM Cloud Pak for AIOps, such as for recovering from a disaster like a complete data center outage.

Overview

IBM Cloud Pak for AIOps backup and restore is based on the storage and database backup of critical datastores, cluster resources, and component data and settings. With the backup and restore feature, you can recover your services to a previous point in time backup to recover from data corruption, system failures, or user errors.

You can also use the backup and restore feature as a means of copying data from one cluster to another, or from one environment into another, such as for disaster recovery purposes.

If you are using backup data to restore environments on a new cluster in another data center, your recover time objective (RTO) and recovery point objective (RPO) must be sufficiently large for this usage.

IBM Cloud Pak for AIOps uses the OpenShift APIs for Data Protection (OADP) to backup data to object storage. In the case of a failure, the APIs and OADP are used to restore data from the backups.

For more information about these tools, see:

Planning for backup and restore

When you are planning your IBM Cloud Pak for AIOps backup and restore strategy, consider what hardware requirements are needed for backing up your environment:

  • You need to install the backup and restore tools in the Red Hat OpenShift Container Platform cluster. These tools include the OpenShift APIs for Data Protection (OADP) which creates required CRs such as DataProtectionApplication and BackupStorageLocation. You need to configure the OpenShift APIs and OADP with the appropriate object storage for storing your backups. This object storage must support RWX (read-write-execute) mode.
  • You can backup both smaller starter and larger production sized cluster deployments.
  • You can use backed up data to restore data to the existing cluster where the data was backed up, or to a new cluster.
  • A restored cluster should have the same size and high availability level as the original cluster.
  • Data is copied and stored in object Storage as part of the backup process. During a restoration, the same data is fetched and restored.

Notes:

  • The backup and restore processes can take time. Using a backup to restore data or configurations is most useful when your recover time objective (RTO) and recovery point objectives (RPO) are met.
  • As your data grows, the size of your backup storage might need to grow.
  • A backup and restore of Red Hat OpenShift Container Platform or ETCD is not completed as part of the IBM Cloud Pak for AIOps backup and restore feature.
  • Configure the duration between your backups to be less than your RPO. By default, the RPO is 12 hours, which requires backups to run once every 12 hours. The default RTO is 4 hours.
  • The Flink state for log anomaly is not backed up and any training is lost during a failure or disaster. You need to run the training again after you recover from the disaster.
  • If you upgrade your deployment, you must setup back and restore again. For more information, see Upgrading IBM Cloud Pak for AIOps backup and restore artifacts.

Backup process

The IBM Cloud Pak® for AIOps backup process is capable of backing up data for both IBM Cloud Pak for AIOps and Infrastructure Automation. The backup of data is completed by backing up the datastores that include IBM Cloud Pak for AIOps data, and if available, Infrastructure Automation data. Some datastores are backed up and restored in their entirety. Other datastores are backed up and restored depending on a particular component need.

The following table shows the IBM Cloud Pak® for AIOps datastores and persistent volume claims (PVCs) that are backed up:

Table. Backed up datastores
Datastores Backup type Backup PVC name Access mode
CouchDB Volume backup data-c-example-couchdbcluster-m-0 ReadWriteOnce
Cassandra Data export back-aiops-topology-cassandra-0 ReadWriteOnce
EDB Postgres Data export postgres-backup-data ReadWriteOnce
Minio Volume backup export-aimanager-ibm-minio-0 ReadWriteOnce
ElasticSearch Partial (Elastic native volume backup used) iaf-system-elasticsearch-es-snap-es-backup-pvc ReadWriteMany
Kafka None
Flink None
mongodb Data export my-mongodump ReadWriteOnce
metastore Data export metastore-backup-data ReadWriteOnce
Other resources like Tunnel custom resource Volume export other-resources-backup-data ReadWriteOnce
CAM Volume export cam-mongo-pv ReadWriteOnce
Infrastructure Management Data export postgresql ReadWriteOnce

The following table shows the backed up Kubernetes resources:

Table. Backed up resources
Backup resource name Resource type
dummy-db pod
es-backup pod
backup-back-aiops-topology-cassandra pod
backup-data-c-example-couchdbcluster-m pod
backup-export-aimanager-ibm-minio pod
backup-metastore pod
backup-postgres pod
backup-other-resources pod
backup-cam pod
aiops-ir-core-model-secret secret
aimanager-ibm-minio-access-secret secret
zen-secrets-aes-key secret
BundleManifest custom resource
ConnectorConfiguration custom resource
ConnectorSchemas custom resource

The backup process is run by custom backup-jobs. When the backup job runs, pods can be scaled down to ensure consistent backups are created. The backup job calls OADP to back up Kubernetes resources and volumes. Native backup scripts are run to back up datastores. The backup job then scales up pods to match their state before the backup process began. Datastores that have a selective backup types are backed up only when the datastore is updated since the previous backup.

For more information about the backup process, see:

Notes: For Elasticsearch, the backup and restore process completes incremental backup snapshots of Elasticsearch. While creating the snapshot, Elasticsearch analyses the list of the index files that are stored in the repository that is being backed up and copies only the files that were created or changed since the last snapshot occurred. This process allows you to have multiple snapshots stored in a compact form.

Component data backups

When the backup process runs for IBM Cloud Pak for AIOps, data for the following components are backed up. This data includes cluster resources, such as Secrets, Pods, CRDs, CRs, PVCs, and PVs.

  • AI model management

    log anomaly detection and metric anomaly detection templates and models are backed up. In addition, model and training metadata, secrets, and configurations are backed up. If this data is not backed up and is lost during a disaster, you cannot run training until you collect training data again. In addition your models and templates can be lost, which can prevent principal component analysis (PCA) from running until you run training to rebuild the models. You might not be able to recreate historical model if data loss occurs.

    For similar ticket analysis, the trained models are backed up.

    Notes

    • Component data, such as for log anomaly detections or similar ticket analysis, is not encrypted when it is backed up.
    • Raw data and data-in-transit, such as Kafka topics, are not backed up.
    • logtrain indices and other training data is not backed up other than training metadata.
    • Before you run the backup process, stop all training that is in progress. Do not start or restart training until the backup process is completed
  • Change risk

    For change risk, risk models, risk assessments, incidents, and ticket data assets are backed up.

    Notes

    • Data-in-transit is not backed up.
    • If change or incident ticket data loss occurs, you might need to run data collection again from your ticketing system.
    • In change risk model loss occurs, you need to recreate the lost models by using the IBM Cloud Pak for AIOps IBM Cloud Pak for AIOps console. Historical models can be difficult to re-create.
    • In the case of risk assessments loss, there is no way to reconstruct the assessments.
  • Automations (Runbooks)

    For Automations, runbooks, automations, execution records, and configurations are backed up.

  • Integrations

    Integration CRs, Secrets, and configurations, are backed up.

  • Event & Topology analytics

    Probable cause data, Topology data, configurations and indexed data, is backed up.

  • Policy management

    For policy management, operational policies and policy run time statistics (last run, count of executions) are backed up.

  • Secure Tunnel

    Tunnel CRs, which represent connections, application mappings, and templates are backed up.

Restore process

To run the restore process, you first need to create a new cluster and install the required prerequisites on the cluster. The prerequisites include installing the required CLI tools, creating the storage classes, and updating the bcdr/common/aiops-config.json and bcdr/restore/restore-data.json files with the required configuration values. Then you need to configure OADP to point to the same s3 storage where the backup is located.

You then need to create the required namespaces for IBM Cloud Pak® for AIOps by running the restore-namespace.sh job. This script creates only the namespaces and namespace metadata, not the contents of the namespace. This namespace restore is needed because the metadata contains SELinux settings that must match the settings from the corresponding namespaces on the cluster where you took the backup.

Then, you need to complete the installation of IBM Cloud Pak for AIOps and create an IBM Cloud Pak for AIOps instance. With the instance created and OADP configured, you can run the OADP restore jobs to populate the instance. When the jobs complete you need to run any other post restore tasks that you need to complete.

Note: OADP does not support the overriding of resources during the restore. The restoration of resources already existing in the cluster are skipped. If this occurs, sync the existing resources and create only the resources that were not restored instead of creating a new resources Operator.

For more information about the restore process, see Restoring IBM Cloud Pak for AIOps.

Procedures for backing up and restoring IBM Cloud Pak for AIOps