Home Architectures Well-Architected Framework Resiliency Resiliency
IBM Well-Architected Framework
Resiliency Leadspace Image -  Well-architected Framework
Overview

Reliability is the ability of your service or application to be available for your users as intended, while resiliency is the capacity for your service or application to recover from different types of failures.


Why is resiliency important?

A business-critical application that is not well-designed for resiliency can have a significant negative impact on your business operations and your user experience. Users expect services to perform well and always be available. When a system does not meet its reliability goals, it can lose the trust of stakeholders and potentially cause financial and reputational loss for your organization. Examples of negative user impact include "my package did not arrive today," "we cannot fuel the aircraft because the fueling system is not available," or even worse.

The reliability needs of a system grow and become more complex as the overall system grows and its interactions with external services increase. Architects must use deep critical analysis and drive reliability requirements until they are understood, documented, and published. These reliability requirements must be described with accompanying design traceability and supporting tests to verify that the needs have been met.

Sometimes stakeholders may not appreciate why resiliency and reliability are important, and that reliability requirements must be explicitly stated to determine whether a system is reliable. In these cases, stakeholders need to be educated on the consequences to the business if reliability requirements are not met.

The following sections describe the principles that must be considered when designing your system for reliability.

Principles

The following sections describe the principles that must be considered when designing your system for reliability.

Resiliency design choices and trade-offs must be tightly coupled to business requirements. Every architectural decision around resiliency can have an impact on business operations, user experience, overall cost, and operability.

When you define resiliency and reliability targets, start from a business perspective. Not every system or application needs to have 99.999% uptime. A highly resilient architecture for a system can significantly impact the project costs, such as initial development costs, hosting costs, operating costs, and on-going support and maintenance costs. You need to work closely with your business and IT stakeholders to define and understand:

  • The proposed budget for the system (ensure that the system architecture and architectural decisions align with the budget)
  • The potential impact of downtime on revenue, customer satisfaction, and brand reputation.

These insights are vital as they help you and your organization balance the cost of designing a highly resilient system with the business value and make strategic decisions that optimize resiliency investments. You can use an iterative process to eventually land on a common ground between business needs and technical implementation.

Designing for failure involves a thorough analysis of all potential types of failures (for example, application, middleware, data, connectivity, data center, zone, region, and so on) and quantifying the impact of each failure scenario in relation to hard business requirements. Your design must enable the solution to maintain desired service levels despite the occurrence of one or more failures.

Reliability can't be added to a pre-existing system. The system and its components must be designed and implemented with reliability in mind. It's a shared responsibility by everyone who contributes to the software development lifecycle, including the architect, the product owner, the DevOps engineer, and the site reliability engineer.

To implement reliability into a software component, you can use the key architectural patterns described in the practices pages. You don't need to use every pattern or technique, but you need to know which are available and which is right for a task.

When you're skilled in these techniques, you can have a meaningful conversation with the product owner on the reliability targets for a service. These techniques can help you gauge the viability and effort that are required to meet a certain reliability target.

Cyber resiliency, a component of operational resiliency, focuses on providing data protection and business continuity in case of cyber attacks, including scenarios where data is encrypted by ransomware. It includes the recovery of data as well ensuring an operating environment can be used while the production environment is not available.

Cyber attacks can happen to large enterprises, even those with robust security technologies in place. A strong cyber resiliency strategy that provides a unified approach combining cybersecurity with data protection and disaster recovery methods can help businesses protect against and rapidly recover from disruptive cyber incidents.

A well-architected and resilient cloud application is distributed and built from smaller components with various lifecycles that may scale up or down based on demand. It becomes harder to rely on traditional monitoring and logging to diagnose problems. It is important to monitor applications to quickly detect deviations from the norm, as defined through service level objectives and service level indicators. There must be sufficient insight into the system to isolate problems and mitigate issues as quickly as possible.

Observability uses external outputs (metrics, logs, and traces) to infer the internal state of system components. These outputs can be used to provide different views of the system, including events, dashboards, and links. By integrating these distinct views into one solution, a site reliability engineer can gain better insights and identify and resolve incidents faster.

Embedding resiliency testing for cloud workloads into planning and day 2 operations involves integrating resiliency considerations into the entire lifecycle of cloud-based services.

One of the key activities is resiliency test planning, which is about developing a comprehensive resiliency testing strategy that includes test scenarios, methodologies, and success criteria (in alignment with resiliency requirements and objectives).

Another key activity is testing in production-like environments. This means performing resiliency testing in these environments to observe how the system behaves in conditions that closely resemble the actual usage. In addition, reliability can greatly benefit from negative testing or chaos testing in these environments.

Finally, automated testing integration ensures that resiliency tests are well integrated into the automated testing framework and CI/CD pipelines. This ensures that resiliency testing becomes a routine part of the development and deployment process.

Operating a service reliably with velocity and scale requires automation. Using automation techniques like infrastructure-as-code (IaC) contributes to reliable services and better user experiences by reducing manual intervention, automating recovery processes, and enabling rapid and consistent responses to failures.

Practices

A business-critical application that is not well-designed for resiliency can have a significant negative impact on your business operations and your user experience. Users expect services to perform well and always be available. When a system does not meet its reliability goals, it can lose the trust of stakeholders and potentially cause financial and reputational loss for your organization. Examples of negative user impact include "my package did not arrive today," "we cannot fuel the aircraft because the fueling system is not available," or even worse. 

The following best practices are intended to help you design resilient systems:

Application resiliency

Reliability can't be added to a built system. The system and its components must be designed and implemented with reliability in mind. It's a shared responsibility by everyone who contributes to the software development lifecycle, including the architect, the product owner, the DevOps engineer, and the site reliability engineer.

To implement reliability into a software component, you can use the key architectural patterns. There are several important patterns and techniques that you need to understand to build reliable applications which are described next. You don't need to use every tool, but you need to know which tools are available and which tool is right for a task.

When you're skilled in these techniques, you can have a meaningful conversation with the product owner on the reliability targets for a service. These techniques can help you gauge the viability and effort that are required to meet a certain reliability target.

Foundational patterns

First, there are three foundational patterns, each with their own trade-offs, to consider:

  • Redundant resources (trade cost): Have redundant resources to avoid single points of failure. Every component can fail, but the system is robust enough that an individual outage can be tolerated.
  • Degraded results (trade quality): Instead of expecting every transaction to succeed, sometimes it can be tolerable for a business to see some requests fail.
  • Retry transient failures (trade latency): Trade latency for reliability. Automatic retries form the base of the third technique.

 

Mitigating techniques

From the three foundational patterns, there are additional techniques that mitigate some consequences caused by the trade-offs or help compensate for failures.

  • Circuit breaker: Detect failures and encapsulate the logic of preventing a failure from constantly recurring during maintenance, temporary external system failure, or unexpected system difficulties
  • Sidecar: Enable a service mesh control plane to secure, control, and observe.
  • Exponential backoff: Space out repeated requests or retransmissions of the same block of data, often to avoid resource congestion.
  • Waterfall: Run multiple instances of a transaction and use one result while you discard the other responses.
  • Partitioning or sharding: Partition a workload into distinct independent parts to improve availability and performance.
  • Fail static: Limit the number of resources that a service uses so that the service can continue to function, meeting SLAs even during extreme load.
  • Caching: Store data so that future requests for it can be served faster.
  • Queuing: Queue requests and process them asynchronously to improve the stability of the system.
  • Throttling: Limit the number of resources that a service uses.
  • Load shedding Deliberately cut off consumers to protect the grid from collapse
  • Bulkhead: Isolate application components so that if one fails, it doesn't impact the others.
  • Waiting room (visitor prioritization): Provide a waiting room experience when your back-end application becomes overloaded.
  • Compensating transaction: Record all the steps of workflow and start to undo the operations if a failure occurs.
  • Event-driven architecture: Integrate services through a publish/subscribe architecture.

 

Non-architectural techniques

In addition to the architecture patterns, you can take other approaches to improve the reliability of a system. Typically, these techniques help to manage a system or to understand the behavior of complex systems. While they don't directly improve reliability, these techniques can help you understand the system and infer strategies to improve reliability.

  • Systems theory: Observe the behavior of a system as a whole, not its individual parts.
  • Observability: Monitor your service as consumers experience it to quickly detect deviations from the norm, as defined through service level objectives and service level indicators.
  • Chaos engineering Inject failure into a system to improve reliability and resilience.
  • Recoverability: Recover quickly from a disaster in a non-routine way.


Data resiliency

When a failure occurs with hosted data, data resiliency allows data to remain available to traditional applications and applications that incorporate APIs and other services for analytics.

Choosing the correct set of data resiliency techniques and technologies in the context of an overall business continuity plan is vital, but it can be complex and difficult. How do you reconstruct a steady state? Might the recovery noticeably affect a user or system? You need to be able to establish a sync point, restart, understand anything that was in flight and not recoverable, and back out.

Here are some examples of techniques you might want to consider in your workload resiliency design:

  • Backups: Traditionally, organizations recover data by using data backups. Backups are usually anchored on single applications. However, when applications are interconnected, backups can face resynchronization challenges. The latency between backups can also create a gap of lost data. Depending on the underlying technologies of the data stores, a logging system might not be available to mitigate the gap of loss. Because backups are taken incrementally, it can be cumbersome to recompose an image to restore. Some modern databases are so large that backups aren't even taken.

  • Snapshots: Snapshots of logical disk units can address a backup that transcends individual applications. However, they might not always work well with various techniques of mirrored data and striped data.

  • Mirroring: Various techniques of mirroring exist. The mirrored copy can be on the same disk drive or pushed to a remote system. Typically, the operating system handles the mirroring, or replication. Mirroring techniques can vary between synchronous and asynchronous replication. With synchronous mirroring, the two copies of the data are identical and a latency issue might exist regarding the physical distance that is permitted between the primary and the secondary locations. Asynchronous mirroring doesn't usually have the distance limitation, but if an unexpected failure occurs, the latency might result in data loss. At the hardware level, a peer-to-peer remote copying can provide a form of mirroring that enables resource services to provide for a controlled switchover or failover.

  • Flash copies: A flash copy can provide a fast point-in-time copy of the data. You can use the copy to bring an application online in a separate partition or system. This type of copying can also supplement the ability to complete an offline backup or populate data for non-production systems.

  • Logical replication: If you use logical replication to build a multi-system with high availability, be sure to use a transport mechanism that uses synchronous remote journaling. The journaling provides a way to replay.

  • Hardware replication: Hardware replication is done at the operating system or disk level instead of at the object level. One advantage that hardware replication has over logical replication is that hardware replication is done at a lower level. When replication is done synchronously, you're more likely to have identical copies of the data. The disadvantage is that the data is accessible from only one copy, and you can't use the second copy during active replication.
  • Software replication: Software or database replication is useful when you need to move to auxiliary systems, such as a data lake or a data warehouse. If you use change data capture (CDC) technology, the data replication software depends on the database that provides a logging mechanism.
     

Constraint: Network latency & data volume

The two primary enemies of a data resiliency initiative are data volume and network latency. To design for high availability, disaster recovery, or workload reallocation, you must factor in real-world physics.

The time to move data from one location to another depends on distance. The longer the distance, the longer the latency. Sending petabytes of data all at once is likely to clog the available bandwidth. All these constraints are independent of other considerations, such as time to rebuild an index or to create a sync point at the target location.

Whether you're moving data in bulk, trickling data via CDC or a message queue, or using a mirrored technology, you need data strategy, data topology, and data governance.

Build for reliability

It is important to define measurable resiliency and reliability targets to ensure you are meeting your user's needs. The targets are typically expressed with service level objects (SLOs), service level indicators (SLIs), and service level agreements (SLAs).

Service level objectives

The SLO is your user's service reliability target level. To be effective, the SLO must be achievable and reflect the technical reality of your organization. It needs to be high enough to keep your users satisfied, but low enough that the business can achieve it. Your SLO can be made stricter over time as your system improves.

The SLO is an agreement by the internal service manager, the development team, and the operations team. All parties must support it.

You should express the SLO as a percentage of achievement over a period. The closer the number to 100%, the better the SLO is; however, an SLO is rarely targeted to be 100%. 

Service level indicators

To calculate your SLO, you need to measure the metrics, or your service level indicators (SLIs). SLIs are metrics that represent your service level from your users's point of view. The process to define your SLO and SLI is iterative. SLOs are driven by business requirements; whereas, SLIs are driven by the available measurement.

The initial SLO might postulate objectives that either aren't technically achievable or are difficult and expensive to achieve. SLOs should be negotiated between between affected parties, such as the product owner, the DevOps engineers, and the site reliability engineers, by comparing the expected business result with the current baseline. You can adjust SLOs and SLIs as business requirements arise or as important performance metrics become available or unavailable.

Service level agreements

A service level agreement (SLA) is an agreement with your users that defines the SLOs that must be met in a specific period of time. In addition, the SLA includes the commercial implications for not achieving the SLOs. An easy way to determine whether an SLO represents an SLA is to ask, "Are there any commercial implications if the SLO is not met?" If the answer is "yes", it is an SLA.

For a given service, the contractually agreed measure of a given attribute constitutes the SLA. For example, a common availability SLA is 99.95%, and a common latency SLA could be 300 milliseconds. SLAs are usually contractual obligations, thus, there may be implications of not achieving SLAs. For example, cloud providers typically provide SLA credits or service credits if the promised SLAs for services are missed.

An SLA can also include other commitments, such as response time and repair time. Because an SLA is a business contract, it is useful to know the key indicators for a specific user's industry. In the area of business process, key performance indicators (KPIs) are used. KPIs are like service level indicators (SLIs), but the focus of a KPI is more on measurement that helps you understand how your organization or department is performing.

Measure your service level objectives IBM Cloud service level objectives

High availability (HA) is mainly about keeping the service available to your customers when ordinary activities are performed on the system like deploying updates, rebooting virtual machines, applying security patches to host operating systems, and so on. High availability within a single cloud multizone region (MZR) can be achieved by eliminating single points of failure.

HA usually doesn’t deal with major unplanned or planned issues, such as complete site loss because of major power outages, earthquakes, severe hardware failures, full site connectivity loss, and so on. In such cases, if the service must meet strict service level objectives (SLOs), you should make the whole application stack (infrastructure, services, and application components) redundant by deploying it in at least two different cloud regions. This is typically defined as a disaster recovery (DR) architecture.

Disaster recovery

The approach to defining your Disaster Recovery strategy needs to be systematic and start with the business application you're providing on the cloud, and the set of resources that are used to build it (for example, Kubernetes apps, virtual machines, cloud services, and so on).

While a holistic approach might be wanted, the reality is that each business application is independent, with its own recovery time objectives (RTOs) and recovery point objectives (RPOs), which for many customers is expressed as different classes of service.

Because each business application has a unique set of composite resources, each one needs to be reviewed to make sure that the strategy and requirements for DR for that resource are understood, documented, and implemented before going to production. To build the framework to drive that analysis and work, a set of profiles with common sets of resources for continuous availability and advanced recovery applications is described in Designing an architecture for your application resiliency objectives.

There are many options to implement DR solutions. For the sake of simplicity, they can be grouped into four major categories:

  • Active/passive: Active/passive options are based on keeping the full application stack active in one location, while another application stack is deployed in a different location but kept idle or shut down. In the case of prolonged unavailability of the primary site, the application stack is activated in the backup site. Often that requires restoration of backups that are taken in the primary site. This approach is not recommended when losing data can be a problem, for example, when the RPO is less than a few hours or when the availability of the service is critical and the RTO is less than a few hours.

  • Active/standby: In the active/standby case, the full application stack is active in both the primary and backup location. However, user's transactions are served by the primary site only. The backup site takes care of keeping a replica of the status of the main location though data replication, such as database replication or disk replication. In cases of prolonged unavailability of the primary site, all client transactions are routed to the backup site. This approach provides good RPO and RTO, generally measured in minutes; however, it is significantly more expensive than the active/passive options. This is because there is a double deployment and resources are wasted because the standby assets aren't used to improve scalability and throughput.

  • Active/active: In the active/active case both locations are active, and client transactions are distributed to both regions according to predefined policies, such as round-robin, geographical load balancing, and so on. In the case of the failure of one site, the other site must be able to serve all clients. It's possible to achieve both an RPO and RTO close to zero with this configuration. The drawback is that both regions must be sized to handle the full load, even if they are used at half of their capabilities when both locations are available. In such cases, auto scaling can help in keeping resources allocated according to the needs. In addition, the data across both locations must be continuously synced with some kind of replication mechanism.

  • Backup & restore: The backup and restore approach centers around the consistent and periodic creation of copies of essential data and system configurations. These backups serve as point-in-time snapshots of the data, capturing a known good state of the information that can be used for recovery purposes. The backups are typically stored in separate locations from the primary system, providing an additional layer of protection against physical or logical damage to the primary storage.

    In the event of a disaster, such as hardware failure, data corruption, accidental deletion, or even a cybersecurity breach (for example, a ransomware attack), the backup and restore approach enables the organization to recover lost or compromised data by restoring it from the backup copies. Depending on the backup frequency and retention policy, the restoration process can return the system to its state just before the incident occurred.

    Note that this approach may not address the overall RTO and RPO of an organization in all scenarios. The time required to restore the data and the amount of data that could potentially be lost between backups may impact the speed and completeness of the recovery process.

 

Site strategy

Site strategy is the most important aspect of the overall resiliency solution because it determines what classes of physical events the solution is able to address, sets the requirements in terms of distances, and sets constraints on the technology side.

Cloud providers typically provide a redundant infrastructure with several layers of resiliency that can be summarized as:

  • Local within an availability zone: An availability zone (AZ) is a physical and logical separation zone, within a cloud location, such as a data center. Its power, cooling, and networking are independent from other AZs.

  • Across multiple availability zones in one region: Using multiple AZs within the same region offers a better level of protection in case of a limited disaster which affects only one of the AZs. AZs in the same region are usually in close proximity (tens of miles).

  • Across two regions on the same geography: Using two regions in the same geography represents the highest level of protection against natural disasters as sites are generally over 400 kilometers (or 250 miles) apart.

  • Across geographies: Selecting sites in different geographies extends the level of protection against natural disasters. Geographies are generally over 1500 kilometers (or 1000 miles) apart, thus representing the optimal option in terms of wider protection requirements. The analysis of the cloud sites for the DR of your cloud-enabled workloads is relevant to avoid problems because of latency. Multiple sites in the same region help to implement DR with near zero RPO. And, remote secondary sites far from the primary site might require asynchronous techniques to be evaluated in your design.

Accordingly, when building cloud applications, you should implement one of the following depending on the type of resiliency you want to support for your application:

  • Zonal application:

    • Provision multiple instances of your application in one or more zones within a region to survive instance failure. Zonal applications are typically active/passive between two zones.
    • Implement data backup/restore strategy for application DR in a different region.
  • Regional application:

    • Provision application instances across three zones.
    • Implement application architecture and data replication strategy for application HA between the instances. Regional applications are typically active/active across zones.
    • Implement data backup and replication strategy for application DR to another region according to RTO and RPO required by your application. Regional applications are typically active/passive across regions.
  • Global application:

    • Implement data backup/replication strategy for application availability across 2 or more regions according to required RTO and RPO. Global applications are typically active/active across regions.
Resiliency for cloud-native applications on Red Hat OpenShift How IBM Cloud ensures high availability and disaster recovery Creating a three-tier highly available architecture on an IBM Cloud VPC

Cyber resilience is an organization's ability to prevent, withstand, and recover from cybersecurity incidents. It brings together business continuity, information systems security, and organizational resilience. With incidents becoming more malicious and techniques more advanced, the strategies and plans to mitigate the impacts of such attacks must also change. Businesses need new technologies and practices to survive and adapt to today’s cyber outage scenarios.

Cyber resiliency focuses on resilience of data and the environments data resides in. It should be considered in addition to common resiliency techniques used for high availability and disaster recovery. While these techniques are important and must be part of the overall resilience program, they often focus on keeping data replicated with the smallest recovery point objective (RPO). This can result in the ransomware being replicated quickly across multiple environments.

Cyber resiliency best practices include:

  • Air-gapped protection as a fail-safe copy against propagated malware:  Air-gapping is separating critical assets from other systems or networks physically or virtually. As in recent ransomware attacks, threat actors used an automated piece of malware which can quickly traverse the network, creating havoc. As a best practice, you can create an air-gapped copy of critical assets (data and systems) to mitigate the risks of exposure and attacks.
  • Prevent back up corruption and deletion by using immutable storage technology: Ransomware attacks like WannaCry and NotPetya have established the need for stronger protection against the corruption or deletion of data. You can use WORM/immutable storage technologies to maintain the integrity of data which in turn maintains business resiliency against attacks.
  • Clean rooms, data scanning, and cleansing tools for test and validation: Enable the verification and the recovery of data from an immutable backup copy.
  • Using automation and orchestration technologies as a part of response and recovery capabilities: Orchestration can augment analysis, giving your team quick access to information and the ability to respond faster. Automation can be utilized in the recovery of interconnected systems as manual recovery might introduce human error. This can facilitate a faster recovery.
  • Separation of duties: Use separate cyber resilience operators for the air-gapped recovery environment.
What is cyber resilience? Immutable object storage IBM Storage Defender

Crucial to any Resiliency program is a plan with requirements and procedures to test the resiliency of the involved systems.
 

Cloud Resiliency Testing
This recommendation for resilience testing for cloud applications is based on the experience of working with clients that have successfully deployed mission critical workloads in the cloud and have considerable experience in productive delivery.
 

Cloud Provider Resiliency Testing
Types of resilience that cloud providers must support and the different failure scenarios that must be tested regularly:

  • Temporary failure of single zone impacting one service.
  • Major data loss in single zone impacting one service.
  • Temporary failure of single zone impacting all services.
  • Disaster of whole region impacting all services.
     

Application Resiliency Testing
Client must consider the resilience of all the components and dependencies of the application and test the failure scenarios for each one of them, and all of them.


Zonal Application:

  • Client must test the failover and failback (in case of Active/Passive) of the Zonal application instances.
  • Client must test Backup/Recovery from a different region. This should include scenario for loss of data.
  • When client is using a Zonal Service or a component that is not Highly Available, for example: compute/storage/network infrastructure like VMs/VSS, VSIs or Bare metal, then the client is responsible for the resilience testing of such component/service as part of their application resilience testing.

Regional Application:

  • Client needs to test that the application is actually deployed across 3 zones, and still works (potentially with degraded service) when 1 or 2 zones fail.
  • DR to another region also must be tested.
     

Global Application:

  • Clients  should test the Active/Active behavior of the Global Application, by simulating a region failure and making sure requests are served by the other region(s) of the application deployment.
  •  Also, client should test the data backup/replication for application availability across 2 or more regions according to required RTO/RPO.

Chaos Testing:

Chaos engineering  is the practice of testing a system's response to turbulent behavior, such as infrastructure failures, unresponsive services, or missing components. The goal is to break the system to correct its architecture, understand its weak points, and anticipate failures and how the system and the people might behave. Rigor is at the heart of chaos engineering, not randomness. The practices of chaos engineering follow strict engineering procedures. Attacks on a system are precisely defined as experiments, pre-planned in detail, completed in a contained manner, observed throughout the experiment, and learned from.

By using this resilience testing best practices as a starting point, which includes cloud provider testing and client applications resilience testing recommendations, clients can then tailor and adopt this model in their planning of reporting of cloud related resilience risks to executives, boards, regulators, and auditors.

For more information on Chaos Engineering, see this link:

  1. chaos-engineering-principles
  2. spotlight/documents/193587

Resiliency testing requirements:
For cloud provider services, the cloud provider should support the following requirements:

Compliance to ISO, SOC2 type 2, IBM Cloud Framework for Financial Services BCDR planning and testing requirements:

  • BCDR plan with yearly revalidation of BC/DR plan
  • Yearly test of BCDR plan in one region with rotation across worldwide locations
  • Test execution and results reports
  • Action plan for remediation of test findings

Perform the test with regular cadence (at a minimum annually) to meet standard regulation requirements:

  • SOC 2 type 2, IBM Cloud Framework for Financial Services, future requirements, and so on
  • In case of significant changes: new services implementation, major architecture changes
  • Each time in one IBM Cloud MZR location with rotation and in every new region

Produce test evidence and records to meet auditing requirement from standard regulations:

  • Data backup/replication strategy for application availability across 2 or more regions according to required RTO and RPO. Global applications are typically active/active across regions.
  • BCDR plan that includes all services in scope, the related platform, infrastructure, internal, external 3rd party service and hardware (network, compute, storage) and data center assets (power, cooling, and so on)
  • Test execution and result reports
  • Action plan to address findings

IBM policies are fully aligned to industry best practices such as the ISO22301 framework, BCI GPG 2018, ITIL and ISO27001, and they address the need to document and review annually the business impact analysis (BIA) and the response to specific disruptive scenarios, such as workplace unavailability (for example, loss of IBM locations), workforce unavailability (for example, due to a pandemic), IT unavailability, and region-wide disasters. For client workloads, it is the client’s responsibility to meet HA and DR policies and local regulations that apply to the client’s business applications and services.

IBM's principles of Chaos Engineering

It is critical to automate operations to be able to run your service with velocity and scale. Using automation techniques contributes to reliable services and better user experiences by reducing manual intervention, automating recovery processes, and enabling rapid and consistent responses to failures.

The following are examples of areas where automation can be employed to improve resiliency:

  • Roll-back: Any time there is a system change, there is a possibility that failures can occur. When they do, the ability to automatically roll-back to a previous working state lessens the impact.
  • Self-healing mechanisms: Automation can enable self-healing mechanisms that detect failures or performance degradation and automatically trigger recovery actions.
  • Scaling: Allowing the workload to automatically scale up and down based on utilization demand can help avoid service interruption due to insufficient or degraded resources.
  • Testing: Automated testing, including resiliency testing and chaos engineering, can be integrated into the continuous integration and deployment pipelines. This enables routine resiliency testing to be carried out during development and deployment phases, catching potential issues early.
  • Redundancy and failover: Automated failover mechanisms can swiftly redirect traffic to redundant resources when a primary component fails, minimizing downtime and service interruptions.
  • Monitoring and alerting: Automated monitoring tools can continuously track the health and performance of cloud workloads. When anomalies or failures are detected, automated alerts are generated, allowing teams to respond quickly to potential issues.
  • Rapid and consistent deployment: Automated deployment processes ensure that cloud workloads are provisioned consistently and according to predefined configurations. This minimizes the chances of misconfiguration that could lead to vulnerabilities or disruptions. One approach to automated deployment is to use infrastructure as code (IaC) tools like Ansible, Terraform, and IBM Cloud Schematics.
Terraform code for building resilient infrastructure on IBM Cloud IBM Cloud Schematics

Observability is a set of practices, platforms, and tools to provide visibility to your systems and applications and avoid extended outages. Observability allows you to learn what is happening in your system and, in the event of an outage, ensure you can find its root cause quickly. This is particularly important in a dynamic, distributed cloud environment, because it can be very difficult to correlate dependencies between components.

It is very important to select an appropriate observability platform for your operations that meets your requirements. You'll want to consider the following questions:

  • Does it run early in your DevOps lifecycle?
  • Does it run on its own infrastructure?
  • Is it a SaaS offering or on premise?
  • Can the it still report issues if its cloud provider is experiencing (regional) issues?

Example observability platforms include:

    • IBM Instana Observability can be used as a SaaS offering or run on different cloud providers or on-premise.
    • IBM Cloud Pak for AIOps combined with Instana provides a leading observability platform for automated remediation, powered by a continuous stream of contextualized telemetry data.
     
    IBM Instana Observability IBM Cloud Pak for AIOps What is observability? Monitor your applications by using the Golden Signals

    Requirements

    Defining and addressing the non-functional requirements (NFRs) for a system are among the most important of a software architect’s responsibilities. NFRs are the system quality attributes for a system, as distinct from the functional requirements, which detail a system’s business features and capabilities. Examples of NFRs include key concepts such as reliability, performance, scalability, and security.

    NFRs are cross-cutting in nature and affect multiple aspects of a system’s architecture. It's important to articulate and address the NFRs early in the project lifecycle and to keep them under review as the system is produced. To help with that task, you can use the following list of key NFRs. The list is designed to be consulted when you define and evolve a system’s architecture.

    • Availability and resilience
    • Performance and scalability
    • Operability
    • Metrics
    • Transactionality and consistency
    • Concurrency
    • Data access and data labeling
    • Regulatory
    • Accessibility
    • Globalization
    • Testing and testability
    • Costs
    • Evolution
    • Complexity

    There are many standards and regulations around the world which have requirements for resilience of systems and solutions in regulated industries. There are standards and regulations that require organizations to define the business recovery process, service levels, and recovery times that are acceptable for their business. Similarly, there are requirements for business recovery processes to be tested periodically, and to provide documented test results showing that SLAs have been met. For example:

    • The National Institute of Standards and Technology (NIST) has provided a number of special special publications that have content related to resiliency:
    • The European Commission initiated the Digital Operational Resilience Act (DORA) to harmonize information and communications technology (ICT) regulation in the financial services sector in the European Union (EU), imposing common requirements in all EU member states in areas of risk management, including Operational Resilience.

    IBM has also created frameworks and guidance to help application providers in regulated industries meet regulatory requirements when deploying applications on IBM Cloud. For example, the following are examples within the financial services industry:

    • The IBM Cloud Framework for Financial Services is designed to help address the needs of financial services institutions with regulatory compliance, security, and resiliency during the initial deployment phase and with ongoing operations on IBM Cloud. The framework also helps to simplify the ability of financial institutions to transact with ecosystem partners who deliver software or SaaS applications on IBM Cloud, and who meet the requirements of the framework. In terms of resiliency, the framework provides best practices for handling business continuity and disaster recovery (BCDR) and high availability.
    • The European Financial Regulatory Guide){: external} highlights key outsourcing, Information and Communications Technology (ICT), and security risk management guidance promulgated by European Banking Authority (EBA), the European Insurance and Occupational Pensions Authority (EIOPA), and the European Securities and Markets Authorities (ESMA).

    For cloud provider services, the cloud provider should support the following requirements:

    Compliance to ISO, SOC2 type 2, IBM Cloud Framework for Financial Services BCDR planning and testing requirements:

    • BCDR plan with yearly revalidation of BC/DR plan
    • Yearly test of BCDR plan in one region with rotation across worldwide locations
    • Test execution and results reports
    • Action plan for remediation of test findings

    Perform the test with regular cadence (at a minimum annually) to meet standard regulation requirements:

    • SOC 2 type 2, IBM Cloud Framework for Financial Services, future requirements, and so on
    • In case of significant changes: new services implementation, major architecture changes
    • Each time in one IBM Cloud MZR location with rotation and in every new region

    Produce test evidence and records to meet auditing requirement from standard regulations:

    • Data backup/replication strategy for application availability across 2 or more regions according to required RTO and RPO. Global applications are typically active/active across regions.
    • BCDR plan that includes all services in scope, the related platform, infrastructure, internal, external 3rd party service and hardware (network, compute, storage) and data center assets (power, cooling, and so on)
    • Test execution and result reports
    • Action plan to address findings

    The IBM Policy, Standards and Guidelines for Business Continuity are fully aligned to industry best practices such as the ISO22301 framework, BCI GPG 2018, ITIL and ISO27001, and they address the need to document and review annually the business impact analysis (BIA) and the response to specific disruptive scenarios, such as workplace unavailability (for example, loss of IBM locations), workforce unavailability (for example, due to a pandemic), IT unavailability, and region-wide disasters. For client workloads, it is the client’s responsibility to meet HA and DR policies and local regulations that apply to the client’s business applications and services.


    The two primary enemies of a data resiliency initiative are data volume and network latency. To design for high availability, disaster recovery, or workload reallocation, you must factor in real-world physics.

    The time to move data from one location to another depends on distance. The longer the distance, the longer the latency. Sending petabytes of data all at once is likely to clog the available bandwidth. All these constraints are independent of other considerations, such as time to rebuild an index or to create a sync point at the target location.

    Whether you're moving data in bulk, trickling data via CDC or a message queue, or using a mirrored technology, you need data strategy, data topology, and data governance.

    Resources IBM Cloud Pak for AIOps

    Innovate faster, reduce operational cost and transform IT operations (ITOps) across a changing landscape with an AIOps platform that delivers visibility into performance data and dependencies across environments. Embrace artificial intelligence, machine learning and automation to help ITOps managers and Site Reliability Engineers (SREs) address incident management and remediation.

    Instana

    IBM® Instana® Observability is the gold standard of incident prevention with automated full-stack visibility, 1-second granularity and 3 seconds to notify.

    IBM Storage Defender

    Detect threats such as ransomware, exfiltration and insider attacks, leveraging intelligent software from IBM and its ecosystem partners. These tools can help you identify the safest recovery points. And because IBM Storage Defender integrates with your existing security operations tools and processes, you can recover a minimally viable company sooner.

    Well-Architected Framework Pillars Hybrid and Portable Resiliency Efficient Operations Security and Compliance Performance Financial Operations and Sustainability
    Next steps

    Solution tutorials provide step-by-step instructions on how to use IBM Cloud to implement common patterns based on best practices and proven technologies.

    IBM Solution Tutorials