Reliability is the ability of your service or application to be available for your users as intended, while resiliency is the capacity for your service or application to recover from different types of failures.
Why is resiliency important?
A business-critical application that is not well-designed for resiliency can have a significant negative impact on your business operations and your user experience. Users expect services to perform well and always be available. When a system does not meet its reliability goals, it can lose the trust of stakeholders and potentially cause financial and reputational loss for your organization. Examples of negative user impact include "my package did not arrive today," "we cannot fuel the aircraft because the fueling system is not available," or even worse.
The reliability needs of a system grow and become more complex as the overall system grows and its interactions with external services increase. Architects must use deep critical analysis and drive reliability requirements until they are understood, documented, and published. These reliability requirements must be described with accompanying design traceability and supporting tests to verify that the needs have been met.
Sometimes stakeholders may not appreciate why resiliency and reliability are important, and that reliability requirements must be explicitly stated to determine whether a system is reliable. In these cases, stakeholders need to be educated on the consequences to the business if reliability requirements are not met.
The following sections describe the principles that must be considered when designing your system for reliability.
The following sections describe the principles that must be considered when designing your system for reliability.
Resiliency design choices and trade-offs must be tightly coupled to business requirements. Every architectural decision around resiliency can have an impact on business operations, user experience, overall cost, and operability.
When you define resiliency and reliability targets, start from a business perspective. Not every system or application needs to have 99.999% uptime. A highly resilient architecture for a system can significantly impact the project costs, such as initial development costs, hosting costs, operating costs, and on-going support and maintenance costs. You need to work closely with your business and IT stakeholders to define and understand:
These insights are vital as they help you and your organization balance the cost of designing a highly resilient system with the business value and make strategic decisions that optimize resiliency investments. You can use an iterative process to eventually land on a common ground between business needs and technical implementation.
Designing for failure involves a thorough analysis of all potential types of failures (for example, application, middleware, data, connectivity, data center, zone, region, and so on) and quantifying the impact of each failure scenario in relation to hard business requirements. Your design must enable the solution to maintain desired service levels despite the occurrence of one or more failures.
Reliability can't be added to a pre-existing system. The system and its components must be designed and implemented with reliability in mind. It's a shared responsibility by everyone who contributes to the software development lifecycle, including the architect, the product owner, the DevOps engineer, and the site reliability engineer.
To implement reliability into a software component, you can use the key architectural patterns described in the practices pages. You don't need to use every pattern or technique, but you need to know which are available and which is right for a task.
When you're skilled in these techniques, you can have a meaningful conversation with the product owner on the reliability targets for a service. These techniques can help you gauge the viability and effort that are required to meet a certain reliability target.
Cyber resiliency, a component of operational resiliency, focuses on providing data protection and business continuity in case of cyber attacks, including scenarios where data is encrypted by ransomware. It includes the recovery of data as well ensuring an operating environment can be used while the production environment is not available.
Cyber attacks can happen to large enterprises, even those with robust security technologies in place. A strong cyber resiliency strategy that provides a unified approach combining cybersecurity with data protection and disaster recovery methods can help businesses protect against and rapidly recover from disruptive cyber incidents.
A well-architected and resilient cloud application is distributed and built from smaller components with various lifecycles that may scale up or down based on demand. It becomes harder to rely on traditional monitoring and logging to diagnose problems. It is important to monitor applications to quickly detect deviations from the norm, as defined through service level objectives and service level indicators. There must be sufficient insight into the system to isolate problems and mitigate issues as quickly as possible.
Observability uses external outputs (metrics, logs, and traces) to infer the internal state of system components. These outputs can be used to provide different views of the system, including events, dashboards, and links. By integrating these distinct views into one solution, a site reliability engineer can gain better insights and identify and resolve incidents faster.
Embedding resiliency testing for cloud workloads into planning and day 2 operations involves integrating resiliency considerations into the entire lifecycle of cloud-based services.
One of the key activities is resiliency test planning, which is about developing a comprehensive resiliency testing strategy that includes test scenarios, methodologies, and success criteria (in alignment with resiliency requirements and objectives).
Another key activity is testing in production-like environments. This means performing resiliency testing in these environments to observe how the system behaves in conditions that closely resemble the actual usage. In addition, reliability can greatly benefit from negative testing or chaos testing in these environments.
Finally, automated testing integration ensures that resiliency tests are well integrated into the automated testing framework and CI/CD pipelines. This ensures that resiliency testing becomes a routine part of the development and deployment process.
Operating a service reliably with velocity and scale requires automation. Using automation techniques like infrastructure-as-code (IaC) contributes to reliable services and better user experiences by reducing manual intervention, automating recovery processes, and enabling rapid and consistent responses to failures.
A business-critical application that is not well-designed for resiliency can have a significant negative impact on your business operations and your user experience. Users expect services to perform well and always be available. When a system does not meet its reliability goals, it can lose the trust of stakeholders and potentially cause financial and reputational loss for your organization. Examples of negative user impact include "my package did not arrive today," "we cannot fuel the aircraft because the fueling system is not available," or even worse.
The following best practices are intended to help you design resilient systems:
Application resiliency
Reliability can't be added to a built system. The system and its components must be designed and implemented with reliability in mind. It's a shared responsibility by everyone who contributes to the software development lifecycle, including the architect, the product owner, the DevOps engineer, and the site reliability engineer.
To implement reliability into a software component, you can use the key architectural patterns. There are several important patterns and techniques that you need to understand to build reliable applications which are described next. You don't need to use every tool, but you need to know which tools are available and which tool is right for a task.
When you're skilled in these techniques, you can have a meaningful conversation with the product owner on the reliability targets for a service. These techniques can help you gauge the viability and effort that are required to meet a certain reliability target.
Foundational patterns
First, there are three foundational patterns, each with their own trade-offs, to consider:
Mitigating techniques
From the three foundational patterns, there are additional techniques that mitigate some consequences caused by the trade-offs or help compensate for failures.
Non-architectural techniques
In addition to the architecture patterns, you can take other approaches to improve the reliability of a system. Typically, these techniques help to manage a system or to understand the behavior of complex systems. While they don't directly improve reliability, these techniques can help you understand the system and infer strategies to improve reliability.
Data resiliency
When a failure occurs with hosted data, data resiliency allows data to remain available to traditional applications and applications that incorporate APIs and other services for analytics.
Choosing the correct set of data resiliency techniques and technologies in the context of an overall business continuity plan is vital, but it can be complex and difficult. How do you reconstruct a steady state? Might the recovery noticeably affect a user or system? You need to be able to establish a sync point, restart, understand anything that was in flight and not recoverable, and back out.
Here are some examples of techniques you might want to consider in your workload resiliency design:
Backups: Traditionally, organizations recover data by using data backups. Backups are usually anchored on single applications. However, when applications are interconnected, backups can face resynchronization challenges. The latency between backups can also create a gap of lost data. Depending on the underlying technologies of the data stores, a logging system might not be available to mitigate the gap of loss. Because backups are taken incrementally, it can be cumbersome to recompose an image to restore. Some modern databases are so large that backups aren't even taken.
Snapshots: Snapshots of logical disk units can address a backup that transcends individual applications. However, they might not always work well with various techniques of mirrored data and striped data.
Mirroring: Various techniques of mirroring exist. The mirrored copy can be on the same disk drive or pushed to a remote system. Typically, the operating system handles the mirroring, or replication. Mirroring techniques can vary between synchronous and asynchronous replication. With synchronous mirroring, the two copies of the data are identical and a latency issue might exist regarding the physical distance that is permitted between the primary and the secondary locations. Asynchronous mirroring doesn't usually have the distance limitation, but if an unexpected failure occurs, the latency might result in data loss. At the hardware level, a peer-to-peer remote copying can provide a form of mirroring that enables resource services to provide for a controlled switchover or failover.
Flash copies: A flash copy can provide a fast point-in-time copy of the data. You can use the copy to bring an application online in a separate partition or system. This type of copying can also supplement the ability to complete an offline backup or populate data for non-production systems.
Logical replication: If you use logical replication to build a multi-system with high availability, be sure to use a transport mechanism that uses synchronous remote journaling. The journaling provides a way to replay.
Constraint: Network latency & data volume
The two primary enemies of a data resiliency initiative are data volume and network latency. To design for high availability, disaster recovery, or workload reallocation, you must factor in real-world physics.
The time to move data from one location to another depends on distance. The longer the distance, the longer the latency. Sending petabytes of data all at once is likely to clog the available bandwidth. All these constraints are independent of other considerations, such as time to rebuild an index or to create a sync point at the target location.
Whether you're moving data in bulk, trickling data via CDC or a message queue, or using a mirrored technology, you need data strategy, data topology, and data governance.
It is important to define measurable resiliency and reliability targets to ensure you are meeting your user's needs. The targets are typically expressed with service level objects (SLOs), service level indicators (SLIs), and service level agreements (SLAs).
Service level objectives
The SLO is your user's service reliability target level. To be effective, the SLO must be achievable and reflect the technical reality of your organization. It needs to be high enough to keep your users satisfied, but low enough that the business can achieve it. Your SLO can be made stricter over time as your system improves.
The SLO is an agreement by the internal service manager, the development team, and the operations team. All parties must support it.
You should express the SLO as a percentage of achievement over a period. The closer the number to 100%, the better the SLO is; however, an SLO is rarely targeted to be 100%.
Service level indicators
To calculate your SLO, you need to measure the metrics, or your service level indicators (SLIs). SLIs are metrics that represent your service level from your users's point of view. The process to define your SLO and SLI is iterative. SLOs are driven by business requirements; whereas, SLIs are driven by the available measurement.
The initial SLO might postulate objectives that either aren't technically achievable or are difficult and expensive to achieve. SLOs should be negotiated between between affected parties, such as the product owner, the DevOps engineers, and the site reliability engineers, by comparing the expected business result with the current baseline. You can adjust SLOs and SLIs as business requirements arise or as important performance metrics become available or unavailable.
Service level agreements
A service level agreement (SLA) is an agreement with your users that defines the SLOs that must be met in a specific period of time. In addition, the SLA includes the commercial implications for not achieving the SLOs. An easy way to determine whether an SLO represents an SLA is to ask, "Are there any commercial implications if the SLO is not met?" If the answer is "yes", it is an SLA.
For a given service, the contractually agreed measure of a given attribute constitutes the SLA. For example, a common availability SLA is 99.95%, and a common latency SLA could be 300 milliseconds. SLAs are usually contractual obligations, thus, there may be implications of not achieving SLAs. For example, cloud providers typically provide SLA credits or service credits if the promised SLAs for services are missed.
An SLA can also include other commitments, such as response time and repair time. Because an SLA is a business contract, it is useful to know the key indicators for a specific user's industry. In the area of business process, key performance indicators (KPIs) are used. KPIs are like service level indicators (SLIs), but the focus of a KPI is more on measurement that helps you understand how your organization or department is performing.
High availability (HA) is mainly about keeping the service available to your customers when ordinary activities are performed on the system like deploying updates, rebooting virtual machines, applying security patches to host operating systems, and so on. High availability within a single cloud multizone region (MZR) can be achieved by eliminating single points of failure.
HA usually doesn’t deal with major unplanned or planned issues, such as complete site loss because of major power outages, earthquakes, severe hardware failures, full site connectivity loss, and so on. In such cases, if the service must meet strict service level objectives (SLOs), you should make the whole application stack (infrastructure, services, and application components) redundant by deploying it in at least two different cloud regions. This is typically defined as a disaster recovery (DR) architecture.
Disaster recovery
The approach to defining your Disaster Recovery strategy needs to be systematic and start with the business application you're providing on the cloud, and the set of resources that are used to build it (for example, Kubernetes apps, virtual machines, cloud services, and so on).
While a holistic approach might be wanted, the reality is that each business application is independent, with its own recovery time objectives (RTOs) and recovery point objectives (RPOs), which for many customers is expressed as different classes of service.
Because each business application has a unique set of composite resources, each one needs to be reviewed to make sure that the strategy and requirements for DR for that resource are understood, documented, and implemented before going to production. To build the framework to drive that analysis and work, a set of profiles with common sets of resources for continuous availability and advanced recovery applications is described in Designing an architecture for your application resiliency objectives.
There are many options to implement DR solutions. For the sake of simplicity, they can be grouped into four major categories:
Active/passive: Active/passive options are based on keeping the full application stack active in one location, while another application stack is deployed in a different location but kept idle or shut down. In the case of prolonged unavailability of the primary site, the application stack is activated in the backup site. Often that requires restoration of backups that are taken in the primary site. This approach is not recommended when losing data can be a problem, for example, when the RPO is less than a few hours or when the availability of the service is critical and the RTO is less than a few hours.
Active/standby: In the active/standby case, the full application stack is active in both the primary and backup location. However, user's transactions are served by the primary site only. The backup site takes care of keeping a replica of the status of the main location though data replication, such as database replication or disk replication. In cases of prolonged unavailability of the primary site, all client transactions are routed to the backup site. This approach provides good RPO and RTO, generally measured in minutes; however, it is significantly more expensive than the active/passive options. This is because there is a double deployment and resources are wasted because the standby assets aren't used to improve scalability and throughput.
Active/active: In the active/active case both locations are active, and client transactions are distributed to both regions according to predefined policies, such as round-robin, geographical load balancing, and so on. In the case of the failure of one site, the other site must be able to serve all clients. It's possible to achieve both an RPO and RTO close to zero with this configuration. The drawback is that both regions must be sized to handle the full load, even if they are used at half of their capabilities when both locations are available. In such cases, auto scaling can help in keeping resources allocated according to the needs. In addition, the data across both locations must be continuously synced with some kind of replication mechanism.
Backup & restore: The backup and restore approach centers around the consistent and periodic creation of copies of essential data and system configurations. These backups serve as point-in-time snapshots of the data, capturing a known good state of the information that can be used for recovery purposes. The backups are typically stored in separate locations from the primary system, providing an additional layer of protection against physical or logical damage to the primary storage.
In the event of a disaster, such as hardware failure, data corruption, accidental deletion, or even a cybersecurity breach (for example, a ransomware attack), the backup and restore approach enables the organization to recover lost or compromised data by restoring it from the backup copies. Depending on the backup frequency and retention policy, the restoration process can return the system to its state just before the incident occurred.
Note that this approach may not address the overall RTO and RPO of an organization in all scenarios. The time required to restore the data and the amount of data that could potentially be lost between backups may impact the speed and completeness of the recovery process.
Site strategy
Site strategy is the most important aspect of the overall resiliency solution because it determines what classes of physical events the solution is able to address, sets the requirements in terms of distances, and sets constraints on the technology side.
Cloud providers typically provide a redundant infrastructure with several layers of resiliency that can be summarized as:
Local within an availability zone: An availability zone (AZ) is a physical and logical separation zone, within a cloud location, such as a data center. Its power, cooling, and networking are independent from other AZs.
Across multiple availability zones in one region: Using multiple AZs within the same region offers a better level of protection in case of a limited disaster which affects only one of the AZs. AZs in the same region are usually in close proximity (tens of miles).
Across two regions on the same geography: Using two regions in the same geography represents the highest level of protection against natural disasters as sites are generally over 400 kilometers (or 250 miles) apart.
Across geographies: Selecting sites in different geographies extends the level of protection against natural disasters. Geographies are generally over 1500 kilometers (or 1000 miles) apart, thus representing the optimal option in terms of wider protection requirements. The analysis of the cloud sites for the DR of your cloud-enabled workloads is relevant to avoid problems because of latency. Multiple sites in the same region help to implement DR with near zero RPO. And, remote secondary sites far from the primary site might require asynchronous techniques to be evaluated in your design.
Accordingly, when building cloud applications, you should implement one of the following depending on the type of resiliency you want to support for your application:
Zonal application:
Regional application:
Global application:
Cyber resilience is an organization's ability to prevent, withstand, and recover from cybersecurity incidents. It brings together business continuity, information systems security, and organizational resilience. With incidents becoming more malicious and techniques more advanced, the strategies and plans to mitigate the impacts of such attacks must also change. Businesses need new technologies and practices to survive and adapt to today’s cyber outage scenarios.
Cyber resiliency focuses on resilience of data and the environments data resides in. It should be considered in addition to common resiliency techniques used for high availability and disaster recovery. While these techniques are important and must be part of the overall resilience program, they often focus on keeping data replicated with the smallest recovery point objective (RPO). This can result in the ransomware being replicated quickly across multiple environments.
Cyber resiliency best practices include:
Crucial to any Resiliency program is a plan with requirements and procedures to test the resiliency of the involved systems.
Cloud Resiliency Testing
This recommendation for resilience testing for cloud applications is based on the experience of working with clients that have successfully deployed mission critical workloads in the cloud and have considerable experience in productive delivery.
Cloud Provider Resiliency Testing
Types of resilience that cloud providers must support and the different failure scenarios that must be tested regularly:
Application Resiliency Testing
Client must consider the resilience of all the components and dependencies of the application and test the failure scenarios for each one of them, and all of them.
Zonal Application:
Regional Application:
Global Application:
Chaos Testing:
Chaos engineering is the practice of testing a system's response to turbulent behavior, such as infrastructure failures, unresponsive services, or missing components. The goal is to break the system to correct its architecture, understand its weak points, and anticipate failures and how the system and the people might behave. Rigor is at the heart of chaos engineering, not randomness. The practices of chaos engineering follow strict engineering procedures. Attacks on a system are precisely defined as experiments, pre-planned in detail, completed in a contained manner, observed throughout the experiment, and learned from.
By using this resilience testing best practices as a starting point, which includes cloud provider testing and client applications resilience testing recommendations, clients can then tailor and adopt this model in their planning of reporting of cloud related resilience risks to executives, boards, regulators, and auditors.
For more information on Chaos Engineering, see this link:
Resiliency testing requirements:
For cloud provider services, the cloud provider should support the following requirements:
Compliance to ISO, SOC2 type 2, IBM Cloud Framework for Financial Services BCDR planning and testing requirements:
Perform the test with regular cadence (at a minimum annually) to meet standard regulation requirements:
Produce test evidence and records to meet auditing requirement from standard regulations:
IBM policies are fully aligned to industry best practices such as the ISO22301 framework, BCI GPG 2018, ITIL and ISO27001, and they address the need to document and review annually the business impact analysis (BIA) and the response to specific disruptive scenarios, such as workplace unavailability (for example, loss of IBM locations), workforce unavailability (for example, due to a pandemic), IT unavailability, and region-wide disasters. For client workloads, it is the client’s responsibility to meet HA and DR policies and local regulations that apply to the client’s business applications and services.
It is critical to automate operations to be able to run your service with velocity and scale. Using automation techniques contributes to reliable services and better user experiences by reducing manual intervention, automating recovery processes, and enabling rapid and consistent responses to failures.
The following are examples of areas where automation can be employed to improve resiliency:
Observability is a set of practices, platforms, and tools to provide visibility to your systems and applications and avoid extended outages. Observability allows you to learn what is happening in your system and, in the event of an outage, ensure you can find its root cause quickly. This is particularly important in a dynamic, distributed cloud environment, because it can be very difficult to correlate dependencies between components.
It is very important to select an appropriate observability platform for your operations that meets your requirements. You'll want to consider the following questions:
Example observability platforms include:
Defining and addressing the non-functional requirements (NFRs) for a system are among the most important of a software architect’s responsibilities. NFRs are the system quality attributes for a system, as distinct from the functional requirements, which detail a system’s business features and capabilities. Examples of NFRs include key concepts such as reliability, performance, scalability, and security.
NFRs are cross-cutting in nature and affect multiple aspects of a system’s architecture. It's important to articulate and address the NFRs early in the project lifecycle and to keep them under review as the system is produced. To help with that task, you can use the following list of key NFRs. The list is designed to be consulted when you define and evolve a system’s architecture.
There are many standards and regulations around the world which have requirements for resilience of systems and solutions in regulated industries. There are standards and regulations that require organizations to define the business recovery process, service levels, and recovery times that are acceptable for their business. Similarly, there are requirements for business recovery processes to be tested periodically, and to provide documented test results showing that SLAs have been met. For example:
IBM has also created frameworks and guidance to help application providers in regulated industries meet regulatory requirements when deploying applications on IBM Cloud. For example, the following are examples within the financial services industry:
For cloud provider services, the cloud provider should support the following requirements:
Compliance to ISO, SOC2 type 2, IBM Cloud Framework for Financial Services BCDR planning and testing requirements:
Perform the test with regular cadence (at a minimum annually) to meet standard regulation requirements:
Produce test evidence and records to meet auditing requirement from standard regulations:
The IBM Policy, Standards and Guidelines for Business Continuity are fully aligned to industry best practices such as the ISO22301 framework, BCI GPG 2018, ITIL and ISO27001, and they address the need to document and review annually the business impact analysis (BIA) and the response to specific disruptive scenarios, such as workplace unavailability (for example, loss of IBM locations), workforce unavailability (for example, due to a pandemic), IT unavailability, and region-wide disasters. For client workloads, it is the client’s responsibility to meet HA and DR policies and local regulations that apply to the client’s business applications and services.
The two primary enemies of a data resiliency initiative are data volume and network latency. To design for high availability, disaster recovery, or workload reallocation, you must factor in real-world physics.
The time to move data from one location to another depends on distance. The longer the distance, the longer the latency. Sending petabytes of data all at once is likely to clog the available bandwidth. All these constraints are independent of other considerations, such as time to rebuild an index or to create a sync point at the target location.
Whether you're moving data in bulk, trickling data via CDC or a message queue, or using a mirrored technology, you need data strategy, data topology, and data governance.
Innovate faster, reduce operational cost and transform IT operations (ITOps) across a changing landscape with an AIOps platform that delivers visibility into performance data and dependencies across environments. Embrace artificial intelligence, machine learning and automation to help ITOps managers and Site Reliability Engineers (SREs) address incident management and remediation.
IBM® Instana® Observability is the gold standard of incident prevention with automated full-stack visibility, 1-second granularity and 3 seconds to notify.
Detect threats such as ransomware, exfiltration and insider attacks, leveraging intelligent software from IBM and its ecosystem partners. These tools can help you identify the safest recovery points. And because IBM Storage Defender integrates with your existing security operations tools and processes, you can recover a minimally viable company sooner.