July 5, 2022 By Swanand Barve
Rakesh Shinde
4 min read

How do you know if a solution is “resilient enough,” and how do you know if your testing covers the necessary scenarios?

The cloud-native architecture paradigm has been around for quite some time now. At the core of cloud-native architecture are cohesive, independent functional components that bring in business agility, scalability and resiliency — contributing to accelerated time to market, competitive advantages and optimized costs. This paradigm has been actively supported through a polyglot technology landscape.

The solutions realized using the above combination of architecture and tech landscape can turn out to be quite complex to maintain and manage, primarily due to the sheer number of components and multiple tech frameworks required for their realization. Sub-optimal implementation of design and engineering practices exponentially increases the complexity and maintenance risks of such solutions.

What is resiliency?

“Resiliency” is one such engineering practice that is critical to the success/failure of any digital transformation initiative. As you may know, resiliency directly contributes to the overall availability of the solution through metrics like Mean Time to Recover (MTTR) and Mean Time Between Failures (MTBF), and it is also directly responsible for making/breaking a transformative user experience.

Resiliency is essentially the ability of a system to sustain against failures. While failures in systems may ultimately manifest as errors or unavailability of a component/system, the list of factors that may cause failures in a distributed, cloud-native system is significant.

There is already a lot of material focusing on how to “implement” resiliency in cloud-native applications. IBM’s Build for Reliability Garage practice provides a great introduction and framework for resiliency implementation. There are also frameworks like chaos monkey or tools like Gremlin that help in “testing” the resiliency of applications.

The challenge though remains — how do we verify if a solution is “resilient enough”? Specifically, how do we know if our testing covers the necessary and sufficient scenarios? How do we know what failures to induce?

We would like to propose the following four-step approach to address the above challenge.

1. Identify scenarios and architectural components that need to be tested for resiliency

This can be done by identifying “unique traversal paths” — essentially, the sequence/combination in which components of your solution can be used to support functional scenarios. These scenarios and the supporting components provide the base set that needs to be tested.

For example, your application may support one or more of following:

  • Search/browse product catalogue through a channel application that invokes backend microservices, which fetch data from a persistent data store.
  • Batch processes/schedulers executing at pre-set time/frequency.
  • Events published on pre-configured topics and processed by subscribing microservices.
  • APIs exposed and invoked by multiple consumer systems.

2. Determine points of failure

Once we’ve identified the scenarios and components, the next step is determining what could “fail” with these components. Let’s take an example of a single microservice with the following characteristics:

  • It exposes an API through a gateway.
  • It is deployed on a Kubernetes-enabled container framework.
  • It accesses a database.
  • It integrates with a downstream system.

This view can be put together through identification of the “failure surfaces,” as below:

3. Identify causes of failure across failure surfaces

Each failure surface identified in previous step could fail for multiple reasons — that’s what we need to identify next. Continuing with the same example as earlier – mapping failure surfaces to the possible causes can give you the following list:

  • Core: The core microservice itself – as a code unit — could fail due to out-of-memory issues, the application server could crash, etc.
  • Microservices pod and node: The node/pod may fail a health check. The VM hosting the Kubernetes container platform may crash.
  • API Gateway: The API Gateway engine may become unresponsive due to insufficient threads/memory required for servicing requests.
  • Backend system: The backend system may take a high amount time to respond, and the system may crash.
  • Compute/storage/network: The network between the microservice and backend system (that could be hosted in a separate location) may go down.

4. Prepare for “assault”

The causes and failure surfaces can be used to create a matrix as shown below. This now allows us to understand and plan the combination with which we need to plan for “assaults” on the solution. These, in turn, can be now implemented through chaos testing frameworks, as mentioned earlier:

Additional considerations

Last, but not least, failure testing alone will not be sufficient. Consider the following scenarios:

  • In addition to introducing failure in one component instance, you need to make sure you don’t have auto-scaling/multiple instances running on cloud platform OR ensure all replicas fail as required.
  • In order to test a degraded result (e.g., through cache), you would need to have a “before” and “after” testing capability.

This requires additional capabilities to complement your chaos testing frameworks such as Infrastructure as Code (IaC) or dynamic reconfiguration of cloud resources.

Additionally — since actual testing with components is expensive — you may also want to consider capabilities for “static” verification, such as the following:

  • Deployment descriptor validation for ReplicaSet
  • Validating auto-scaling config for VMs
  • Static code checks for retries, circuit breaker implementation, etc.

Learn more

Overall, we think that resiliency requires focus not just post-development, but throughout — from identification of scenarios early on, prioritizing them based on business impact and then using a combination of static and dynamic “assaults” to verify and validate component-level resiliency. The approach we have laid out in this blog post will help address the key challenges cited in this entire journey.

IBM’s cloud-native application development and modernization services ensure infusion of engineering practices with required consistency and rigor. Check out the following links to learn more:

Was this article helpful?
YesNo

More from Cloud

IBM Cloud Virtual Servers and Intel launch new custom cloud sandbox

4 min read - A new sandbox that use IBM Cloud Virtual Servers for VPC invites customers into a nonproduction environment to test the performance of 2nd Gen and 4th Gen Intel® Xeon® processors across various applications. Addressing performance concerns in a test environment Performance testing is crucial to understanding the efficiency of complex applications inside your cloud hosting environment. Yes, even in managed enterprise environments like IBM Cloud®. Although we can deliver the latest hardware and software across global data centers designed for…

10 industries that use distributed computing

6 min read - Distributed computing is a process that uses numerous computing resources in different operating locations to mimic the processes of a single computer. Distributed computing assembles different computers, servers and computer networks to accomplish computing tasks of widely varying sizes and purposes. Distributed computing even works in the cloud. And while it’s true that distributed cloud computing and cloud computing are essentially the same in theory, in practice, they differ in their global reach, with distributed cloud computing able to extend…

How a US bank modernized its mainframe applications with IBM Consulting and Microsoft Azure

9 min read - As organizations strive to stay ahead of the curve in today's fast-paced digital landscape, mainframe application modernization has emerged as a critical component of any digital transformation strategy. In this blog, we'll discuss the example of a US bank which embarked on a journey to modernize its mainframe applications. This strategic project has helped it to transform into a more modern, flexible and agile business. In looking at the ways in which it approached the problem, you’ll gain insights into…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters