Chaos engineering is the intentional and controlled causing of failures in the production or pre-production environment to understand their impact and plan a better defense posture and incident maintenance strategy.
Every day creates a new opportunity for an organization’s critical application or infrastructure to fail, potentially threatening its ability to deliver services to customers. Causes of failure can vary between several issues, including security breaches, misconfigurations or service disruptions. The likelihood of errors or disruptions can rise as more applications and data are hosted in the cloud, which can create an increase in security issues.
One way to address disruptions is chaos engineering. It is not a random process where engineers terminate instances or services or otherwise cause systems to fail without any purpose. This process identifies potential future issues, allowing engineering teams to solve problems proactively and avoid them in the live environment further down the road.
Chaos engineering is important because an error or disruption can slow down an organization’s momentum, expending precious time figuring out a solution on the fly as downtime increases. Netflix learned this concept firsthand when it switched from on-premises to the cloud1 -they experienced an outage that led to a three-day interruption to service delivery in 2008.
This outage predates its transformation as a video streaming operation, which would have made that outage exponentially more costly. As a result, Netflix decided that it would do everything possible to minimize disruptions and it began to introduce chaos engineering into its workflows. This process allows them to identify issues before they happen and to minimize the damage if and when an unavoidable failure occurs.
Netflix created chaos monkey2, an open source tool that creates random incidents in IT services and infrastructure meant to identify weaknesses that can be fixed or addressed through automatic recovery procedures. They implemented chaos monkey when it moved from a private data center to Amazon Web Services (AWS) in response to unreliability from the cloud. Many organizations now use chaos monkey to run their chaos engineering experiments.
Chaos engineering is an important defense against infrastructure failures, outages or missing components in an organization’s production environment. It helps site reliability engineers (SREs) and other members of the DevOps team to provide continuous delivery of services by avoiding significant disruptions to their service. Chaos engineering helps them understand their vulnerabilities better and informs how to minimize the impact if a disruption occurs.
Even a small issue in code can have a catastrophic effect on the overall production environment given different program dependencies. For instance, an error in the transaction software system for a financial services firm can result in the loss of millions of dollars3 .
Organizations might be unable to avoid all IT incidents, but they can minimize the damage by using chaos management to understand likely scenarios and their best-possible solutions.
Organizations with high resilience, digital maturity and high observability through dashboards and other tools should embrace chaos engineering, as they can take immediate action on issues that occur through experiments. Organizations that lack this observability4 can take too long to resolve the experiments they create through chaos engineering.
Chaos engineering is also a must-have for organizations that are using cloud, particularly public cloud and cloud-native apps. The public cloud introduces potential outage issues that require coordination with the cloud provider, which creates a different approach than dealing with on-premises issues.
Enterprises that use the cloud still often approach IT incidents without considering how the cloud and software-as-a-service (SaaS) impact those incidents differently according to Constellation Research5.
In addition, the rise of using microservices, which increases how many hosts or containers are running in a system, creates unique challengesthat can be unearthed and solved through chaos experiments. It shifts complexities from code design into system operations, which does not eliminate the complexities but allows for greater automation.
Chaos engineering can also help organizations to enhance the velocity of their continuous integration and continuous delivery (CI/CD) pipelines. Incorporating chaos engineering into CI/CD as Netflix did6 enables organizations to automate continuous experiments while controlling their potential impact.
Finally, the fact that organizations increasingly connect with partners through APIs means that an issue in their systems can have a knock-on impact on other organizations. Deploying chaos engineering helps organizations understand the weak points in their architecture and correct them, ultimately creating the ability to anticipate future failures.
Successful chaos engineering helps organizations to minimize technical failures with any significant customer impact and it also supports the construction of stronger and more resilient complex system architectures. Once an organization decides to pursue chaos engineering, the next step is determining whether to execute it in the pre-production or production environment.
DevOps teams have several options for running chaos engineering experiments to test various system processes.
Creating the ideal chaos engineering process requires several principles to ensure that an organization can have a distributed system at scale.
Organizations that use chaos engineering must decide whether to use chaos testing in their production or pre-production environments. There are several reasons why chaos engineering is most beneficial in production environments.
Live environments provide the most accurate environment for understanding how an incident impacts the customer experience. Another reason is that the pre-production environment might not have the exact settings as the live environment, therefore introducing some variability to the experiments.
For instance, an incident in a pre-production environment might not create a realistic response because it lacks the same traffic levels as the live environment. It also might not have the same security configurations as that environment.
Some organizations fear intentionally causing issues with their live site, so they run their experiments on their pre-production or developmental site. This ensures that any issues that occur do not impact the live customer experience. To mitigate this, some organizations begin in pre-production environments to get a handle on the process before moving to the live production environment.
Organizations choose which environment to use based on their risk tolerance. Ultimately, chaos engineering aims to test actual large-scale issues, which is why production environments give the most accurate picture of what's happening and what requires fixing.
Chaos engineering provides organizations with several key benefits.
Customers have high expectations about the availability of the services they purchase from companies. Any downtime or inability to access what they've paid for can have a serious effect on customer satisfaction, leading to lost revenue and reputational damage. Testing systems and identifying solutions means that there is less risk that a system will be down for a significant period of time.
Disruptions can come from bad code, server issues or external threats. The latter can strike even with excellent security practices. Chaos engineering helps identify issues that can be exploited, so organizations can introduce patches and bug fixes to keep their services secure.
Chaos engineering enables organizations to create a more informed blueprint for how they tackle issues that will occur in the future. Organizations that embrace chaos engineering will have specific game plans for many incidents, enabling quicker repair and less downtime. Chaos engineering can decrease downtime7 by as much as 20%.
Chaos engineering experiments identify how a system allocates resources. Introducing experiments will demonstrate how the system handles loads, showing where bottlenecks are or are likely to occur.
Chaos engineering helps teams build greater system resiliency and flexibility into their software. Therefore, organizations can approach coding new software and solutions more intelligently because they know how the current system handles issues.
Read how CEOs feel about sustainability in their own words and how they are baking it into their business.
Discover seven business trends that experts expect to shape the world in the next three years—and seven bets worth making to benefit from them.
Dive into how Climate Service integrated climate data into financial decisions using IBM technology.
See how using the IBM Garage methodology helped Kraft Heinz Company improve product velocity.
Reinvent how work gets done by intersecting business and technology transformation to unlock enterprise agility.
Reimagine and modernize HR with AI at the core to deliver better business outcomes and unlock employees’ full potential.
Unlock financial performance and business value with end-to-end services that infuse data analytics, AI, and automation across core processes.
1 Chaos Engineering: System Resiliency in Practice (link resides outside ibm.com), Casey Rosenthal, Nora Jones, 2020.
2 What is Chaos Monkey? Chaos engineering explained (link resides outside ibm.com), InfoWorld, 13 May 2020.
3 Knight Capital Says Trading Glitch Cost It USD 440 Million (link resides outside ibm.com), New York Times, 2012.
4 There Is No Resilience without Chaos (link resides outside ibm.com), The New Stack, 13 Apr 2023.
5 Incident Management in the Cloud Era (link resides outside ibm.com), Constellation Research, 2023.
6 ChAP: Chaos Automation Platform (link resides outside ibm.com), Netflix Blog, 26 July 2017.
7 The I&O Leader’s Guide to Chaos Engineering (link resides outside ibm.com), Gartner, 28 October 2021.