It is an established fact in the IT industry that change is one of the biggest contributors to service outages.

With more enterprises migrating their applications to cloud native deployment and using automated build and deployment pipelines, the volume and rate of change have significantly increased. That makes it difficult for the Site Reliability Engineers (SREs) to use the traditional methods of assessing risk for each change manually.

The Change Risk Prediction capability in IBM Cloud Pak® for Watson AIOps complements SREs’ skills and knowledge by alerting them to a possible problematic change and presenting historical evidence from their own or someone else’s prior experience. This capability helps SREs increase their efficiency and enables them to maintain high service quality in this fast-moving environment.

Building an automated system for change risk assessment is challenging. While many specific techniques for risk evaluation have been proposed, these generic methods could not be directly applied to change risk assessment. In this article, we describe the data used for experiments, highlight the challenges and provide a methodology for addressing these challenges.

Data

We use historical change and incident records that have the following information:

  • A single change record typically captures attributes like change number, change title, change description, change purpose, change environment, change team, closure code, backout plan, close notes and configuration items.
  • An incident record typically captures information like incident number, incident title, incident description, opened date, incident severity, impacted configuration item(s), outage start time, outage end time, incident state, resolution description and caused by change (change ID and details of change if incident is induced by change).

Challenge 1

Change records datasets are highly imbalanced and extremely noisy for most machine learning methods. Although many major incidents are caused by changes, most changes do not cause incidents. In general, changes happen frequently but incidents are rare, so the percentage of all changes that cause incidents is very low. We consider a change as problematic if the SRE was not able to deploy the change (e.g., the change failed), if it induced an incident during deployment or if the change was deployed successfully but subsequently caused an incident. This study is based on 227.7K change records gathered over a period of seven months. Only 2.1% of the changes are marked as problematic.

Furthermore, only a small number of incidents caused by change include explicit references to inducing changes, making the set of problematic changes incomplete and unusable “as is” as the ground truth unless it is extended with implicit linkages.

To address this challenge and create the set of problematic changes to be used for training the change risk model, we need to identify the implicit linkages between change and incident records. As the nature of information captured by change and incident records is different from each other, using standard similarity measures like cosine similarity would not help to discover implicit linkages between these two sets of records. We implement a four-step, semi-supervised, learning-based approach to leverage the explicit linkages and discover additional implicit linkages.

These are the four steps:

  1. Identifying explicit linkages between change and incident records.
  2. Generating all possible candidates change-incident pairs (implicit linkages).
  3. Computing linkage strength.
  4. Determining optimal linkage strength cutoff.

The summary of results is shown in the figure below. The implicit linkages with strength higher than the value of the cutoff (the dotted line) are added to the explicit linkages to complete the set of problematic changes to be used for training the change risk model:

Challenge 2

If a change is identified as potentially problematic, that information alone has limited value. To make it actionable, we need to provide a reason. We are determining risk level based on a model that is trained with past changes, so one would reasonably expect that some subset of the problematic changes in that set must be similar enough to the newly identified “risky” change. These similar changes together with their Root Cause Analysis (RCA) reports would be useful in creating an explanation.

Using complete ground truth by adding implicit change-incident linkages and choosing the best model performance metrics above, we train multiple classification models that help us separate problematic changes from successful changes. We use the pre-processed change text for extracting the features for training the change risk classifier. In addition to the change text, we also use structured fields like change environment and change team. The dependent variable for the binary classification is problematic, which takes values {0, 1}.

We use three methods for feature extraction from the change text to train different classifiers, as listed below:

  • Bag of words representation
  • Sequential representation using pre-trained word embeddings for training
  • Concatenated representation

Evaluation

Table 1 shows the comparison of the performance of classifiers without handling the class imbalance. The results show a high precision, but the Recall values are low, with the best at around 0.69. This approach is used with an automatic deployment where alerts are sent to an SRE and deployment of the change deferred to the SRE. The SVM based classifier outperforms other classifiers and gives the highest F0.5 score of 0.88 and the precision is almost perfect at 0.98.

Table 2 shows the comparison of the performance of classifiers using algorithm level methods for handling the class imbalance. The results here show a significant increase in the recall values. This approach is used in a manual deployment setting. Although the precision values drop significantly, the change-risk alerts in this setting are adding information for due diligence done by SREs during manual change deployment process. The LSTM3 classifier outperforms all other classifiers for this scenario and gives the highest F2.0 score of 0.72 and a recall of 0.82.

Table 1: Change Risk model performance without class weights balancing.

Table 2: Change Risk model performance using class weights balancing.

A combination of a semi-supervised learning technique for discovering implicit linkages between change and incident records and a set of supervised learning techniques for change risk assessment have rendered a good prediction performance, as shown in the results above.

Learn more

In this article we described a novel methodology that is the base of the Change Risk Prediction capability in IBM Cloud Pak® for Watson AIOps. Having an automated change risk assessment allows SREs to focus on the changes that truly require their attention, improving reliability, performance or utilization, while reducing the time they spent on toil.

Be sure to check out this capability in tech preview coming on March 31, 2021.

Want to know what other capabilities are included in the IBM Cloud Pak for Watson AIOps? Join this webinar to learn more.

Was this article helpful?
YesNo

More from Cloud

IBM + AWS: Transforming Software Development Lifecycle (SDLC) with generative AI

7 min read - Generative AI is not only changing the way applications are built, but the way they are envisioned, designed, tested, documented, and deployed. It’s also revolutionizing the software development lifecycle (SDLC). IBM and AWS are infusing Amazon Bedrock generative AI capabilities into the IBM® SDLC solution to drive increased efficiency, speed, quality and value in every application lifecycle consistently and at scale. And The evolution of the SDLC landscape The software development lifecycle has undergone several silent revolutions in recent decades.…

How digital solutions increase efficiency in warehouse management

3 min read - In the evolving landscape of modern business, the significance of robust operational and maintenance systems cannot be overstated. Efficient warehouse management helps businesses to operate seamlessly, ensure precision and drive productivity to new heights. In our increasingly digital world, bar coding stands out as a cornerstone technology, revolutionizing warehouses by enabling meticulous data tracking and streamlined workflows. With this knowledge, A3J Group is focused on using IBM® Maximo® Application Suite and the Red Hat® Marketplace to help bring inventory solutions…

How fintechs are helping banks accelerate innovation while navigating global regulations

4 min read - Financial institutions are partnering with technology firms—from cloud providers to fintechs—to adopt innovations that help them stay competitive, remain agile and improve the customer experience. However, the biggest hurdle to adopting new technologies is security and regulatory compliance. While third and fourth parties have the potential to introduce risk, they can also be the solution. As enterprises undergo their modernization journeys, fintechs are redefining digital transformation in ways that have never been seen before. This includes using hybrid cloud and…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters