It is an established fact in the IT industry that change is one of the biggest contributors to service outages.

With more enterprises migrating their applications to cloud native deployment and using automated build and deployment pipelines, the volume and rate of change have significantly increased. That makes it difficult for the Site Reliability Engineers (SREs) to use the traditional methods of assessing risk for each change manually.

The Change Risk Prediction capability in IBM Cloud Pak® for Watson AIOps complements SREs’ skills and knowledge by alerting them to a possible problematic change and presenting historical evidence from their own or someone else’s prior experience. This capability helps SREs increase their efficiency and enables them to maintain high service quality in this fast-moving environment.

Building an automated system for change risk assessment is challenging. While many specific techniques for risk evaluation have been proposed, these generic methods could not be directly applied to change risk assessment. In this article, we describe the data used for experiments, highlight the challenges and provide a methodology for addressing these challenges.

Data

We use historical change and incident records that have the following information:

  • A single change record typically captures attributes like change number, change title, change description, change purpose, change environment, change team, closure code, backout plan, close notes and configuration items.
  • An incident record typically captures information like incident number, incident title, incident description, opened date, incident severity, impacted configuration item(s), outage start time, outage end time, incident state, resolution description and caused by change (change ID and details of change if incident is induced by change).

Challenge 1

Change records datasets are highly imbalanced and extremely noisy for most machine learning methods. Although many major incidents are caused by changes, most changes do not cause incidents. In general, changes happen frequently but incidents are rare, so the percentage of all changes that cause incidents is very low. We consider a change as problematic if the SRE was not able to deploy the change (e.g., the change failed), if it induced an incident during deployment or if the change was deployed successfully but subsequently caused an incident. This study is based on 227.7K change records gathered over a period of seven months. Only 2.1% of the changes are marked as problematic.

Furthermore, only a small number of incidents caused by change include explicit references to inducing changes, making the set of problematic changes incomplete and unusable “as is” as the ground truth unless it is extended with implicit linkages.

To address this challenge and create the set of problematic changes to be used for training the change risk model, we need to identify the implicit linkages between change and incident records. As the nature of information captured by change and incident records is different from each other, using standard similarity measures like cosine similarity would not help to discover implicit linkages between these two sets of records. We implement a four-step, semi-supervised, learning-based approach to leverage the explicit linkages and discover additional implicit linkages.

These are the four steps:

  1. Identifying explicit linkages between change and incident records.
  2. Generating all possible candidates change-incident pairs (implicit linkages).
  3. Computing linkage strength.
  4. Determining optimal linkage strength cutoff.

The summary of results is shown in the figure below. The implicit linkages with strength higher than the value of the cutoff (the dotted line) are added to the explicit linkages to complete the set of problematic changes to be used for training the change risk model:

Challenge 2

If a change is identified as potentially problematic, that information alone has limited value. To make it actionable, we need to provide a reason. We are determining risk level based on a model that is trained with past changes, so one would reasonably expect that some subset of the problematic changes in that set must be similar enough to the newly identified “risky” change. These similar changes together with their Root Cause Analysis (RCA) reports would be useful in creating an explanation.

Using complete ground truth by adding implicit change-incident linkages and choosing the best model performance metrics above, we train multiple classification models that help us separate problematic changes from successful changes. We use the pre-processed change text for extracting the features for training the change risk classifier. In addition to the change text, we also use structured fields like change environment and change team. The dependent variable for the binary classification is problematic, which takes values {0, 1}.

We use three methods for feature extraction from the change text to train different classifiers, as listed below:

  • Bag of words representation
  • Sequential representation using pre-trained word embeddings for training
  • Concatenated representation

Evaluation

Table 1 shows the comparison of the performance of classifiers without handling the class imbalance. The results show a high precision, but the Recall values are low, with the best at around 0.69. This approach is used with an automatic deployment where alerts are sent to an SRE and deployment of the change deferred to the SRE. The SVM based classifier outperforms other classifiers and gives the highest F0.5 score of 0.88 and the precision is almost perfect at 0.98.

Table 2 shows the comparison of the performance of classifiers using algorithm level methods for handling the class imbalance. The results here show a significant increase in the recall values. This approach is used in a manual deployment setting. Although the precision values drop significantly, the change-risk alerts in this setting are adding information for due diligence done by SREs during manual change deployment process. The LSTM3 classifier outperforms all other classifiers for this scenario and gives the highest F2.0 score of 0.72 and a recall of 0.82.

Table 1: Change Risk model performance without class weights balancing.

Table 2: Change Risk model performance using class weights balancing.

A combination of a semi-supervised learning technique for discovering implicit linkages between change and incident records and a set of supervised learning techniques for change risk assessment have rendered a good prediction performance, as shown in the results above.

Learn more

In this article we described a novel methodology that is the base of the Change Risk Prediction capability in IBM Cloud Pak® for Watson AIOps. Having an automated change risk assessment allows SREs to focus on the changes that truly require their attention, improving reliability, performance or utilization, while reducing the time they spent on toil.

Be sure to check out this capability in tech preview coming on March 31, 2021.

Want to know what other capabilities are included in the IBM Cloud Pak for Watson AIOps? Join this webinar to learn more.

Was this article helpful?
YesNo

More from Cloud

The power of embracing distributed hybrid infrastructure

2 min read - Data is the greatest asset to help organizations improve decision-making, fuel growth and boost competitiveness in the marketplace. But today’s organizations face the challenge of managing vast amounts of data across multiple environments. This is why understanding the uniqueness of your IT processes, workloads and applications demands a workload placement strategy based on key factors such as the type of data, necessary compute capacity and performance needed and meeting your regulatory security and compliance requirements. While hybrid cloud has become…

Serverless vs. microservices: Which architecture is best for your business?

7 min read - When enterprises need to build an application, one of the most important decisions their leaders must make is what kind of software development to use. While there are many software architectures to choose from, serverless and microservices architectures are increasingly popular due to their scalability, flexibility and performance. Also, with spending on cloud services expected to double in the next four years, both serverless and microservices instances should grow rapidly since they are widely used in cloud computing environments. While…

Seamless cloud migration and modernization: overcoming common challenges with generative AI assets and innovative commercial models

3 min read - As organizations continue to adopt cloud-based services, it’s more pressing to migrate and modernize infrastructure, applications and data to the cloud to stay competitive. Traditional migration and modernization approach often involve manual processes, leading to increased costs, delayed time-to-value and increased risk. Cloud migration and modernization can be complex and time-consuming processes that come with unique challenges; meanwhile there are many benefits to gen AI assets and assistants and innovative commercial models. Cloud Migration and Modernization Factory from IBM Consulting®…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters