December 16, 2020 By Rama Akkiraju 5 min read

Information Technology (IT) Operations management is a vexing problem for most companies that rely on IT systems for mission-critical business applications.

Despite the best intentions of engineers, good designs, and solid development practices, software and hardware systems deployed in companies in service of critical business applications are susceptible to outages, resulting in millions of dollars in labor, revenue loss, and customer satisfaction issues each year.  

Artificial intelligence (AI) can help solve these problems. AI can help IT operations managers, also known as Site Reliability Engineers (SREs) in detecting issues early, predicting them before they occur, reducing event and alert noise, locating the specific application or infrastructure component that is the source of the issue, determining the scope of incident impact, and recommending relevant and timely actions.

All these analytics help reduce the mean times to detect (MTTD), identify/isolate (MTTI) and resolve (MTTR) an incident. This, in turn, saves millions of dollars by preventing direct costs (lost revenue, penalties, opportunity costs, etc.) and indirect costs (customer dissatisfaction, lost customers, lost references, etc.).

Fortunately, Watson AIOps can do all of this for you! Figure 1 shows the various types of data that Watson AIOps can process. In this article, we describe the AI in our Watson AIOps solution:

Log anomaly prediction

IBM Watson AIOps’ state-of-the-art and multi patent-pending log anomaly detection technology is capable of automatically parsing IT application and infrastructure logs from log aggregation tools (e.g., Splunk, Humio, LogDNA, and Logstash) to automatically detect anomalies in real-time. This is much sooner than traditional thresholding-based or error-string-matching types of alerting techniques can, thereby significantly reducing the meant time to diagnose an incident. We use deep learning algorithms to both prepare features from logs during log parsing and to make anomaly predictions. Users don’t have to set static thresholds or manual rules to detect anomalies.

Metric anomaly prediction

Watson AIOps’ metric-based anomaly detection analyzes metrics data from various systems (e.g., New Relic, AppDynamics, and SolarWinds) to automatically learn the normal behavior of metrics in your company and detect anomalies from those metrics. It employs a set of time-tested time-series algorithms (e.g., Granger Causality, Robust Bounds, Variant/Invariant, Finite Domain, and Predominant Range) to capture seasonality and significant trends and to perform forecasting. 

Event grouping

An event indicates that something that is noteworthy has happened in an IT operations environment. For example, an application has become unavailable or a disk is full/reaching capacity, etc. The goal of event grouping and classification is to reduce the noise for IT operations management personnel and help them focus on a few important events that need their immediate attention. Anomalies detected from metrics, logs, and tickets are grouped using multiple algorithms (e.g., Temporal, Spatial, and Association Rule mining) in Watson AIOps for event grouping.

Static and dynamic topology management

Application and network topology refer to a map or diagram that lays out the connections between different mission-critical applications in an enterprise. Static topology refers to a map that is constructed based on the build and deploys information on applications and infrastructure components. Dynamic topology, on the other hand, refers to a dynamic map that captures the resources and their relationships as the environment changes at run-time and provides a near-real-time visibility of the same.

With Topology Manager in Watson AIOps, you can compare the current topology with a historical one to answer questions such as “What happened?” and “What’s happening now?” It helps you investigate the details that led up to an incident and see the topology (and status) changes over time. In addition, faults are localized on topology.

Fault localization and blast radius

Entity mentions are the names of the resources (e.g., service or application component names, server names, server IP addresses, pod IDs, node ID, etc.) that are referenced in anomalous logs, alerts, tickets, and events. Once events are grouped, entity mentions in anomalous logs, metrics, alerts, and events are extracted. These entities are resolved with topological resources to isolate the problem and to place the identified entities on the corresponding dynamic topology instances that match the time at which the mentions were noted. Traversing the topological graph in the application, infrastructure, and network layers enable us to map out the impacted components, known as blast radius.

Incident resolution

Watson AIOps ingests and mines prior incident ticket data by connecting to tools such as ServiceNow to provide timely and relevant next-best-action recommendations for the currently diagnosed problem at hand. Current incident symptoms are framed as a query to the indexed ticket data to not only search and retrieve the top k relevant prior incident records, but also important entity-action (aka noun-verb) phrases are extracted from each relevant record to make it easy for SREs to get a quick glimpse of the suggested action. We apply various natural language processing techniques to extract entity and action phrases, including rule-based systems.

Insight delivery and action implementation

In Watson AIOps, all of the insights described above are delivered by via both ChatOps and dashboards. Real-time, in-the-moment insights are delivered via ChatOps to SREs directly in the place where they work. Within ChatOps, there is functionality to interact and share selected incident resolution suggestions with other collaborators, in addition to exploring the evidence of the insights. From ChatOps, SREs can launch log, metric, and ticket monitoring tools to explore further details. Similarly, SREs can launch interactive dashboards for detailed exploration of events, event groups, metric anomalies, and topology. Applicable actions/runbooks can then be automatically run via Runbook execution.

A note on AI model lifecycle management

Watson AIOps takes representative set of metrics, logs, and ticket data for training and building its unsupervised models. These models are set up to learn continuously by using up-to-date data from your environment and improve based on user feedback. All AI models are transparent and explainable to earn your trust:

Figure 2: AI pipelines in Watson AIOps.

What’s next for Watson AIOps?

In the next generations of our Watson AIOps solution, we envision fully instrumented, observable, self-aware, automated, and autonomic IT operations environments. AIOps solutions will not only be able to help resolve issues in a reactive mode, but help avoid issues from happening in the first place by designing the Development-Security-Operations (DevSecOps) lifecycle activities for efficient operations right from the get-go. For example, smart checks and gates prevent risky deployments from getting pushed to production, stop under-tested code modules from getting into deployment phases, and block code with risky security vulnerabilities from getting to the deployment phase. We can’t wait to shape the future and take you all with us in this journey.

Figure 3: Shifting-left in DevSecOps lifecycle while closing the loop’s virtuous feedback and feedforward cycles for efficient operations management.

Was this article helpful?
YesNo

More from Cloud

IBM Cloud Virtual Servers and Intel launch new custom cloud sandbox

4 min read - A new sandbox that use IBM Cloud Virtual Servers for VPC invites customers into a nonproduction environment to test the performance of 2nd Gen and 4th Gen Intel® Xeon® processors across various applications. Addressing performance concerns in a test environment Performance testing is crucial to understanding the efficiency of complex applications inside your cloud hosting environment. Yes, even in managed enterprise environments like IBM Cloud®. Although we can deliver the latest hardware and software across global data centers designed for…

10 industries that use distributed computing

6 min read - Distributed computing is a process that uses numerous computing resources in different operating locations to mimic the processes of a single computer. Distributed computing assembles different computers, servers and computer networks to accomplish computing tasks of widely varying sizes and purposes. Distributed computing even works in the cloud. And while it’s true that distributed cloud computing and cloud computing are essentially the same in theory, in practice, they differ in their global reach, with distributed cloud computing able to extend…

How a US bank modernized its mainframe applications with IBM Consulting and Microsoft Azure

9 min read - As organizations strive to stay ahead of the curve in today's fast-paced digital landscape, mainframe application modernization has emerged as a critical component of any digital transformation strategy. In this blog, we'll discuss the example of a US bank which embarked on a journey to modernize its mainframe applications. This strategic project has helped it to transform into a more modern, flexible and agile business. In looking at the ways in which it approached the problem, you’ll gain insights into…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters