Home Topics Anomaly detection What is anomaly detection?
Explore IBM's anomaly detection solution Subscribe for AI updates
Illustration with collage of pictograms of gear, robotic arm, mobile phone

Published: 12 December 2023
Contributors: Joel Barnard, Cole Stryker

What is anomaly detection?

Anomaly detection, or outlier detection, is the identification of observations, events or data points that deviate from what is usual, standard or expected, making them inconsistent with the rest of a data set.

Anomaly detection has a long history in the field of statistics, where analysts and scientists would study charts looking for any elements that appeared abnormal. Today, anomaly detection leverages artificial intelligence (AI) and machine learning (ML) to automatically identify unexpected changes in a data set’s normal behavior.

Anomalous data can signal critical incidents happening under the hood, such as an infrastructure failure, a breaking change from an upstream source or security threats. Anomalies can also highlight opportunities for architectural optimization or improving marketing strategies.

Anomaly detection has a range of use cases across various industries. For example, it is used in finance for fraud detection, in manufacturing to identify defects or equipment malfunctions, in cybersecurity to detect unusual network activity and in healthcare to identify abnormal patient conditions.

Outlier detection can be challenging because anomalies are often rare, and the characteristics of normal behavior can be complex and dynamic. From a business perspective, identifying actual anomalies rather than false positives or data noise is essential.

A data leader's guide

Learn how to leverage the right databases for applications, analytics and generative AI.

Related content

Register for the ebook on responsible AI workflows

Why is anomaly detection important?

Data anomalies can have a significant impact in the field of data science, leading to incorrect or misleading conclusions. For example, a single outlier can significantly skew the mean of a data set, making it an inaccurate representation of the data. Additionally, data anomalies can impact the performance of machine learning algorithms, as they can cause the model to fit the noise rather than the underlying pattern in the data.

Identifying and handling data anomalies is crucial for several reasons: 

Improved data quality: Identifying and handling data anomalies can significantly improve data quality, which is essential for accurate and reliable data analysis. By addressing data anomalies, analysts can reduce noise and errors in the data set, ensuring that the data is more representative of the true underlying patterns.

Enhanced decision making: Data-driven decision making relies on accurate and reliable data analysis to inform decisions. By identifying and handling data anomalies, analysts can ensure that their findings are more trustworthy, leading to better-informed decisions and improved outcomes.

Optimized machine learning performance: Data anomalies can significantly impact the performance of machine learning algorithms, as they can cause the model to fit the noise rather than the underlying pattern in the data. By identifying and handling data anomalies, analysts can optimize the performance of their machine learning models, ensuring that they provide accurate and reliable predictions.

Types of anomalies

An anomaly detection system can uncover two general types of anomalies: unintentional and intentional.

Unintentional anomalies are data points that deviate from the norm due to errors or noise in the data collection process. These errors can be either systematic or random, originating from issues like faulty sensors or human error during data entry. Unintentional anomalies can distort the data set, making it challenging to derive accurate insights.

Intentional anomalies are data points that deviate from the norm due to specific actions or events. These anomalies can provide valuable insights into the data set, as they may highlight unique occurrences or trends. For example, a sudden spike in sales during a holiday season could be considered an intentional anomaly, as it deviates from the typical sales pattern but is expected due to a real-world event.

Time-series data anomalies

In business data, three main time-series data anomalies exist: point anomalies, contextual anomalies and collective anomalies.

Point anomalies, also known as global outliers, are individual data points that exist far outside the rest of the data set. They can be either intentional or unintentional and may result from errors, noise or unique occurrences. An example of a point anomaly is a bank account withdrawal that is significantly larger than any of the user’s previous withdrawals.   

Contextual anomalies are data points that deviate from the norm within a specific context. These anomalies are not necessarily outliers when considered in isolation but become anomalous when viewed within their specific context.

For example, consider home energy usage. If there is a sudden increase in energy consumption at midday when no family members are typically home, the anomaly would be contextual. This data point might not be an outlier when compared to energy usage in the morning or evening (when people are usually home), but it is anomalous in the context of the time of day it occurs.

Collective anomalies involve a set of data instances that together deviate from the norm, even though individual instances may appear normal. An example of this type of anomaly would be a network traffic data set that shows a sudden surge in traffic from multiple IP addresses at the same time.

Anomaly detection methods

Using an anomaly detection system to detect data anomalies is a critical aspect of data analysis, ensuring that the findings are accurate and reliable. Various anomaly detection methods can be used in building an anomaly detection system.

Visualization

Visualization is a powerful tool for detecting data anomalies, as it allows data scientists to quickly identify potential outliers and patterns in the data. By plotting the data using charts and graphs, analysts can visually inspect the data set for any unusual data points or trends.

Statistical tests

Statistical tests can be used by data scientists to detect data anomalies by comparing the observed data with the expected distribution or pattern.

For example, the Grubbs test can be used to identify outliers in a data set by comparing each data point to the mean and standard deviation of the data. Similarly, the Kolmogorov-Smirnov test can be used to determine whether a data set follows a specific distribution, such as a normal distribution.

Machine learning algorithms

Machine learning algorithms can be used to detect data anomalies by learning the underlying pattern in the data and then identifying any deviations from that pattern. Some of the most common ML anomaly detection algorithms include:

  • Decision trees: One type of decision tree, Isolation Forest, is an ensemble learning method that isolates anomalies by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

  • One-Class Support Vector Machine (SVM): One-Class SVM is a method of classification algorithm trained only on the "normal" instances, aiming to create a boundary that encompasses the normal data. Instances falling outside this boundary are considered anomalies.

  • k-Nearest Neighbors (k-NN): k-NN is a simple algorithm that classifies a data point based on the majority class of its k nearest neighbors. Instances that have significantly fewer neighbors of the same class can be considered anomalies.

  • Naive Bayesian: These methods work by defining the probability that an event will occur based on the presence of contributing factors and detecting relationships with the same root cause.

  • Autoencoders: A type of neural network that uses time-stamped data to forecast data patterns and identify abnormalities that don’t align with the historical data. 

  • Local Outlier Factor (LOF): LOF is a density-based algorithm that measures the local density deviation of a data point with respect to its neighbors. Points with significantly lower density compared to their neighbors are considered outliers.

  • k-means clustering: k-means is a clustering technique that analyzes the mean distance of the unlabeled data points and then helps to cluster the same into specific groups. 
Anomaly detection techniques

An anomaly detection algorithm can learn to identify patterns and detect anomalous data using various machine learning training techniques. The amount of labeled data, if any, in a data team’s training data set determines which of the main anomaly detection techniques they will use—unsupervised, supervised or semi-supervised.

Unsupervised anomaly detection

With unsupervised anomaly detection techniques, data engineers train a model by providing it with unlabeled data sets that it uses to discover patterns or abnormalities on its own. Although these techniques are by far the most commonly used due to their wider and relevant application, they require massive data sets and computing power. Unsupervised machine learning is most often found in deep learning scenarios, which rely on artificial neural networks.

Supervised anomaly detection

Supervised anomaly detection techniques use an algorithm that is trained on a labeled data set that includes both normal and anomalous instances. Due to the general unavailability of labeled training data and the inherent unbalanced nature of the classes, these anomaly detection techniques are rarely used. 

Semi-supervised anomaly detection

Semi-supervised techniques maximize the positive attributes of both unsupervised anomaly detection and supervised anomaly detection. By providing an algorithm with some portion of labeled data, it can be partially trained. Data engineers then use the partially trained algorithm to label a larger data set autonomously, referred to as “pseudo-labeling.” Assuming they prove reliable, these newly labeled data points are combined with the original data set to fine tune the algorithm.

Finding the right combination of supervised and unsupervised machine learning is vital to machine learning automation. Ideally, the vast majority of data classifications would be done without human interaction in an unsupervised manner. That said, data engineers should still be able to feed algorithms with training data that will help create business-as-usual baselines. A semi-supervised approach allows for scaling anomaly detection with the flexibility to make manual rules regarding specific anomalies.

Anomaly detection use cases
Fraud detection

Anomaly detection models are used extensively in the banking, insurance and stock trading industries to identify fraudulent activities in real time, such as unauthorized transactions, money laundering, credit card fraud, bogus tax return claims and abnormal trading patterns.

Cybersecurity

Intrusion detection dystems (IDSs) and other cybersecurity technologies use anomaly detection to help identify unusual or suspicious user activities or network traffic patterns, indicating potential security threats or attacks like malware infections or unauthorized access.

Manufacturing and quality control

Anomaly detection algorithms are often employed together with computer vision to identify defects in products or packaging by analyzing high-res camera footage, sensor data and production metrics.

IT systems management

Anomaly detection can be used to monitor the performance of IT systems, and to keep operations running smoothly by identifying unusual patterns in server logs and reconstructing faults from patterns and past experiences to predict potential issues or failures.

Energy, transportation and critical infrastructure

By identifying irregularities in data from Internet of Things (IoT) sensors and operation technology (OT) devices, anomaly detection can help predict equipment failures or maintenance needs in industries like aviation, energy and transportation. When used to monitor energy consumption patterns and identify anomalies in usage, anomaly detection can lead to more efficient energy management and early detection of equipment failures.

Retail and ecommerce

Merchants use anomaly detection models to identify unusual patterns in customer behavior, which can help with fraud detection, predicting customer churn and improving marketing strategies. In e-commerce, anomaly detection is applied to identify fake reviews, account takeovers, abnormal purchasing behavior and other indicators of fraud or cybercrime.

 

Related products
IBM Databand

IBM® Databand® is observability software for data pipelines and warehouses that automatically collects metadata to build historical baselines, detect anomalies and triage alerts to remediate data quality issues.

Explore Databand

IBM DataStage

Supporting ETL and ELT patterns, IBM® DataStage® delivers flexible and near-real-time data integration both on premises and in the cloud.

Explore DataStage

IBM Knowledge Catalog

An intelligent data catalog for the AI era, IBM® Knowledge Catalog lets you access, curate, categorize and share data, knowledge assets and their relationships—no matter where they reside.

Explore Knowledge Catalog
Resources Using Databand’s anomaly detection to meet data delivery deadlines

In this article, learn how Databand’s Anomaly Detection helps data teams identify data pipeline issues more quickly so they can do a better job of meeting the data SLAs they’ve set.

Supervised versus unsupervised learning

Explore the basics of two data science approaches: supervised and unsupervised. Find out which approach is right for your situation.

How to ensure data quality, value and reliability

Ensuring high-quality data is the responsibility of data engineers and the entire organization. This post describes the importance of data quality, how to audit and monitor your data and how to get buy-in from key stakeholders.

Take the next step

Implement proactive data observability with IBM Databand today—so you can know when there’s a data health issue before your users do.

Explore Databand Book a live demo