Downsampling decreases the number of data samples in a dataset. In doing so, it aims to correct imbalanced data and thereby improve model performance.
Downsampling is a common data processing technique that addresses imbalances in a dataset by removing data from the majority class such that it matches the size of the minority class. This is opposed to upsampling, which involves resampling minority class points. Both Python scikit-learn and Matlab contain built-in functions for implementing downsampling techniques.
Downsampling for data science is often mistaken for downsampling in digital signal processing (DSP). The two are similar in spirit. Downsampling for digital signal processing (also known as decimation) is the process of decrementing the bandwidth and sampling rate of the sampler, thus removing some of the original data from the original signal. The process of decreasing the sampling frequency is often done by reducing the sampling rate by some integer factor, keeping only one of every nth sample. This is done by using a lowpass filter, otherwise known as an anti-aliasing filter, to reduce the high frequency/noise components of a discrete-time signal by the previously mentioned integer factor.
Downsampling for data balancing can also be confused with downsampling for image processing. When data contains lots of features, like in high resolution MRI images, calculations can become expensive. Downsampling in image processing thus reduces the dimensionality of each data point through convolution. This is not the same as balancing the dataset: it is an optimization technique that will later require interpolation to get back the original data.
Downsampling is an effective way to address imbalances within a dataset. An imbalanced dataset is defined as a dataset in which one class is greatly underrepresented in the dataset relative to the true population, creating unintended bias. For instance, imagine a model is trained to classify images as showing a cat or a dog. The dataset used is composed of 90% cats and 10% dogs. Cats in this scenario are overrepresented, and if we have a classifier predicting cats every time, it will yield a 90% accuracy for classifying cats, but 0% accuracy for classifying dogs. The imbalanced dataset in this case will cause classifiers to favor accuracy for the majority class at the expense of the minority class. The same issue can arise with multi-class datasets.1
The process of downsampling counteracts the imbalanced dataset issue. It identifies majority class points to remove based on specified criteria. These criteria can change with the chosen downsampling technique. This balances the dataset by effectively decreasing the number of samples for an overrepresented majority class until the dataset contains an equal ratio of points across all classes.
While imbalances can be seen by simply plotting the counts of data points in each class, it doesn’t tell us whether it will greatly affect the model. Fortunately, we can use performance metrics to gauge how well a downsampling technique corrects for class imbalance. Most of these metrics will be for binary classification, where there are only two classes: a positive and a negative. Usually, the positive class is the minority class while the negative class is the majority class. Two popular metrics are Receiver Operating Characteristic (ROC) curves and precision-recall curves.1
Random downsampling is a deletion technique where random points in the majority class are chosen without replacement and deleted from the dataset until the majority class size is equal to the minority class size. This is an easy way to randomly delete a subset of data for balancing purposes. However, this technique can cause important patterns or distributions in the majority class to disappear, negatively affecting classifier performance.2
Near Miss downsampling is a technique that aims to balance class distribution by randomly eliminating certain majority class examples.
Conceptually, Near Miss operates on the principle that data should be kept in places where the majority and minority classes are very close, as these places give us key information in distinguishing the two classes.3 These points are generally known as ‘hard’ to learn data points. Near Miss downsampling generally operates in two steps:
There are three variations of the Near Miss algorithm that provide a more definitive way of selecting majority class instances to remove.
Condensed Nearest Neighbors (CNN for short, not to be confused with Convolutional Neural Networks) seeks to find a subset of a dataset that can be used for training without loss in model performance. This is achieved by identifying a subset of the data that can be used to train a model that correctly predicts the entire dataset.
CNN downsampling can be broken down into the following steps:5
Like Near Miss, this process essentially removes all majority class instances far away from the decision boundary, which, again, are points that are easy to classify. It also ensures that every data in our original dataset can be correctly predicted using just the data within S. This way, the dataset can be shrunk significantly while preserving the decision boundary reasonably well.
This image shows an example of applying condensed nearest neighbors using 1 nearest-neighbors and 21 nearest neighbors to two datasets. The top two images are before applying condensed nearest neighbors while the bottom two are after. As one can see, the decision boundary is reasonably well preserved.
The premise of Tomek Link downsampling is to reduce noise in the data by removing points near the decision boundary and increase class separation. How it works is that it identifies “tomek links” — a grouping of two points from different classes with no existing third point that is closest to either.2
For all tomek links, the point within the majority class is deleted. By removing a majority class point that is close to a minority class point, class separation increases. One drawback of this method is the computational complexity of calculating all pairwise distances between majority and minority class points.2 Tomek Link downsampling is most effective when combined with other techniques.
Edited Nearest Neighbors (ENN) downsampling is similar to Tomek Link downsampling, where the goal is to remove examples near the decision boundary in order to increase class separation. In general, this method removes data points that differ in class from a majority of its neighbors.2 This means that the process removes majority class datapoints with a majority of its nearest neighbors belonging to the minority class, and vice versa. Majority in this context can be freely defined: it could mean that at least one neighbor is of a different class or that the proportion of neighbors in a different class exceeds a certain threshold.
ENN downsampling is usually done with 3 nearest neighbors, as illustrated below.
This is a coarser-grain strategy because it looks at the general neighborhood of points rather than at a single neighbor, but it is an efficient way to get rid of noise within the data. ENN downsampling is most effective when combined with other techniques.
Current developments in downsampling revolve around deep learning integrations. This has been used in fields like image processing and medical data, which involve using neural networks to downsample the data.6 An example of this is SOM-US, which uses a two-layer neural network.7 In recent years, active learning has also been applied to downsampling to try and mitigate effects of imbalanced data.8 Experiments have shown that these models perform significantly better than traditional techniques.
Current research in downsampling also revolve around combining it with other techniques to create hybrid techniques. One combination is to both downsample and upsample the data to get the benefits of both: SMOTE+Tomek Link, Agglomerative Hierarchical Clustering (AHC), and SPIDER are a few examples of these.9 Algorithm-level techniques can also incorporate ideas from traditional downsampling techniques, such as with Hard Example Mining where training only focuses on the ‘harder’ data points.2 All show better performance than using each technique individually.
We surveyed 2,000 organizations about their AI initiatives to discover what's working, what's not and how you can get ahead.
IBM® Granite™ is our family of open, performant and trusted AI models, tailored for business and optimized to scale your AI applications. Explore language, code, time series and guardrail options.
Access our full catalog of over 100 online courses by purchasing an individual or multi-user subscription today, enabling you to expand your skills across a range of our products at one low price.
Led by top IBM thought leaders, the curriculum is designed to help business leaders gain the knowledge needed to prioritize the AI investments that can drive growth.
Want to get a better return on your AI investments? Learn how scaling gen AI in key areas drives change by helping your best minds build and deliver innovative new solutions.
Learn how to confidently incorporate generative AI and machine learning into your business.
Dive into the 3 critical elements of a strong AI strategy: creating a competitive edge, scaling AI across the business and advancing trustworthy AI.
1 Haobo He and Edwardo Garcia, Learning from Imbalanced Data, IEEE, September 2009, https://ieeexplore.ieee.org/document/5128907 (link resides outside ibm.com).
2 Kumar Abhishek and Mounir Abdelaziz, Machine Learning for Imbalanced Data, Packt, November 2023
3 Ajinkya More, Survey of resampling techniques for improving classification performance in unbalanced datasets, 22 August 2016, https://arxiv.org/pdf/1608.06048 (link resides outside ibm.com).
4 Jianping Zhang and Inderjeet Mani, kNN Approach to Unbalanced Data Distributions: A Case Study involving Information Extraction, 2003, https://www.site.uottawa.ca/~nat/Workshop2003/jzhang.pdf (link resides outside ibm.com).
5 More, Survey of resampling techniques for improving calssification performance in unbalanced datasets, 22 August 2016, https://arxiv.org/pdf/1608.06048 (link resides outside ibm.com). Alberto Fernandez, et al., Learning from Imbalanced Data Sets, Springer, 2018.
6 Md Adnan Arefeen, Sumaiya Tabassum Nimi, and M. Sohel Rahman, Neural Network-Based Undersampling Techniques, IEEE, 02 September 2020, https://ieeexplore.ieee.org/abstract/document/9184909?casa_token=RnLRvnqyiF8AAAAA:iyxPWT06HX6a9g8X1nhShrllo_ht9ZM1cqHMWjET5wOopeR5dqizBF29cSSmFMRPo9V1D7XBIwg (link resides outside ibm.com).
7 Ajay Kumar, SOM-US: A Novel Under-Sampling Technique for Handling Class Imbalance Problem, hrcak, 30 January 2024, https://hrcak.srce.hr/clanak/454006 (link resides outside ibm.com).
8 Wonjae Lee and Kangwon Seo, Downsampling for Binary Classification with a Highly Imbalanced Dataset Using Active Learning, Science Direct, 26 April 2022, https://www.sciencedirect.com/science/article/pii/S2214579622000089 (link resides outside ibm.com).
9 Alberto Fernandez, et al., Learning from Imbalanced Data Sets, Springer, 2018.