What is upsampling?

Published: 29 April 2024
Contributors: Jacob Murel Ph.D.

Upsampling increases the number of data samples in a dataset. In doing so, it aims to correct imbalanced data and thereby improve model performance.

Upsampling, otherwise known as oversampling, is a data processing and optimization technique that addresses class imbalance in a dataset by adding data. Upsampling adds data by using original samples from minority classes until all classes are equal in size. Both Python scikit-learn and Matlab contain built-in functions for implementing upsampling techniques.

Upsampling for data science is often mistaken for upsampling in digital signal processing (DSP). The two are similar in spirit yet distinct. Similar to upsampling in data science, upsampling for DSP artificially creates more samples in a frequency domain from an input signal (specifically a discrete time signal) by interpolating higher sampling rates. These new samples are generated by inserting zeros into the original signal and using a low pass filter for interpolation. This differs from how data is upsampled in data balancing.

Upsampling for data balancing is also distinct from upsampling in image processing. In the latter, high resolution images are first reduced in resolution (removing pixels) for faster computations, after which convolution returns the image to its original dimensions (adding back pixels).

Why AI governance is a business imperative for scaling enterprise artificial intelligence

Learn about barriers to AI adoptions, particularly lack of AI governance and risk management solutions.

Advantages

No Information Loss: Unlike downsampling, which removes data points from the majority class, upsampling generates new data points, avoiding any information loss.
Increase Data at Low Costs: Upsampling is especially effective, and is often the only way, to increase dataset size on demand in cases where data can only be acquired through observation. For instance, certain medical conditions are simply too rare to allow for more data to be collected.

Disadvantages

Overfitting: Because upsampling creates new data based on the existing minority class data, the classifier can be overfitted to the data. Upsampling assumes that the existing data adequately captures reality; if that is not the case, the classifier may not be able to generalize very well.
Data Noise: Upsampling can increase the amount of noise in the data, reducing the classifier’s reliability and performance.²
Computational Complexity: By increasing the amount of data, training the classifier will be more computational expensive, which can be an issue when using cloud computing.²

Upsampling techniques

Random oversampling

Random oversampling is the process of duplicating random data points in the minority class until the size of the minority class is equal to the majority class.

Though they are similar in nature, random oversampling is distinct from bootstrapping. Bootstrapping is an ensemble learning technique that resamples from all classes. By contrast, random oversampling resamples from only the minority class. Random oversampling can thus be understood as a more specialized form of bootstrapping.

Despite its simplicity, random oversampling has limitations, however. Because random oversampling solely adds duplicate datapoints, it can lead to overfitting.³ But it still has many advantages over other methods: its ease of implementation, lack of stretching assumptions about the data, and low time complexity due to a simple algorithm.²

SMOTE

The Synthetic Minority Oversampling Technique, or SMOTE, is an upsampling technique first proposed in 2002 that synthesizes new data points from the existing points in the minority class.⁴ It consists of the following process:²

Find the K nearest neighbors for all minority class data points. K is usually 5.
Repeat steps 3-5 for each minority class data point:
Pick one of the data point’s K nearest neighbors.
Pick a random point on the line segment connecting these two points in the feature space to generate a new output sample. This process is known as interpolation.
Depending on how much upsampling is desired, repeat steps 3 and 4 using a different nearest neighbor.

SMOTE counters the problem of overfitting in random oversampling by adding previously unseen new data to the dataset rather than simply duplicating pre-existing data. For this reason, some researchers consider SMOTE a better upsampling technique than random oversampling.

On the other hand, SMOTE’s artificial data point generation adds extra noise to the dataset, potentially making the classifier more unstable.¹ The synthetic points and noise from SMOTE can also inadvertently lead to overlaps between the minority and majority classes that don’t reflect reality, leading to what is called over-generalization.⁵

Borderline SMOTE

One popular extension, Borderline SMOTE, is used to combat the issue of artificial dataset noise and to create ‘harder’ data points. ‘Harder’ data points are data points close to the decision boundary, and therefore harder to classify. These harder points are more useful for the model to learn.²

Borderline SMOTE identifies the minority class points that are close to many majority class points and puts them into a DANGER set. DANGER points are the ‘hard’ data points to learn, which again is because they’re harder to classify compared to points that are surrounded by minority class points. This selection process excludes points whose nearest neighbors are only majority class points, which are counted as noise. From there, the SMOTE algorithm continues as normal using this DANGER set.³

ADASYN

Adaptive Synthetic Sampling Approach (ADASYN) is similar to Borderline SMOTE in that it generates more difficult data for the model to learn. But it also aims to preserve the distribution of the minority class data.⁶ It does this by first creating a weighted distribution of all the minority points based on the number of majority class examples in its neighborhood. From there, it uses minority class points closer to the majority class more often in generating new data.

The process goes as follows:²

Create a KNN model on the entire dataset.
Each minority class point is given a “hardness factor”, denoted as r, which is ratio of the number of majority class points over the total number of neighbors in KNN.
Like SMOTE, the synthetically generated points are a linear interpolation between the minority data and its neighbors, but the number of points generated scales with a point’s hardness factor. What this does is generate more points in areas with less minority data and less points in areas with more.

Data transformation/augmentations

Data augmentation creates new data by creating variations of the data. Data augmentation applies across a variety of machine learning fields.

The most basic form of data augmentation deals with transforming the raw inputs of the dataset. For example, in computer vision, image augmentations (cropping, blurring, mirroring and so on) can be used to create more images for the model to classify. Similarly, data augmentation can also be used in natural language processing tasks, like replacing words with their synonyms or creating semantically equivalent sentences.

Researchers have found that data augmentation effectively increases model accuracy for computer vision and NLP tasks because it adds similar data at a low cost. However, it is important to note some cautions before executing these techniques. For traditional geometric augmentations, “safety” of transformations should be looked at before performing them. For example, rotating an image of a “9” would make it look a “6,” changing its semantic meaning.⁷

Recent research

SMOTE extensions and deep learning have been the focus of upsampling techniques in recent years. These methods aim to improve model performance and address some of the shortcomings of upsampling, like introduced bias in the distribution of the minority class.

Some developments in SMOTE include a minority-predictive-probability SMOTE (MPP-SMOTE), which upsamples based on estimated probabilities of seeing each minority class samples.⁸ Multi-Label Borderline Oversampling Technique (MLBOTE) has been proposed to extend SMOTE to multi-class classification.⁹ Both have outperformed all existing SMOTE variants and retained the patterns in the original data.

Neural networks have also been used to develop oversampling techniques. Generative Adversarial Networks have stirred some interest, producing promising results, although training time makes this technique slower than other traditional upsampling methods.¹⁰

Related resources

What is downsampling?

Downsampling decreases the number of data samples in a dataset. In doing so, it aims to correct imbalanced data and thereby improve model performance.

What is overfitting?

In machine learning, overfitting occurs when an algorithm fits too closely or even exactly to its training data, resulting in a model that can’t make accurate predictions or conclusions from any data other than the training data.

What is machine learning (ML)?

Machine learning (ML) is a branch of AI and computer science that focuses on using data and algorithms to enable AI to imitate the way that humans learn, gradually improving its accuracy.

Footnotes

¹ Haobo He and Edwardo Garcia, Learning from Imbalanced Data, IEEE, September 2009, https://ieeexplore.ieee.org/document/5128907 (link resides outside ibm.com). (1,2,10)

² Kumar Abishek and Mounir Abdelaziz, Machine Learning for Imbalanced Data, Packt, November 2023, https://www.packtpub.com/product/machine-learning-for-imbalanced-data/9781801070836 (link resides outside ibm.com). (3,4,6,8,9,12,14-17)

³ Kumar Abishek and Mounir Abdelaziz, Machine Learning for Imbalanced Data, Packt, November 2023, https://www.packtpub.com/product/machine-learning-for-imbalanced-data/9781801070836 (link resides outside ibm.com). Alberto Fernandez, et al., Learning from Imbalanced Data Sets, 2018.

⁴ Nitesh Chawla, et al., SMOTE: Synthetic Minority Over-sampling Technique, JAIR, 01 June 2002, https://www.jair.org/index.php/jair/article/view/10302 (link resides outside ibm.com).

⁵ Kumar Abishek and Mounir Abdelaziz, Machine Learning for Imbalanced Data, Packt, November 2023. Haobo He and Edwardo Garcia, Learning from Imbalanced Data, IEEE, September 2009, https://ieeexplore.ieee.org/document/5128907 (link resides outside ibm.com).

⁶ Alberto Fernandez, et al., Learning from Imbalanced Data Sets, Springer, 2018.

⁷ Connor Shorten and Taghi Khoshgoftaar, A survey on Image Data Augmentation for Deep Learning, Springer, 06 July 2019**,** https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0197-0 (link resides outside ibm.com).

⁸ Zhen Wei, Li Zhang, and Lei Zhao, Minority prediction probability based oversampling technique for imbalanced learning, Science Direct, 06 December 2022, https://www.sciencedirect.com/science/article/abs/pii/S0020025522014578?casa_token=TVVIEM3xTDEAAAAA:LbzQSgIvuYDWbDTBKWb4ON-CUiTUg0EUeoQf9q12IjLgXFk0NQagfh0bU3DMUSyHL_mjd_V890o (link resides outside ibm.com).

⁹ Zeyu Teng, et al., Multi-label borderline oversampling technique, ScienceDirect, 14 September 2023, https://www.sciencedirect.com/science/article/abs/pii/S0031320323006519?casa_token=NO8dLh60_vAAAAAA:AWPCvCP8PQG43DvkQFChZF2-3uzB1GJBBtgPURevWe_-aR0-WTbLqOSAsiwxulNAuh_4mIDZx-Y (link resides outside ibm.com).

¹⁰ Justin Engelmann and Stefan Lessmann, Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning, 15 July 2021, ScienceDirect, https://www.sciencedirect.com/science/article/abs/pii/S0957417421000233?casa_token=O0d1BtspA8YAAAAA:n2Uv3v2yHvjl9APVU9V_13rQ9K_KwT0P__nzd6hIngNcZJE-fmQufDgR6XT1uMmDBHx8bLXPVho (link resides outside ibm.com). Shuai Yang, et al., Fault diagnosis of wind turbines with generative adversarial network-based oversampling method, IOP Science, 12 January 2023, https://iopscience.iop.org/article/10.1088/1361-6501/acad20/meta (link resides outside ibm.com).