What is data augmentation?

Published: 07 May 2024
Contributors: Jacob Murel Ph.D., Eda Kavlakoglu

Data augmentation uses pre-existing data to create new data samples that can improve model optimization and generalizability.

In its most general sense, data augmentation denotes methods for supplementing so-called incomplete datasets by providing missing data points in order to increase the dataset’s analyzability.¹ This manifests in machine learning by generating modified copies of pre-existing data to increase the size and diversity of a dataset. Thus, with respect to machine learning, augmented data may be understood as artificially supplying potentially absent real-world data.

Data augmentation improves machine learning model optimization and generalization. In other words, data augmentation can reduce overfitting and improve model robustness.² That large, diverse datasets equal improved model performance is an axiom of machine learning. Nevertheless, for a number of reasons—from ethics and privacy concerns to simply the time-consuming effort of manually compiling necessary data—acquiring sufficient data can be difficult. Data augmentation provides one effective means of increasing dataset size and variability. In fact, researchers widely use data augmentation to correct imbalanced datasets.³

Many deep learning frameworks, such as PyTorch, Keras, and Tensorflow provide functions for augmenting data, principally image datasets. The Python package Ablumentations (available on Github) is also adopted in many open source projects. Albumentations allows for augmenting image and text data.

Augmented data versus synthetic data

Note that data augmentation is distinct from synthetic data. Admittedly, both are generative algorithms that add new data into a data collection in order to improve the performance of machine learning models. Synthetic data, however, refers to the automatic generation of entirely artificial data. An example is using computer-generated images—as opposed to real-world data—to train an object detection model. By contrast, data augmentation copies existing data and transforms those copies to increase the diversity and amount of data in a given set.

Why AI governance is a business imperative for scaling enterprise artificial intelligence

Learn about barriers to AI adoptions, particularly lack of AI governance and risk management solutions.

Related content

Data augmentation techniques

There are a variety of data augmentation methods. The specific techniques used for augmenting data depend upon the nature of data with which a user is working. Note that data augmentation is typically implemented during preprocessing on the training dataset. Some studies investigate the effect of augmentation on the validation or test set, but augmentation applications outside of training sets are rarer.⁴

Image augmentation

Data augmentation has been widely implemented in research for a range of computer vision tasks, from image classification to object detection. As such, there is a wealth of research on how augmented images improve the performance of state-of-the-art convolutional neural networks (CNNs) in image processing.

Many tutorials and non-academic resources classify image data augmentation into two categories: geometric transformations and photometric (or, color space) transformations. Both consist of relatively simple image file manipulation. The first category denotes techniques that alter the space and layout of the original image, such as resizing, zooming, or changes in orientation (for example, horizontal flip). Photometric transformations alter an image’s RGB (red-green-blue) channels. Examples of photometric transformation include saturation adjustment and grayscaling an image.⁵

Some sources categorize noise injection with geometric transformations,⁶ while others classify it with photometric transformations.⁷ Noise injection inserts random black, white, or color pixels into an image according to a Gaussian distribution.

As noise injection illustrates, the binary classification of image augmentation techniques into geometric and photometric fails to cover the whole range of possible augmentation strategies. Excluded image augmentation techniques are kernel filtering (sharpening or blurring an image) and image mixing. An example of the latter is random cropping and patching. This technique randomly samples sections from several images to create a new image. This new image is a composite made from the sampled sections of the input images. A related technique is random erasing, which deletes a random portion of an image.⁸ Such tasks are useful in image recognition tasks, as real-world use cases may require machines to identify partially obscured objects.

Instance-level augmentation is another augmentation. Instance-level augmentation essentially copies labeled regions (for example, bounding boxes) from one image and inserts them onto another image. Such an approach trains the image to identify objects against different backgrounds as well as objects obscured by other objects. Instance-level augmentation is a particularly salient approach for region-specific recognition tasks, such as object detection and image segmentation tasks.⁹

Text augmentation

Like image augmentation, text data augmentation consists of many techniques and methods that are used across a range of natural language processing (NLP) tasks. A few resources divide text augmentation into rule-based (or “easy”) and neural methods. Of course, as with the binary division of image augmentation techniques, this categorization is not all-encompassing.

Rule-based approaches include relatively simple find-and-replace techniques, such as random deletion or insertion. Rule-based approaches also encompass synonym replacement. In this strategy, one or more words in a string are replaced with their respective synonyms as recorded in predefined thesaurus, such as WordNet or the Paraphrase Database. Sentence inversion and passivation, in which the object and subject are swapped, are also examples of rule-based approaches.¹⁰

Per their classification, neural methods utilize neural networks to generate new text samples from the input data. One notable neural method is back-translation. This uses machine translation to translate input data into a target language and then back into the original input language. In this way, back-translation leverages linguistic variances that result in automated translations to generate semantic variances in single-language dataset for the purpose of augmentation. Research suggests this is effective for improving machine translation model performance.¹¹

Mix-up text augmentations is another strategy. This approach deploys rule-based deletion and insertion methods using neural network embeddings. Specifically, pre-trained transformers (for example, BERT) generate word or sentence-level embeddings of text, transforming text into vector points, as in a bag of words model. The transformation of text into vector points generally aims to capture linguistic similitude, that is, words or sentences nearer one another in vector space are believed to share similar meanings or frequency. Mix-up augmentations interpolates text strings within a specified distance of one another to produce new data that is an aggregate of the input data.¹²

Recent research

Many users struggle with identifying which data augmentation strategies to implement. Do data augmentation techniques vary in efficacy between datasets and tasks? Comparative research on data augmentation techniques suggests that multiple forms of augmentation have a greater positive impact than one, but determining the optimal combination of techniques is dataset and task dependent.¹³ But how does one go about selecting the optimal techniques?

Automated augmentation

To address this issue, research has turned to automated data augmentation. One automated augmentation approach uses reinforcement learning to identify augmentation techniques that return the highest validation accuracy on a given dataset.¹⁴ This approach has shown to implement strategies that improve performance on both in and out of sample data.¹⁵ Another promising approach to automated augmentation identifies and augments false positives from classifier outputs. In this way, automatic augmentation identifies the best strategies to correct for frequently misclassified items.¹⁶

Generative networks

More recently, research has turned to generative networks and models to identify task-dependent¹⁷ and class-dependent¹⁸ optimal augmentation strategies. This includes work with generative adversarial networks (GANs). GANs are deep learning networks typically used to generate synthetic data, and recent research investigates their use for data augmentation. A few experiments, for instance, suggest that synthetic data augmentations of medical image sets improve classification¹⁹ and segmentation²⁰ model performance more than classic augmentations. Relatedly, research in text augmentation leverages large language models (LLMs) and chatbots to generate augmented data. These experiments use LLMs to generate augmented samples of input data with mix-up and synonymizing techniques, showing a greater positive impact for text classification models than classic augmentation.²¹

Researchers and developers widely adopt data augmentation techniques when training models for various machine learning tasks. By contrast, synthetic data is a comparatively newer area of research. Comparative experiments on synthetic versus real data show mixed results, with models trained entirely on synthetic data sometimes outperforming, sometimes underperforming models trained on real-world data. Perhaps unsurprisingly, this research suggests synthetic data is most useful when it reflects characteristics of real-world data.²²

Related resources

What is synthetic data?

Created artificially through computer simulation or generated by algorithms, synthetic data can be used as an alternative or supplement to real-world data when real-world data is not readily available; it can also aid in data science experiments.

Five ways IBM is using synthetic data to improve AI models

Synthetic data is information generated on a computer to augment or replace real data to test and train AI models.

Data warehouse augmentation, Part 1

Combine traditional and big data technologies to maximize and augment the effectiveness of existing data warehouses.

Take the next step

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

Explore watsonx.ai

Book a live demo

Footnotes

^f Martin Tanner and Wing Hung Wong, “The Calculation of Posterior Distributions by Data Augmentation,” Journal of the American Statistical Association, Vol. 82, No. 398 (1987), pp. 528-540.

² Sylvestre-Alvise Rebuffi, Sven Gowal, Dan Andrei Calian, Florian Stimberg, Olivia Wiles, and Timothy A Mann, “Data Augmentation Can Improve Robustness,” Advances in Neural Information Processing Systems, Vol. 34, 2021, https://proceedings.neurips.cc/paper_files/paper/2021/hash/fb4c48608ce8825b558ccf07169a3421-Abstract.html (link resides outside ibm.com).

³ Manisha Saini and Seba Susan, “Tackling class imbalance in computer vision: A contemporary review,” Artificial Intelligence Review, Vol. 54, 2023, https://link.springer.com/article/10.1007/s10462-023-10557-6 (link resides outside ibm.com).

⁴ Fabio Perez, Cristina Vasconcelos, Sandra Avila, and Eduardo Valle, “Data Augmentation for Skin Lesion Analysis,” OR 2.0 Context-Aware Operating Theaters, Computer Assisted Robotic Endoscopy, Clinical Image-Based Procedures, and Skin Image Analysis, 2018, https://link.springer.com/chapter/10.1007/978-3-030-01201-4_33 (link resides outside ibm.com).

⁵ Connor Shorten and Taghi M. Khoshgoftaa, “A survey on Image Data Augmentation for Deep Learning,” Journal of Big Data, 2019, https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0197-0 (link resides outside ibm.com).

⁶ Duc Haba, Data Augmentation with Python, Packt Publishing, 2023.

⁷ Mingle Xu, Sook Yoon, Alvaro Fuentes, and Dong Sun Park, “A Comprehensive Survey of Image Augmentation Techniques for Deep Learning,” Patter Recognition, Vol. 137, https://www.sciencedirect.com/science/article/pii/S0031320323000481 (link resides outside ibm.com).

⁸ Connor Shorten and Taghi M. Khoshgoftaa, “A survey on Image Data Augmentation for Deep Learning,” Journal of Big Data, 2019, https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0197-0 (link resides outside ibm.com). Terrance DeVries and Graham W. Taylor, “Improved Regularization of Convolutional Neural Networks with Cutout,” 2017, https://arxiv.org/abs/1708.04552 (link resides outside ibm.com).

⁹ Zhiqiang Shen, Mingyang Huang, Jianping Shi, Xiangyang Xue, and Thomas S. Huang, “Towards Instance-Level Image-To-Image Translation,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3683-3692, https://openaccess.thecvf.com/content_CVPR_2019/html/Shen_Towards_Instance-Level_Image-To-Image_Translation_CVPR_2019_paper.html (link resides outside ibm.com). Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung-Yi Lin, Ekin D. Cubuk, Quoc V. Le, and Barret Zoph, “Simple Copy-Paste Is a Strong Data Augmentation Method for Instance Segmentation,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 2918-2928, https://openaccess.thecvf.com/content/CVPR2021/html/Ghiasi_Simple_Copy-Paste_Is_a_Strong_Data_Augmentation_Method_for_Instance_CVPR_2021_paper.html (link resides outside ibm.com).

¹⁰ Connor Shorten, Taghi M. Khoshgoftaar and Borko Furht, “Text Data Augmentation for Deep Learning,” Journal of Big Data, 2021, https://journalofbigdata.springeropen.com/articles/10.1186/s40537-021-00492-0 (link resides outside ibm.com). Junghyun Min, R. Thomas McCoy, Dipanjan Das, Emily Pitler, and Tal Linzen, “Syntactic Data Augmentation Increases Robustness to Inference Heuristics,” Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 2339-2352, https://aclanthology.org/2020.acl-main.212/ (link resides outside ibm.com).

¹¹ Connor Shorten, Taghi M. Khoshgoftaar, and Borko Furht, “Text Data Augmentation for Deep Learning,” Journal of Big Data, 2021, https://journalofbigdata.springeropen.com/articles/10.1186/s40537-021-00492-0 (link resides outside ibm.com). Rico Sennrich, Barry Haddow, and Alexandra Birch, “Improving Neural Machine Translation Models with Monolingual Data,” Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016, pp. 86-96, https://aclanthology.org/P16-1009/ (link resides outside ibm.com).

¹² Connor Shorten, Taghi M. Khoshgoftaar, and Borko Furht, “Text Data Augmentation for Deep Learning,” Journal of Big Data, 2021, https://journalofbigdata.springeropen.com/articles/10.1186/s40537-021-00492-0 (link resides outside ibm.com). Lichao Sun, Congying Xia, Wenpeng Yin, Tingting Liang, Philip Yu, and Lifang He, “Mixup-Transformer: Dynamic Data Augmentation for NLP Tasks,” Proceedings of the 28th International Conference on Computational Linguistics, 2020, https://aclanthology.org/2020.coling-main.305/ (link resides outside ibm.com). Hongyu Guo, Yongyi Mao, and Richong Zhang, “Augmenting Data with Mixup for Sentence Classification: An Empirical Study,” 2019. https://arxiv.org/abs/1905.08941 (link resides outside ibm.com).

¹³ Suorong Yang, Weikang Xiao, Mengchen Zhang, Suhan Guo, Jian Zhao, and Furao Shen, “Image Data Augmentation for Deep Learning: A Survey,” 2023, https://arxiv.org/pdf/2204.08610.pdf (link resides outside ibm.com). Alhassan Mumuni and Fuseini Mumuni, “Data augmentation: A comprehensive survey of modern approaches,” Array, Vol. 16, 2022, https://www.sciencedirect.com/science/article/pii/S2590005622000911 (link resides outside ibm.com). Evgin Goveri, “Medical image data augmentation: techniques, comparisons and interpretations,” Artificial Intelligence Review, Vol. 56, 2023, pp. 12561-12605, https://link.springer.com/article/10.1007/s10462-023-10453-z (link resides outside ibm.com).

¹⁴ Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V. Le, “AutoAugment: Learning Augmentation Strategies From Data,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 113-123, https://openaccess.thecvf.com/content_CVPR_2019/papers/Cubuk_AutoAugment_Learning_Augmentation_Strategies_From_Data_CVPR_2019_paper.pdf (link resides outside ibm.com).

¹⁵ Barret Zoph, Ekin D. Cubuk, Golnaz Ghiasi, Tsung-Yi Lin, Jonathon Shlens, and Quoc V. Le, “Learning Data Augmentation Strategies for Object Detection,” Proceedings of the 16^th European Conference on Computer Vision, 2020, https://link.springer.com/chapter/10.1007/978-3-030-58583-9_34 (link resides outside ibm.com).

¹⁶ Sandareka Wickramanayake, Wynne Hsu, and Mong Li Lee, “Explanation-based Data Augmentation for Image Classification,” Advances in Neural Information Processing Systems, Vol. 34, 2021, https://proceedings.neurips.cc/paper_files/paper/2021/hash/af3b6a54e9e9338abc54258e3406e485-Abstract.html (link resides outside ibm.com).

¹⁷ rishna Chaitanya, Neerav Karani, Christian F. Baumgartner, Anton Becker, Olivio Donati, and Ender Konukoglu, “Semi-supervised and Task-Driven Data Augmentation,” Proceedings of the 26^th International Conference on Information Processing in Medical Imaging, 2019, https://link.springer.com/chapter/10.1007/978-3-030-20351-1_3 (link resides outside ibm.com).

¹⁸ Cédric Rommel, Thomas Moreau, Joseph Paillard, and Alexandre Gramfort, “ADDA: Class-wise Automatic Differentiable Data Augmentation for EEG Signals,” International Conference on Learning Representations, 2022, https://iclr.cc/virtual/2022/poster/7154 (link resides outside ibm.com).

¹⁹ Maayan Frid-Adar, Idit Diamant, Eyal Klang, Michal Amitai, Jacob Goldberger, and Hayit Greenspan, “GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification,” Neurocomputing, 2018, pp. 321-331, https://www.sciencedirect.com/science/article/abs/pii/S0925231218310749 (link resides outside ibm.com).

²⁰ Veit Sandfort, Ke Yan, Perry Pickhardt, and Ronald Summers, “Data augmentation using generative adversarial networks (CycleGAN) to improve generalizability in CT segmentation tasks,” Scientific Reports, 2019, https://www.nature.com/articles/s41598-019-52737-x (link resides outside ibm.com).

²¹ Kang Min Yoo, Dongju Park, Jaewook Kang, Sang-Woo Lee, and Woomyoung Park, “GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation,” Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 2225-2239, https://aclanthology.org/2021.findings-emnlp.192/ (link resides outside ibm.com). Haixing Dai, Zhengliang Liu, Wenxiong Liao, Xiaoke Huang, Yihan Cao, Zihao Wu, Lin Zhao, Shaochen Xu, Wei Liu, Ninghao Liu, Sheng Li, Dajiang Zhu, Hongmin Cai, Lichao Sun, Quanzheng Li, Dinggang Shen, Tianming Liu, and Xiang Li, “AugGPT: Leveraging ChatGPT for Text Data Augmentation,” 2023, https://arxiv.org/abs/2302.13007 (link resides outside ibm.com).

²² Bram Vanherle, Steven Moonen, Frank Van Reeth, and Nick Michiels, “Analysis of Training Object Detection Models with Synthetic Data,” 33^rd British Machine Vision Conference, 2022, https://bmvc2022.mpi-inf.mpg.de/0833.pdf (link resides outside ibm.com). Martin Georg Ljungqvist, Otto Nordander, Markus Skans, Arvid Mildner, Tony Liu, and Pierre Nugues, “Object Detector Differences When Using Synthetic and Real Training Data,” SN Computer Science, Vol. 4, 2023, https://link.springer.com/article/10.1007/s42979-023-01704-5 (link resides outside ibm.com). Lei Kang, Marcal Rusinol, Alicia Fornes, Pau Riba, and Mauricio Villegas, “Unsupervised Writer Adaptation for Synthetic-to-Real Handwritten Word Recognition,” Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2020, pp. 3502-3511, https://openaccess.thecvf.com/content_WACV_2020/html/Kang_Unsupervised_Writer_Adaptation_for_Synthetic-to-Real_Handwritten_Word_Recognition_WACV_2020_paper.html (link resides outside ibm.com).