Introductory Chapter: Current State and Achievements of Data Augmentation

*Robertas Damaševičius*

#### **1. Introduction**

Artificial intelligence (AI) models assume a growing role in biomedical imaging and health services. However, the development of AI systems as clinical decision support systems in the real-life setting presents several challenges [1]. One of these challenges is the scarcity of data, particularly in domains such as healthcare, where data is inherently limited and unbalanced. Also, datasets can be unreachable due to privacy matters or the lack of data-sharing incentives [2]. Data augmentation, particularly through generative models, has emerged as a significant approach to address these challenges. It allows for the generation of synthetic data, thereby expanding the available dataset for training AI models. This not only enhances the performance of these models but also enables their application in data-scarce scenarios. Data augmentation allows to expand the diversity of data used for training models while skipping the need to acquire additional data. Padding, cropping, and horizontal flipping are standard data augmentation approaches employed to train massive neural networks [3].

In the domain of AI and particularly in image processing, data augmentation plays a crucial role. It not only helps in preventing overfitting but also provides a means to enhance the performance of deep learning models [4]. In image processing, data augmentation can generate visually diverse images that can improve the robustness of models to new, unseen data [5].

This chapter aims to present an overview of the current state and achievements of data augmentation, discussing its impact, challenges, and limitations, and exploring future emerging trends in the field.

#### **2. Current state of data augmentation**

#### **2.1 Formal definition**

The objective of data augmentation is to create a diverse set of transformed data points that can help enhance the performance of machine learning models. Data augmentation can be formally defined as a process that generates a set of transformed data points from an original dataset. Let us denote the original dataset as *D* ¼ f g *x*1, *x*2, … , *xn* , where *xi* is a data point. The data augmentation process can be

represented as a function *f* : *X* ! *X*<sup>0</sup> , where *X* is the space of original data points and *X*<sup>0</sup> is the space of augmented data points. For each data point *xi* ∈ *D*, the data augmentation function *f* generates a set of transformed data points *Di* <sup>0</sup> ¼ *x*<sup>0</sup> *<sup>i</sup>*1, *x*<sup>0</sup> *<sup>i</sup>*2, … , *x*<sup>0</sup> *im* , where *<sup>x</sup>*<sup>0</sup> *ij* ¼ *f x*ð Þ*<sup>i</sup>* and *m* is the number of augmented data points generated from *xi*. The augmented dataset *D*<sup>0</sup> is the union of all *Di* <sup>0</sup>, i.e., *<sup>D</sup>*<sup>0</sup> <sup>¼</sup> <sup>∪</sup>*<sup>n</sup> <sup>i</sup>*¼<sup>1</sup>*Di* <sup>0</sup> . This process can be represented as follows:

$$D' = \bigcup\_{i=1}^{n} \{ f(\mathbf{x}\_i) | \mathbf{x}\_i \in D \} \tag{1}$$

#### **2.2 Standard techniques in image processing**

Rotation is a common data augmentation technique in image processing. It involves rotating the image by a certain angle. This can help to make the model invariant to the orientation of the object in the image. The rotation operation can be described as:

$$R(\mathbf{x}, \theta) = \begin{bmatrix} \cos \theta & -\sin \theta \\ \sin \theta & \cos \theta \end{bmatrix} \mathbf{x} \tag{2}$$

where *x* is the initial image and *θ* is the rotation angle.

Scaling involves resizing the image, either by making it larger (zooming in) or smaller (zooming out). This can help to make the model invariant to the size of the object in the image. The scaling operation can be described as:

$$S(x,\mathfrak{s}) = \mathfrak{s} \cdot x \tag{3}$$

where *x* is the initial image and *s* is the scaling factor.

Cropping involves cutting out a portion of the image. This can help to make the model focus on the important parts of the image. The cropping operation can be described as:

$$\mathbf{C}(\mathbf{x}, r) = \mathbf{x}[r] \tag{4}$$

where *x* is the initial image and *r* is the region to be cropped.

Flipping involves reversing the image either horizontally or vertically. This can help to make the model invariant to the orientation of the object in the image. The flipping operation can be described as:

$$F(\mathbf{x}) = \mathbf{x}^{\dagger} \tag{5}$$

where *x* is the initial image and *x*† is the flipped image.

Adjusting the brightness and contrast of the image can help to make the model invariant to different lighting conditions. The brightness and contrast adjustment operation can be represented as:

$$B(\mathfrak{x}, b) = \mathfrak{x} + b \tag{6}$$

$$\mathbf{C}(\mathbf{x}, \mathbf{c}) = \mathbf{c} \cdot \mathbf{x} \tag{7}$$

where *x* is the initial image, *b* is the brightness adjustment, and *c* is the contrast adjustment.

#### **2.3 Advanced techniques in image processing**

Elastic deformations are a type of data augmentation technique that involves applying random, smooth transformations to an image. This can help to make the model invariant to small local deformations in the object's shape. The elastic deformation operation can be described as:

$$E(\mathbf{x}, a, \sigma) = \mathbf{x} + a \cdot \nabla \cdot \mathbf{G}\_{\sigma} \ast r \tag{8}$$

where *x* is the initial image, *r* is a random field for each pixel, *G<sup>σ</sup>* is a Gaussian filter with standard deviation *σ*, and *α* is a scaling factor.

Random erasing is a data augmentation technique that involves randomly selecting a rectangle in the image and replacing its pixels with random values. This can help to increase the robustness of the model to occlusion. The random erasing operation can be described as:

$$RE(\mathbf{x}, r) = \mathbf{x} \cdot (\mathbf{1} - m) + m \cdot v \tag{9}$$

where *x* is the initial image, *r* is the region to be erased, *m* is a mask that is 1 in the region *r* and 0 elsewhere, and *v* is a random value.

Mixup and CutMix are data augmentation techniques that involve creating new training examples by taking a convex combination of two training examples. For Mixup, this is done pixel-wise, while for CutMix, a region from one image is cut and pasted onto another image. The Mixup and CutMix operations can be described as:

$$\text{Mixup}(\mathbf{x}\_1, \mathbf{x}\_2, \boldsymbol{\lambda}) = \boldsymbol{\lambda} \cdot \mathbf{x}\_1 + (\mathbf{1} - \boldsymbol{\lambda}) \cdot \mathbf{x}\_2 \tag{10}$$

$$\text{CutMax}(\mathbf{x}\_1, \mathbf{x}\_2, r) = \mathbf{x}\_1 \cdot (\mathbf{1} - m) + m \cdot \mathbf{x}\_2 \tag{11}$$

where *x*<sup>1</sup> and *x*<sup>2</sup> are original images, *λ* is a random value between 0 and 1, *r* is the region to be cut and pasted, and *m* is a mask that is 1 in region *r*, and 0 elsewhere.

#### **2.4 Generative models for data augmentation**

#### *2.4.1 Generative adversarial networks (GANs)*

Generative adversarial networks (GANs) are a class of generative models proposed in Ref. [6]. Generative adversarial networks (GANs) are made of a generator and a discriminator, which are two neural networks trained simultaneously. The generator creates new data instances, whereas the discriminator determines whether or not each sample of data matches the real training dataset. The generator is trained to create data that the discriminator cannot separate from real data, while the discriminator is trained to get better at separating real data from created data. This is formally represented by the following minimax game between generator *G* and discriminator *D*:

$$\min\_{G} \max\_{D} V(D, G) = \mathbb{E}\mathbf{z} \sim p \text{data}(\mathbf{z}) [\log D(\mathbf{z})] + \mathbb{E}\mathbf{z} \sim p\mathbf{z}(\mathbf{z}) [\log(1 - D(G(\mathbf{z})))] \tag{12}$$

where: *x* is a real data instance, *z* is a noise vector sampled from a prior noise distribution *pz*ð Þ*z* , *G*ð Þ*z* is the data instance generated by the generator, *D*ð Þ *x* is the probability that the real data instance *x* is a real data sample (according to the discriminator), *D G*ð Þ ð Þ*z* is the probability that a fake data instance is a real data instance (according to the discriminator).

In the context of data augmentation, GANs can be employed to synthesize additional training data that is similar to the original training data. This can be particularly useful when the original dataset is small or imbalanced [7, 8].

#### *2.4.2 Variational autoencoders (VAEs)*

Variational autoencoders (VAEs) are a generative model that have been used for data augmentation. They consist of an encoder and a decoder, where the encoder maps input data to a latent space and the decoder maps from latent space back to the original data space. The key difference between VAEs and traditional autoencoders is that the latent space of VAEs is continuous, which is achieved by having the encoder output two vectors of means and standard deviations instead of a single encoding vector [9].

The formal mathematical definition of VAEs involves several components. Let *X* be the training data where each *xi* represents a data point. The encoder learns a mapping *<sup>Q</sup>θ*ð Þ *<sup>z</sup>*j*<sup>X</sup>* from an input *xi* to the mean *<sup>μ</sup>*ð Þ *xi* and covariance *<sup>σ</sup>*<sup>2</sup>ð Þ *xi* vectors of the latent variables, where the latent variable *z* follows a normal distribution *N*ð Þ 0, 1 . The decoder learns a mapping *Pϕ*ð Þ *X*j*z* from the latent representation *z* to the distribution parameters of *X*. The objective function for training a VAE is given by:

$$\mathcal{L}(\theta, \phi; X) = \mathbb{E}z \sim Q\theta(z|X)\left[\log P\_{\theta}(X|z)\right] - D\_{\text{KL}}(Q\_{\theta}(z|X) \| P(z)) \tag{13}$$

where *θ*, *ϕ* are the encoder and decoder parameters, and *DKL* is the Kullback-Leibler Divergence between two probability distributions [10].

Variational autoencoders (VAEs) have been used in various applications for data augmentation. For example, in the field of audio processing, VAEs have been used to augment data by learning to synthesize new audio data instances from the latent space [11]. In the field of medical imaging, VAEs have been used to generate synthetic medical images for training diagnostic models [12].

#### **2.5 Data augmentation methods for natural language processing**

Data augmentation (DA) is also applied in natural language processing, achieving improvements in several tasks [13]. The primary goal of DA approaches in natural language processing (NLP) is to increase the variety of training data, allowing the model to generalize to previously unknown testing data. Based on the variety of enhanced data, DA approaches in NLP may be divided into three types: paraphrase, noising, and sampling.


*Introductory Chapter: Current State and Achievements of Data Augmentation DOI: http://dx.doi.org/10.5772/intechopen.112284*

• **Sampling** involves generating new sentences by sampling from a language model trained on the original data. This helps to enhance the diversity of the training data.

In addition to these methods, there are also task-specific DA methods for NLP tasks, for example, named entity recognition (NER) and sentence classification. For example, the unified medical language system-easy data augmentation (UMLS-EDA) method extends the easy data augmentation (EDA) approach for biomedical NER by including the Unified Medical Language System (UMLS) knowledge, which can boost the model performance for both NER and sentence classification [14].

#### **2.6 Data augmentation methods for audio data**

Data augmentation for audio data is a crucial technique to enhance the results of machine learning models, especially when the available dataset is limited. Several methods have been proposed for this purpose:


These methods can be used individually or in combination to augment audio data. However, the effectiveness of these methods can vary depending on the specific characteristics of the audio data and the task at hand [15].

#### **2.7 Data augmentation methods for tabular data**

Data augmentation for tabular data is a demanding task due to the structured nature of the data and the potential relationships between different columns. Several methods have been proposed to tackle this issue:

• **Resampling** is most commonly used for imbalanced tabular data such as creating new samples by resampling the existing data. The two main types of resampling

are oversampling, where new instances are synthesized from the minority class, and undersampling, where instances from the majority class are deleted.


These methods can be used individually or in combination to augment tabular data. The effectiveness of these methods can vary depending on the specific features of the data and the task at hand.

#### **3. Discussion**

#### **3.1 Achievements and impact of data augmentation**

Data augmentation has been instrumental in improving the effectiveness of machine learning models, particularly in image recognition tasks. The use of data augmentation in deep learning applications in medical image analysis, for instance, has led to better results in diagnostic accuracy [17–19].

One of the key benefits of data augmentation is its ability to mitigate overfitting. Overfitting occurs when a model learns the training data too well, to the point where it performs poorly on unseen data. By creating a more diverse training dataset, data augmentation can help to prevent overfitting, thereby improving the model's power to scale to new data.

Data augmentation is particularly useful in scenarios where data is scarce. Here, data augmentation can be used to artificially increase the size of the dataset. This has been demonstrated in various fields, including medical imaging and plant stress phenotyping, where data augmentation has enabled the development of robust machine learning models despite the limited availability of data [20].

#### **3.2 Challenges and limitations of data augmentation**

Another challenge of data augmentation is preserving the original data distribution. When augmenting data, it is crucial to ensure that the transformed data points do not significantly alter the overall distribution of the data. If the data augmentation process introduces bias or changes the data distribution, it can lead to misleading results and poor model performance [21].

Data augmentation can be particularly challenging for complex data types. For instance, in the context of network data, standard data augmentation techniques may not be applicable due to the complex interdependencies between data points. New methods and techniques are needed to effectively augment such complex data types [22].

Data augmentation can be computationally expensive, especially for large datasets and complex augmentation operations. This can increase the time and computational

resources required for training machine learning models. However, the benefits of data augmentation in terms of improved model performance often outweigh these additional costs [23].

#### **3.3 Future directions and emerging trends**

Automated data augmentation, where the augmentation process is guided by machine learning algorithms, is a promising future direction. This approach can potentially generate more effective and diverse augmented data by learning the optimal transformations for each data point. Recent advancements in GANs have shown potential in this area such as in digital pathology [24].

As machine learning applications become more specialized, the need for domainspecific augmentation techniques is becoming increasingly apparent. For instance, in materials science, AI and machine learning are being used to discover new materials and understand their properties. In this context, domain-specific augmentation techniques can help generate more realistic and diverse data for training machine learning models [25].

The integration of data augmentation with active learning is another emerging trend. Active learning is a strategy where the model actively selects the most informative data points for training. By combining this with data augmentation, it may be possible to create a more efficient and effective learning process. This approach has shown promise in the field of cardiovascular imaging [26] and plant root segmentation [27].

#### **4. Conclusion**

In this chapter, we have explored the current state, achievements, challenges, and future directions of data augmentation in the context of artificial intelligence and image processing. We have seen how data augmentation techniques have evolved over time and how they have contributed to significant improvements in model performance, particularly in image recognition tasks. We have also discussed the challenges associated with data augmentation, including preserving the original data distribution, augmenting complex data types, and managing computational costs. Looking forward, we have highlighted emerging trends such as automated data augmentation, domain-specific augmentation techniques, and the integration of data augmentation with active learning.

As we move forward, it is clear that data augmentation will continue to play a crucial role in the field of artificial intelligence and image processing. The development of more sophisticated and automated data augmentation techniques, as well as the integration of data augmentation with other machine learning strategies, opens up exciting new possibilities for future research. However, it is also important to be mindful of the challenges and limitations of data augmentation, and to continue to develop strategies to address these issues. As we continue to push the boundaries of what is possible with artificial intelligence and image processing, data augmentation will undoubtedly remain a key tool in our arsenal.

*Deep Learning – Recent Findings and Research*

#### **Author details**

Robertas Damaševičius Department of Applied Informatics, Vytautas Magnus University, Kaunas, Lithuania

\*Address all correspondence to: robertas.damasevicius@ktu.lt

© 2023 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

*Introductory Chapter: Current State and Achievements of Data Augmentation DOI: http://dx.doi.org/10.5772/intechopen.112284*

#### **References**

[1] Castiglioni I, Rundo L, Codari M, Di Leo G, Salvatore C, Interlenghi M, et al. Ai applications to medical images: From machine learning to deep learning. European Journal of Medical Physics. 2021;**83**:9-24

[2] Williams B, Borroni D, Liu R, Zhao Y, Zhang J, Lim JWC, et al. An artificial intelligence-based deep learning algorithm for the diagnosis of diabetic neuropathy using corneal confocal microscopy: A development and validation study. Diabetologia. 2019;**63** (2):419-430

[3] Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems. Curran Associates, Incorporated; 2012. pp. 1097-1105

[4] Shorten C, Khoshgoftaar TM. A survey on image data augmentation for deep learning. Journal of Big Data. 2019; **6**(1):60

[5] Perez L, Wang J. The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621. 2017

[6] Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative adversarial nets. In: Advances in Neural Information Processing Systems. Curran Associates, Incorporated; 2014. pp. 2672-2680

[7] Antoniou A, Storkey A, Edwards H. Augmenting image classifiers using data augmentation generative adversarial networks. In: Artificial Neural Networks and Machine Learning–ICANN, 2018. Springer; 2018. pp. 570-582

[8] Weng Y, Zhou H. Data augmentation computing model based on generative adversarial network. IEEE Access. 2019; **7**:75819-75828

[9] Kingma D, P, Welling M. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. 2013

[10] Doersch C. Tutorial on variational autoencoders. arXiv preprint arXiv: 1606.05908. 2016

[11] Pascual S, Bonafonte A, Serrà J. Melnet: A generative model for audio in the frequency domain. arXiv preprint arXiv:1906.01083. 2019

[12] Goodfellow I, Bengio Y, Courville A. Deep Learning (Adaptive Computation and Machine Learning Series). Adaptive Computation and Machine Learning Series. MIT Press; 2016

[13] Li B, Hou Y, Che W. Data augmentation approaches in natural language processing: A survey. Artificial Intelligence Open. 2022;**3**:71-90

[14] Kang T, Perotte AJ, Tang Y, Ta CN, Weng C. Umls-based data augmentation for natural language processing of clinical research literature. Journal of the American Medical Informatics Association. 2020;**28**(4):812-823

[15] Abayomi-Alli OO, Damaševičius R, Qazi A, Adedoyin-Olowe M, Misra S. Data augmentation and deep learning methods in sound classification: A systematic review. Electronics. 2022;**11**(22)

[16] Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research. 2002;**16**:321-357

[17] Ker J, Lin W, Rao JP, Lim TC. Deep learning applications in medical image analysis. IEEE Access. 2018;**6**:9375-9389

[18] Abayomi-Alli OO, Damaševičius R, Misra S, Maskeliūnas R, Abayomi-Alli A. Malignant skin melanoma detection using image augmentation by oversampling in nonlinear lowerdimensional embedding manifold. Turkish Journal of Electrical Engineering and Computer Sciences. 2021;**29**: 2600-2614

[19] Oyewola DO, Dada EG, Misra S, Damaševičius R. A novel data augmentation convolutional neural network for detecting malaria parasite in blood smear images. Applied Artificial Intelligence. 2022;**36**(1)

[20] Singh AK, Ganapathysubramanian B, Sarkar S. Deep learning for plant stress phenotyping: Trends and future perspectives. Trends in Plant Science. 2018;**23**(10):883-898

[21] Talebi H, Milanfar P. Nima: Neural image assessment. IEEE Transactions on Image Processing. 2018; **27**(8):3998-4011

[22] Cranmer SJ, Leifeld P, McClurg SD, Rolfe M. Navigating the range of statistical tools for inferential network analysis. American Journal of Political Science. 2017;**61**(1):237-251

[23] Lin Y, Li H, Xiao X, Zhang L, Wang K, Gregersen H, et al. Daism-dnnxmbd: Highly accurate cell type proportion estimation with in silico data augmentation and deep neural networks. Patterns. 2022;**3**(3):100440

[24] Tschuchnig ME, Oostingh GJ, Gadermayr M. Generative adversarial networks in digital pathology: A survey on trends and future potential. Patterns. 2020;**1**(5):100089

[25] Li J, Lim K, Yang H, Ren Z, Raghavan S, Chen P-Y, et al. Ai applications through the whole life cycle of material discovery. Matter. 2020;**3**(2): 371-407

[26] O'Regan D. Putting machine learning into motion: Applications in cardiovascular imaging. Clinical Radiology. 2020;**75**(1):5-13

[27] Smith A, Petersen JK, Selvan R, Rasmussen CR. Segmentation of roots in soil with u-net. Plant Methods. 2020; **16**(1):1-14

#### **Chapter 2**
