High-Fidelity Synthetic Data Applications for Data Augmentation

*Zhenchen Wang, Barbara Draghi, Ylenia Rotalinti, Darren Lunn and Puja Myles*

#### **Abstract**

The use of high-fidelity synthetic data for data augmentation is an area of growing interest in data science. In this chapter, the concept of synthetic data is introduced, and different types of synthetic data are discussed in terms of their utility or fidelity. Approaches to synthetic data generation are presented and compared with computer modelling and simulation approaches, highlighting the unique benefits of highfidelity synthetic data. One of the main applications of high-fidelity synthetic data is supporting the training and validation of machine learning algorithms, where it can provide a virtually unlimited amount of diverse and high-quality data to improve the accuracy and robustness of models. Furthermore, high-fidelity synthetic data can address missing data and biases due to under-sampling using techniques such as BayesBoost, as well as boost sample sizes in scenarios where the real data is based on a small sample. Another important application is generating virtual patient cohorts, such as digital twins, to estimate counterfactuals in silico trials, allowing for better prediction of treatment outcomes and personalised medicine. The chapter concludes by identifying areas for further research in the field, including developing more efficient and accurate synthetic data generation methods and exploring the ethical implications of using synthetic data.

**Keywords:** synthetic data, external validation, sample size boosting, in silico trials, clinical trials, virtual populations, BayesBoost, bias correction

#### **1. Introduction**

In the rapidly evolving field of data science, one of the most critical challenges faced by researchers is the availability of training data. Data forms the bedrock upon which accurate and robust machine learning models are built. However, acquiring a sufficient amount of high-quality data can often be a challenging and resource-intensive task. This is where the concept of data augmentation comes into play, offering a transformative solution to address this persistent hurdle.

Data augmentation, in its simplest form, is a technique used to enhance the existing dataset by generating variations of the available samples using external data sources or synthetic data. By doing so, it aims to increase the diversity, volume, and quality of the training data. This augmentation process is important in improving the model's ability to generalise and make accurate predictions on unseen instances. This expanded dataset helps the model learn to handle different scenarios and improve its performance [1].

In the realm of data science, data augmentation allows researchers to overcome the limitations imposed by limited training data. By introducing variability through augmentation, it helps mitigate overfitting, a common challenge where the model becomes too specialized to the training data and performs poorly on new data. Additionally, data augmentation assists in addressing class imbalance issues, especially when certain classes are underrepresented in the dataset. It achieves this by generating additional samples for minority classes, thereby ensuring a more balanced representation of different classes in the augmented dataset [2–4].

Recent advancements in machine learning have paved the way for innovative data augmentation methods in high-fidelity synthetic data generation, ushering in augmentation strategies that provide richer and more insightful enhancements. For example, tabular data augmentation techniques, as described in [5], showcase the potential for improving disease predictive capabilities. Furthermore, in conjunction with methods like generative adversarial networks (GANs) [6], variational autoencoders (VAEs) [7], rule-based models, and physics-based simulations, which are used to craft highly realistic synthetic data, the incorporation of graphical models such as Bayesian networks and Markov random fields enables the capture of dependencies and relationships among variables within the dataset. This capability facilitates the generation of synthetic data that closely mirrors the characteristics of the original dataset. By combining these techniques, the creation of 'high-fidelity' synthetic data becomes achievable, showcasing complex patterns, domain knowledge integration, and simulation of real-world interactions.

The use of high-fidelity synthetic data for data augmentation presents a myriad of advantages. Firstly, it addresses the limitations of traditional augmentation techniques by providing more diverse and representative samples, especially in cases where the original dataset is small or lacks variability. Secondly, synthetic data enables researchers to explore hypothetical scenarios, enabling them to understand the behaviour of their models under different conditions. Additionally, it can be an invaluable asset in situations where the collection of real-world data is prohibitively expensive, time-consuming, or ethically challenging. By simulating data with similar statistical and relational properties, synthetic data augments the training set, expanding the model's ability to handle a wide range of scenarios.

In this chapter, we will explore the definitions, generation approaches, and considerations surrounding high-fidelity synthetic data. Furthermore, we will examine the fidelity and utility of various types of synthetic data, drawing comparisons between different generation approaches and techniques. Additionally, we will explore how high-fidelity synthetic data can be applied to support the training and validation of machine learning algorithms. Specifically, we will investigate how it can address challenges such as missing data in situations where real data is either randomly or non-randomly missing. Moreover, we will explore how high-fidelity synthetic data can help address biases resulting from under-sampling, employing innovative techniques like BayesBoost [8]. As we conclude this chapter, we will also identify areas that hold promise for further research in the field. This includes the development of

more efficient and precise methods for generating and evaluating synthetic data, and ethical implications associated with the use of synthetic data.

#### **2. Synthetic data: definitions, approaches, and considerations**

Synthetic data are artificial data that can mimic the statistical properties, patterns, and relationships observed in real-world data. It is generated or simulated rather than directly collected from authentic sources. The quality of synthetic data is primarily determined by the chosen approach employed for its generation. The quality of synthetic data can be described in terms of its fidelity i.e., how effectively the synthetic data captures the relevant features and characteristics of the original data. The fidelity in turn determines its utility i.e., its practical usefulness for various applications.

#### **2.1 High-fidelity synthetic data**

High-fidelity synthetic data refers to synthetic data capable of capturing the intricate interrelationships that exist between various data fields, replicating the complex patterns observed in real data.

In the field of financial transactions, a high-fidelity synthetic dataset would accurately emulate the intricate relationships and patterns observed in real financial data. It would possess the same statistical characteristics, transactional structures, and market trends, making it virtually indistinguishable from genuine financial data [9]. In the realm of transportation planning, a high-fidelity synthetic dataset would replicate the complex interactions and dynamics within transportation systems, including traffic flows, travel patterns, and infrastructure utilisation [10]. This synthetic dataset would mirror the statistical properties and intricate relationships found in real-world transportation data, making it practically indistinguishable from genuine data. In the context of patient health care data, a high-fidelity synthetic dataset would be able to capture complex clinical relationships and be clinically indistinguishable from real patient data [11–13]. Within the realm of social network analysis, a high-fidelity synthetic dataset would accurately capture the intricate connections, community structures, and communication patterns present in real social networks [14]. It would possess the same statistical properties, network topologies, and user behaviours, rendering it virtually indistinguishable from genuine social network data.

Generating a high-fidelity synthetic dataset can be demanding in terms of resources. Generating synthetic datasets with lower or moderate utility, that are less demanding of resources, might be deemed sufficient depending on the application. It is also important to note that there is a trade-off between utility and privacy in synthetic data generation. Higher fidelity synthetic data, which closely resembles real data, may come with increased privacy risks [15].

#### **2.2 Approaches to generate synthetic data**

We categorize the generation of synthetic data (see **Figure 1**) into two distinct methods that tackle the various challenges associated with data generation: the model-based approach and the simulation-based approach.

#### **Figure 1.**

 *Synthetic data generation approach categories.* 

#### **Figure 2.**

 *Statistical-based synthetic data generation. E.g., Gaussian Mixture Model components are designed to capture the statistical properties of the real data, the arrows illustrate how the Gaussian components influence the generation of synthetic data.* 

#### *2.2.1 Model-based approach*

 The model-based approach to generating synthetic data includes statistical-based, noise-based, and machine learning-based generation methods.

 Statistical-based generation (see **Figure 2** ) involves creating synthetic data by leveraging the statistical properties found in real-world data. This approach aims to capture the statistical characteristics, distributions, and dependencies observed in the original data.

 Various techniques and models are used in this category to generate synthetic data that closely resembles the statistical properties of the real data [ 16 ]. For example, Gaussian mixture models are commonly employed in statistical-based generation to capture the underlying distributions and generate synthetic data points [ 17 ].

#### *High-Fidelity Synthetic Data Applications for Data Augmentation DOI: http://dx.doi.org/10.5772/intechopen.113884*

These models allow researchers to replicate the patterns and variations observed in the original data. Similarly, Markov chains [ 18 ] are used to model the dependencies and transitions between different states or variables, enabling the generation of synthetic sequences that mimic the temporal behaviour of the real data.

 However, the statistical-based approach has limitations in capturing complex relationships and dynamics within the data [ 19 ]. While it reproduces statistical properties well, it struggles with non-linear relationships that are not explicitly reflected in the statistical properties. In the context of body mass index (BMI) and health outcomes, a linear relationship might suggest that as BMI increases, the risk of certain health conditions, such as diabetes, also increases linearly. However, the relationship between BMI and health outcomes may not be linear. It could be that initially, as BMI increases, the risk of health conditions rises sharply, but beyond a certain BMI threshold, the effect on health outcomes levels off or even declines. This non-linear relationship is challenging to capture solely through statistical properties.

 Furthermore, the statistical-based approach may struggle to capture complex interactions between variables. For example, the relationship between BMI and health conditions may vary depending on other factors such as age, gender, or genetic predispositions. These complex interactions, where the effect of BMI on health outcomes is influenced by other variables, are not easily captured through statistical properties alone.

 Thus, the fidelity of the synthetic data generated using statistical-based approaches will largely depend on the assumptions and limitations of the data generation model, as well as the specific characteristics of the dataset. Researchers need to consider the specific nature of the data and the level of detail required for their analysis, as the statistical-based approach may not be able to fully replicate the more nuanced aspects of the real data.

 The noise-based approach (see **Figure 3** ) involves adding noise to a small sample of data. This approach, while useful when regenerating a portion of the real-world

#### **Figure 3.**

 *Noise-based synthetic data generation, noise source, e.g., a jitter function, represents the introduction of random noise or perturbations to the real data.* 

data, is also part of a broader category known as rule-based synthetic data [ 20 ]. This involves defining rules or mathematical functions that generate synthetic data adhering to specific patterns or distributions. One example of a noise-based approach is the use of jitter, where small random perturbations are added to the data points. Jitter introduces slight variations to the data values, simulating the inherent randomness or measurement errors observed in real-world data [ 21 ]. By applying the jitter approach within the noise-based category, researchers can generate synthetic datasets that capture the desired characteristics and patterns while incorporating random fluctuations. For instance, in a study investigating the relationship between body mass index (BMI) and blood pressure, researchers can use jitter to introduce random variations to the BMI values within a certain range. This ensures that the synthetic data closely mirrors the statistical properties and patterns observed in real data, accounting for the inherent variability in BMI measurements. Additionally, techniques like data imputation, commonly used to address missing values, also fall under this category.

 Often, an important consideration in synthetic data generation is the protection of privacy. This becomes particularly relevant when dealing with sensitive or personally identifiable information. The noise-based approach, driven by privacypreserving techniques such as differential privacy [ 22 ], is used to safeguard individual privacy while still allowing for meaningful analysis and data utilisation. By incorporating such techniques into the synthetic data generation process, researchers can ensure that the generated datasets maintain a high level of privacy protection. This can be achieved by adding carefully calibrated noise to the data, minimising the risk of re-identification while preserving the statistical properties and utility of the synthetic data. By prioritising privacy alongside utility and fidelity, synthetic data can serve as a valuable resource for various applications while respecting individual privacy rights.

 However, similar to the statistical-based approach, the noise-based approach has limitations in capturing intricate relationships and dependencies present in the real data, such as in healthcare datasets [ 23 ]. While it can reproduce specific patterns and distributions, it may struggle to capture the complexity of non-linear relationships or intricate dependencies that may exist within the original dataset. This can affect the fidelity of the synthetic data, as it may not fully capture the nuances and interconnections present in the real data.

 Machine learning-based generation methods (see **Figure 4** ) involve the use of machine learning techniques for prediction and inference, facilitating the generation

#### **Figure 4.**

 *Machine learning-based synthetic data generation, e.g., Bayesian networks model relationships between variables via latent variables.* 

#### *High-Fidelity Synthetic Data Applications for Data Augmentation DOI: http://dx.doi.org/10.5772/intechopen.113884*

of generative synthetic data. This approach encompasses various models, including GANs, VAEs and Bayesian Network-Based Generative Model and other state-of-theart generative models. These models can learn the underlying structure of real data and generating synthetic data that closely resembles it.

 GANs consist of two functions, a generator function (G) and discriminator function(D), trained together to produce realistic synthetic data through "adversarial learning." The VAE also consists of two functions, an encoder and a decoder functions. It uses an encoder and decoder pair to map high-dimensional data to a lower-dimensional latent space. The decoder function takes samples from this latent space and maps them back to the data space. The VAE's advantage lies in generating synthetic data that matches the distribution of the original data using latent variables. Likewise, Bayesian networks model relationships between variables. These networks represent dependencies and generate consistent synthetic data using techniques like Markov Chain Monte Carlo (MCMC) sampling. For instance, in medical data, a Bayesian network can model dependencies between conditions, symptoms, and treatments to generate synthetic records matching observed data.

 However, the machine learning-based approach for generating synthetic data has certain limitations that need to be considered [ 24 ]. One challenge is the requirement of substantial computational resources and a large amount of training data. Training generative models like VAEs or GANs can be computationally intensive, particularly when working with large and intricate datasets. Another limitation is the possibility of model collapse, where the generative model fails to capture the full complexity of the data and produces limited variations or repetitive samples. This can result in synthetic data that lacks diversity and fails to capture the full range of patterns present in the real data. Additionally, the quality and fidelity of the generated synthetic data heavily rely on the quantity and representativeness of the training data. Inadequate or biased training data can lead to synthetic data that deviates from the true underlying distribution of the real data, reducing its fidelity and usefulness.

#### *2.2.2 Simulation-based approach*

 Simulation-based approaches (see **Figure 5** ) on the other hand, include agentbased, compartmental model-based, and discrete event-based methods for synthetic data generation.

 In transportation planning, agent-based modelling techniques can capture traffic flows and travel patterns. By simulating realistic scenarios considering road networks,

#### **Figure 5.**

 *Simulation-based synthetic data generation, e.g., with domain knowledge, agent-based model can simulate interactions between agents to generate data.* 

traffic signals, and driver behaviours, these approaches generate synthetic data for analysing traffic congestion, designing efficient transportation systems, and evaluating infrastructure projects [25].

In epidemiology, epidemic models such as compartmental models (e.g., Susceptible-Infectious-Recovered (SIR) or Susceptible-Exposed-Infectious-Recovered (SEIR) models), spatial models, and network models simulate disease transmission dynamics to generate synthetic data [26]. These models incorporate population demographics, contact networks, and intervention strategies to study control measures, predict outbreak impacts, and plan healthcare resource allocation.

Simulation-based approaches also find applications in other domains. Ecological research employs ecological models to simulate species interactions, population dynamics, and environmental factors [27]. In social sciences, agent-based models simulate human behaviour, social networks, and economic systems [28]. In manufacturing and supply chain management, discrete event simulation models replicate production processes, logistics networks, and inventory systems [29]. These techniques generate synthetic data to analyse, make informed decisions, and evaluate policies in respective fields.

However, the obvious limitation of this approach is that it may require domainspecific knowledge and expertise. Designing accurate simulations that replicate the intricacies of the system often necessitates a deep understanding of the underlying mechanisms and processes [30]. Developing reliable simulation models requires careful calibration, validation, and refinement to ensure that the generated synthetic data accurately reflects the real-world phenomenon [31].

The methods discussed earlier have their own strengths and limitations (see **Table 1**). The choice of method usually depends on the specific requirements of the application, the available resources, and the desired level of similarity between the synthetic and real data.


#### **Table 1.**

*Comparison of synthetic data generation approaches.*

#### **3. Applications of high-fidelity synthetic data**

The use of high-fidelity synthetic data has found applications in a wide range of fields, notably computer vision and natural language processing. However, the healthcare domain stands out as a particularly compelling area of focus. This emphasis on healthcare is justified by the significant impact that high-fidelity synthetic data can have in this field. By generating realistic and privacy-preserving synthetic healthcare data, researchers can overcome challenges related to data availability and data bias concerns. This representative use of high-fidelity synthetic data in healthcare enables the development and validation of machine learning models, facilitating advancements in disease prediction and personalised healthcare. Moreover, highfidelity synthetic data plays a crucial role in generating virtual patient cohorts for analysis when sample size is small, like digital twins [32], to estimate counterfactuals in silico trials, allowing for better prediction of treatment outcomes and personalised medicine. In this section, we delve into the applications of high-fidelity synthetic data, emphasising its role in supporting the training and validation processes of machine learning algorithms, while highlighting specific examples from healthcare that showcase the transformative potential of this approach in healthcare.

#### **3.1 Addressing missing data**

One of the most noteworthy applications of synthetic data lies in the opportunity to effectively tackle missing data due to the absence or unavailability of certain observations or values in a dataset. Failing to appropriately address missing data can jeopardize the reliability of the results by introducing bias into statistical analysis and modelling, leading to results that are not generalisable to real world populations or scenarios [33, 34]. In addition, missing data decreases the sample size available for analysis, inevitably reducing statistical power as it becomes more challenging to detect true relationships and make reliable predictions.

To mitigate these risks and select the most suitable solution, we usually carry out an accurate a-priori analysis considering factors such as the specific characteristics of the dataset (i.e., data types), the proportion of missing data and the underlying mechanisms causing the missingness [35]. Indeed, missing data can be classified into three different categories depending on the missing data mechanism.

Missing Completely at Random (MCAR) data embodies the scenario where the missingness is a random process that occurs independently of any measured or unmeasured feature. MCAR is an ideal scenario as it implies that the missing information does not introduce bias into the analysis. Missing at Random (MAR) occurs when the probability of missingness depends entirely on the observed data. MAR assumes that the missingness can be explained by observable characteristics. Both MCAR and MAR can be handled through a set of approaches as surveyed in [36]. Traditional statistical and machine learning imputation techniques including mean, regression, K nearest neighbour, ensemble based etc, have been proposed in the literature to handle these scenarios [37].

More sophisticated techniques are required to handle more complex data patterns such as Missing Not at Random (MNAR) data. This category describes a situation where the missingness is related to the values that are missing. MNAR is considered the most challenging type of missing data because the missing data is systematically related to unobserved variables. Handling MNAR requires techniques [38] considering the missing data mechanism and the relationship between the missing values and the unobserved variables to make imputations and draw valid inferences.

Recent research on modelling MNAR demonstrates the use of Bayesian networks to generate high-fidelity synthetic data from large-scale UK primary care datasets. These datasets usually contain noise, structurally missing data, and numerous nonlinear relationships [39]. In this context, three approaches can be employed to model MNAR. The first approach focuses on the discrete nodes, one can include a "missing state" in all possible node states. Alternatively, for continuous nodes, a new binary parent known as the "missing node" can be added to each node, indicating whether the data point is missing or not. Additionally, the Fast Causal Inference (FCI) algorithm [40] offers a third approach that can be applied to both discrete and continuous nodes. FCI aids in inferring the position and inclusion of latent variables in the network, effectively capturing Missing Not at Random (MNAR) and other unmeasured effects. By incorporating robust latent variables, this approach aims to enhance the accuracy of the underlying distributions and account for any MNAR effects.

Another example [41] is using GANs to model missing data in retailing and healthcare datasets and to generate synthetic data. It starts by creating unique identifiers for missing patterns in the original dataset. Then, it uses these identifiers to fill in the missing values and learns from the data to generate synthetic samples with similar missing patterns. The final synthetic dataset resembles the original data but includes missing values in the same way.

Even though synthetic data represents a solution to handle missing data, we must recognize the complexity of this task and the potential risks it entails [42]. Meticulous attention and thoughtful planning are required. For instance, incorrectly handling missing data can disrupt correlations and relationships among features in the dataset. This can result in misleading associations and invalid inferences, affecting the accuracy of predicting models and the ability to draw meaningful conclusions. Also, an improper solution can increase the variability of the results, making them less precise and less generalizable to the population. In certain situations, such as with temporal correlations of missing data distribution, additional modelling approaches, or sensitivity analyses may be required to handle missing data more appropriately.

#### **3.2 Addressing data bias**

Real-world data often suffer from underrepresentation or inadequate representation of certain groups. Specific groups may be under-represented due to cultural sensitivities amongst some communities, institutionalised data collection procedures, or research involving small patient cohorts for rare diseases and outcomes, leading to bias in analyses and decision-making processes [21]. Advanced model-based synthetic data generation techniques such as BayesBoost and Importance Sampling can be employed to mitigate these biases.

BayesBoost is an algorithmic technique that leverages high-fidelity synthetic data to correct biases due to under-sampling by boosting underrepresented groups by generating synthetic data points that augment the representation of underrepresented populations. This technique has been recently used to correct bias in COVID-19 and cardiovascular disease datasets [8]. In this study, the approach's effectiveness was confirmed by validating it with a biased subset of data from a dataset and comparing the bias-corrected synthetic data with the original "full" dataset. The findings demonstrated that synthetic data can effectively correct biases and enhance the generalizability of machine learning algorithms to population subgroups that are underrepresented in the real data.

#### *High-Fidelity Synthetic Data Applications for Data Augmentation DOI: http://dx.doi.org/10.5772/intechopen.113884*

Importance Sampling [43] is a statistical technique utilised in the process of creating high-fidelity synthetic data. Its primary purpose is to alleviate biases caused by under-sampling in the generated datasets. By assigning appropriate weights to each sample drawn from the importance distribution, Importance Sampling can adjust the synthetic data points, giving more significance to those that align well with the target population's distribution. In recent work [44], Importance Sampling helped handle uncertainty in small-sample scenarios during classification tasks by selecting and weighting data points from an alternative distribution. This approach enabled more accurate estimation of classification errors, providing a robust and reliable assessment of classifier performance, even in situations with limited training data.

Despite the importance of addressing bias, both BayesBoost and Importance Sampling encounter common limitations and challenges. Both approaches can be computationally intensive, especially when dealing with large datasets or complex models. The iterative nature of BayesBoost and the need for multiple samples in Importance Sampling contribute to significant computational overhead. Moreover, overfitting is a potential challenge for both approaches. BayesBoost may overfit if the model becomes too complex or if the training data contains noise. Similarly, improper choice of the importance distribution in Importance Sampling can increase variance and result in overfitting. Furthermore, parameter tuning is essential for both methods. Adjusting parameters such as the number of boosting iterations and learning rates in BayesBoost, or selecting the appropriate importance distribution in Importance Sampling, requires careful consideration to achieve optimal results.

#### **3.3 Supporting in silico clinical trials**

Clinical trials are systematic investigations conducted on human subjects to evaluate the safety, efficacy, and potential benefits of medical interventions, treatments, therapies, and diagnostic procedures. The traditional approach to conducting clinical trials can be time-consuming, resource-intensive, and costly, often hindering the progress of medical innovation. To address these challenges, an emerging and promising strategy involves leveraging synthetic data generation methodologies to create virtual patient cohorts, fostering the concept of digital twins [32] in the context of healthcare.

The core idea behind generating virtual patient cohorts is to simulate the effects and responses of interventions or treatments on these digital twins. In essence, this process transforms a traditional clinical trial into an in silico trial [45, 46], which is conducted through computer simulation and modelling techniques.

In [47], the researchers demonstrated using electronic health records to create synthetic patient populations and personalised, predictive models of response to therapy, and incorporating in silico clinical trials to accelerate the development of new drugs. Another study [48] successfully generated synthetic radiological images for novel medical device evaluations for in silico trials, though some anatomical distinctions persist between synthetic and real images.

Despite early success in using synthetic data for in silico trials, several potential limitations and challenges need to be addressed. Ensuring the accuracy and representativeness of synthetic data is of utmost importance. The validity of results heavily relies on the quality and appropriateness of the data used to construct virtual patient cohorts. Additionally, thorough validation of the simulation models against real-world clinical data, considering their complexity and incorporation of various

physiological, pharmacological, and disease-specific parameters, is essential for ensuring their validity. In [49], the authors explored existing challenges and research opportunities to enhance both synthetic data generation methods and in silico trial techniques.

#### **4. Areas for further research**

There are several emerging areas of research associated with the generation of synthetic data. These include potential advancements in synthetic data generation and metrics that can be utilised for evaluating both the usefulness and privacy of the data that has been generated. Towards the end of this section, we will also discuss the general concerns related to synthetic data that we believe will persist in further research.

#### **4.1 Synthetic data generation**

Techniques such as GANs and VAEs enable the creation of synthetic data that closely resembles real-world data, thereby enhancing the effectiveness of downstream applications and research. However, there are new approaches that are emerging related to data augmentation and privacy preservation that may further enhance the generation of synthetic data.

#### *4.1.1 Enhanced data augmentation techniques*

Using model-based synthetic data generation approaches can facilitate the generation of augmented synthetic data that introduces occlusions, transformations, missing parts, or combinations of different samples. For example, mixup [50] is a technique that blends pairs of samples, incorporating their features and labels. This process fosters smooth transitions between instances, generating synthetic samples with interpolated characteristics., CutMix [51], augments synthetic data by patching fragments of one sample onto another, creating combined instances that exhibit properties from both samples. This technique introduces spatial relationships and fine-grained details into the synthetic data, making it more representative of realworld scenarios.

Self-supervised learning techniques [52] have shown promise in augmenting synthetic data by leveraging pretext tasks or auxiliary tasks. By defining surrogate tasks that encourage the model to learn meaningful representations from unlabelled data, self-supervised learning enhances the richness and diversity of the synthetic data. These techniques allow the synthetic data to capture intricate patterns and structures present in real-world data, enabling more effective model training and enhancing the quality of downstream applications.

While techniques like mixup and CutMix enhance diversity, they may introduce artifacts and unrealistic combinations, impacting the model's performance in practical applications. Moreover, self-supervised learning techniques heavily depend on the quality of pretext tasks, posing a challenge in obtaining meaningful representations for effective data augmentation. Hence, addressing these new challenges is essential to create reliable and representative synthetic data for robust machine learning models in real-world scenarios.

#### *4.1.2 Advancements in privacy-preserving synthetic data generation*

With increasing public concerns surrounding privacy, privacy-preserving synthetic data generation has become increasingly important. Techniques such as personalized privacy, and privacy accounting play a role in ensuring robust privacy protection.

Personalised privacy [53] focuses on tailoring privacy protection measures to individuals or specific data subjects, considering their unique privacy requirements. This approach allows for a more fine-grained privacy control, ensuring that every individual's privacy needs are adequately addressed during the generation of synthetic data.

Advanced privacy accounting [54] involves systematically measuring and evaluating the privacy guarantees provided by synthetic data generation methods. This technique assesses the level of privacy protection offered by synthetic datasets and quantifies the potential risks of re-identification or privacy breaches.

Privacy amplification techniques [55] aim to strengthen privacy guarantees by incorporating additional privacy-enhancing mechanisms. These mechanisms introduce extra noise or perturbations to the synthetic data, making it even more challenging for attackers to re-identify individuals or extract sensitive information.

That said, finding the right balance between privacy and utility is crucial in privacy-preserving synthetic data generation. Tailoring privacy measures and using advanced privacy accounting can enhance protection but may complicate data generation and affect utility. Privacy amplification techniques strengthen security but can introduce data distortion. Striking an optimal balance remains a challenging yet essential goal for researchers and practitioners in this field.

#### **4.2 Synthetic data validation**

Metrics to validate the utility and privacy of synthetic datasets have evolved over time. Historically, utility assessment focused on statistical measures such as mean squared error and correlation coefficients to measure similarity between the synthetic and original datasets. As privacy concerns heightened, metrics were developed to evaluate the risk of identity disclosure in synthetic datasets, including measures like re-identification risk and information disclosure. More recently, with the advent of deep learning and generative models, evaluation metrics have incorporated domain-specific measures such as fidelity, diversity, and semantic consistency to assess the validity and usefulness of synthetic data. Furthermore, privacy evaluation has expanded to include differential privacy mechanisms and privacy-preserving techniques that quantitatively measure the risk of identity disclosure and ensure data protection.

**Table 2** summarises these evaluation metrics and provides a comprehensive overview of the measures used to evaluate the similarity, and privacy protection of synthetic datasets.

It is worth noting that the traditional data augmentation methods primarily focus on the enhancing the existing dataset by generating variations of the available samples. They do not involve a separate validation process to ensure the similarity, and privacy protection as in synthetic dataset [31].

In addition, when choosing the validation metrics of synthetic data, it is essential to consider its planned applications. For example, if the intended use of high-fidelity


#### **Table 2.**

*Overview of evaluation metrics.*

synthetic data is for sample size boosting in clinical trials, validity could be assessed by comparing a boosted data extract with the full data. On the other hand, when synthetic data is utilised as a privacy-enhancing technology and a proxy for real (ground truth) data, validity can be evaluated by comparing statistical distributions of variables and ML model results in the synthetic and ground truth data.

#### **4.3 General concerns**

When generating synthetic data, bias-related challenges should be considered. Overfitting to biased patterns can occur when the synthetic data generation process closely mirrors biased real data, leading to skewed results. Moreover, synthetic data methods may not fully capture the diverse characteristics of the real data, particularly for underrepresented groups, thus introducing bias. Selection bias may also arise if synthetic data is generated from a biased subset of real data. An incorrect selection of bias correction techniques during synthetic data generation can result in residual bias in the synthetic data, failing to address complex biases in the real data adequately. If the data generation process does not account for unaccounted confounding variables, i.e., variables that are not considered or controlled for during data analysis, influencing bias in the real data, bias may be also introduced in synthetic data. Finally, making unrealistic assumptions about data distribution in synthetic data generation models can introduce bias and impact the validity of the generated data.

#### *High-Fidelity Synthetic Data Applications for Data Augmentation DOI: http://dx.doi.org/10.5772/intechopen.113884*

Apart from the bias, another consideration is the potential for unintended reidentification of individuals. Despite efforts made to anonymise the data during the synthetic data generation process, there is still a possibility of re-identification, especially when combined with other external data sources. Synthetic data containing unique or rare attributes may increase the risk of re-identification, compromising individuals' privacy and confidentiality. In such cases, ethical guidelines and robust privacy-preserving techniques must be implemented to minimize the re-identification risk and protect individuals' privacy.

Addressing the transparency and accountability of synthetic data generation methods is crucial. Synthetic data generation often involves complex algorithms and models, making it challenging to understand and interpret the underlying processes. This lack of transparency can raise concerns about the fairness, interpretability, and accountability of the generated synthetic data. Researchers must be transparent about the methods used, document the assumptions and limitations, and provide clear explanations of how the synthetic data aligns with the original dataset.

The potential implications of using synthetic data in high-stakes decision-making contexts should not be overlooked. If synthetic data is used to train or test algorithms that have a direct impact on individuals' lives, such as in healthcare or finance, the ethical implications are amplified. Rigorous assessment of the performance and generalisability of models trained on synthetic data is crucial to avoid biases, unfair outcomes, or adverse effects on marginalised populations. Regular monitoring, validation, and auditing of the synthetic data generation process can help identify and mitigate potential biases and ethical concerns.

These concerns highlight the need for careful data curation, ensuring sufficient and diverse training data, and addressing the risk of model collapse to achieve highfidelity synthetic data generation.

#### **5. Conclusion**

This chapter has explored the concept of synthetic data and its significance in data augmentation. Synthetic data, which effectively replicates real-world data while safeguarding privacy, presents valuable opportunities for research and practical applications. The selection of an appropriate synthetic data generation method relies on specific requirements, available resources, and the desired resemblance to real data.

Notably, high-fidelity synthetic data has emerged as a potent tool with transformative potential, particularly in the realm of healthcare. It effectively addresses challenges related to missing data and biases stemming from under-sampling, thereby propelling advancements in disease prediction, drug discovery, and personalised healthcare. Through the imputation of missing values and the generation of additional synthetic samples, high-fidelity synthetic data empowers researchers to surmount data scarcity and enhance inference accuracy. Furthermore, it plays a vital role in the creation of virtual patient cohorts for in silico trials, enabling superior predictions of treatment effectiveness, personalised medicine, and the estimation of counterfactual scenarios. Moreover, high-fidelity synthetic data finds practical utility in rare event analysis, facilitating the study of uncommon diseases or adverse drug reactions.

The progress made in synthetic data generation techniques, such as GANs and VAEs, has considerably bolstered the capacity to create synthetic data that closely approximates real-world data. Furthermore, augmentation techniques have the potential to expand datasets and furnish a more diverse set of samples for training machine learning models.

The evaluation metrics for synthetic datasets have evolved to encompass measures of utility, privacy, and domain-specific characteristics. These metrics now include fidelity, diversity, semantic consistency, and privacy assessment techniques. Anticipated future trends involve the development of more sophisticated metrics that account for context-specific utility, robust privacy guarantees, and considerations of fairness.

Synthetic data raises general concerns related to biases, representativeness, unintended re-identification, transparency, and accountability. Overcoming these concerns requires careful evaluation of the fidelity and quality of synthetic data, implementation of privacy-preserving techniques, transparent documentation of the generation methods, and rigorous assessment of model performance and generalisability.

## **Author details**

Zhenchen Wang\*, Barbara Draghi, Ylenia Rotalinti, Darren Lunn and Puja Myles Medicines and Healthcare Products Regulatory Agency, London, United Kingdom

\*Address all correspondence to: zhenchen.wang@mhra.gov.uk

© 2024 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### **References**

[1] Shorten C, Khoshgoftaar TM. A survey on image data augmentation for deep learning. Journal of Big Data. 2019;**6**:60. DOI: 10.1186/ s40537-019-0197-0

[2] Antoniou A et al. Data augmentation for time series classification using convolutional neural networks. Data Mining and Knowledge Discovery. 2018;**32**:914-945. DOI: 10.1007/ s10618-018-0595-8

[3] Miotto R et al. Deep patient: An unsupervised representation to predict the future of patients from the electronic health records. Scientific Reports. 2017;**6**:26094. DOI: 10.1038/srep26094

[4] Yan L et al. Data augmentation in ECG-based deep cardiac arrhythmia classification. Computers in Biology and Medicine. 2018;**102**:411-420. DOI: 10.1016/j.compbiomed.2018.10.006

[5] Abayomi-Alli R, Damaševičius RM, Abayomi-Alli A. BiLSTM with data augmentation using interpolation methods to improve early detection of Parkinson disease. In: 2020 15th Conference on Computer Science and Information Systems (FedCSIS), Sofia, Bulgaria. IEEE. 2020. pp. 371-380. DOI: 10.15439/2020F188

[6] Goodfellow IJ et al. Generative adversarial nets. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, 8-13 December 2014; Montreal Canada. Cambridge, MA, USA: MIR Press. pp. 2672-2680

[7] Kingma DP, Welling M. Auto-encoding variation. In: Proceedings of the International Conference on Learning Representations (ICLR). 2014

[8] Draghi B, Wang Z, Myles P, Tucker A. BayesBoost: Identifying and handling bias using synthetic data generators. In: Proceedings of the Third International Workshop on Learning with Imbalanced Domains: Theory and Applications, in Proceedings of Machine Learning Research. Vol. 154. Bilbao, Spain: ECML-PKDD 2021, 2021. pp. 49-62. Available from https://proceedings.mlr.press/v154/ draghi21a.html

[9] Assefa SA et al. Generating synthetic data in finance: Opportunities, challenges and pitfalls. In: Proceedings of the First ACM International Conference on AI in Finance (ICAIF '20). New York, NY, USA: Association for Computing Machinery; 2021. pp. 1-8 Article 44. DOI: 10.1145/3383455.3422554

[10] Li G, Chen Y, Wang Y, et al. Cityscale synthetic individual-level vehicle trip data. Scientific Data. 2023;**10**:96. DOI: 10.1038/s41597-023-01997-4

[11] Wang Z, Myles P, Tucker A. Generating and evaluating crosssectional synthetic electronic healthcare data: Preserving data utility and patient privacy. Computational Intelligence. 2021;**37**:1-33. DOI: 10.1111/coin.12427

[12] Wang Z et al. Evaluating a longitudinal synthetic data generator using real world data. In: Proceedings of the IEEE 34th International Symposium on Computer-Based Medical Systems (CBMS), 7-9 June 2021. Aveiro, Portugal; pp. 259-264

[13] El Emam K, Mosquera L, Jonker E, Sood H. Evaluating the utility of synthetic COVID-19 case data. JAMIA Open. 2021;**4**(1):ooab012. DOI: 10.1093/ jamiaopen/ooab012

[14] Shirzadian P, Antony B, Gattani AG, et al. A time evolving online social network generation algorithm. Scientific Reports. 2023;**13**:2395. DOI: 10.1038/ s41598-023-29443-w

[15] Appenzeller A et al. Privacy and utility of private synthetic data for medical data analyses. Applied Sciences. 2022;**12**:12320. DOI: 10.3390/ app122312320

[16] Buczak AL, Babin S, Moniz L. Datadriven approach for creating synthetic electronic medical records. BMC Medical Informatics and Decision Making. 2010;**10**:59. DOI: 10.1186/1472-6947-10-59

[17] Figueira A, Vaz B. Survey on synthetic data generation, evaluation methods and GANs. Mathematics. 2022;**10**(15):2733. DOI: 10.3390/math10152733

[18] Sonnenberg FA, Beck JR. Markov models in medical decision making: A practical guide. Medical Decision Making. 1993;**13**(4):322-338. DOI: 10.1177/0272989X9301300409

[19] Levy JJ, O'Malley AJ. Don't dismiss logistic regression: the case for sensible extraction of interactions in the era of machine learning. BMC Medical Research Methodology. 2020;**20**:171. DOI: 10.1186/s12874-020-01046-3

[20] Momeny M et al. Learningto-augment strategy using noisy and denoised data: Improving generalizability of deep CNN for the detection of COVID-19 in X-ray images. Computers in Biology and Medicine. 2021;**136**:104704. DOI: 10.1016/j. compbiomed.2021.104704

[21] Chambers JM. Graphical Methods for Data Analysis. Boca Raton, FL: Chapman and Hall/CRC; 1983. DOI: 10.1201/9781351072304

[22] Dwork C. Differential privacy. In: Bugliesi M, Preneel B, Sassone V, Wegener I, editors. Automata, Languages and Programming. ICALP 2006. Lecture Notes in Computer Science. Vol. 4052. Berlin, Heidelberg: Springer; 2006. DOI: 10.1007/11787006\_1

[23] Shuryak I. Advantages of synthetic noise and machine learning for analyzing radioecological data sets. PLoS One. 2017;**12**(1):e0170007. DOI: 10.1371/ journal.pone.0170007. PMID: 28068401; PMCID: PMC5222373

[24] Sarker IH. Deep learning: A comprehensive overview on techniques, taxonomy, applications and research directions. SN Computer Science. 2021;**2**:420. DOI: 10.1007/ s42979-021-00815-1

[25] Huang J et al. An overview of agentbased models for transport simulation and analysis. Journal of Advanced Transportation. 2022;**2022**:1252534. DOI: 10.1155/2022/1252534

[26] Ferguson NM et al. Strategies for mitigating an influenza pandemic. Nature. 2006;**442**(7101):448-452. DOI: 10.1038/nature04795

[27] Ovaskainen O, Roy DB, Fox R. Uncovering hidden spatial structure in species communities with spatially explicit joint species distribution models. Methods in Ecology and Evolution. 2016;**7**(4):428-436

[28] Steinbacher M, Raddant M, Karimi F, et al. Advances in the agentbased modeling of economic and social behavior. SN Business Economy. 2021;**1**:99. DOI: 10.1007/s43546-021-00103-3

[29] Chan KC, Rabaev M, Pratama H. Generation of synthetic manufacturing datasets for machine learning using discrete-event simulation.

*High-Fidelity Synthetic Data Applications for Data Augmentation DOI: http://dx.doi.org/10.5772/intechopen.113884*

Production & Manufacturing Research. 2022;**10**(1):337-353. DOI: 10.1080/21693277.2022.2086642

[30] Morris TP, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Statistics in Medicine. 2019;**38**(11):2074-2102. DOI: 10.1002/sim.8086

[31] Mumuni A, Mumuni F. Data augmentation: A comprehensive survey of modern approaches. Array. 2022;**16**:100258. DOI: 10.1016/j. array.2022.100258

[32] Jones DE et al. Characterising the digital twin: A systematic literature review. CIRP Journal of Manufacturing Science and Technology. 2020;**29**:36-52

[33] McKnight PE et al. Missing Data: A Gentle Introduction. New York: Guilford Press; 2007

[34] Nakagawa S, Freckleton RP. Missing inaction: The dangers of ignoring missing data. Trends in Ecology & Evolution. 2008;**23**(11):592-596

[35] Kleinberg G, Diaz MJ, Batchu S, Lucke-Wold B. Racial underrepresentation in dermatological datasets leads to biased machine learning models and inequitable healthcare. Journal of Biomedical Research. 2022;**3**(1):42-47

[36] Emmanuel T, Maupong T, Mpoeleng D, et al. A survey on missing data in machine learning. Journal of Big Data. 2021;**8**:140. DOI: 10.1186/ s40537-021-00516-9

[37] Baraldi AN, Enders CK. An introduction to modern missing data analyses. Journal of School Psychology. 2010;**48**(1):5-37

[38] Iddrisu AK, Gumedze F. An application of a pattern-mixture model with multiple imputation for the analysis of longitudinal trials with protocol deviations. BMC Medical Research Methodology. 2019;**19**:10. DOI: 10.1186/ s12874-018-0639-y

[39] Tucker A et al. Generating highfidelity synthetic patient data for assessing machine learning healthcare software. NPJ Digital Medicine. 2020;**3**(1):1-13

[40] Colombo D et al. Learning highdimensional directed acyclic graphs with latent and selection variables. The Annals of Statistics. 2012;**40**:294-321

[41] Wang X, Asif H, Vaidya J. Preserving missing data distribution in synthetic data. In: Proceedings of the ACM Web Conference 2023 (WWW '23), April 30–May 04, 2023; Austin, TX, USA. New York, NY, USA: ACM; 2023. p. 12

[42] Stavseth MR, Clausen T, Røislien J. How handling missing data may impact conclusions: A comparison of six different imputation methods for categorical questionnaire data. SAGE Open Medicine. 2019;**7:**2050312118822912. DOI: 10.1177/2050312118822912

[43] Tokdar ST, Kass RE. Importance sampling: A review. WIREs Computational Statistics. 2010;**2**:54-60. DOI: 10.1002/wics.56

[44] Maddouri O, Qian X, Alexander FJ, Dougherty ER, Yoon BJ. Robust importance sampling for error estimation in the context of optimal Bayesian transfer learning. Patterns (N Y). 2022;**3**(3):100428. DOI: 10.1016/j. patter.2021.100428

[45] Wang Z, Gao C, Glass L, Sun J. Artificial intelligence for in silico clinical trials: A review. ArXiv, abs/2209.09023. 2022

[46] Badano A. In silico imaging clinical trials: cheaper, faster, better, safer, and more scalable. Trials. 2021;**22**:64. DOI: 10.1186/s13063-020-05002-w

[47] Zand R, Abedi V, Hontecillas R, Lu P, Noorbakhsh-Sabet N, Verma M, et al. Development of synthetic patient populations and in silico clinical trials. In: Bassaganya-Riera, editor. Accelerated Path to Cures. Cham: Springer; 2018. pp. 57-77

[48] Galbusera F et al. Exploring the potential of generative adversarial networks for synthesizing radiological images of the spine to be used in in silico trials. Frontiers in Bioengineering and Biotechnology. 2018;**6**:53. DOI: 10.3389/ fbioe.2018.00053

[49] Myles P et al. Synthetic data and the innovation, assessment, and regulation of AI medical devices. Progress in Biomedical Engineering. 2023;**5**:013001

[50] Zhang H, Cisse M, Dauphin YN, et al. Mixup: Beyond empirical risk minimization. In: Proceedings of International Conference on Learning Representations, April 2018. BC, Canada: Vancouver; 2018. pp. 1-13

[51] Yun S, Han D, Chun S, Oh SJ, Yoo Y, Choe J. CutMix: Regularization strategy to train strong classifiers with localizable features. In: IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, South Korea; 2019. pp. 6022-6031

[52] Chen T et al. A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning (ICML'20). Vol. 119. Virtual Conference; 2020. pp. 1597-1607 JMLR.org, Article 149

[53] Canhoto AI, Keegan BJ, Ryzhikh M. Snakes and ladders: Unpacking the

personalisation-privacy paradox in the context of AI-enabled personalisation in the physical retail environment. Information Systems Frontiers. 2023;**25**. DOI: 10.1007/s10796-023-10369-7

[54] Doroshenko V, Ghazi B, Kamath P, Kumar R, Manurangsi P. Connect the dots: Tighter discrete approximations of privacy loss distributions. Proceedings on Privacy Enhancing Technologies. 2022;**2022**:552-570

[55] Bennett CH, Brassard G, Crepeau C, Maurer UM. Generalized privacy amplification. IEEE Transactions on Information Theory. 1995;**41**(6):1915- 1923. DOI: 10.1109/18.476316

[56] Raghunathan TE et al. Multiple imputation for statistical disclosure limitation. Journal of Official Statistics. 2003;**19**:1

[57] Loukides G, Denny JC, Malin B. The disclosure of diagnosis codes can breach research participants' privacy. Journal of the American Medical Informatics Association. 2010;**17**(3):322-327. DOI: 10.1136/jamia.2009.002725

[58] Vaidya J, Clifton C. Privacypreserving k-means clustering over vertically partitioned data. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '03). New York, NY, USA: ACM; 2003. pp. 206-215

[59] Machanavajjhala A et al. L-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data. 2007;**1**:3–es. DOI: 10.1145/1217299.1217302

[60] Domingo-Ferrer J, Torra V. Ordinal, continuous and heterogeneous k-anonymity through microaggregation. Data Mining and Knowledge Discovery.

*High-Fidelity Synthetic Data Applications for Data Augmentation DOI: http://dx.doi.org/10.5772/intechopen.113884*

2005;**11**:195-212. DOI: 10.1007/ s10618-005-0007-5

[61] El Emam K et al. A globally optimal k-anonymity method for the de-identification of health data. Journal of the American Medical Informatics Association. 2009;**16**(5):670-682. DOI: 10.1197/jamia.M3144

[62] Zemel R et al. Learning fair representations. In: Proceedings of the 30th International Conference on International Conference on Machine Learning – Volume 28 (ICML'13). GA, USA: Atlanta; 2013 JMLR.org, III–325–III–333

[63] Shokri R et al. Membership inference attacks against machine learning models. In: 2017 IEEE Symposium on Security and Privacy (SP). CA, USA: San Jose; 2017. pp. 3-18

## **Chapter 8** 6G Physical Layer Security

*Israt Ara and Brian Kelley*

#### **Abstract**

Securing the proliferation of wireless networks in 6G requires security-based signaling as a native component. This paper analyzes Physical Layer Security (PLS) applied to 6G Radio Access Networks (RAN) to enhance Layer-1 security. The description defines a PLS air interface and system model with AI/ML-based intelligent codebook generation and detection schemes. The paper also proposes an operational overview of AI/ML integrated PLS with shared key-agreement protocol in an O-RAN architecture for 6G security. Results include codebook generation details, the impact of MIMO antenna array size, and the key-Bit Error rate (BER) of 6G-PLS detection in the presence of eavesdroppers and Rayleigh fading plus noise.

**Keywords:** 6G, security, physical layer security, O-RAN, AI/ML, deep learning

#### **1. Introduction**

6G Wireless Systems under development provide advanced next-generation mobile communication capability with significant integration of distributed neural networks and joint communication and sensing. Integrating AI/ML and 6G fuse capabilities across high-rate communications, high-speed computing, cyber-physical systems, and biologically inspired frameworks, ushering in an era of true Intelligence of Everything (IoE) [1]. Emerging 6G industries include smart-grids, Factory 5.0, automated transportation, 3-Dim immersive XR, and remote surgical robotics.

Prior generation 5G Systems control machines and Internet of Things (IoT) devices provide connectivity to various industrial applications in agriculture, construction, smart grids, healthcare, transportation, satellites, and IoT. A significant achievement of 5G technology involves operation across a vast expanse of cellular bands (e.g., 3GPP FR1, FR2) from 600 MHz to 71 GHz. The FR2 frequencies of 5G support broadband millimeter wave (mmWave) applications. Fixed-access mmWave early 5G roll-outs will still use sub-6 GHz for supporting mobility. Many 5G enhanced mobile broadband (eMBB) and ultra-reliable low latency communication (URLLC) applications jointly mandate high data rates and low latency. Tailored 6G wireless systems inherently overcome these challenges Many 5G enhanced mobile broadband (eMBB) and ultra-reliable low latency communication (URLLC) applications jointly mandate high data rates and low latency. Tailored 6G wireless systems inherently overcome these challenges [2]. **Table 1** presents the key performance indicators to enable 6G applications, with a comparative overlook with 5G technology simultaneously.


#### **Table 1.**

*Key performance indicators of 5G and 6G [3, 4].*

**Table 1** shows that 6G technology will have almost 1 Tbps peak data rate with connection density of 1 Gb/s*/m*<sup>2</sup> , which are 100 times more than the current prevailing 5G technology. So this kind of enormous information traffic migration from wired ethernet to 6G wireless necessitates high levels of security. 6G wireless system's network slicing taxonomy typically delineates Enhanced Mobile Broadband Plus (eMBB-Plus), Ultra-High-Speed with Low Latency Communications (uHSLLC), and Secure Ultra-Reliable Low-Latency Communications (SURLLC) [5]. In addition, AI being the critical enabler of 6G, AI-generated configurations across multiple 6G System layers support massive, densely populated network infrastructure that accommodates an exponential increase in radio access nodes. Therefore, security, especially at air interfaces, is essential. Physical layer security offers the potential for intrinsic, low-latency security at Layer-1 as a native component. Also, within the context of 6G intelligence, AI/ML provides many enhancements to parameter estimation, detection, mobility optimization, and detection of malicious actors [6].

AI/ML in 6G ecosystems requires a sophisticated infrastructure to support developing and testing advanced algorithms and systems. The infrastructure should include high-performance computing and GPU clusters, which are essential for training and validating models. AI/ML algorithms require vast data collection and analytics tools to train and validate models. Integrated intelligence in the 6G Core, 6G Radio Access Network (RAN), and E2E management occur through AI/ML [7–9]. The 6G research community actively proposes integrated AI/ML design within the 6G system infrastructure for intelligent allocation and management of the network, spectrum, computing, data storage resource, and security. References in [10–13] delineate 6Gintelligent use case applications in control, sensing, automated operation, and security.

This chapter presents an AI/ML-integrated, shared key-based Layer-1 Physical Layer Security (PLS) protocol as a candidate for 6G SURLLC. With its transmission designs based on the intrinsic randomness of the wireless medium to achieve secrecy, PLS ensures lower complexity and incurs less latency than traditional cryptography [14]. In addition, intelligent optimization produces a more secure and, thus, reliable air interface protocol leveraging AI/ML-based PLS. Along with security, the proposed PLS method offers a fundamentally lower latency exchange of secret information when operating in machine-to-machine (M2M) mode. At the same time, our effort toward achieving an intelligent and optimized PLS also applies to 4G and 5G systems. However, many existing 5G and prior 4G Physical Layer infrastructures must approve standardization changes. Furthermore, the ITU's recommendations on" IMT for 2030 and Beyond" [15] and the 3GPP 6G Work Items [16] on 6G actively investigate new air interfaces for evolution from 5G-to-6G, leading to a focus on Layer-1 6G security [17, 18].

#### *6G Physical Layer Security DOI: http://dx.doi.org/10.5772/intechopen.112989*

The overview and organization of this chapter are as follows: in Section 3 of this chapter, we study the fundamental concepts of Physical Layer Security, including a review of the shared Key-based PLS system. Section 4 of the chapter introduces practical schemes for integrating PLS for 6G communication. This section applies Deep Learning (DL) in the context of PLS to jointly optimize and provision higher-tier security for control, data, and management channels. Specifically, the section describes an AI/ML integration in the shared key-based PLS model, contributing to an optimized approach toward codebook generation and intelligent shared secret key decoding schemes. In addition, in this section, we introduce, for the first time to our knowledge, an overview of the AI/ML-integrated PLS scheme workflow for 6G O-RAN. The new solution adds latency improvements within an O-RAN Alliance framework and enhanced schemes for PLS integration within 6G O-RAN. Section 5 of the chapter illustrates PLS models, protocols, simulations, and results demonstrating improved AI/ML-based security performance. Finally, Section 6 concludes with a discussion on future scope and prospects of this research topic.

#### **2. Physical layer security for 6G**

Physical Layer Security protocols overlay secure transmission schemes onto Physical Layer (PHY) data links with the goal of shared secret information exchange. Time Division Duplex (TDD) wireless channels, typical in 6G wireless, support uplinkdownlink (UL-DL) channel reciprocity. The legitimate users exchange information over spatial channel statistics that non-legitimate users and eavesdroppers can approximate, but only if they reside near both ends of the link.

Physical-layer security classification categories generally consist of SINR-based (keyless) or key-based approaches. Keyless methods include beamforming, power allocation, and injection of artificial noise algorithms [19]. The second key-based category involves complexity-based schemes utilizing shared secret keys between legitimate users at the physical layer [20]. The shared key protocols represent a significant underpinning of advanced cryptographic engineering.

Keyless PLS schemes possess several drawbacks. Additive noise in Artificial Noise (AN) based PLS scheme degrades detection. Also, in beamforming-based PLS, the message signal is guided in the correct direction to the legitimate receiver using beamforming. Transmit power concentrates within the main lobe beam but does radiate in the antenna's minor side lobes. The finite number of transmitting antennas only has a slight spatial directivity. Thus, this penetration allows nearby eavesdroppers to decipher the message signal [21].

Key-based physical layer security systems integrate the wireless transmission medium as a promising source of randomness. The rich scattering in wireless environments results in stochastically varying multipath fading at each mobile antenna. TDD channel reciprocity applies to legitimate users with channels defined by their joint spatial channel statistics. Non-proximate positioning by eavesdroppers results in uncorrelated channel statistics, preventing malicious users from duplicating secret key generation protocols. For this reason, shared key-based PLS schemes have gained significant research interest.

Hence, in this chapter, we adopted a key-based PLS scheme. To prevent eavesdroppers from being able to estimate the reference signals, we have adopted a PHY layer key generation scheme utilizing a precoding matrix index (PMI) and rotated reference signals [22]. The PMI method aligns with the 6G OFDM requirements.

Precoding is an operation for the MIMO system to utilize the best subchannel gains. Codebook-based precoding balances the feedback overhead, the equalizer complexity, and the system performance [22]. In the shared key and codebook-based PLS model, a global codebook shared among the communication terminals contains a finite number of precoding matrices. Each precoding matrix in the codebook has an index, the PMI. The secret information from legitimate transmitters and receivers maps to a precoding matrix. The precoding matrix indices, in turn, map to secret keys transmitted from legitimate information sources. The method formulates codebook elements [23] by applying a DFT codebook. Operators drawn from the field of complex number unitary matrices generate the precoders. The PLS system's complete secret key concatenates both transmitter's and receiver's private information. Hence, an eavesdropper cannot extract the secret key by simply placing itself closer to the transmitter or receiver solely.

#### **3. Practical schemes for integrating physical layer security in 6G**

**Figure 1** illustrates the proposed security model for 6G cellular consisting of three users: a first user Alice represents the Radio Access Network (RAN). A second user, Bob, corresponds to the User Equipment (UE). A third illegitimate user, Eve, passively eavesdrops on the secret bidirectional information exchange between Alice and Bob (the legitimate users). The successful exchange of private information implies securely transmitting secret information. The scheme leverages machine learningbased share key-based PLS techniques for Bob (UE) and Alice (RAN) across 6G communication channels. Transmission performance for legitimate channels

#### **Figure 1.**

*Shared key block ciphers, an important class of block ciphers, require one key for both encryption and decryption. The PLS system model secures the 6G wireless channel between Alice–bob–eve for shared keybased ciphers.*

improves significantly, with AI/ML automatically mitigating the concern for eavesdropper channels.

In **Figure 1**, the TDD channel between Alice and Bob, denoted as *HAB*, and Bob to Alice channel, denoted as *HBA*, have a well known transpose relationship. *HAE* and *HBE* denotes the channel between Alice-Eve and Bob-Eve, respectively. In the MIMO-OFDM transceiver, the transmitter, Alice, first sends out a reference signal for the legitimate receiver, Bob, to estimate the channel matrix *HAB*, as illustrated in **Figure 1**. During signal transmit Alice and Bob obfuscate the channel matrix by applying a random channel sounding operator. Indirectly, they observe a singular vector obtained from performing Singular Value Decomposition (SVD) of the channel matrix. *G* represents a random reference signal operator in **Figure 1**. The subscripts represent the steps in the transmission process and their corresponding channels. In the secret key based PLS scheme, the transmitter and receiver each contribute their own secret information to a shared secret. Bob sends his secret information after encoding it using a codebook. If operating key-exchange protocols, this encoded secret information should be considered the 'secret key.' Bob sends *SB*, his secrete contribution, to Alice over the channel. Alice estimates ^ *SB* and similarly sends her encoded secret information or secret key, *SA* to Bob. Bob's estimated version of *SA* is ^ *SA*. Concatenating ^ *SA* and ^ *SB* gives Alice the full secret information. The length of the secret key is pre-agreed, and also the codebook is known to all parties which makes it 'universal'.

#### **3.1 Framework of proposed shared key based PLS system model**

The framework analyzes Alice and Bob, each equipped with multi-antenna systems. An elementary secrecy problems involves the wiretap channel [24] at the eavesdropper, a cascade second discrete memoryless channel. The universal codebook available to Alice, Bob, and Eve contains precoding matrices and the corresponding PMIs. The eavesdropper reconstruction of the full wireless environment represents an ongoing risk. Hence, the scheme adopts a rotation operator applied to the reference signal, instead of the unaltered reference signal. Embedding the secret information within the wireless channel conceals Alice's and Bob's wireless signal against Eve's eavesdropping. The detailed formulation and procedure is explained in the later part of the chapter.

Prior PLS publications (see [25]) explained the shared key-based PLS framework and introduced the use of AI/ML within the decoding scheme. This chapter, describes improved security performance within the context of a MIMO wiretap channel. A passive eavesdropper Eve, monitors the channel between Alice and Bob. The protocol focuses on two primary areas to optimize with the help of ML algorithm: (a) in generation of an optimum codebook, (b) in decoding secret key at receiver end.

#### *3.1.1 Codebook generation based on the optimum-PLS capacity using AI/ML*

The optimum codebook selection process goes through a Feed Forward Neural Network (FFNN) algorithm. The reason for choosing FFNN algorithm is the simplified architecture of the algorithm and less computational time loss, otherwise processing delay would be a factor that can impact the performance of our model.

First, Alice sends out a rotated reference signal for Bob. As the legitimate receiver, Bob estimates the channel matrix. Bob finds the precoding matrix and its

corresponding PMI from the codebook that maximizes the channel capacity shown in Eq. 1 [22]:

$$\mathbf{C}\_{H,\mathcal{F}} = \log\_2 \det \left[ I\_p + \frac{E\_s}{n\_s \sigma^2} F^\dagger H^\dagger H F \right] \tag{1}$$

where *Ip* is the identity matrix with *p* denoting the number of transmit and receive antennas, *Es* is the total power of the transmitted signal vector, *ns* is the number of data, *σ*<sup>2</sup> is the noise variance, *H* is the channel between legitimate transmitter and receiver at any time of observation and *F* is the precoding matrix from the universal codebook, F constructed following [23], such that *F* ∈ ℱ. The estimated optimum precoding matrix, denoted as *F*^, would need to satisfy the maximum capacity requirements:

$$\hat{F} = \underset{F \in \mathcal{F}}{\text{arg}\,\text{max}}\,\mathcal{C}\_{H,\mathcal{F}}\tag{2}$$

The FFNN model generates the optimum precoder matrices and the codebook. The parameters used are shown in algorithm 1 below where *m* is the number of codebook-bit and *P* is the number of antennas. The input data to the ML model is precoders from the universal, DFT codebook, *F*, such that *F* ∈ ℱ, go through the ML model and are trained to choose the optimum codebook elements that satisfy the MIMO secrecy capacity represented in Eq. 1. The FFNN algorithm determines the optimum precoding matrix, *F*^, that maximize the capacity as shown in 1.

#### *3.1.2 Intelligent PLS decoding process using AI/ML*

Alice and Bob exchange their secret keys that employs an optimum codebook obtained from and decoded by a Deep Neural Network (DNN) based algorithm. The sum-secret-rate maximization between the Bob-to-Alice transmitter simultaneously minimizes the eavesdropper's signal capacity, a main goal. Achieving perfect secrecy occurs when the transmitter and the legitimate receiver communicate at some positive rate while ensuring that the eavesdropper receives zero bits of information. Our method successfully achieved optimal secrecy PLS detection. When maximum secrecy is obtained between legitimate transmitter and receiver, the threat from a passive eavesdropper drops.

Among all other ML techniques, the choice of DNN for decoding includes the following advantages: (i) Once the training process is finished, a deep neural network (DNN) provides the accuracy solutions within a very short computational time [26]; our time varying channel computational latency occurs in near real time and with a much reduced processing time. (ii) Other ML algorithms, like CNNs and LSTMs are used for classifying sequence data. However, in our case, the data is not sequential but rather consists of features which predict a class (the PMI). Hence, DNN is the best match for our model. Algorithm 2 presents the parameters for deep learning neural network codebook detection.

Using these two ML algorithms, a detailed formulation and working procedure of the shared key based PLS model is shown in **Figure 2** and is described as follows:


**Figure 2.** *6G signal privacy using AI/ML for secure decoding of information in physical layer security.*


**Step 0: AI Initialization stage** Alice first initiates a request to the legitimate receiver for private information exchange. Alice transmits a reference signal *r*, rotated by a random unitary matrix *G*. This random unitary matrix, *G* obscures the channel and generates uniformly distributed signaling. It depends on number of antennas used.

TDD implies that the transmitter and receiver channels ð*HAB* and *HBA*Þ are transposes of each other such that *HBA* ¼ ð Þ *HAB <sup>T</sup>*. The protocol applies TDD reciprocity. **Step 1: Bob-to-Alice**


#### **Step 2: Alice-to-Bob**

1.Alice performs an SVD, such that *HBAG*<sup>1</sup> <sup>¼</sup> *<sup>V</sup>* <sup>∗</sup> *A* P *<sup>A</sup> U<sup>T</sup> AG*<sup>1</sup> � � and applies channel coding to *SA* to generate *CA*. Alice looks up *CA* in the optimum codebook and

finds the corresponding codebook element *FA*. Next, she transmits a rotated reference signal *<sup>G</sup>*2*<sup>r</sup>* to Bob, where *<sup>G</sup>*<sup>2</sup> <sup>¼</sup> *VAF*† *A*.

2.Bob estimates *HABG*<sup>2</sup> (see Step 1) channel parameters and feeds the noisy received information into the AI/ML model, denoted as the Bob AI/ML detector in **Figure 2**. Bob obtains estimated version of Alice's signal information as model output, ^ *SA*, concatenates *SB*, and estimates ^ *SA*.

Bob and Alice inject additive random noise into the received secret key. The PLS procedure tests the performance of the model by comparing the estimated version of transmitted random signal information to the actual secret key and conclude the performance analysis based on Key BER results. Bob repeats his steps again, sends new secret information and starts transmitting to Alice's for detection. The process iteratively repeats.

Both Alice and Bob have half of the information which they generated themselves and the other half which they estimated in the form of the PMIs. Even as Eve moves into the spatial proximity of Alice, she eavesdrops only *SB* rather than *SA* and *SB*.

#### **3.2 Integration of open RAN solutions with key based PLS**

This section described the proposed PLS scheme in the context of 6G communications. A vital capability enabled by PLS involves extensible adaption for the new technologies in 6G cellular. Emergign technologies include massive-MIMO (massive Multiple Input, Multiple Output) [27], millimeter wave communications, subterahertz communications [28], network-based sensing [29], network slicing [30], and ML-based digital signal processing. Managing and optimizing these new network systems require flexible security solutions that integrate across the Radio Access Network (RAN). Open RAN (O-RAN), the most prominent 6G RAN configuration, disaggregates, virtualizes and enables the" softwarization" of infrastructure resources and components via open standards, open interfaces, and interoperability across private vendors and open source software communities [31]. Disaggregation and virtualization enable flexible deployments, based on cloud-native principles. Reliance on cloud frameworks increases the resiliency and reconfigurability of Open RAN and allows operators to aggregate technology across different sizes and varieties of equipment vendors.

The O-RAN Alliance formed a next Generations Research Group (nGRG) with the aim of carrying out research about O-RAN and future 6G networks [32]. The research community focus has a strong incentive to pair 6G O-RAN with strong security. Authors in [33] introduced the O-RAN building blocks and architecture, with use cases related to the application of ML to the RAN. To our knowledge, no work has demonstrated how Layer-1 PLS will practically operate with embedded intelligence in a 6G O-RAN architecture context. The security scheme discussed aims to extend the functional disaggregation paradigm proposed by 3GPP for 6G PLS gNBs [34]. RAN disaggregation splits base stations into different functional units: (a) High Layer Split, (b) Low Layer Splits and (c) Double Splits. The functional splits are represented in **Figure 3**. While considering the functional split concept defining a fronthaul interface, there are two competing interests


Many groups have standardized upon a subset of allowable O-RAN split points. The "72x" [35] function option split allows a variation, with the precoding function to be located either "above" the interface in the O-DU or "below" the interface in the O-RU. Following the O-RAN split option 72x shown in **Figure 4**, the O-RU logically hosts radio frequency (RF) processing and the low-PHY layer consists of the D/A conversion, cyclic prefix, and IFFT insertion. The O-DU within the O-RAN Distributed Unit which is a logical node for hosting a high-PHY layer consisting of Resource Element (RE) Mapping, PLS Precoding, Layer Mapping, Modulation, Scrambling, Coding, along with other layers- MAC and RLC layers. A centralized Unit (CU) runs SDAP/RRC and PDCP layers. The interface between O-DU and O-RU is known as Open Fronthaul (O-FH) interface. O-RAN has defined and standardized the F1 interface for communication between the O-CU and O-DU.

In O-RAN, control, optimization, and AI/ML algorithms can be trained and deployed in two logical functions: the non-real-time (RT) Radio Intelligent Controller (RIC) and the near-real-time (RT) Radio Intelligent Controller (RIC), shown in

**Figure 3.** *Overview of functional split for O-RAN.*

#### **Figure 4.** *Distributed O-RAN with physical layer security for 6G.*

**Figure 4**. The non-RT RIC is a logical function internal to the Service Management and Orchestration (SMO) and complements the near-RT RIC for intelligent RAN operation and optimization on a time scale larger than 1 s. It hosts Data management and Exposure, and supports AI/ML models for training within applications denoted as rApps. The near-RT RIC is a logical function deployed at the edge of the network and operates control loops with a periodicity between 10 ms and 1 s. It interacts with the O-DU and O-CU and consists of multiple applications supporting custom logic, called xApps, which are microservices used to provide Near-RT controllable operation in the O-DU RAN for PLS through the O-RAN E2 interface. The xAPPs receive measurements from the DU node and respond with control actions.

However, to ensure SURLLC, the control decisions and execution needs to be realized in real time. Limiting the execution of control applications to the near-RT and non-RT RICs prevents the use of data-driven solutions where control decisions and inference must be made in real time, or within temporal windows shorter than the 10 ms supported by near-RT control loops [33, 36, 37] as the RICs have limited access to low level information. The near-RT RIC brings network control closer to the edge, but it primarily executes in cloud facilities [38]. Therefore, data needs to travel from the DUs to the near-RT RIC, and the output of the inference needs to go back to the DUs/RUs. The additional communication results increased latency and overhead over the E2 interface to support data collection, inference, and control. To mitigate this challenge, authors in [39] introduced the notion of dApps, custom and distributed applications that complement xApps/rApps by implementing RAN intelligence at the

CUs/DUs for real-time use cases outside the timescales of the current RICs. The authors have demonstrated that using dApps can result in a 3.57 reduction in overhead.

Although dApps has been introduced and proposed in [39], required interfaces are yet to be standardized by ORAN Alliance. In this article, we have adopted dApps to implement low latency security scheme for operating in the lower layer of protocol stack and extended dApps to propose a functional integration and working procedure overview for the use case of PLS, which will be the first to our knowledge, to open the paradigm of a low latency and secure, intelligent, Layer-1 security scheme design proposal proposed for 6G ORAN. We call it 'security Apps' or 'sApps'- an application for exchanging low latency information securely between the UE and O-RAN network over physical layer security channels.

The proposed PLS scheme integrates with O-RAN interfaces. This section describes extensions for the management, deployment and execution of sApps:


**Figure 5.** *Procedure for the setup of an E2 session in the E2 node.*

**Figure 6.** *Flow diagram of communication in E2 nodes in O-RAN.*


In the 6G-PLS scheme, Alice (RAN) needs to initiate the first step: ML Initialization Stage, and send the rotated reference signal, *Gr* to Bob (UE). This is proposed to take place in a sApp located in the E2 nodes. The E2 set up procedure is illustrated in **Figure 5**. E2 interface runs on top of SCTP protocol. The E2 node (in this case the O-DU) transmits an E2 setup request that lists the RAN functions and configurations it supports. In combination with the identifiers for the node, the CU node processes this information and replies it with an E2 setup response.

After the connection is established, an E2 Service Model (SM) RAN Controller implements the 6G PLS protocol. To send the rotated reference signal to Bob (UE), Alice (RAN) publishes data to the CU node to initialize the PLS procedures. Algorithm 3 illustrated in **Figure 6** describes the PLS control service message exchange between the RAN CU node and the DU node. For transmission or Downlink (DL), the O-DU node computes resource element (e.g., 5G RE or subcarrier) mapping, PLS Precoding, Layer mapping, modulation, scrambling, coding and eventually sends *Gr* to Bob (UE). After receiving the reference signal, Bob initiates the Step 1 transmission as described in part A of this section. For the Uplink (UL) transmission,

Bob sends his encoded secret key to Alice. Also, the O-DU performs PLS decoding instead of PLS precoding in the case of Uplink (UL) as well as all the other steps mentioned.


Alice receives Bob's secret information and decodes and estimates *SB*. The AI/ML training data collected in SMO applies training protocols in x-Apps of the Near-RT RIC or in sApps in the O-CU-CP. The location of process depends on the operator's intent, as explained in algorithm 3. Both xApps and sApps reside as cloud native containers (e.g., Docker). In the former case, data for inference is received from xApp via the E2 interface, while in the latter, data is locally available at the sApp in E2 node (in this case the CU node). An operator's intent determines how to split and distribute intelligence among xApps and sApps, and dispatch them. To orchestrate this better and also to mitigate any possible conflict, there is an sApp Controller and Monitor hosted in near-RT RIC. The AI/ML life cycle consists of following steps which follow guideline provided by the O-RAN Alliance (see WG1 [40]):


In the O-DU, real time KPIs report observations from the environment, performance evaluation, and ML performance feedback. Based on the results, sApp make a control decision whether to initiate an AI/ML Agent re-training trigger. When retraining is required, based on these real time data, the AI/ML model retrains in the O-DU, or based on operator's intent via the O1 interface. Real time data is fed to SMO to be trained in xApp in the Near RT-RIC. The AI/ML lifecycle iterates.

The integration of sApps occurs in lower layer of the RAN architecture. The scheme further proposes realizing 'AI at the edge' to significantly improve network performance and latency.

#### **4. Simulation results**

Analysis of the simulation results in Matlab Key Bit Error Rate (BER) of the AI/ML-based PLS scheme confirm the efficacy of the approach. Bob and Alice both transmit secret keys under noise-limited scenarios in 2 2 and 4 4 MIMO channels with Rayleigh fading. The model applies a 960 kHz sampling rate and a 130 Hz Doppler shift. The simulations applied a Rayleigh fading channel model for 2, 4 and 5-bit codebooks across a Monte Carlo transmission model containing 5000 information bits. The PLS scheme applies a secret key agreement protocol. Analysis employs Key Bit Error Rate (BER) metrics and measures the detector Key BER probability at the receiver.

In the ML algorithm, the data split used was 70–30% for training and validation, while the training epoch chosen was 40. As a result, the validation accuracy was 98.33% for Bob-to-Alice (UL) Tx and 95% for Alice-to-Bob Tx (DL). The results prove that the model neither underfits nor overfits and is well-trained to perform pattern recognition and classification.

We first compare our proposed model with a traditional shared key based PLS model with DFT codebook where no ML-algorithm was adopted. 5G systems employ

**Figure 7.**

*Plot for key BER (raw BER prior to error correction decoding) vs. SNRdB for non-ML and ML based PLS for bob and Alice transmission in presence of eve using a 2 2 MIMO for 2-bit and 4-bit codebooks.*

#### **Figure 8.**

*ML model performance for bob and Alice transmission: Plot for key BER (raw BER prior to error correction decoding) vs. SNRdB in 2 2 and 4 4 MIMO system, using 2-bit codebooks.*

#### **Figure 9.**

*ML model performance for bob and Alice transmission: Plot for key BER (raw BER prior to error correction decoding) vs. SNRdB in 2 2 and 4 4 MIMO system, using 4-bit codebooks.*

#### **Figure 10.**

*ML model performance for bob-to-Alice transmission: Plot for key BER (raw BER prior to error correction decoding) vs. SNRdB using 5-bit codebooks in 2 2 and 4 4 MIMO system.*

polar codes for the control channel and LDPC codes for the data channel. **Figures 7**–**10** illustrate the raw BER prior to error correction decoding. From **Figure 7**, for a 2 2 MIMO system, it is observed that when SNR changes from 10 dB to 50 dB, Key BER decreases with the signal power increase and noise variance decreases for both non-ML and ML models. However, for a 4-bit codebook, the dB-change in Key BER performance is higher than that of a 2-bit codebook due to higher detection precision requirements for a higher-order codebook. For example, for an SNR of 10 dB, Key BER for a 2-bit codebook is 17.5% for the ML model and Key BER for 2-bit codebook is 32.8% for the non-ML model for Alice-to-Bob transmission. But, for a 4-bit codebook, Key BER is 19.5% for the ML model whereas Key BER is 31.96% for the non-ML model for Alice-to-Bob (DL) transmission. Our analysis proves that compared to the non-ML, existing shared key based PLS model, the proposed ML model with an optimum codebook as well as ML based decoding scheme can perform better and can guarantee a better Key BER performance with security and hence reliability.

**Figure 7** also demonstrates an analysis of Eve's performance in comparison with Alice to Bob's transmission for a 2 2 MIMO system and a multi-codebook system. The assumption is that Eve has the knowledge of the same, universal, DFT codebook that Alice and Bob uses and that Eve also has an uncorrelated channel. The figure clearly shows that Eve cannot perfectly decode the secret key just by placing herself closer to Bob or Alice. For example, if we focus on a comparative overlook of Alice, Bob and Eve's performance for 4-bit codebook, it is evident that at a SNR of 25 dB, Alice to Bob transmission achieves perfect secrecy (0 BER), whereas Eve has a Key BER of 38.80% by placing herself closer to Bob and a Key BER of 41.28% by placing herself closer to Alice. Generally, a signal with an SNR value of 20 dB to 25 dB is recommended for data network. It is evident from this analysis that perfect security has been achieved.

Reliability promised by 6G system can also be manifested by achieving improved BER in UL and DL transmission sessions. This can be achieved by increasing antenna number, which is illustrated in **Figures 8**–**10**.

Comparing **Figures 8** and **9** as well as **Figure 10**, it is evident that BER in Alice to Bob transmission is compromised with the use of higher codebook as SNR changes from 10 dB to 35 dB, which was also demonstrated in **Figure 7**. The reason is, higher order codebooks provide with higher number of precoders, hence the precision requirement. **Figure 10** demonstrates that for the case of 2 2 MIMO, a 5-bit codebook, at SNR = 10 dB, DL and UL Key BER are 30% and 21.01% respectively. When codebook bit is reduced to 4-bit and eventually to 2-bit as shown in **Figures 8** and **9**, for the same number of antenna, 2 2 MIMO, DL and UL Key BER reduces to 21.80%, 12.10% for 4 bit-codebook and to 20.10%, 10.20% for 2-bit codebook respectively, at SNR = 10 dB. For a 5-bit codebook, 0 Key BER is achieved at 30 dB whereas, for a 4-bit codebook and 2-bit codebook, 0 Key BER can be achieved at even lower SNRs, such as 20 dB and 15 dB respectively.

Key BER can be compensated by increasing the number of MIMO antennas. **Figure 8** shows that for a 2-bit codebook, in both UL and DL transmission, increasing transmit and receive antenna number from 2 to 4 decreases Key BER significantly from 10.20% to 0.71% for UL transmission and from 20.10% to 1.43% for DL transmission respectively, at SNR = 10 dB. Similarly, if we compare the results demonstrated in **Figures 9** and **10**, Key BER improves significantly as we increase number of both transmit and receive antenna arrays, proving better security and reliability of the proposed ML-based PLS model. **Figure 11** represents a summarized overview of the


#### **Figure 11.**

*Comparison of proposed ml-based system performance demonstrating improved security and reliability in terms of key BER (raw BER prior to error cor*

*rection decoding) over multi-codebook and MIMO antenna systems.*

ML based PLS system performance over multi-codebooks and MIMO antenna systems at 10 dB SNR:

**Figure 11** comparatively analyzes the BER performance for an OFDM system with 64 subchannels. The *L* subchannels have a 15 kHz subchannel spacing and sampling frequency, *fs* of 960 KHz. The bandwidth, *W* < *fs* and cyclic prefix (CP) have been selected to be 25% overhead. The symbol rate, 1*=Tsymb* formula can be defined as follows: *<sup>L</sup>*þ*CP fs* . The models focus on a 4-bit codebook scenario, as shown in **Figure 11**. For the same codebook size, higher MIMO arrays transmits lower rates of data resulting in less spectral efficiency, *η*; however higher MIMO sizes also result in improved BER, for uplink and downlink. For a codebook size of 16, a 2 � 2 MIMO results in a BER of 21.80%, 12.10% with a spectral efficiency of 0.76 bits/sec/Hz whereas a 4 � 4 MIMO results in a much improved BER of 10.20%, 9.80% with a spectral efficiency of 0.18 bits/sec/Hz, for downlink and uplink transmission cases respectively.

#### **5. Conclusions**

In conclusion, AI/ML-based Physical Layer Security is a promising candidate for a secure 6G. AI/ML can optimize the parameters and contribute to superior transmission and secrecy performance, resulting in high levels of security and reliability in 6G communication. The proposed sApps contribute significantly in reducing latency by operating in the lower layer of communication. The model realized the goal of 6G SURLLC. Future paradigm of our work includes implementing the structure for time varying channel models, higher mobile velocity, and higher levels of security by employing intelligent PLS infrastructures.

#### **Acknowledgements**

This research was supported by the National Science Foundation REU program for AI Powered Robotics in 5G Networks, OUSD R&E FutureG Advanced Research, and the UTSA Klesse College of Engineering at the University of Texas at San Antonio.

#### **Acronyms definitions**


*6G Physical Layer Security DOI: http://dx.doi.org/10.5772/intechopen.112989*

#### **Author details**

Israt Ara\* and Brian Kelley\* University of Texas at San Antonio, San Antonio, USA

\*Address all correspondence to: isratmist@gmail.com; brian.kelley@utsa.edu

© 2023 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## **References**

[1] Bertin E, Magendanz TC. Noel: Shaping future 6G networks needs, impacts, and technologies. In: Bertin E, Crespi N, Magendanz T, editors. Toward 6G-Collecting the Research Visions. Wiley, 2022. pp. 1-8

[2] Saad W, Bennis M, Chen M. A vision of 6g wireless systems: Applications, trends, technologies, and open research problems. IEEE Network. 2020;**34**(3): 134-142

[3] Sharma A, Aswani K. Top Companies and Universities Mapping the 6G Technology. Available from: https://www.greyb.com/blog/6gcompanies/

[4] 5g KPIS vs 6g KPIS — Difference between 5g and 6g KPIS. Available from: https://www.rfwireless-world.com/Te rminology/5G-KPIs-vs-6G-KPIs. html

[5] Adhikari M, Hazra A. 6G-enabled ultra-reliable low-latency communication in edge networks. IEEE Communications Standards Magazine. 2022;**6**(1):67-74

[6] Yang H, Alphones A, Xiong Z, Niyato D, Zhao J, Wu K. Artificial-intelligenceenabled intelligent 6G networks. IEEE Network. 2020;**34**(6):272-280

[7] Letaief KB, Chen W, Shi Y, Zhang J, Zhang Y-JA. The roadmap to 6g: AI empowered wireless networks. **57**(8): 84-90. Available from: https://ieeexplore .ieee.org/document/8808168/

[8] Sheth K, Patel K, Shah H, Tanwar S, Gupta R, Kumar N. A taxonomy of AI techniques for 6g communication networks. **161**:279-303. Available from: https://linkinghub.elsevier.com/retrieve/ pii/S0140366420318478

[9] Akyildiz IF, Kak A, Nie S. 6g and beyond: The future of wireless communications systems. **8**:133 995-134 030. Available from: https://ieee xplore.ieee.org/document/9145564/

[10] Viswanathan H, Mogensen PE. Communications in the 6g era. **8**:57 063- 57 074. Available from: https://ieee xplore.ieee.org/document/9040431/

[11] Sun Y, Liu J, Wang J, Cao Y, Kato N. When machine learning meets privacy in 6g: A survey. **22**(4):2694-2724. Available from: https://ieeexplore.ieee.org/ document/9146540/

[12] Pin Tan DK, He J, Li Y, Bayesteh A, Chen Y, Zhu P, et al. Integrated sensing and communication in 6g: Motivations, use cases, requirements, challenges and future directions. In: 2021 1st IEEE International Online Symposium on Joint Communications & Sensing (JC&S). IEEE. pp. 1-6. Available from: https://ieeexplore.ieee.org/document/ 9376324/

[13] Matthaiou M, Yurduseven O, Ngo HQ, Morales-Jimenez D, Cotton SL, Fusco VF. The road to 6g: Ten physical layer challenges for communications engineers. **59**(1):64-69. Available from: https://ieeexplore.ieee.org/document/ 9356519/

[14] Chen R, Li C, Yan S, Malaney R, Yuan J. Physical layer security for ultrareliable and low-latency communications. IEEE Wireless Communications. 2019;**26**:6-11

[15] Jiang W, Schotten HD. The kick-off of 6g research worldwide: An overview. In: 2021 7th International Conference on Computer and Communications (ICCC). 2021. pp. 2274-2279

*6G Physical Layer Security DOI: http://dx.doi.org/10.5772/intechopen.112989*

[16] 3gpp List of Work Items. Available from: https://www.3gpp.org/dynare port?code=WIList.htm

[17] Yerrapragada AK, Eisman T, Kelley B. Physical layer security for beyond 5g: Ultra secure low latency communications. **2**:2232-2242. Available from: https://ieee xplore.ieee.org/document/9519720/

[18] Porambage P, Gur G, Osorio DPM, Liyanage M, Gurtov A, Ylianttila M. The roadmap to 6g security and privacy. **2**: 1094-1122. Available from: https://ieee xplore.ieee.org/document/9426946/

[19] Kumar MS, Ramanathan R, Jayakumar M. Key less physical layer security for wireless networks: A survey. Engineering Science and Technology, An International Journal. 2022;**35**:101260

[20] Nazzal T, Mukhtar H. Evaluation of key-based physical layer security systems. In: 2021 4th International Conference on Signal Processing and Information Security (ICSPIS). 2021. pp. 84-87

[21] Bjornson E, Bengtsson M, Ottersten B. Optimal multiuser transmit beamforming: A difficult problem with a simple solution structure [lecture notes]. IEEE Signal Processing Magazine. 2014; **31**(4):142-148. DOI: 10.1109% 2Fmsp.2014.2312183

[22] Wu C-Y, Lan P-C, Yeh P-C, Lee C-H, Cheng C-M. Practical physical layer security schemes for MIMO-OFDM systems using precoding matrix indices. **31**(9):1687-1700. Available from: http:// ieeexplore.ieee.org/document/6584930/

[23] Samsung. MIMO for Long Term Evolution. Vol. GPPTSG RANWG1 42. 2005. pp. 1-6

[24] Wyner AD. The wire-tap channel. The Bell System Technical Journal. 1975; **54**(8):1355-1387

[25] Kelley B, Ara I. An intelligent and private 6g air interface using physical layer security. In: MILCOM 2022 - 2022 IEEE Military Communications Conference (MILCOM). 2022. pp. 968-973

[26] Zhang M, Cumanan K, Thiyagalingam J, Tang Y, Wang W, Ding Z, et al. Exploiting deep learning for secure transmission in an underlay cognitive radio network. IEEE Transactions on Vehicular Technology. 2021;**70**(1):726-741

[27] Marzetta TL. Noncooperative cellular wireless with unlimited numbers of base station antennas. IEEE Transactions on Wireless Communications. 2010;**9**(11): 3590-3600

[28] Akyildiz IF, Jornet JM, Han C. Terahertz band: Next frontier for wireless communications. Physical Communication. 2014;**12**:16-32

[29] de Lima C, Belot D, Berkvens R, Bourdoux A, Dardari A, Guillaud M et al, editors. 6G White Paper on Localization and Sensing [White paper]. (6G Research Visions, No. 12). University of Oulu; 2020. Available from: http://urn.f i/urn:isbn:9789526226743

[30] D'Oro S, Bonati L, Restuccia F, Melodia T. Coordinated 5g network slicing: How constructive interference can boost network throughput. IEEE/ ACM Transactions on Networking. 2021; **29**(4):1881-1894

[31] O. W.-R.-A.-D. v05.00 Technical Specification. O-RAN Architecture Description 5.00. O-RAN Alliance. White Paper; 2021

[32] Lee H, Cha J, Kwon D, Jeong M, Park I. Hosting Ai/ml workflows on o-ran ric platform. 2020;**12**:1-6

[33] Bonati L, D'Oro S, Polese M, Basagni S, Melodia T. Intelligence and learning in O-RAN for data-driven NextG cellular networks. IEEE Communications Magazine. 2021;**59**(10):21-27

[34] 3rd Generation Partnership Project (3GPP). NG-RAN; Architecture Description. 3GPP, Technical Specification (TS); 2022

[35] 5G NR logical Architecture and its Functional Splits. Parallel Wireless, Inc., White Paper; 2021

[36] Orhan O, Swamy VN, Tetzlaff T, Nassar M, Nikopour H, Talwar S. Connection management xAPP for O-RAN RIC: A graph neural network and reinforcement learning approach. In: 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA). 2021. pp. 936-941

[37] Abdalla AS, Upadhyaya PS, Shah VK, Marojevic V. Toward next generation open radio access networks: What O-RAN can and cannot do! IEEE Network. 2022;**36**(6):206-213

[38] Polese M, Bonati L, D'Oro S, Basagni S, Melodia T. Understanding O-RAN: Architecture, interfaces, algorithms, security, and research challenges. IEEE Communications Surveys & Tutorials. 2023;**25**(2):1376-1411

[39] D'Oro S, Polese M, Bonati L, Cheng H, Melodia T. dApps: Distributed applications for real-time inference and control in o-RAN. IEEE Communications Magazine. 2022;**60**(11):52-58. DOI: 10.1109%2Fmcom.002.2200079

[40] O-RAN Alliance: O-RAN Working Group 1 Massive MIMO Use Cases Technical Report, O-RAN.WG1. MMIMO-USE-CASES-TR-v01.00. 2022. pp. 11-87

#### **Chapter 9**
