**2. Generative models for synthetic data**

*Security and Privacy From a Legal, Ethical, and Technical Perspective*

end up with much higher budget than is necessary.

*Fake celebrity images created using generative modeling; none of these images are real people.*

system where the privacy-protected data is available.

use-cases demonstrate that due to concerns about risk, most implementations

• Many mechanisms of differential privacy require noise to be added to the data in cases where the original data is highly skewed, resulting in reduced utility of

• In many specific fields of statistical analysis, users of micro-data are highly trained to use specific tools (STATA, SAS, R and Python) and query procedures, which often do not support the complexity of differential-privacyprotected mechanisms. This presents a behavior-change challenge whereby analysts need to be convinced to abandon their familiar methods and tools (which they may have been using for decades) in favor of the interactive

Fortunately, deep generative models – a recent and novel approach in deep neural networks – provide an alternative for direct sharing of micro-data without

With generative models, a deep neural network algorithm uses the existing micro-data to approximate, with high accuracy, the underlying probability distribution of the data in some high-dimensional latent space. Once the probability distribution is approximated, the trained model can be used to generate any number of *synthetic* records by randomly sampling from that distribution. Those generated records are related to the original data only through the shared underlying probability distribution, and thus does not include any information that can be linked back

To further illustrate how synthetic data generation works, consider CelebA,1 a dataset with more than 200,000 synthetic celebrity face images, each with 40 automatically extracted attribute annotations. Using generative models, researchers have demonstrated the ability to learn the underlying distribution well enough to

generate photorealistic celebrity faces as is shown in **Figure 1** above.

the outputs, and in some cases rendering the whole exercise useless.

**142**

privacy risk.

**Figure 1.**

to the original (private) records.

<sup>1</sup> http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html

Generative models are a class of mathematical models that approximate a probability distribution of some dataset and can be used to generate samples of data according to the modeled (or approximated) distribution. Such generated data is often called "synthetic data," "fake data," or "realistic but not real."

For a given data domain, consider a dataset A with N data records. For most practical cases, the dataset can be assumed to be drawn from some (usually unknown) probability distribution P(x). A *synthetic* dataset S is a dataset similar to A in terms of fields or structure, where records in S are randomly drawn from some probability distribution Q(x).

In an ideal world where Q(X) = P(X) we can clearly use S for various purposes of analysis and modeling, because they are sampled from the same distribution. The key idea behind generating synthetic data is as follows: can we accurately estimate this probability distribution P(x), such that Q(X) ≈ P(X), with high fidelity?

Let us look at a simple example – consider a one-dimensional series of values A, where A is drawn from a normal (Gaussian) distribution with mean *μ* and standard deviation σ. In other words, we know in this case that P(x) is the normal distribution with *Pμ*,σ(*<sup>x</sup>*) = \_1 σ √ \_ 2*π e* −\_1 2 (\_ *x*−*μ* σ ) 2 , and that the values in A should fit this distribution. We can then use Gaussian fitting to estimate the values of *μ* and *σ* from the data, as is demonstrated in **Figure 2**.

**Figure 2.** *Sample Gaussian fitting.*

Once we have a good approximation for the parameters of the distribution (*μ* and *σ*), we can sample from this distribution to generate completely new data points that are fully consistent with the Gaussian distribution describing the original data.

This is of course a simplified example for two reasons. First, with a real generative model we do not know the actual form of the distribution function (e.g., Gaussian in this case); instead we use the neural network to estimate that function. Second, in the real world the data is not one-dimensional, but of much higher dimension.

So how do we approximate an unknown probability distribution from highdimensional data?

The traditional approach to approximating data distribution is simple frequency counting (histograms), but of course this approach does not work in high dimensions due to the curse of dimensionality, namely the fact that most statistical methods fail in high-dimensional data due to increasing sparseness. This is also the case here with frequency counting, where with many dimensions the amount of histogram needed quickly explodes to make the method unfeasible.

Instead, the approach used in modern generative modeling research is to assume a functional form of the distribution *Pθ*(*x*) and learn the parameters θ of the function from the data. This set of parameters *θ* is in essence a compressed representation for the original dataset, often called "latent space representation."

To further illustrate this, let us go back to our example of celebrity images. Assume that the images are black and white (so that each pixel is represented by either 0 or 1), and of size 28 × 28 = 784 pixels. If we represent each image as a vector of 784 binary values, the number of possible values for a vector in this space is 2784 = 10236; if we want to approximate P(x) for each possible vector x in this space, we would need to estimate it for 10236 such vectors, which is clearly not realistic in practice (thus "the curse of dimensionality"). Instead, we can define some *Pθ*(*x*) with a much smaller set of parameters *θ* and estimate those parameters in such a way that *Pθ*(*x*) ≈ *P*(*x*). It turns out that deep neural networks are a good match for this kind of problem, and can be used to accurately estimate the parameters of the distribution *Pθ*(*x*); there are many possible neural network architectures suitable for this task, most common of which are auto-encoders and generative adversarial networks.

Images are a very vivid (pun intended) demonstration of the power of generative models and how they can generate high utility synthetic data; but these techniques can also be successfully applied to many other fields such as music, poetry, cartoon characters, or even synthetic "video miles" for self-driving cars.

The performance of recent techniques in generative modeling is quite impressive, and their success led to a growth in applications of generative models in industry. For example, self-driving car companies use synthetic data to significantly increase the size of training data they have available, covering many more scenarios and edge-cases for improving their self-driving algorithms.

The usefulness of synthetic data generally falls into one of 3 important categories:

• **Replacement**. If access to the real dataset is limited or restricted (e.g., when data access is highly regulated), synthetic data often provides an excellent alternative. A good example comes from healthcare – access to medical records is often heavily restricted because of personal identifiers and the risk of linkage attacks. Synthetic medical records with high fidelity can provide the medical and bio-pharma research community with a replacement dataset that accurately reflects the statistical properties of the original data. This opens up an enormous opportunity to share and aggregate medical data from various clinical care sources and unlock important insights such as how effective are various therapeutics like drugs, medical devices or clinical care protocols.

**145**

**Figure 3.**

*Auto-encoder architecture.*

networks.

**3. Variational auto-encoders**

is shown in **Figure 3**.

*Beyond Differential Privacy: Synthetic Micro-Data Generation with Deep Generative Neural…*

• **Augmentation**. In many predictive modeling use-cases, the dataset available for training the model is relatively small in size, which often results in lower accuracy of the model. This phenomenon is further exacerbated when using deep learning for predictive modeling, where small datasets tend to overfit quite easily. Creating synthetic training examples and combining the real and synthetic data points ("augmenting" the real dataset with synthetic data), resulting in a much larger training dataset overall, can significantly improve

• **Equalization/reshaping**. An interesting aspect of using generative models is that we can generate as much data as is desired; often many more records than exist in the original dataset. A key characteristic of generative models is that we can direct them to shape the output dataset to certain desired criteria. For example, if the original dataset has 60% male and 40% female, we can control the gender distribution and generate a 50%/50% synthetic dataset. This enables users of the synthetic data to battle bias in the original dataset.

Equipped with a basic understanding of what synthetic data is, and how it's created using generative models, let us look in more detail at two of the most common types of generative models: variational auto-encoders and generative adversarial

An autoencoder is a specific type of deep learning architecture which is split into two distinct neural networks: one is called the "encoder" and the other "decoder," as

In this architecture, the encoder *Eθ* is a deep neural network that encodes the input data (X) into some intermediate representation (Z, often referred to as "latent representation") in a reduced dimensional space, and the decoder *Dθ* is also a deep neural network that decodes the vector Z back into the output vector Y. X and Y are of the same dimensionality. The goal of training the auto-encoder is to reconstruct the input X in the output Y, while transitioning through the lower dimensionality representation Z, so that we get as close as possible to *Y* = *Dθ*(*Eθ*(*X*)). If you optimize this auto-encoder in such a way that the loss of data between input X and

*DOI: http://dx.doi.org/10.5772/intechopen.92255*

the accuracy of the predictive models.

*Beyond Differential Privacy: Synthetic Micro-Data Generation with Deep Generative Neural… DOI: http://dx.doi.org/10.5772/intechopen.92255*


Equipped with a basic understanding of what synthetic data is, and how it's created using generative models, let us look in more detail at two of the most common types of generative models: variational auto-encoders and generative adversarial networks.
