**3. Variational auto-encoders**

*Security and Privacy From a Legal, Ethical, and Technical Perspective*

original data.

dimensional data?

Once we have a good approximation for the parameters of the distribution (*μ* and *σ*), we can sample from this distribution to generate completely new data points that are fully consistent with the Gaussian distribution describing the

This is of course a simplified example for two reasons. First, with a real generative model we do not know the actual form of the distribution function (e.g., Gaussian in this case); instead we use the neural network to estimate that function. Second, in the real world the data is not one-dimensional, but of much higher dimension. So how do we approximate an unknown probability distribution from high-

The traditional approach to approximating data distribution is simple frequency counting (histograms), but of course this approach does not work in high dimensions due to the curse of dimensionality, namely the fact that most statistical methods fail in high-dimensional data due to increasing sparseness. This is also the case here with frequency counting, where with many dimensions the amount of

Instead, the approach used in modern generative modeling research is to assume a functional form of the distribution *Pθ*(*x*) and learn the parameters θ of the function from the data. This set of parameters *θ* is in essence a compressed representa-

To further illustrate this, let us go back to our example of celebrity images. Assume that the images are black and white (so that each pixel is represented by either 0 or 1), and of size 28 × 28 = 784 pixels. If we represent each image as a vector of 784 binary values, the number of possible values for a vector in this space is 2784 = 10236; if we want to approximate P(x) for each possible vector x in this space, we would need to estimate it for 10236 such vectors, which is clearly not realistic in practice (thus "the curse of dimensionality"). Instead, we can define some *Pθ*(*x*) with a much smaller set of parameters *θ* and estimate those parameters in such a way that *Pθ*(*x*) ≈ *P*(*x*). It turns out that deep neural networks are a good match for this kind of problem, and can be used to accurately estimate the parameters of the distribution *Pθ*(*x*); there are many possible neural network architectures suitable for this task, most common of

Images are a very vivid (pun intended) demonstration of the power of generative models and how they can generate high utility synthetic data; but these techniques can also be successfully applied to many other fields such as music, poetry,

The performance of recent techniques in generative modeling is quite impressive, and their success led to a growth in applications of generative models in industry. For example, self-driving car companies use synthetic data to significantly increase the size of training data they have available, covering many more scenarios

The usefulness of synthetic data generally falls into one of 3 important categories:

• **Replacement**. If access to the real dataset is limited or restricted (e.g., when data access is highly regulated), synthetic data often provides an excellent alternative. A good example comes from healthcare – access to medical records is often heavily restricted because of personal identifiers and the risk of linkage attacks. Synthetic medical records with high fidelity can provide the medical and bio-pharma research community with a replacement dataset that accurately reflects the statistical properties of the original data. This opens up an enormous opportunity to share and aggregate medical data from various clinical care sources and unlock important insights such as how effective are various therapeutics like drugs, medical devices or clinical care protocols.

histogram needed quickly explodes to make the method unfeasible.

tion for the original dataset, often called "latent space representation."

which are auto-encoders and generative adversarial networks.

and edge-cases for improving their self-driving algorithms.

cartoon characters, or even synthetic "video miles" for self-driving cars.

**144**

An autoencoder is a specific type of deep learning architecture which is split into two distinct neural networks: one is called the "encoder" and the other "decoder," as is shown in **Figure 3**.

In this architecture, the encoder *Eθ* is a deep neural network that encodes the input data (X) into some intermediate representation (Z, often referred to as "latent representation") in a reduced dimensional space, and the decoder *Dθ* is also a deep neural network that decodes the vector Z back into the output vector Y. X and Y are of the same dimensionality. The goal of training the auto-encoder is to reconstruct the input X in the output Y, while transitioning through the lower dimensionality representation Z, so that we get as close as possible to *Y* = *Dθ*(*Eθ*(*X*)). If you optimize this auto-encoder in such a way that the loss of data between input X and

**Figure 3.** *Auto-encoder architecture.*

**Figure 4.** *Variational auto-encoder architecture.*

output Y (reconstruction error) is minimized, then it's as if you are trying to find an optimal compressed representation for the input data.

Traditional auto-encoders have been around since the early days of neural networks and in their basic form they cannot be used to generate synthetic data; In 2013 the idea of *variational* auto-encoders (VAE) started to take shape, primarily with the work of [2, 3], as a way to use auto-encoders as generative models.

With VAEs, instead of mapping the input vector X to a fixed vector Z, we want to map it into a distribution *q*θ(*z*|*x*), often assumed to be a multivariate normal distribution with mean *μ* and standard deviation σ; then to generate synthetic outputs Y we just randomly sample this learned distribution and decode the sampled vector to arrive at a synthetic output Y, as shown in **Figure 4**.

VAEs, being one of the first deep neural network architectures for practical generative models, created a lot of excitement about synthetic data, and was used primarily to generate synthetic images. Although elegant and theoretically pleasing, the synthetic images generated by VAEs tend to be blurry, which very quickly became a limiting factor for their use in synthetic imaging. Various improvements to the basic VAE approach have been proposed such as beta-VAE [4] and VQ-VAE [5] to address these issues; however, this also led researchers to the idea of generative adversarial networks, which we discuss next.

### **4. Generative adversarial networks**

The idea of a generative adversarial network is inspired by game theory: we build two models, a generator and a discriminator, that compete with each other in an adversarial manner to collaboratively optimize the whole system. The generator G is a generative neural network that outputs synthetic samples given a noise variable Z. The discriminator D is a different neural network that is trained to discriminate between real and synthetic samples. During training, the generator is trying to generate samples that mimic as much as possible the real data, so that it can fool the discriminator, whereas the discriminator is trained not to be fooled and be able to distinguish between real and synthetic data samples. This is shown in **Figure 5**.

As you can see from **Figure 5**, a key idea in this architecture is that the discriminator D shares gradient updates with the generator, such that the generator can "understand" how its generated data fails to fool the discriminator and improve its generation over time resulting in better and better synthetic samples.

**147**

*Beyond Differential Privacy: Synthetic Micro-Data Generation with Deep Generative Neural…*

GANs were first formulated by Ian Goodfellow and colleagues [6], and since then have been an active area of research; they have demonstrated the ability to generate significantly better synthetic images than VAEs, and have been used in a variety of applications like generating synthetic celebrity faces, fake Pokémon characters, time-series medical events [7] and electronic medical records [8]. Due to the impressive realism in synthetic data generated by GANs, they have also initiated an active and important discussion of malicious use of generative models, and privacy implications. We will discuss this important aspect of genera-

One difficulty with GANs is that they are quite difficult to train, and often require significant time and effort to manually tune until they reach the desired

• **Nash Equilibrium:** the Generator and Discriminator work against each other in a competitive manner, and it is often rather difficult to reach the Nash equilibrium of this 2-player minimax game. Training GANs to achieve this equilibrium tends to require extensive experimentation and good intuition

• **Vanishing gradient:** when the Discriminator is doing very well in its role to discriminate between real and synthetic data, its gradients are very close to 0 and thus learning in the Generator slows down significantly or sometimes even

• **Mode collapse:** a common failure mode in GANs where the Generator generates samples that "fool" the Discriminator but fails to generate the full breadth of such possible samples and thus gets stuck in a local sub-space of synthetic samples possible. For example, consider an image face generator that generates excellent photorealistic images of faces but only focuses on faces of people with gray hair. Since the images are of great quality, the discriminator will consider them of great quality and indistinguishable from real images, however they only represent a fraction of the types of images in the training set, which

outcome; some of the most common issues when training GANs are:

*DOI: http://dx.doi.org/10.5772/intechopen.92255*

tive models in Section 6.

*Generative adversarial networks.*

**Figure 5.**

about how GANs work.

include many more hair colors.

stops completely.

*Beyond Differential Privacy: Synthetic Micro-Data Generation with Deep Generative Neural… DOI: http://dx.doi.org/10.5772/intechopen.92255*

*Security and Privacy From a Legal, Ethical, and Technical Perspective*

optimal compressed representation for the input data.

to arrive at a synthetic output Y, as shown in **Figure 4**.

tive adversarial networks, which we discuss next.

**4. Generative adversarial networks**

output Y (reconstruction error) is minimized, then it's as if you are trying to find an

With VAEs, instead of mapping the input vector X to a fixed vector Z, we want to map it into a distribution *q*θ(*z*|*x*), often assumed to be a multivariate normal distribution with mean *μ* and standard deviation σ; then to generate synthetic outputs Y we just randomly sample this learned distribution and decode the sampled vector

VAEs, being one of the first deep neural network architectures for practical generative models, created a lot of excitement about synthetic data, and was used primarily to generate synthetic images. Although elegant and theoretically pleasing, the synthetic images generated by VAEs tend to be blurry, which very quickly became a limiting factor for their use in synthetic imaging. Various improvements to the basic VAE approach have been proposed such as beta-VAE [4] and VQ-VAE [5] to address these issues; however, this also led researchers to the idea of genera-

The idea of a generative adversarial network is inspired by game theory: we build two models, a generator and a discriminator, that compete with each other in an adversarial manner to collaboratively optimize the whole system. The generator G is a generative neural network that outputs synthetic samples given a noise variable Z. The discriminator D is a different neural network that is trained to discriminate between real and synthetic samples. During training, the generator is trying to generate samples that mimic as much as possible the real data, so that it can fool the discriminator, whereas the discriminator is trained not to be fooled and be able to distinguish between real and synthetic data samples. This is shown in **Figure 5**. As you can see from **Figure 5**, a key idea in this architecture is that the discriminator D shares gradient updates with the generator, such that the generator can "understand" how its generated data fails to fool the discriminator and improve its

generation over time resulting in better and better synthetic samples.

Traditional auto-encoders have been around since the early days of neural networks and in their basic form they cannot be used to generate synthetic data; In 2013 the idea of *variational* auto-encoders (VAE) started to take shape, primarily with the work of [2, 3], as a way to use auto-encoders as generative models.

**146**

**Figure 4.**

*Variational auto-encoder architecture.*

GANs were first formulated by Ian Goodfellow and colleagues [6], and since then have been an active area of research; they have demonstrated the ability to generate significantly better synthetic images than VAEs, and have been used in a variety of applications like generating synthetic celebrity faces, fake Pokémon characters, time-series medical events [7] and electronic medical records [8].

Due to the impressive realism in synthetic data generated by GANs, they have also initiated an active and important discussion of malicious use of generative models, and privacy implications. We will discuss this important aspect of generative models in Section 6.

One difficulty with GANs is that they are quite difficult to train, and often require significant time and effort to manually tune until they reach the desired outcome; some of the most common issues when training GANs are:


Various approaches and hacks have been proposed to address the vulnerabilities in GANs with varying levels of success. One important improvement over the basic GAN approach is Wasserstein GAN (WGAN [9]) which uses a different loss function based on Wasserstein distance, and has been shown to be more robust to mode collapse.
