**1. Introduction**

Differential privacy, created more than a decade ago, continues to play an important role in protecting privacy of micro-data while enabling statistical analysis. Initially applied by statistics agencies such as the US census bureau, it is now well recognized that, although useful for some applications, differential privacy comes with significant limitation (e.g., [1]).

To understand some of the limitations of differential privacy, consider the following:


**Figure 1.** *Fake celebrity images created using generative modeling; none of these images are real people.*

use-cases demonstrate that due to concerns about risk, most implementations end up with much higher budget than is necessary.


Fortunately, deep generative models – a recent and novel approach in deep neural networks – provide an alternative for direct sharing of micro-data without privacy risk.

With generative models, a deep neural network algorithm uses the existing micro-data to approximate, with high accuracy, the underlying probability distribution of the data in some high-dimensional latent space. Once the probability distribution is approximated, the trained model can be used to generate any number of *synthetic* records by randomly sampling from that distribution. Those generated records are related to the original data only through the shared underlying probability distribution, and thus does not include any information that can be linked back to the original (private) records.

To further illustrate how synthetic data generation works, consider CelebA,1 a dataset with more than 200,000 synthetic celebrity face images, each with 40 automatically extracted attribute annotations. Using generative models, researchers have demonstrated the ability to learn the underlying distribution well enough to generate photorealistic celebrity faces as is shown in **Figure 1** above.

**143**

**Figure 2.**

*Sample Gaussian fitting.*

*Beyond Differential Privacy: Synthetic Micro-Data Generation with Deep Generative Neural…*

This same technique can be applied to many other types of data – music, text, videos as well as healthcare, financial or insurance data. In this chapter we will explore synthetic data generation and its application, and how releasing synthetic

In Section 2, we explore synthetic data in more detail, and how generative models can create synthetic data. In Section 3, we discuss using variational autoencoders as generative models, followed by Section 4, where we discuss generative adversarial networks. In Section 5, we discuss the application of generative models to healthcare data, and in Section 6, we discuss privacy in the context of synthetic data, and some approaches that combine differential privacy with synthetic data generation. Section 7 is a summary and discussion on future directions in synthetic

Generative models are a class of mathematical models that approximate a probability distribution of some dataset and can be used to generate samples of data according to the modeled (or approximated) distribution. Such generated data is

For a given data domain, consider a dataset A with N data records. For most practical cases, the dataset can be assumed to be drawn from some (usually

unknown) probability distribution P(x). A *synthetic* dataset S is a dataset similar to A in terms of fields or structure, where records in S are randomly drawn from some

In an ideal world where Q(X) = P(X) we can clearly use S for various purposes of analysis and modeling, because they are sampled from the same distribution. The key idea behind generating synthetic data is as follows: can we accurately estimate this probability distribution P(x), such that Q(X) ≈ P(X), with high fidelity?

Let us look at a simple example – consider a one-dimensional series of values A, where A is drawn from a normal (Gaussian) distribution with mean *μ* and standard deviation σ. In other words, we know in this case that P(x) is the normal distribu-

We can then use Gaussian fitting to estimate the values of *μ* and *σ* from the data, as

, and that the values in A should fit this distribution.

often called "synthetic data," "fake data," or "realistic but not real."

*DOI: http://dx.doi.org/10.5772/intechopen.92255*

**2. Generative models for synthetic data**

probability distribution Q(x).

tion with *Pμ*,σ(*<sup>x</sup>*) = \_1

is demonstrated in **Figure 2**.

σ √ \_ 2*π e* −\_1 2 (\_ *x*−*μ* σ ) 2

data generation.

micro-data can provide an alternative to differential privacy.

<sup>1</sup> http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html

*Beyond Differential Privacy: Synthetic Micro-Data Generation with Deep Generative Neural… DOI: http://dx.doi.org/10.5772/intechopen.92255*

This same technique can be applied to many other types of data – music, text, videos as well as healthcare, financial or insurance data. In this chapter we will explore synthetic data generation and its application, and how releasing synthetic micro-data can provide an alternative to differential privacy.

In Section 2, we explore synthetic data in more detail, and how generative models can create synthetic data. In Section 3, we discuss using variational autoencoders as generative models, followed by Section 4, where we discuss generative adversarial networks. In Section 5, we discuss the application of generative models to healthcare data, and in Section 6, we discuss privacy in the context of synthetic data, and some approaches that combine differential privacy with synthetic data generation. Section 7 is a summary and discussion on future directions in synthetic data generation.
