2. Anonymization

### 2.1. Anonymization

Generally, the fundamental theme data anonymization [2–5] is the use of one or more techniques designed to make it impossible or at least more difficult to identify a particular individual from stored data related to them. Purposes of data anonymization are:


It is privacy preservation techniques used for static data. Techniques implemented in anonymization are:


### 2.2. K-anonymity

A release of data is said to have the k-anonymity property if the information for each person contained in the release cannot be distinguished from a least k-1 individuals whose information also appear in the release. For example, if you try to identify a person from a release dataset but you only have information of his/her birth date and gender. There are k people that meet the requirement. This is k-anonymity [6, 7].

From above tables, we can conclude that Andre has heart disease; here the heart disease is the sensitive attribute. It is known as linking attack by combining two different tables. The solution is to consider all of the released tables before releasing the new one and trying to avoid linking. And k-anonymity does not provide privacy if sensitive values in an equivalence class

Data Privacy for Big Data Publishing Using Newly Enhanced PASS Data Mining Mechanism

http://dx.doi.org/10.5772/intechopen.77033

169

When the numbers of tuples in the processing window reach μ, one round of the clustering algorithm is started to slide again in order to accumulate more tuples in each

The main drawback of FANNST is that some tuples may remain in the system for more than allowable time constraint. In addition, the time and space complexity of the algorithm is O(S\*S) and not efficient for a data streaming algorithm. Another weakness of FANNST is that it does

The algorithm considers a set as a buffer and saves at most δ tuples in it [11, 12]. Also, another set (setkc) is considered to hold the newly created cluster for later reuse. Each k-anonymized cluster will be remained in setkc up to the reuse constraint Tkc, and after that, the cluster is

The main drawback of the FADS is that the algorithm does not check the remaining time of tuples that hold in the buffer in each round and give their result when they might be considered to have expired. The other important weakness of FADS is that it is not parallel and

cannot handle a large number of data streams in tolerable time.

lack diversity [8, 9].

3. Related work

3.1.1. Algorithm

round [10].

3.1.2. Drawback

3.1. FANNST algorithm

Parameters used in the algorithm are k, u, d:

u defines the processing window size.

not support categorical data.

3.2. FADS algorithm

removed.

3.2.1. Drawbacks

k defines the parameter for cluster anonymization.

d defines the number of clusters which can be used later.

### 2.2.1. Classification of attributes

Key attribute is name, address, and cell phone, which can uniquely identify an individual directly. It is always removed before release.

Quasi-identifier is a zip code, birth date, and gender, a set of attributes that can be potentially linked with external information to re-identify entities. Eighty-seven percent of the population in the USA can be uniquely identified based on these attributes, according to the census summary data in 1991. There are two tables shown below: Table 1 is hospital dataset and Table 2 is voter data.


Table 1. Medical dataset.


Table 2. Voter dataset.

From above tables, we can conclude that Andre has heart disease; here the heart disease is the sensitive attribute. It is known as linking attack by combining two different tables. The solution is to consider all of the released tables before releasing the new one and trying to avoid linking. And k-anonymity does not provide privacy if sensitive values in an equivalence class lack diversity [8, 9].
