5. Proposed PASS algorithm

### 5.1. Details of the PASS algorithm

S = total number of tuples in the dataset.

K = anonymization parameter.

\$ = number of tuples to be read before processing.

SetTp = set of \$ tuples.

SetKc = set of all unique generalized sets.

Snew = set of K tuples.

4.2.6.1.3. The distance between two tuples

Figure 4. Occupation taxonomy tree.

between categorical attributes of two tuples.

such a way that the information loss is minimum. Information loss of a single cluster is calculated as:

Total information loss = sum of information loss of all the clusters.

Information loss of the cluster = info loss of all the tuples in the cluster.

categorical domain.

Figure 3. Country taxonomy tree.

172 Data Mining

numerical attributes).

Distance between two tuples t = {N1,…,Nm, C1,…,Cn} is the quasi-identifier of table T, where Ni (i = 1,…,m) is an attribute with a numeric domain and Cj(j = 1,…,n) is an attribute with a

d(r1,r2) = sum of distances between numerical attributes of two tuples + sum of distances

Information loss: generalization leads to information loss, but we have to group clusters in

Information loss of the tuple = information loss of all the attributes (categorical attributes and

Information loss of numerical attribute = (value of attribute)/(domain of the attribute).

The distance d(r1,r2) (i.e., the distance between two tuples r1, r2) is defined as:

Gs = generalized set of Snew.

The algorithm reads \$ tuples continuously and inserts them into the SetTp. At First, for each tuple in SetTp procedure finds t's K-1 nearest tuples in SetTp, with the help of tuple t and its K-1 nearest tuples, generate a new set called as Snew and generalize it into Gs. Then a set with minimum information loss (Sk-best) that covers tuple t is chosen from SetKc if Sk-best exists and has smaller information loss than Gs; then tuple t is published Sk-best generalization.

If tuple t does not match with any set of SetKc which has less information loss compared to Gs, then tuple t is published with Snew generalization, i.e., Gs. Then Gs is inserted in SetKc.

In the following, a simple example is illustrated for better understanding. Table 3 is a portion of a university person data stream, in which quasi-identifiers are age and job. Also \$ and K are assumed as \$ = 3 and K = 2. Suppose that in thread n, the value of variables is as follows:


Table 3. University person.


For each set which covers t do

Call the set as Sk-best

Than Gs) then

6. Result and discussion

6.1. Experiment environment

Figure 5. Taxonomy tree.

End for

Else

End if

End for End while

}

Calculate the information loss

3. Select a set which includes less information loss

Publish t with Sk-best generalization

Publish t with Gs and insert Gs in set Kc

4. If (Sk-best exist and Sk-best generate less information loss

Data Privacy for Big Data Publishing Using Newly Enhanced PASS Data Mining Mechanism

http://dx.doi.org/10.5772/intechopen.77033

175

This experiment is performed on the system having Intel i5 processor with the processing power of 2.2 GHz and main memory of 4.0 GB using Linux platform. The algorithm is

implemented in Java and executed with the help of Hadoop MapReduce framework.

Table 4. Two anonymized university persons.

In this stage, information loss of Sk-best is compared with Gs information loss. As the information loss of Sk-best is less than Gs, a tuple with idn is published with Sk-best generalization. Table 4 represents Two anonymized university persons.


### 5.2. Proposed PASS algorithm

### Big data Anonymization (S,K,\$)

{

```
while S!=0 do
```
Read \$ tuples and insert them into

SetTp.

For each tuple t do

1. Select K-1 unique tuples

which are closest to t among

the tuples in SetTp and insert them into

set Snew.

2. Generalize Snew into Gs.

For each set which covers t do

Calculate the information loss

### End for

3. Select a set which includes less information loss

Call the set as Sk-best

4. If (Sk-best exist and Sk-best generate less information loss

Than Gs) then

Publish t with Sk-best generalization

### Else

Publish t with Gs and insert Gs in set Kc

End if

### End for
