The distance between v1,v2 = d(v1,v2) = |v1v2|/|D|

where D is the domain of the values.

### 4.2.6.1.2. The distance between two categorical values

If all the categorical values are arranged in the form of a tree where the root is the most generalized value of all the values and lowest most level containing more specialized values of the categorical values, e.g., of a categorical tree as shown in Figure 3 Country taxonomy tree and Figure 4 Occupation taxonomy tree.

### Distance between two categorical values v1,v2 = d(v1,v2) = (height of the subtree roots at lowest common ancestor of (v1,v2))/(height of tree):

For example, distance between India and Egypt (considering the tree from the above picture).

=Height of subrooted tree of a lowest common ancestor of India and Egypt/height of the tree.

=Height of the tree with east as root/height of tree = 1/3 = 0.33.

Information loss of categorical attribute = (height of the tree rooted with categorical attribute)/(height of categorical attribute tree) where h is the height of the tree and k is the height of

Data Privacy for Big Data Publishing Using Newly Enhanced PASS Data Mining Mechanism

http://dx.doi.org/10.5772/intechopen.77033

173

The algorithm reads \$ tuples continuously and inserts them into the SetTp. At First, for each tuple in SetTp procedure finds t's K-1 nearest tuples in SetTp, with the help of tuple t and its K-1 nearest tuples, generate a new set called as Snew and generalize it into Gs. Then a set with minimum information loss (Sk-best) that covers tuple t is chosen from SetKc if Sk-best exists and has smaller information loss than Gs; then tuple t is published Sk-best generalization.

If tuple t does not match with any set of SetKc which has less information loss compared to Gs, then tuple t is published with Snew generalization, i.e., Gs. Then Gs is inserted in SetKc.

In the following, a simple example is illustrated for better understanding. Table 3 is a portion of a university person data stream, in which quasi-identifiers are age and job. Also \$ and K are assumed as \$ = 3 and K = 2. Suppose that in thread n, the value of variables is as follows:

Pid Age University person

. . .

Id1 22 Bachelor Id2 24 Master Id3 37 Nonacademic

> . . .

Idn + 2 39 PhD

Idn 45 Academic Idn + 1 26 Nonacademic

the tree rooted at the required categorical attribute.

5. Proposed PASS algorithm

5.1. Details of the PASS algorithm

K = anonymization parameter.

SetTp = set of \$ tuples.

Snew = set of K tuples.

. . .

Table 3. University person.

Gs = generalized set of Snew.

S = total number of tuples in the dataset.

SetKc = set of all unique generalized sets.

\$ = number of tuples to be read before processing.

Figure 3. Country taxonomy tree.

Figure 4. Occupation taxonomy tree.

### 4.2.6.1.3. The distance between two tuples

Distance between two tuples t = {N1,…,Nm, C1,…,Cn} is the quasi-identifier of table T, where Ni (i = 1,…,m) is an attribute with a numeric domain and Cj(j = 1,…,n) is an attribute with a categorical domain.

The distance d(r1,r2) (i.e., the distance between two tuples r1, r2) is defined as:

d(r1,r2) = sum of distances between numerical attributes of two tuples + sum of distances between categorical attributes of two tuples.

Information loss: generalization leads to information loss, but we have to group clusters in such a way that the information loss is minimum.

Information loss of a single cluster is calculated as:

Total information loss = sum of information loss of all the clusters.

Information loss of the cluster = info loss of all the tuples in the cluster.

Information loss of the tuple = information loss of all the attributes (categorical attributes and numerical attributes).

Information loss of numerical attribute = (value of attribute)/(domain of the attribute).

Information loss of categorical attribute = (height of the tree rooted with categorical attribute)/(height of categorical attribute tree) where h is the height of the tree and k is the height of the tree rooted at the required categorical attribute.
