4. Terminology of proposed algorithm

### 4.1. Data stream

A sequence of tuples is defined as <sn>n∈N where N is the natural number set. The kth term of <sn> is order pair (t, tk) where k is a number and tk is a tuple.

A data stream S is a potentially infinite sequence of tuples, depicted by <ti>, where all tuples ti follow the schema ti = <ID,a1,am,q1,qn,TS>. ID is an identifier attribute; q1 to qn are quasiidentifiers, and TS is the time stamp.

### 4.2. Cluster

The cluster is a set of tuples in a stream [12]. Suppose that PS is a set of tuples in stream cluster C which can be defined as follow:

The above tuple can be generalized as follows: gc = <∗,staff,[39–46]>. Since we do not want to disclose the name, we kept \* in the first column; here profession is categorical value, and age is numerical value; age is generalized as [max, min], and profession is generalized to lowest

Data Privacy for Big Data Publishing Using Newly Enhanced PASS Data Mining Mechanism

http://dx.doi.org/10.5772/intechopen.77033

171

Distance is used to calculate the similarity or dissimilarity between two tuples. This function is the heart of the clustering. Generally, clustering is done based on distance calculation; the

If all the categorical values are arranged in the form of a tree where the root is the most generalized value of all the values and lowest most level containing more specialized values of the categorical values, e.g., of a categorical tree as shown in Figure 3 Country taxonomy tree

Distance between two categorical values v1,v2 = d(v1,v2) = (height of the subtree roots at

For example, distance between India and Egypt (considering the tree from the above picture). =Height of subrooted tree of a lowest common ancestor of India and Egypt/height of the tree.

common ancestor of academic and nonacademic.

4.2.6.1.1. The distance between the numerical values

The distance between v1,v2 = d(v1,v2) = |v1v2|/|D|

4.2.6.1.2. The distance between two categorical values

lowest common ancestor of (v1,v2))/(height of tree):

=Height of the tree with east as root/height of tree = 1/3 = 0.33.

tuples with the closest distance are placed the same cluster.

4.2.6. Distance

4.2.6.1. Types of distances

Figure 2. University taxonomy tree.

Let v1,v2 be 2 numerical values.

where D is the domain of the values.

and Figure 4 Occupation taxonomy tree.

C = {t|t belongsPs}

### 4.2.1. K-anonymized cluster

If a cluster C is built from the data stream and the number of the unique tuple in the cluster is greater than k, the cluster is called a k-anonymized cluster.

### 4.2.2. Generalization

Generalization is a function that maps a cluster into a tuple. More formally, generalization function G is defined as G: PowerSet(TUPLE) ! TUPLE where TUPLE is the set of all possible tuples.

### 4.2.3. Numerical value generalization

Numerical values are generalized in between maximum and minimum value, i.e., they are generalized in their domain.

### 4.2.4. Categorical value generalization

Categorical values are generalized to their lowest common ancestors as shown in Figure 2.

### 4.2.5. Example of above two types of generalization

Considering a cluster of three tuples which contains both numerical and categorical values, the tuples contain the name, profession, and age of employees.

C=<"prof.young", Academic, 43>,

<"Mr.Zhou", non-Academic, 39>,

<"Prof.Chung", Academic, 46>.

Data Privacy for Big Data Publishing Using Newly Enhanced PASS Data Mining Mechanism http://dx.doi.org/10.5772/intechopen.77033 171

Figure 2. University taxonomy tree.

The above tuple can be generalized as follows: gc = <∗,staff,[39–46]>. Since we do not want to disclose the name, we kept \* in the first column; here profession is categorical value, and age is numerical value; age is generalized as [max, min], and profession is generalized to lowest common ancestor of academic and nonacademic.

### 4.2.6. Distance

4. Terminology of proposed algorithm

identifiers, and TS is the time stamp.

C which can be defined as follow:

4.2.3. Numerical value generalization

4.2.4. Categorical value generalization

C=<"prof.young", Academic, 43>, <"Mr.Zhou", non-Academic, 39>, <"Prof.Chung", Academic, 46>.

4.2.5. Example of above two types of generalization

tuples contain the name, profession, and age of employees.

generalized in their domain.

<sn> is order pair (t, tk) where k is a number and tk is a tuple.

greater than k, the cluster is called a k-anonymized cluster.

A sequence of tuples is defined as <sn>n∈N where N is the natural number set. The kth term of

A data stream S is a potentially infinite sequence of tuples, depicted by <ti>, where all tuples ti follow the schema ti = <ID,a1,am,q1,qn,TS>. ID is an identifier attribute; q1 to qn are quasi-

The cluster is a set of tuples in a stream [12]. Suppose that PS is a set of tuples in stream cluster

If a cluster C is built from the data stream and the number of the unique tuple in the cluster is

Generalization is a function that maps a cluster into a tuple. More formally, generalization function G is defined as G: PowerSet(TUPLE) ! TUPLE where TUPLE is the set of all possible

Numerical values are generalized in between maximum and minimum value, i.e., they are

Categorical values are generalized to their lowest common ancestors as shown in Figure 2.

Considering a cluster of three tuples which contains both numerical and categorical values, the

4.1. Data stream

170 Data Mining

4.2. Cluster

C = {t|t belongsPs}

4.2.2. Generalization

tuples.

4.2.1. K-anonymized cluster

Distance is used to calculate the similarity or dissimilarity between two tuples. This function is the heart of the clustering. Generally, clustering is done based on distance calculation; the tuples with the closest distance are placed the same cluster.

### 4.2.6.1. Types of distances

4.2.6.1.1. The distance between the numerical values

Let v1,v2 be 2 numerical values.
