6. Result and discussion

### 6.1. Experiment environment

This experiment is performed on the system having Intel i5 processor with the processing power of 2.2 GHz and main memory of 4.0 GB using Linux platform. The algorithm is implemented in Java and executed with the help of Hadoop MapReduce framework.

Figure 5. Taxonomy tree.

### 6.2. Dataset description

In this experiment, we evaluated the performance of the proposed algorithm on the adult dataset from UCI [13]. The dataset was widely used for the privacy-preserving purpose. The taxonomy tree is defined as per Figure 5. The sensitive attribute in the dataset is age (numerical) and profession (categorical).

### 6.3. Results and discussions

The total number of records in the dataset used for the experiment purpose is 32,599 tuples. The efficiency of proposed algorithm is verified by parameter information loss. The average information loss of the proposed PASS algorithm, FADS and FAST, is presented in Figure 6. The proposed PASS algorithm publishes data with less information loss, because the SetKc in the proposed approach as shown in Figure 7 has more entities so that the data tuple has more

options to select, and this decreases the information loss as shown in Figure 8, and hence the results of an algorithm show improvement. The average execution time drastically decreases

Data Privacy for Big Data Publishing Using Newly Enhanced PASS Data Mining Mechanism

http://dx.doi.org/10.5772/intechopen.77033

177

All the algorithms which are present for data stream processing are not capable of processing big data, i.e., data with high capacity and volume. The data which is processed using data anonymization (nonparallel) algorithms use old languages (JAVA, SQL) and old techniques, which are not very effective means because they take a lot of time for computation and sometimes provide tuples, which are expired; this lead to loss of accuracy as well as loss of privacy which is very dangerous. Static algorithms need all the computations to be performed on a single node due to which the data and the processing requirements are very high and the

In this paper, we have proposed PASS algorithm, which uses Hadoop framework to process the data. Using Hadoop, the computer's resources are used to the maximum extent by which time required for computation is reduced which in turn prevents the publishing of expired tuples. Other advantages of this algorithm are that computations can be performed on nodes which have less computation and less storage capacity than that of computers which perform nonparallel data processing. The proposed PASS algorithm publishes data with less information loss. Using Hadoop, the failures in both data and processors can be recovered. These

as MapReduce-based newly enhanced PASS mechanism is used.

computers used are prone to failure which is very expensive to recover.

features drastically reduce the maintenance cost and the initial setup cost.

Department of Computer Science and Engineering, MANIT, Bhopal, India

Priyank Jain\*, Manasi Gyanchandani and Nilay Khare

\*Address all correspondence to: priyankjain1984@gmail.com

7. Conclusion

Figure 8. Attribute vs. information loss.

Author details

Figure 6. Information loss in FAST and FADS algorithms.

Figure 7. Number of tuples vs. running time.

Figure 8. Attribute vs. information loss.

options to select, and this decreases the information loss as shown in Figure 8, and hence the results of an algorithm show improvement. The average execution time drastically decreases as MapReduce-based newly enhanced PASS mechanism is used.
