**5. Results and discussion**

This section describes the findings, and the overall discussion represents the datasets with data cleansing preparation. These three datasets were obtained from the data repository, and **Table 1** represents the excel files dataset before using the data mining tool. Machine data file contain 30 000 records in the dataset. It contains 10% missing values and 7 duplicate records. The alarm data file contains 45 000 records. It contains 25% missing values and 28 duplicated records in the dataset. Finally, the sensor data file contains 100 000 records in the dataset. It contains 45% missing values and 100 duplicated records. These files format was using Microsoft Excel as the technique to use datasets.


#### **Table 1.**

*Raw data.*


#### **Table 2.**

*Data mining uses.*

A data mining tool was used as the result of the analysis. **Table 2** shows the importance of using data mining in removing errors and inconsistencies in records. The data mining tool in **Table 2** has removed machine data records decreases from 30 000 to 26 993. There was no missing value found on the dataset, with no duplication records. Alarm data records decreased from 45 000 to 33 722, with no missing values and duplicated records. Sensor data records decreased 100 000 to 54 900, with no missing values and duplicated records.

Missing values represent how efficient this tool in finding missing values of a file. Other features were whether this tool could find duplication, illegal values, merging the records and misspelling. Ease file format supported by these records and of use.

#### **5.1 Discussion**

This paper aims to investigate data cleansing in big data. Based on the available data cleansing methods discussed in the previous section, data cleansing for big data needs to be improvised and improved to cope with the massive amount of data. The traditional data cleaning method is important for developing the data cleaning framework for big data applications. In the review of Potter, this method only focused on solving data transformation challenges [13]. The Excel spreadsheet supports problems like duplicate record detection, and the user needs other approaches to deal with duplicate record detection problems [27].

Data mining can require manual and automatic procedures, but this approach focuses on duplication and missing elimination despite various data quality challenges in the dataset. Traditional data cleansing tools tend to solve only one data quality problem throughout the process and require human intervention to resolve data cleansing conflicts. In the big data era, the traditional data cleansing process is no longer acceptable as data needs to be cleansed and analyzed fast. The data is growing more complex as it may include structured data, semi-structured data, and unstructured data. The discussed methods focus only on structured data. However, existing methods have some limitations when working with dirty data. Data mining performs the computations of each stage as "local" in each Excel spreadsheet, and the data exchange is done at the stage boundaries by broadcast or hash partitioning.
