**3.2 Data cleansing**

Data cleansing is an operation within data mining software that can be performed on the existing data to remove anomalies and obtain the data collection. It involves removing the errors, inconsistencies and transform data into a uniform format in the dataset [12]. With the amount of data collected, manual data cleansing for preparation is impossible as it is time-consuming and prone to errors. The data cleansing process consists of several stages: detecting data errors and repairing the data errors [13]. Although, it is thought of as a tedious exercise. However, establish a process and template for the data cleansing process gives assurance that the method applied is correct. Hence, data cleansing focuses on errors beyond small technical variations and constitutes a significant shift within [14].

Data cleansing based on the knowledge of technical errors expects normal values on the dataset. Missing values may be due to interruptions of the data flow. Hence,

predefined rules for dealing with errors and true missing and extreme values are part of better practice. However, it is more efficient to detect the errors by active searching for them on the dataset in a planned way. Lack of data through data cleansing will arise if the analysts and users do not fully understand a dataset, including skips and filters [14].

Moore and McCabe [15] emphasized the serious strategic decision error would endure if the data quality were poor, leading to low data utilization efficiency. Although data cleansing follows data collection, data thoroughly checked for errors, and other inconsistencies are corrected for future use [16]. Although the importance of data-handling procedure is being underlined in better clinical practice and data management guidelines, gaps in knowledge about optimal data handling methodologies and standard of quality data are still present [14].

Detecting and correcting corrupted or inaccurate records help to meet standard quality data from the dataset. Find the incorrect, inaccurate, or irrelevant parts of the data, replace, modify, and delete coarse data [14]. The reality of the matter, data cannot always be used as it is and needs preparation to be used. Achieving higher preparation data quality during a data cleansing process is required to remove anomalies. Thus, the data cleansing process can be defined as assessing data's correctness and improving it. Therefore, enhancing data quality, pre-processing data mining techniques are used to understand the data and make it more easily accessible.

#### **3.3 Data validation**

Data validation is described as the process of ensuring data has undergone cleaning to ensure that it is both correct and useful. Although, it intended to provide a guarantee for the fitness and consistency of data in the dataset. Failure or omission in data validation can lead to data corruption. Catching data early on the dataset is important as it helps debug the roots of the cause and roll back in the working state [17]. Moreover, it is important to rely on mechanisms specific to data validation rather than on the detection of second-order effects.

Errors are bound to happen during the data collection process, while data is seldom 100% correct. Data validation helps to minimize erroneous data from the dataset. Data validation rules help organizations follow standards that make it efficient to work with data. Although, duplication data provide challenges to many organizations. Factors that cause the duplication of data are the data entry of machines and operators from production to capture data. An organization needs a powerful matching solution to overcome this challenge of duplicating records to ensure clean and usable data.

Data validation checks the accuracy and data quality of source data, usually performed before processing the data. It can be seen as a form of data cleansing. Data validation ensures that the data is complete (no blanks or empty values), unique (includes different values that are not repeated), and the values that range consistent with the expectations. When moving and merging data, it is important to ensure that data from different sources and repositories conform to organizational rules and not become corrupted due to inconsistencies in type or context. Data validation is a general term and can be performed on any data. However, including data within a single application, such as Microsoft Excel, or merging simple data within a single data store.

The data validation process is a significant aspect of filtering the large dataset and improving the overall process's efficiency. However, every technique or process consists of benefits and challenges; therefore, it is crucial to have a complete

acknowledgement. Data handling can be easier if analysts and users adapt this technique with the appropriate process, then data validation can provide the best outcome possible for data. Data validation can be broken down into the following categories: data completeness and data consistency.

#### *3.3.1 Data integrity*

Data integrity refers to the integrity of the data. However, for the data to be valid, there should not be any gaps or missing information for data to be truly complete. Occasionally incomplete data is unusable, but it is usually used in the absence of information, leading to cost error and miscalculations.

An incomplete data is usually the result of unsuccessful data collection. This denotes the degree to which all required data are available in the dataset [18]. A measure of data completeness would be the percentage of missing data entries. However, the true goal of data completeness is not to have perfect 100% data. It ensures that data the essential to the purpose of validity. Therefore, it is a necessary component of the data quality framework and is closely related to validity and accuracy.

#### *3.3.2 Data consistency*

Data consistency means that there is consistency in the measurement of variables throughout the datasets. This becomes a concern, primarily when data aggregates from multiple sources. Discrepancies in data meanings between data sources can create inaccurate, unreliable datasets. Since the data inconsistency comes from the storage format, semantic expressions, and numerical values, a method of consistent quantification assesses the degree of data consistency quantitatively after defining the degree of consistency.

Data consistency could be the difference between great business success or failure. Data is the foundation for successful organizational strategic decisions, and inconsistent data can lead to misinformed business decisions. Organizations must ensure data consistency, especially when aggregating data from multiple internal or external sources without changing their structure, to be confident and successful in their strategic decision-making.

Data consistency checks that the data values of all instances of the application are the same. These data belong together and describe a specific process at a specific time, which means that the data remains unchanged during processing or transmission. Synchronization and protection measures help to ensure that data consistency during the multi-stage processing [19]. Data consistency is essential to the operation of programs, systems, applications, and databases. Locking measures prevent data from being altered by two applications simultaneously and ensure correct processing order. Controlling simultaneous operations and handling incomplete data are essential to maintain and restore data consistency in power failures.

#### **3.4 Data preparation**

Data preparation is the process of cleaning and transforming raw data before processing and analysis for future use. It is an important step before processing and often involves reformatting data, correcting data, and combining data sets to enrich data [20]. Its task is to blend, shape, clean, consolidate data into one file or data table to get it ready for analytics or other organizational purposes.

The data must be clean, formatted, and transformed into something digestible by data mining software to achieve the final preparation stage. These actual processes include a wide range of steps, such as consolidating or separating fields and columns, changing formats, deleting unnecessary or junk data, and making corrections to data.

In this literature review, several studies have used data preparation and data mining on the messy data on the dataset for future use, few studies on the quality data check. This is the gap in this paper, as it aims at reviewing the available data mining preparing methods for messy data. Since the data preparation framework needs to meet data quality criteria, using a quality dimension includes accuracy, completeness, timeliness, and consistency [21]. Quality data check is crucial because it automates data and provides information about the number of valid, missing, and mismatched values in each column. The result shows the quality data above each column in the dataset. A data mining software will help remove errors and inconsistencies in the dataset to meet quality data check percentage [22].

Quality data check on the dataset, it may be better to use a transformation. These quality data checks can create data quality rules which persist in checking columnar data against defined. Performing variety checks, transform data automatically show the effect of transformations on the overall quality of data. It can provide various services for the organization and only with high-quality data and achieve the topservice in the organization [13].
