**3. Measuring data quality**

Based on the definitions of data quality, several DQ measurement methods have been developed, that can generally be divided into objective and subjective methods. While objective methods tend to evaluate data quality rather from the perspective of the data producer based on hard criteria, subjective methods rather take the user's perspectives and beliefs into account.

### **3.1 Objective DQ measurement methods**

Measurements of data quality are generally intended to assess the dimensions of data quality as defined in the previous section. As a first step, a framework must

### *Scientometrics Recent Advances*

be set up with the indicators that one wants to assess. Next, a proper reference for verification of the data within the data systems must be determined.

Ideally, the data are compared using real world data, which allows for validation and, if required immediate corrective actions. This method is termed *data auditing* and is the only way of measuring the quality level of dimensions like accuracy, completeness. Furthermore, by going through the data itself, one can discover data quality issues that were unexpected and therefore are of great value for taking corrective measures to improve data quality. However, data auditing comes at a high cost as it is very time consuming and the need of experts in the respective field is required. Furthermore, data auditing can be also very labor-intensive and requires that data controllers have access to the actual data.

For example, consider the metadata of publications that are contained in publication databases. If a data controller validates the content of the metadata fields with the metadata as indicated on the publications, inaccuracies can be detected. These can contain expected flaws like spelling errors but can also provide valuable information on unexpected errors that also might be highly relevant in the context of bibliometric analyses.

If the conditions for data auditing are not met, data controllers can use *rulebased checking* in order to determine data quality. This method heavily relies on business rules that are drafted based upon the domain knowledge and experience that the data controllers have with regards to the data. Consequently, these rules can only check for flaws that were anticipated by the data controllers. However, rule-based checking also offers important advantages, especially as they can be automated after conversion to validation rules, which allows for the identification of the errors (or possibly correct outliers!) via data mining techniques. Nevertheless, the presumed errors still need to be corrected, which remains labor-intensive.

### **3.2 Subjective DQ measurement methods**

Some dimensions, however, cannot be measured objectively because of their intrinsic properties. For example, the dimension relevancy pertains to the extent to which data is applicable and helpful for the stated objective. Obviously, this dimension can only be evaluated using the *perception of the users*. Although this results in a subjective scoring, user evaluations are the only way to measure dimensions that describe external data quality attributes. Internal data quality dimensions on the contrary are preferably measured using objective DQ measurement methods as described above.

Regardless of which methodology is chosen to measure data quality, it is always important to provide information about the measurement method and parameters in addition to the dimension under evaluation, in order that the measurement results can be interpreted correctly by everyone. Furthermore, although a lot of attention always goes to correcting errors, it is important to stress that eliminating the root cause should always be the ultimate goal [7].

## **4. Data quality management**

### **4.1 Data quality frameworks**

As data are extremely valuable resources in today's society, a plethora of data quality management frameworks have been published in the last decades that all strive to preserve the quality of data and to make it accessible for future use.

**7**

**Figure 1.**

*The cornerstone of data quality frameworks.*

*Data Quality Management*

**4.2 Critical success factors**

inter-dependencies with each other.

*4.2.1 Operational processes*

*DOI: http://dx.doi.org/10.5772/intechopen.86819*

found throughout the literature that show slight differences.

• DAMA DMBOK's Data governance model [8]

• Oracle's Data Quality Management Process [10]

• EWSolutions' EIM Maturity Model [9]

The most popular models are listed below, however more DQM frameworks can be

All frameworks are basically centered around three basic elements, that is, the metadata associated with the data, the processes involved in the registration, organization and (re)use of the data, and the organizational context in relation to the data (**Figure 1**). The quality of each individual element, as well as the interplay in between them, ultimately determines the quality and thus the true value of an organization's data heritage. Ideally, an organization uses metadata standards that are understandable throughout the organization and aligned with the organization's processes, business strategies and goals. Rather than describing all popular frameworks, we will describe critical success factors that are useful for developing effective DQ management strategies, and that can be found in all DQ frameworks.

Critical success factors, also termed CSFs, have been defined by Milosevic and Patanakul as '*characteristics, conditions, or variables that can have a significant impact on the success of i.e., a company or a project when properly sustained, maintained, or managed*' [11]. In 2014, Baskarada described 11 CSFs in the field of information quality management that provide valuable means for developing effective DQ management strategies [12]. These CSFs can be clustered into four major groups, that is, training, governance, management and operational processes, that have

The first group of critical success factors deals with the operational processes involved in the collection, storage, analysis and security of the data, which are all highly interdependent. As data is a valuable good, its quality should be managed throughout its entire lifecycle. In practice this comes down to taking measures that maximize, whenever possible, the **automated capture** of data in **real-time**, directly from its **original source**. This minimizes the risk of errors introduced by manual data entry, which can result in typo's, inaccuracies, missing values, erroneous data

*Scientometrics Recent Advances*

of bibliometric analyses.

labor-intensive.

described above.

be set up with the indicators that one wants to assess. Next, a proper reference for

Ideally, the data are compared using real world data, which allows for validation and, if required immediate corrective actions. This method is termed *data auditing* and is the only way of measuring the quality level of dimensions like accuracy, completeness. Furthermore, by going through the data itself, one can discover data quality issues that were unexpected and therefore are of great value for taking corrective measures to improve data quality. However, data auditing comes at a high cost as it is very time consuming and the need of experts in the respective field is required. Furthermore, data auditing can be also very labor-intensive and requires

For example, consider the metadata of publications that are contained in publication databases. If a data controller validates the content of the metadata fields with the metadata as indicated on the publications, inaccuracies can be detected. These can contain expected flaws like spelling errors but can also provide valuable information on unexpected errors that also might be highly relevant in the context

If the conditions for data auditing are not met, data controllers can use *rulebased checking* in order to determine data quality. This method heavily relies on business rules that are drafted based upon the domain knowledge and experience that the data controllers have with regards to the data. Consequently, these rules can only check for flaws that were anticipated by the data controllers. However, rule-based checking also offers important advantages, especially as they can be automated after conversion to validation rules, which allows for the identification of the errors (or possibly correct outliers!) via data mining techniques. Nevertheless, the presumed errors still need to be corrected, which remains

Some dimensions, however, cannot be measured objectively because of their intrinsic properties. For example, the dimension relevancy pertains to the extent to which data is applicable and helpful for the stated objective. Obviously, this dimension can only be evaluated using the *perception of the users*. Although this results in a subjective scoring, user evaluations are the only way to measure dimensions that describe external data quality attributes. Internal data quality dimensions on the contrary are preferably measured using objective DQ measurement methods as

Regardless of which methodology is chosen to measure data quality, it is always important to provide information about the measurement method and parameters in addition to the dimension under evaluation, in order that the measurement results can be interpreted correctly by everyone. Furthermore, although a lot of attention always goes to correcting errors, it is important to stress that eliminating

As data are extremely valuable resources in today's society, a plethora of data quality management frameworks have been published in the last decades that all strive to preserve the quality of data and to make it accessible for future use.

verification of the data within the data systems must be determined.

that data controllers have access to the actual data.

**3.2 Subjective DQ measurement methods**

the root cause should always be the ultimate goal [7].

**4. Data quality management**

**4.1 Data quality frameworks**

**6**

The most popular models are listed below, however more DQM frameworks can be found throughout the literature that show slight differences.


All frameworks are basically centered around three basic elements, that is, the metadata associated with the data, the processes involved in the registration, organization and (re)use of the data, and the organizational context in relation to the data (**Figure 1**). The quality of each individual element, as well as the interplay in between them, ultimately determines the quality and thus the true value of an organization's data heritage. Ideally, an organization uses metadata standards that are understandable throughout the organization and aligned with the organization's processes, business strategies and goals. Rather than describing all popular frameworks, we will describe critical success factors that are useful for developing effective DQ management strategies, and that can be found in all DQ frameworks.
