**5.1 Definition of the DQ project**

The DQ improvement workflow starts with defining the scope of the DQ improvement project. This includes the selection of a dataset relevant to a specific business goal, and the determination of the data attributes required. When collecting this information, it is very important to discuss the meaning of the metadata required with all stakeholders in order to be able to identify any discrepancies in interpretation of the required data attributes versus the meaning of the existing metadata, as this prevents erroneous data collection, analysis and interpretation. All obtained information should be documented using domain modeling techniques that include information on the data and the associated operations on the data [19]. Examples of such techniques include Business Process Model Notation (BPMN) diagrams [20], data flow diagrams of which the resulting information should be contained in data governance tools together with the accompanying semantics. In addition, data quality dimensions important to the specific use purposes of the data should be determined, and if possible, these are preferably defined in a measurable manner which facilitates further steps in the DQ improvement process.

For example, consider the use of bibliometric data as part of a researcher's evaluation in the context of career-wise promotions. In order to provide an adequate, qualitative data-analysis, a clear framework should be defined by an organization's management comprising what should be evaluated, that is, which publications (books, journals.), validation criteria (peer reviewed, group author.) are to be used as well as the accompanying processes. This information should be discussed with all stakeholders, that is, researchers, librarians, data analysts and IT-staff in order to harmonize the data flow, the accompanying semantics, procedures and models in accordance with the management's goals. Next, the *As Is* situation should be evaluated with regards to these intentions and according to the relevant data dimensions. In bibliometric analyses, accuracy, completeness, timeliness, relevance,

**Figure 2.** *Data quality improvement workflow.*

accessibility, traceability of the data are all relevant dimensions, of which the accurate and complete collection and analysis of a researcher's published works are the foremost ones.

### **5.2 Measurement DQ**

In order to determine the quality level of the current data in relation to the organization's objectives, the quality dimensions need to be expressed in a measurable manner. While the internal dimensions can be scored in a quantitative manner by means of expressing the errors in the data set in terms of magnitude, number of errors or missing records, the external dimensions are measured in a qualitative manner based on the context of the data's use purposes. Independent of the dimension under analysis, measurements must always be relevant for the purpose for which the data will be used and according to the task's requirements. Although in most cases, common sense will be used to identify task requirements, in other cases specific techniques like sensitivity analysis might be used which allows for identifying critical factors and errors in data models [21, 22]. Furthermore, data profiling is another technique frequently used in DQ assessment as a method to discover the true content, structure and quality of data by means of rule-based checking [23]. Obviously, this technique does not find all inaccurate data, as it can only identify violations to the predefined rules, and hence expected errors. For instance, data profiling can identify invalid data values (i.e., using column property analysis), invalid data combinations (i.e., through structure analysis), inaccurate data (i.e., through value rule analysis). Importantly, data profiling also provides metrics on the data inaccuracies in a dataset, that is, the number of violations, the frequency of invalid data values, etc. Such metrics can be useful as a means to communicate to stakeholders on the (in)accuracy of a data set, and the follow-up of the progression in subsequent DQ improvement programs.

In our bibliometric example, the accuracy and completeness of the bibliometric records for a given author, collected in a university's database system should be compared to a publication list provided by the author. By manually auditing the registered data found within the database system, one could indeed record the completeness of information. Furthermore, the accuracy can be tested using a manual auditing procedure. This allows for the identification of spelling errors, erroneous exchange of an author's last versus first name, etc. In addition, manual auditing also allows for identification of rather unexpected data entries, like changes in the author's first or last name over time. The latter example of a DQ inaccuracy, can however not be detected through data profiling as rule-based checking is unable to test for unexpected errors. Nevertheless, data profiling has an important role in DQ measurement as it allows for automated and thus efficient screening of DQ.

### **5.3 Analyzing DQ issues**

Once DQ inaccuracies have been detected, these should be analyzed in order to screen for the potential existence of (groups of) common underlying root causes. For example, author names can have various problems like misspelling, last names mistaken for first names, etc. The grouping of such errors that show similar patterns, also called error cluster analysis, allows for the identification of common causes and is often more efficient in terms of time and resources as compared to handling all inaccuracies in a stand-alone way. In addition, a data event analysis can be performed which evaluates the time points when data are created and updated in order to facilitate the identification of the root causes of problems. For example, the manual entry of author names in a database system might result in misspelling, the

**13**

*Data Quality Management*

occurrence of DQ inaccuracies.

**5.4 DQ improvement trajectories**

cause itself.

*DOI: http://dx.doi.org/10.5772/intechopen.86819*

lack of automated verification in the recording process, the lack of domain specific knowledge of the persons responsible for recording the data, … might affect the

Commonly used techniques to identify root causes include the auditing of the data, the surveying of the user perceptions and the evaluation of the data process. The identified causes can then be depicted in cause and effect diagrams, also termed Ishikawa or fishbone diagrams [24]. These diagrams cluster causes together in groups which is instrumental in identifying, classifying and prioritizing the impact of root causes to a problem. In our example root cause analysis could result in the identification of the field 'author name', as a string datatype, that is, completed according to the data provider's interpretation and accuracy. Because the datatype is set as a string, multiple inaccuracies can occur during the registration process.

In the next phase, the focus resides on finding solutions to eliminate the root cause of the problem. These solutions, also termed remedies, are in fact changes to data systems or processes in order to prevent data inaccuracies from happening including the swift detection upon their occurrence. While some solutions might be oriented towards improving the data registration, others might focus on the implementation of validation rules or periodic data profiling. In addition, re-engineering of associated data processes and even training of the data provider and user community on data quality aspects, should be considered. Data cleansing might be applied as well, however this mostly is not a solution to eliminate the root

Although solutions might be found using common sense, in most cases more efforts are needed. A frequently used method encompasses the organization of topic-oriented brainstorm sessions in the presence of all stakeholders. This approach has the benefit to tackle the problem from multiple viewpoints and at the same time enables a higher engagement of the stakeholders. Importantly, all relevant solutions to the problem should be listed and effects of the proposed solutions should be investigated carefully. In general, continuous, short-term improvements are to be preferred as these might result in quick wins which can result in additional

In our example many solutions can be found that focus on improving the correct registration of the author name. However, if an author ID would be registered and coupled to an author name, the specific focus on registering the name perfectly in a wide variety of bibliometric sources diminishes. Although this seems an easy solution at first glance, this strategy also includes the re-engineering of business processes, that is, the authentication of research publications by an author using its author ID. In order to investigate the effect of this proposed solution, one could investigate the number of publications that can be attributed to a group of authors that has registered and authenticated their research publications versus a group of authors that have no author ID (i.e., the control group) in an experimental setting. By measuring the DQ of both groups in terms of accuracy and completeness, one

Based on all DQ solutions tested, the most appropriate solution(s) should be selected for implementation. It is important to note here that the success of implementation is dependent on the guidance foreseen to all stakeholders. In essence, this comes down to providing information on the solution and its effectuation on

business benefits (as DQ improvement is mostly not a goal in itself).

can see the effect of the proposed solution.

**5.5 DQ control and follow-up**

### *Data Quality Management DOI: http://dx.doi.org/10.5772/intechopen.86819*

*Scientometrics Recent Advances*

the foremost ones.

**5.2 Measurement DQ**

in subsequent DQ improvement programs.

**5.3 Analyzing DQ issues**

accessibility, traceability of the data are all relevant dimensions, of which the accurate and complete collection and analysis of a researcher's published works are

In order to determine the quality level of the current data in relation to the organization's objectives, the quality dimensions need to be expressed in a measurable manner. While the internal dimensions can be scored in a quantitative manner by means of expressing the errors in the data set in terms of magnitude, number of errors or missing records, the external dimensions are measured in a qualitative manner based on the context of the data's use purposes. Independent of the dimension under analysis, measurements must always be relevant for the purpose for which the data will be used and according to the task's requirements. Although in most cases, common sense will be used to identify task requirements, in other cases specific techniques like sensitivity analysis might be used which allows for identifying critical factors and errors in data models [21, 22]. Furthermore, data profiling is another technique frequently used in DQ assessment as a method to discover the true content, structure and quality of data by means of rule-based checking [23]. Obviously, this technique does not find all inaccurate data, as it can only identify violations to the predefined rules, and hence expected errors. For instance, data profiling can identify invalid data values (i.e., using column property analysis), invalid data combinations (i.e., through structure analysis), inaccurate data (i.e., through value rule analysis). Importantly, data profiling also provides metrics on the data inaccuracies in a dataset, that is, the number of violations, the frequency of invalid data values, etc. Such metrics can be useful as a means to communicate to stakeholders on the (in)accuracy of a data set, and the follow-up of the progression

In our bibliometric example, the accuracy and completeness of the bibliometric records for a given author, collected in a university's database system should be compared to a publication list provided by the author. By manually auditing the registered data found within the database system, one could indeed record the completeness of information. Furthermore, the accuracy can be tested using a manual auditing procedure. This allows for the identification of spelling errors, erroneous exchange of an author's last versus first name, etc. In addition, manual auditing also allows for identification of rather unexpected data entries, like changes in the author's first or last name over time. The latter example of a DQ inaccuracy, can however not be detected through data profiling as rule-based checking is unable to test for unexpected errors. Nevertheless, data profiling has an important role in DQ

measurement as it allows for automated and thus efficient screening of DQ.

Once DQ inaccuracies have been detected, these should be analyzed in order to screen for the potential existence of (groups of) common underlying root causes. For example, author names can have various problems like misspelling, last names mistaken for first names, etc. The grouping of such errors that show similar patterns, also called error cluster analysis, allows for the identification of common causes and is often more efficient in terms of time and resources as compared to handling all inaccuracies in a stand-alone way. In addition, a data event analysis can be performed which evaluates the time points when data are created and updated in order to facilitate the identification of the root causes of problems. For example, the manual entry of author names in a database system might result in misspelling, the

**12**

lack of automated verification in the recording process, the lack of domain specific knowledge of the persons responsible for recording the data, … might affect the occurrence of DQ inaccuracies.

Commonly used techniques to identify root causes include the auditing of the data, the surveying of the user perceptions and the evaluation of the data process. The identified causes can then be depicted in cause and effect diagrams, also termed Ishikawa or fishbone diagrams [24]. These diagrams cluster causes together in groups which is instrumental in identifying, classifying and prioritizing the impact of root causes to a problem. In our example root cause analysis could result in the identification of the field 'author name', as a string datatype, that is, completed according to the data provider's interpretation and accuracy. Because the datatype is set as a string, multiple inaccuracies can occur during the registration process.
