**Figure 1.**

*The cornerstone of data quality frameworks.*

due to misinterpretations, multiple copies of the same data entry. Such errors have been identified in almost all existing research and innovation databases, but have a significant impact on the resulting scientometric analyses. Suppose a highly cited paper is included in the Web of Science with typo's in the author's name. This can erroneously lead to the omission of this paper in the bibliometric analyses performed on this author, which on its turn can have a major impact on this researcher career perspectives in terms of chances of success in obtaining grants, promotion.

In addition, these errors can be due to a lack of the use of common **standards** for the concepts contained within the databases and a uniform interpretation thereof by both information providers as well consumers throughout the entire organization. Nevertheless, such standards are available, that is, the Common European Research Information Format (CERIF) is a well-known standard for exchanging research information created by the EuroCRIS organization and is widely used throughout Europe [13], the CASRAI dictionary is a standard created by the organization on Consortia Advancing Standards in Research Administration Information (CASRAI) and was created in Canada [14]. Although both communities work closely together to align the concepts and meanings described in the standards, some differences remain which might cause difficulties in exchanging information in between CRIS systems. Furthermore, the inclusion of a standard in the information model of a data system does not safeguard that all data providers use the standard similarly, nor that the data users grasp the information as intended. Next to using standards for aligning the concepts and meanings of research-related data, the formats of the data fields should be standardized as well. A well-known example here includes the various formats in which a (publication) date is recorded. By means of standardizing this format in a data system, important gains can be obtained in terms of ease of interpretation of the data, leading to more accurate analyses. However as described above, efforts should also be made to clarify what the concept of (publication) date means. For instance, it could point to the creation date, submission date, the published online date, the publication date for in print papers, the date on which the material was made available.

Furthermore, when storing research-related data, it is highly recommended to provide **traceability** to the raw data, which ensures that the data quality can always be controlled. Most bibliometric databases, including the Web of Science and Scopus, comply to this rule by providing a link to the journal article. Research data repositories mostly refer to the creator of the datasets involved. However, over time, researchers can switch positions and thus institutions and as the data are stored in institutional repositories, it would be more meaningful to refer to the research institution in question. In addition, **versioning** should be included when storing research data, as this can be very helpful to understand and potentially (re)use data. Although this is frequently observed in research data repositories, bibliometric and patent databases usually do not show version control. Finally, **back-up** and **data recovery** processes should be ensured when storing research-related data, which is mostly realized via back-up servers at various physical places.

The access to research information should be managed using an **information security management** plan in order to safeguard the intellectual property rights of the researchers that created the information, including their respective institutions. Although large data repositories on bibliometric, innovation and research data control accessibility rights, researchers themselves do not always closely follow the measures taken to control access. Particularly when it comes down to research data that may contain sensitive data [15], strict follow-up of information security measures is needed as emphasized by the EU Regulation 2016/679, also known as the General Data Protection Regulation (GDPR) that protects natural persons with regards to the processing of personal data and on the free movement of such data.

**9**

*Data Quality Management*

*4.2.2 Management processes*

*DOI: http://dx.doi.org/10.5772/intechopen.86819*

Although the GDPR regulation only applies to personal data *in se*, it nicely under-

These information security management plans indeed not only entail the accessibility rights of individuals, including user authentication and a regular update of their access rights, but also include the secure storage, archival, transmission, and if required, destruction of the information. In case of research data on natural persons, this can be achieved via pseudonymization, for example, through encryption, or via anonymization of the research information residing in data systems or on data carriers. Obviously, when transmitting research information, the proper legal agreements should be put in place, for example, non-disclosure agreements with third parties are well-known examples used to secure research information. Finally, information security management plans should also contain audit trails in order to constantly monitor and adjust the security of research-related information.

A second group of CSFs encompasses the managerial processes that are imposed on these operational processes, and which are primarily aimed at the alignment of the data quality with the organization's goals with regards to the data and the resulting data analyses. Consider for example, the information requirement of a university that wants to monitor the research funds obtained via researchers. In order to answer this question, the concepts of research funds and researchers should be clear and uniform between information providers and users. Although this might seem straightforward, it could well be that the interpretation of 'researcher' is different in between stakeholders, that is, while some might include PhD students, other might omit this group. Furthermore, it could well be that the university does not have a specific label for clustering funds as belonging to the 'research' category, or that the information is only partially provided by the researchers. These examples clearly illustrate that the lack of management of operational DQ processes, has a devastat-

ing effect on the data analyses and the conclusions based thereon.

*will result in inefficiency and an ineffective use of resources*' [16].

Managerial processes of data quality essentially focus on four sequential processes, that is, the determination of the information quality requirements, the assessment of the risks associated with DQ issues, the assessment or monitoring of DQ and the continuous improvement of the related DQ processes [16]. First, the **information quality requirements** should be determined of the collected data, considering all stakeholders. Next, a conceptual information model should be drafted using high-level data constructs, generally described in non-technical terms in order to be understandable by executives and managers. This model should then be translated into a logical data model that uses entities, attributes and relationships that are customized towards the organization's use of the data, in terms of the organization's terminology, semantics as well as the prevailing business rules. Finally, the logical model should be transferred to developers that can derive a physical data model in line with this logical model including validation rules, based upon the business rules, that are useful for automating data quality control. Obviously, the constructed models must consider the importance of the data within the organization. For example, certain data will be more important than others, and poor DQ of those data might have a larger negative impact in terms of loss of reputation, financial loss of the organization. The explicit **management of these DQ risks** is a must as a manner to guarantee data quality. As stated by Baskarada '*using gut feeling* 

Next, a framework of key performance DQ indicators needs to be set up in line with the organization's goals, in order to **assess the DQ performance**. This assessment must be performed on a regular basis in order to allow for the **continuous** 

pins some elements present in information security management plans.

### *Data Quality Management DOI: http://dx.doi.org/10.5772/intechopen.86819*

*Scientometrics Recent Advances*

due to misinterpretations, multiple copies of the same data entry. Such errors have been identified in almost all existing research and innovation databases, but have a significant impact on the resulting scientometric analyses. Suppose a highly cited paper is included in the Web of Science with typo's in the author's name. This can erroneously lead to the omission of this paper in the bibliometric analyses performed on this author, which on its turn can have a major impact on this researcher career perspectives in terms of chances of success in obtaining grants, promotion. In addition, these errors can be due to a lack of the use of common **standards** for the concepts contained within the databases and a uniform interpretation thereof by both information providers as well consumers throughout the entire organization. Nevertheless, such standards are available, that is, the Common European Research Information Format (CERIF) is a well-known standard for exchanging research information created by the EuroCRIS organization and is widely used throughout Europe [13], the CASRAI dictionary is a standard created by the organization on Consortia Advancing Standards in Research Administration Information (CASRAI) and was created in Canada [14]. Although both communities work closely together to align the concepts and meanings described in the standards, some differences remain which might cause difficulties in exchanging information in between CRIS systems. Furthermore, the inclusion of a standard in the information model of a data system does not safeguard that all data providers use the standard similarly, nor that the data users grasp the information as intended. Next to using standards for aligning the concepts and meanings of research-related data, the formats of the data fields should be standardized as well. A well-known example here includes the various formats in which a (publication) date is recorded. By means of standardizing this format in a data system, important gains can be obtained in terms of ease of interpretation of the data, leading to more accurate analyses. However as described above, efforts should also be made to clarify what the concept of (publication) date means. For instance, it could point to the creation date, submission date, the published online date, the publication date for in print

papers, the date on which the material was made available.

mostly realized via back-up servers at various physical places.

Furthermore, when storing research-related data, it is highly recommended to provide **traceability** to the raw data, which ensures that the data quality can always be controlled. Most bibliometric databases, including the Web of Science and Scopus, comply to this rule by providing a link to the journal article. Research data repositories mostly refer to the creator of the datasets involved. However, over time, researchers can switch positions and thus institutions and as the data are stored in institutional repositories, it would be more meaningful to refer to the research institution in question. In addition, **versioning** should be included when storing research data, as this can be very helpful to understand and potentially (re)use data. Although this is frequently observed in research data repositories, bibliometric and patent databases usually do not show version control. Finally, **back-up** and **data recovery** processes should be ensured when storing research-related data, which is

The access to research information should be managed using an **information security management** plan in order to safeguard the intellectual property rights of the researchers that created the information, including their respective institutions. Although large data repositories on bibliometric, innovation and research data control accessibility rights, researchers themselves do not always closely follow the measures taken to control access. Particularly when it comes down to research data that may contain sensitive data [15], strict follow-up of information security measures is needed as emphasized by the EU Regulation 2016/679, also known as the General Data Protection Regulation (GDPR) that protects natural persons with regards to the processing of personal data and on the free movement of such data.

**8**

Although the GDPR regulation only applies to personal data *in se*, it nicely underpins some elements present in information security management plans.

These information security management plans indeed not only entail the accessibility rights of individuals, including user authentication and a regular update of their access rights, but also include the secure storage, archival, transmission, and if required, destruction of the information. In case of research data on natural persons, this can be achieved via pseudonymization, for example, through encryption, or via anonymization of the research information residing in data systems or on data carriers. Obviously, when transmitting research information, the proper legal agreements should be put in place, for example, non-disclosure agreements with third parties are well-known examples used to secure research information. Finally, information security management plans should also contain audit trails in order to constantly monitor and adjust the security of research-related information.

### *4.2.2 Management processes*

A second group of CSFs encompasses the managerial processes that are imposed on these operational processes, and which are primarily aimed at the alignment of the data quality with the organization's goals with regards to the data and the resulting data analyses. Consider for example, the information requirement of a university that wants to monitor the research funds obtained via researchers. In order to answer this question, the concepts of research funds and researchers should be clear and uniform between information providers and users. Although this might seem straightforward, it could well be that the interpretation of 'researcher' is different in between stakeholders, that is, while some might include PhD students, other might omit this group. Furthermore, it could well be that the university does not have a specific label for clustering funds as belonging to the 'research' category, or that the information is only partially provided by the researchers. These examples clearly illustrate that the lack of management of operational DQ processes, has a devastating effect on the data analyses and the conclusions based thereon.

Managerial processes of data quality essentially focus on four sequential processes, that is, the determination of the information quality requirements, the assessment of the risks associated with DQ issues, the assessment or monitoring of DQ and the continuous improvement of the related DQ processes [16]. First, the **information quality requirements** should be determined of the collected data, considering all stakeholders. Next, a conceptual information model should be drafted using high-level data constructs, generally described in non-technical terms in order to be understandable by executives and managers. This model should then be translated into a logical data model that uses entities, attributes and relationships that are customized towards the organization's use of the data, in terms of the organization's terminology, semantics as well as the prevailing business rules. Finally, the logical model should be transferred to developers that can derive a physical data model in line with this logical model including validation rules, based upon the business rules, that are useful for automating data quality control. Obviously, the constructed models must consider the importance of the data within the organization. For example, certain data will be more important than others, and poor DQ of those data might have a larger negative impact in terms of loss of reputation, financial loss of the organization. The explicit **management of these DQ risks** is a must as a manner to guarantee data quality. As stated by Baskarada '*using gut feeling will result in inefficiency and an ineffective use of resources*' [16].

Next, a framework of key performance DQ indicators needs to be set up in line with the organization's goals, in order to **assess the DQ performance**. This assessment must be performed on a regular basis in order to allow for the **continuous** 

**improvement of data quality** in terms of analyzing the root cause of the errors as well as cleansing erroneous data.

The application of such DQ managerial processes has already been implemented to some extent in CRIS systems that contain research information. For example, the Flanders Research Information Space, also termed FRIS, is a research information portal sustained by the Department of Economy, Science and Innovation in Flanders, Belgium that collects research information from a wide range of Flemish stakeholders in the research field, that is, research universities, higher education colleges, strategic research centers and research institutions (www.researchportal. be) [17]. Underlying the FRIS architecture, a conceptual metamodel was developed in order to model all concepts, attributes and relationships that are contained within FRIS. This conceptual model is based on the CERIF standard, but customized to the Flemish context. In addition, in line with the use purposes of this CRIS system, business rules were drafted to safeguard the quality of the contained information. These business rules were translated to validation rules that are used for the automated quality control of the research information received. If non-compliances to these rules are detected, the research information is rejected, and the information providers receive a notification thereby allowing for immediate data cleansing. Furthermore, the Flemish government also performs manual quality checks on a regular basis in order to validate the research information contained as validation rules in general are not well suited for detecting unpredicted errors. Such errors generally provide valuable input for root cause analyses that can identify important underlying problems which can be caused by human, process, organizational or technological factors.

### *4.2.3 Governance process*

A third group of CSFs encompasses the governance processes associated with DQ management. These processes can be largely summarized as the **commitment of an organization's top management** to set DQ management as a priority and to stimulate a culture change throughout the entire organization in this respect. In the field of information governance, Gartner Research defined information governance as '*the specification of decision rights and an accountability framework to encourage desirable behavior in the valuation, creation, storage, use, archival and deletion of information*' [18]. In practice, information governance basically comes down to allocating budget and resources to the process of DQ management by defining roles and responsibilities, making agreements on related concepts, terms and associated DQ processes, including the monitoring, control and improvement thereof. The FRIS-system as indicated above has included data governance in order to ensure proper DQ management [17].

## *4.2.4 Training*

Although an organization might have all operational, managerial and governance processes perfectly in place, a complete implementation of DQ management also requires the investment in training throughout the organization. A first and foremost important goal is to inform people on the importance of qualitative data to the organization. Secondly, people should receive training via training programs, course series, mentorships on the rules as set out in the operational, managerial and governance processes in order to ensure a systematic implementation of DQ throughout the entire organization. Finally, a continuous follow-up is also needed which allows for swift adjustments in case of unpredicted errors, adjustment of business rules, etc.

**11**

**Figure 2.**

*Data quality improvement workflow.*

*Data Quality Management*

*DOI: http://dx.doi.org/10.5772/intechopen.86819*

throughout the entire DQ improvement workflow.

In order to safeguard the continuous monitoring of data quality and the adoption of measures to improve data quality, a DQ improvement workflow needs to be established. This workflow essentially comprises a repetitive workflow of five consecutive phases, that is, the definition, measurement, analyze, improvement and control phase as depicted in **Figure 2**. A best practice is to formalize this data quality improvement process, in terms of properly documenting all related processes and activities in each phase, as this allows for the tracking of progress

The DQ improvement workflow starts with defining the scope of the DQ improvement project. This includes the selection of a dataset relevant to a specific business goal, and the determination of the data attributes required. When collecting this information, it is very important to discuss the meaning of the metadata required with all stakeholders in order to be able to identify any discrepancies in interpretation of the required data attributes versus the meaning of the existing metadata, as this prevents erroneous data collection, analysis and interpretation. All obtained information should be documented using domain modeling techniques that include information on the data and the associated operations on the data [19]. Examples of such techniques include Business Process Model Notation (BPMN) diagrams [20], data flow diagrams of which the resulting information should be contained in data governance tools together with the accompanying semantics. In addition, data quality dimensions important to the specific use purposes of the data should be determined, and if possible, these are preferably defined in a measurable

manner which facilitates further steps in the DQ improvement process.

For example, consider the use of bibliometric data as part of a researcher's evaluation in the context of career-wise promotions. In order to provide an adequate, qualitative data-analysis, a clear framework should be defined by an organization's management comprising what should be evaluated, that is, which publications (books, journals.), validation criteria (peer reviewed, group author.) are to be used as well as the accompanying processes. This information should be discussed with all stakeholders, that is, researchers, librarians, data analysts and IT-staff in order to harmonize the data flow, the accompanying semantics, procedures and models in accordance with the management's goals. Next, the *As Is* situation should be evaluated with regards to these intentions and according to the relevant data dimensions. In bibliometric analyses, accuracy, completeness, timeliness, relevance,

**5. Data quality improvement**

**5.1 Definition of the DQ project**
