**4.3 Follow-up**

When the records have all their process information filled, they should have the "end of cycle" variable filled to "Yes". Therefore, in the analysis we must take the records that have finished their cycle ("end of cycle" = "Yes"). However, this variable was not defined as mandatory in the database and, therefore, presents inconsistencies that were solved according to the following rules:


The variables related to dates are important since they allow the information recorded to be followed up by years. Therefore, their completion should be ensured as far as possible, in particular the date of the sample result and the date of the

**37**

support.

*Analysis and Curation of the Database of a Colo-Rectal Cancer Screening Program*

colonoscopy which are the most crucial ones. The dates considered for completion are the following: date of invitation to the program, date of interview at primary care, date of reception of the sample, date of result of FIT, date scheduled for the colonoscopy, date of colonoscopy. Chronological completion was done as follows:

• In the records where the FIT result date was not completed, it was set as

• In the records where the colonoscopy date was not completed, it was established considering the same date as the date programmed or considering the

In this section the most important aspects of the curation process carried out in order to obtain a clean data set with reliable information on which to carry out future analyses have been introduced. Despite they are specific, it has been shown, the wide range of potential bugs that may appear due to a wrong design of the database.

Good data management and consequently good data quality must comply with some basic principles. This is especially important when those data are used to support decision makers in public health services. In this chapter, we analyze the database of a regional CRC screening program to identify the weaknesses in the process of data collection, providing some guidelines for future maintenance. We also identified incorrections in the database design that may lead to data errors. General and specific recommendations were suggested to meet the requirements of

However, most of these recommendations are forward-looking suggestions, i.e. they will improve the quality of future data from the moment they are considered. Simultaneously, and in order to be able to exploit the information retrospectively, it was necessary to make a data curation of the historical information. To do this, a clean-up process was followed in the most conservative way possible, re-establishing values, cleaning-up some data and discarding repetitive or non-essential data, trying to eliminate as many errors as possible and guarantee good quality data both prospectively and retrospectively. This process is time costly and tedious, but it is an essential first step in data governance to extract reliable knowledge and taking

In summary, this analysis showed the importance of data quality and curation to get a robust, consistent and reliable database, as well as the need for a good design of the data acquisition process and, finally, a proper and coherent maintenance system, especially in health systems where the decisions derived from the analysis

We acknowledge the help provided by the Department of Health of the Aragón

Government for allowing us the access to anonymized patient data and to the EU Program for Employment and Social Innovation (EaSI) for partial financial

follows in order of preference: result date = colonoscopy date −1-month, result date = sample receipt date, result date = OMI date, result date = colonoscopy schedule date-1 month, result date = program invitation date+1 month.

*DOI: http://dx.doi.org/10.5772/intechopen.95899*

result date of the FIT-1 month.

data integrity, consistency and coherence.

**5. Conclusions**

correct decisions.

of databases may be critical.

**Acknowledgements**

*Analysis and Curation of the Database of a Colo-Rectal Cancer Screening Program DOI: http://dx.doi.org/10.5772/intechopen.95899*

colonoscopy which are the most crucial ones. The dates considered for completion are the following: date of invitation to the program, date of interview at primary care, date of reception of the sample, date of result of FIT, date scheduled for the colonoscopy, date of colonoscopy. Chronological completion was done as follows:


In this section the most important aspects of the curation process carried out in order to obtain a clean data set with reliable information on which to carry out future analyses have been introduced. Despite they are specific, it has been shown, the wide range of potential bugs that may appear due to a wrong design of the database.

### **5. Conclusions**

*Data Integrity and Quality*

found and the steps followed for their curation:

remove these records from the data set.

variable was reset to "No".

**4.3 Follow-up**

value of the colonoscopy variable was reset to "Yes".

definition of the default value of this field in the database.

inconsistencies that were solved according to the following rules:

information should take less than one year to be filled in.

of cycle" should take the value "Yes".

The variable that indicates the performance of the colonoscopy (colonoscopy variable) is also strictly related to the result of the FIT, and, obviously, with the variables associated with the colonoscopy. Below we present the inconsistencies

• Records were detected in which the FIT result was negative, and the colonoscopy variable took the value "Yes". After studying it, it was concluded that this error was likely due to an overwriting or crushing of the data. Since this error does not allow the recovery of realistic values from the record, it was decided to

• Records were detected that had a value in the variable that indicates the result of the colonoscopy (normal colonoscopy, polyps, cancerous lesion**…**) and with a completed colonoscopy date, and yet the colonoscopy variable did not take the value "Yes". These cases are a clear example of error because the colonoscopy variable was not defined as mandatory in the database. In these cases the

• All records with no colonoscopy value were reset to 0. This is due to the lack of

• Records with no information on colonoscopy-related variables were detected, yet the colonoscopy variable was equal to "Yes". This error was possible thanks to the lack of constraints in the database. In these cases the colonoscopy

• The variable that determines the reason for the exploration in the cases in which the colonoscopy variable was equal to "No" was reset to null.

When the records have all their process information filled, they should have the "end of cycle" variable filled to "Yes". Therefore, in the analysis we must take the records that have finished their cycle ("end of cycle" = "Yes"). However, this variable was not defined as mandatory in the database and, therefore, presents

• All records with the colonoscopy variable equal to "No" should be closed cycles

• All records with the variable that determines the reason for the end of the cycle completed refer to the patient's closed cycles and, therefore, the variable "end

• All colonoscopies prior to the last year (2018) with "end of cycle" equal to "No" would really be considered as closed cycle ("end of cycle" = "Yes") since the

The variables related to dates are important since they allow the information recorded to be followed up by years. Therefore, their completion should be ensured as far as possible, in particular the date of the sample result and the date of the

and, therefore, the variable "end of cycle" should take the value "Yes".

**4.2 Colonoscopy**

**36**

Good data management and consequently good data quality must comply with some basic principles. This is especially important when those data are used to support decision makers in public health services. In this chapter, we analyze the database of a regional CRC screening program to identify the weaknesses in the process of data collection, providing some guidelines for future maintenance. We also identified incorrections in the database design that may lead to data errors. General and specific recommendations were suggested to meet the requirements of data integrity, consistency and coherence.

However, most of these recommendations are forward-looking suggestions, i.e. they will improve the quality of future data from the moment they are considered. Simultaneously, and in order to be able to exploit the information retrospectively, it was necessary to make a data curation of the historical information. To do this, a clean-up process was followed in the most conservative way possible, re-establishing values, cleaning-up some data and discarding repetitive or non-essential data, trying to eliminate as many errors as possible and guarantee good quality data both prospectively and retrospectively. This process is time costly and tedious, but it is an essential first step in data governance to extract reliable knowledge and taking correct decisions.

In summary, this analysis showed the importance of data quality and curation to get a robust, consistent and reliable database, as well as the need for a good design of the data acquisition process and, finally, a proper and coherent maintenance system, especially in health systems where the decisions derived from the analysis of databases may be critical.

#### **Acknowledgements**

We acknowledge the help provided by the Department of Health of the Aragón Government for allowing us the access to anonymized patient data and to the EU Program for Employment and Social Innovation (EaSI) for partial financial support.

*Data Integrity and Quality*
