**2.2 Entity-relationship model**

The entity-relationship model facilitates the representation of the relationships between the entities. The main objectives of having an entity-relationship model are [14, 15]:


**Figure 1** shows the entity-relationship diagram of our database that represents the relationships between the entities. For simplicity, these relationships are shown

**25**

about 140 fields.

gram are:

**Figure 1.**

*Analysis and Curation of the Database of a Colo-Rectal Cancer Screening Program*

in the diagram by means of the Chen notation [16]. The main relationships among data that may be found in the database of the colo-rectal cancer screening pro-

• One patient belonging to the target population may not have any invitation since their invitation to the program has not been processed yet, or may have received more than one invitation to the program over the years (1–0:n).

• The target patient decides whether (or not) to undergo an FIT and also a

• In each colonoscopy, findings can be detected (or not) such as cancer lesions

The fields for each entity are presented below. In total, the database contains

negative FIT result or colonoscopy findings (0:1–0:1).

colonoscopy if the FIT result is positive (1–0:n).

and/or polyps (1–0:n).

*Entity–relationship model using Chen notation.*

• A patient may (or may not) be excluded from the program (0:1–0:n) for several reasons. As mentioned above, this exclusion may (or may not) be due to a

*DOI: http://dx.doi.org/10.5772/intechopen.95899*

*Analysis and Curation of the Database of a Colo-Rectal Cancer Screening Program DOI: http://dx.doi.org/10.5772/intechopen.95899*

**Figure 1.** *Entity–relationship model using Chen notation.*

in the diagram by means of the Chen notation [16]. The main relationships among data that may be found in the database of the colo-rectal cancer screening program are:


The fields for each entity are presented below. In total, the database contains about 140 fields.

*Data Integrity and Quality*

patients) and BDU (User Database).

CRC, presence of colonic disease, colectomy, irreversible disease (e.g., Alzheimer), previous negative FIT result, or previous negative colonoscopy outcome. This information is automatically dumped into the table from different health system databases, such as OMI-AP (clinical information of patients attended in primary care), CMBDH (clinical information of patients attended in hospital), HP-His (clinical information of ambulatory

• Correspondence: Information from the letters sent to patients along their stay in the program. The process of sending letters is carried out manually by administrative staff through the platform, according to the hospital's criteria, using the target population (60–69 years), excluding those in the exclusion table. Administrative staff is in charge of setting dates, choosing the number of patients to send the letter and gathering positive results. This process is time

• Test: Information about the tests performed on patients throughout the

polyps and cancer lesions detected in the colonoscopy.

The whole information regarding the test procedure, preparation, exploration and findings is analyzed and entered into the platform by health professionals with

In summary, the information in the database comes from different external databases whose information is automatically dumped as well as from data filling by hospital staff with different roles. In these situations where several agents are involved and different information is crossed, we must ensure a good database design, proper data integration and an appropriate data checking and validation.

The entity-relationship model facilitates the representation of the relationships between the entities. The main objectives of having an entity-relationship model

• To allow a high degree of independence between the application/platform and

**Figure 1** shows the entity-relationship diagram of our database that represents the relationships between the entities. For simplicity, these relationships are shown

• To provide a solid basis for addressing data consistency and redundancy.

○ Fecal immunochemical test (FIT): The result of this test comes from several laboratories, whose information is automatically uploaded to the table. This implies the need for a homogenization process for the information provided by the different labs, which might also provoke

○ Colonoscopy: The anatomo-pathological results of this test comes from several pathology laboratories, whose information is translated to the tables by health staff, which may also imply additional errors. Regarding the findings, the tables distinguish between the information about

costly and prone to errors, requiring additional validation.

program. In particular, the tests carried out are the following:

misunderstandings and associated errors.

**24**

different roles.

are [14, 15]:

**2.2 Entity-relationship model**

the internal representation of data.

#### *Data Integrity and Quality*

Patient entity. The patient entity contains 12 attributes with basic patient demographic information in addition to his/her identifier. These fields are related to the date of birth, sex, the round in which the program is at the time in which the patient was enrolled, as well as the place of residence, the health district and the hospital to which the patient belongs.

Exclusion entity. The exclusion entity contains 6 attributes related to the period of exclusion from the program (date of exclusion and, in the case of temporary exclusion, the date of inclusion), the reason for exclusion, a number that determines the priority of exclusion, a binary field that determines whether the exclusion was entered manually and a free text field for comments. If a patient had more than one exclusion, the one with the highest priority prevails. As shown in **Figure 1**, in addition to these fields, the table contains the unique exclusion identifier, the patient identifier, and the test identifier if the exclusion was due to such test.

Correspondence entity. The correspondence entity contains 8 attributes which are as follows: the time when the correspondence was sent (date and time), the type of correspondence sent to the patient (invitation to the program, FIT result, date of scheduled colonoscopy), the round in which the patient was enrolled at the time when the correspondence was received, a binary field that determines if the patient agreed to participate in the program, a binary field that determines if the test recipient was received successfully, a binary field that determines if the patient was included in the program on demand and a free text field for notes. In addition, the table contains the unique identifier of the correspondence and the patient identifier.

Test entity. The entity related to the tests of the screening program contains a large number of attributes (>80). For simplicity, we present here only a high-level description with additional detail for the most relevant aspects. Specifically, the entity contains attributes related to the patient's condition prior to the test and after ending the patient's cycle (round, if the cycle ended, the reason for such ending and the patient's situation after finishing the round).

Regarding the FIT inheritance table, the fields are associated with the date of the interview at primary care, the date of the test and the result of the test. In particular, a field for continuous values of blood concentration in feces (ng/ml), a binary field that determines whether the test was positive and fields that determine whether the sample and the test were correct.

Concerning the colonoscopy inheritance table, it contains a big number of fields related to the following information: basic colonoscopy information (date and time of the scheduled colonoscopy, whether it was performed or not, actual date and time of the colonoscopy and the reason for being performed), colonoscopy preparation (drugs, tolerance, modality, colonic preparation and Boston scale [17]), the process during the colonoscopy (tolerance, which zone was reached, the duration, adequacy of the colonoscopy, etc.), the treatment used during the colonoscopy (type of sedation, type of endoscopic treatment, etc.), the findings found during the colonoscopy (main result: normal colonoscopy, non-neoplastic pathology, polyps, polyposis, cancer, cancer associated with polyposis; risk degree: no risk, low risk, medium risk, high risk and cancer; number of polyps, adenomas, cancer lesions), with the possible complications after the operation (type of complication, whether hospitalization was required and if the patient passed away within the following 30 days…) and possible repetitions of the colonoscopy if required.

Polyp entity. The polyp entity contains, in addition to the identifier of each polyp and the colonoscopy test identifier, 12 attributes concerning the order, size, histology, dysplasia, shape and location of the detected polyp as well as the method for the polypectomy performed, the treatment, the removal performed, etc.

Cancer lesion entity. The cancer lesions entity contains, in addition to the identifier of each lesion and the colonoscopy test, 12 attributes related to the order, size,

**27**

**3.1 Principles**

*Analysis and Curation of the Database of a Colo-Rectal Cancer Screening Program*

correct database design. This is analyzed in the next section.

histology, location of the lesion detected, as well as the stage of the cancer lesion, presence of occluding structure, the type of primary resection and the type of

The entity relation model is therefore clear and well defined. However, we should not forget that, in these situations where several agents are involved and different information is crossed, we must ensure proper data integration from a

An incorrect design of the database and/or the platform often ends up with deficiencies, noise and mistakes in the data, which might prevent a rigorous analysis. In order to have information of sufficient quality to guide appropriate clinical decisions, it is necessary to follow basic principles for data collection and

The underlying overall objective of the chapter, and of this section in particular,

The derived recommendations correspond to prospective improvement actions related to data filling, platform characteristics and database design. However, if the information is intended to be used retrospectively, it is also necessary to carry out an additional data curation action. In the following section we explain this curation

In summary, the underlying objective is to highlight the importance of an effective data governance [18, 19], a concept that refers to the ability of an organization to guarantee high quality data throughout its lifecycle, ensuring principles such as availability, easy use, consistency, integrity and security of data. The data manager

This concept is crucial as organizations rely more and more on data analysis to optimize their processes and to take relevant decisions [20]. In our particular case, quality data are essential to extract statistical information such as the screening program indicators, or to carry out studies with the objective of improving the overall healthcare system. Some examples are the establishment of the cut-off point to undergo a colonoscopy, a risk analysis to identify risk factors and decisions taken

Some of the basic principles, related to data integrity, coherence and consis-

• Information utility: All fields defined in the database (or variable to be intro-

• Maintenance of consistency: The database or data manager must ensure the stability of the information to any change in the procedure/process and/or to

duced in the platform) must be filled for some entity (or record).

is to establish from a general perspective, a set of basic principles regarding the integrity, consistency and coherence of data that must be met by any data management system. In particular, for our case study, we thoroughly analyzed whether each of these principles was met and, if not, we proposed a series of recommendations to mitigate the noise of the data and improve the quality of data management. In this analysis it was fundamental to work in a multidisciplinary team with bio-

*DOI: http://dx.doi.org/10.5772/intechopen.95899*

chemotherapy or radiotherapy if applied.

**3. Analysis and recommendations**

medical, statistical and database experts.

tency, that we analyzed are the following:

any data dump from an external database.

must ensure such data governance principles and processes.

to minimize the undetected lesions, but always based on data evidence.

process in our case study.

management.

histology, location of the lesion detected, as well as the stage of the cancer lesion, presence of occluding structure, the type of primary resection and the type of chemotherapy or radiotherapy if applied.

The entity relation model is therefore clear and well defined. However, we should not forget that, in these situations where several agents are involved and different information is crossed, we must ensure proper data integration from a correct database design. This is analyzed in the next section.
