**3. Results**

*Artificial Intelligence - Applications in Medicine and Biology*

atomic value (date of stay) or other documents (patient, medical benefit, etc.). The base of HSRs is a collection of hospital stays, and every stay corresponds to a single entry of this base. The collection includes documents for certain entities, shown through their structure, such as name of the key (medical unit, patients, etc.) and its content (values of the keys (integer (id stay), string (patients, ICD-10), etc.)). The rest of the entities are shown through structural imbrications, such as the medical benefit that is a medical act value aggregate. The entities "medical benefits" refer to an aggregate of a key (diag), which is also an aggregate of (value is ICD-10 code, key is codediag). A document can be defined as a hierarchy of elements that can act as atomic values or nested documents shown by a new set of pairs (value, attribute). There is a simple attribute in the HSR, which makes these values to atomic. The values of compounded attributes are nested documents, as shown in **Figure 2**.

The relationship between different entities is translated in the form of nesting. A document model uses the specific NoSQL request language to query in width and depth all entities present in the collection. A multi- and mono-criteria request can be carried out. An example could be as follows: for a given diagnostic and the type

To evaluate our model, framework MongoDB was deployed in only one data node. The system resources used were (8Go RAM, processor i5–4 heats 2 Terabytes hard disk). MongoDB is a key-value system using document-oriented storage. The volume of data received by one node for the test is 1.6 million documents representing 1 year of the hospital stays. The volume of documents can be worm at 40 times the initial volume. These data are divided into two major groups: encoded and rejected data. The interest of the rejected data is to expand the database, introduce the noise, and have a case of associations of diagnostic code to avoid. Two evaluations were performed. The first one was performed by calculating the model performance (multi-criteria request (Query #2), mono's (Query #1) elapsed time), and the second one was performed by calculating the precision and recall of the model. The initial step was to create two main groups of requests arranged by dimensionality and selectivity validated by the business process, which can be used in a real context. Dimensionality is the value of different keys of the entities ("typediag" and/or "typeact"). Selectivity refers to the degree of data elimination through an aggregate function on the search attribute (code = "CCMA code or ICD-10 code").

of diagnostic, give all associated diagnostic codes and act codes.

**122**

**2.5 Evaluation**

**Figure 2.**

*Document-based schema.*

The document-oriented model of the big data-coding warehouse was implemented in the MongoDB database. Ten separate single-criteria queries were executed with an elapsed time between 75 and 90 milliseconds (ms), while an elapsed time between 80 and 110 ms was obtained during the execution of 10 separate multi-criteria queries.

Query #1 requests the data warehouse to display all association codes, in which the diagnosis code is equal to "Z092" in the ICD-10 coding system, corresponding to "the pharmacotherapy for other conditions." Associated codes obtained are the associated diagnostic code "E780" used to code "pure hypercholesterolemia" and the act code "EBQM002" used to code "Doppler ultrasonography of extracranial cervicocephalic arteries, with Doppler ultrasound of lower extremity arteries." The elapsed time of this request is 90 ms (**Table 1**).

**Table 1** presents results of five sequences of request of Query #1 where requested code represents the code to be queried, associated code represents the obtained associated codes, typology represents different types of diagnostic, and elapsed time represents the execution time of request.

Query #2 requests the data warehouse to display all association codes, in which the type of diagnosis is the main diagnosis and the diagnosis code is equal to "I51.4" in the ICD-10 coding system, corresponding to "cardiomegaly." The response time obtained with no index is approximately 1900 ms. The response time obtained with a diagnostic code indexed is approximately 110 ms. The associated codes are "I080" corresponding to "disorders of mitral and aortic valves" and "D721" corresponding to "eosinophilia," and associated act is "DEQP003" corresponding to "electrocardiography at least 12 leads" (**Table 2**).

**Table 2** presents results of five sequences of request of Query #2 where requested code represents the code to be queried and its typologies, associated code represents obtained associated codes (diagnostic and act), typology represents different types of diagnostic, and elapsed time represents the execution time of request.

### *Artificial Intelligence - Applications in Medicine and Biology*


#### **Table 1.**

*Associations of diagnosis codes according to their typology and their elapsed time.*


#### **Table 2.**

*Associations of diagnosis codes according to their typology and their elapsed time.*


#### **Table 3.**

*Evaluation results of the big data model.*

The list of associated codes present in **Tables 1** and **2** is not exhaustive; it can be extended to more than 100. We make the choice to present a small number.

These results show that the main coding rules have been respected. The associated diagnosis must always be coupled to Z code declared as the main diagnostic and associated acts linked to disease declared as main diagnostic. The presence of related diagnostic demonstrates the quality of associated code containing in the data warehouse.

Query #1 and Query #2 were used to compute the precision and the robustness of the model (**Table 3**).

Based on the observation, the least selective (more lines selected) queries required a long execution time. According to our evaluation, we observed that the system is bijective and corresponds to the reality of the coding of clinical activity of HEGP. This suggests that we can, from the document-oriented model, recover the initial encoding data and vice versa. In this regard, it is apparent that everything that has been set in the big data warehouse corresponds to the reality of the patient. The data warehouse gives the possibility of being more aware of the coding performed in the previous year.

**125**

*Using Artificial Intelligence and Big Data-Based Documents to Optimize Medical Coding*

model. The 50/50% precision test was 40 and 25% for recall versus 92/87% for the mono-criteria (Query #1) request. For the multi-criteria (Query #2) request, it was 80/70% for the 80/20% test. Although there are some errors in the test, the sensitivity of conformity computation was 0.8, and its specification was 0.7. Based on this result, the level of accuracy depended on the number of associated diagnostic codes

This study investigated the process for implementation of a big data-coding

warehouse for coding support in a document-oriented NoSQL system. We observed that flexibility is the particularity of this model as it allows inserting redundancy into the database. A stay with four ASD codes and one PD code is split into four documents. The duplicated line is high when there are more associated diagnoses and medical acts. Therefore, presenting one entity is easier in the entire document. The case of "stay" with only a primary diagnosis, one or more associated diagnosis, and/or without a medical act can be easily inserted in the database without the need to implement a generic code to replace the missing one. In most cases, the addition of a generic code is meant to let the physician understand that there is no need of associating a diagnostic code used with the medical act. This system is advantageous since there is complete information because the issue of missing data is solved. Therefore, the information can be handled without any need to join. Only one reading is needed to get all information. If there is no link between the documents, it is possible to arrange the collection without any challenge. This is an essential part of the construction of a big data-coding warehouse. However, one of the disadvantages associated with this model is that the hierarchization of access does not allow access to ICD-10 code information without going through the type of medical benefit, in addition to the redundancy; there are two pseudorandom choices that provide effective results, while the hazardous choice (50/50%) produces wrong results. To generate huge volumes of data, we used the same "HSR base" and swapped the name ICD-10 by the concept "Obicd10'' and CCMA by the concept "Obccam" (Ob as rejected). The rejected data were used to show that, in the optimization process of coding, we learn about as many accepted cases as rejected cases. The major interest in building the coding aid data warehouse is to use the huge volumes of coding information from a large number of hospitals because it is more exhaustive. The model that was implemented allows obtaining an optimal combination of codes (diagnosis, acts) for a given reason for care. Because of the way they are structured, relational databases usually scale vertically—a single server has to host the entire database to ensure reliability and continuous availability of data. This gets expensive quickly, places limits on scale, and creates a relatively small number of failure points for database infrastructure. It's why we propose our model to solve this problem. Indeed, our coding aid data warehouse scales horizontally—several servers host the entire database, allow grouping of all the relevant data for the diagnosis and medical coding in a generic way, to enrich the coding data by crossing the coding information from other hospital sources and to allow for easier exploration of the coding code associations. It's a system that is subject to expertise. This fact Does not remove the richness of Clinical Data Warehouse (CDW). Our contribution consists of building a specific CDW-based document to propose an "in silico" test framework to enhance the efficacy of algorithms used to optimize coding as an example of algorithm based on manual decision-making paper [15] and various natural language processing (NLP) tools associated with the EHR in−/outpatient summary reports [16].

*DOI: http://dx.doi.org/10.5772/intechopen.85749*

present in the association of codes.

**4. Discussion and conclusion**

Based on the requests defined above and executed using the learn/control database, **Table 1** shows the results of the evaluation provided by the big data *Using Artificial Intelligence and Big Data-Based Documents to Optimize Medical Coding DOI: http://dx.doi.org/10.5772/intechopen.85749*

model. The 50/50% precision test was 40 and 25% for recall versus 92/87% for the mono-criteria (Query #1) request. For the multi-criteria (Query #2) request, it was 80/70% for the 80/20% test. Although there are some errors in the test, the sensitivity of conformity computation was 0.8, and its specification was 0.7. Based on this result, the level of accuracy depended on the number of associated diagnostic codes present in the association of codes.
