**3.2 ConTra**

ConTra1 is an open source App developed by some of the authors of this chapter and it is used for mining contradictory data from attributes with many values

<sup>1</sup> https://github.com/ncjoes/contra

#### *Applications of Pattern Recognition*

pattern where the contradictory data are objects associated with mutually exclusive attribute values. It enables its users to query attributes of particular objects whose attribute values are mutually exclusive and display the percentage of the values that contradicts the mutual exclusive rules and the percentage of the values that abide by the rule in a pie chart. Algorithm 1 displays the pseudocode for mining objects whose attribute values are contradictory and those whose attribute values are consistent.

ConTra was used to analyse objects in a Comma Separated Values (CSV) dataset containing over a million rows and six columns. The dataset 'Normal Tissue' dataset is deposited in [24] and it contains expression profiles for proteins in human tissues. It consists of the following columns: 'Gene', 'Gene name', 'Tissue', 'Cell type', 'Level', and Reliability. It has a size of 79.5 MB. Normal Tissue dataset reports experiments on tissues and identified gene expression levels such as low, medium, and high. It also indicates the annotated cell type ("Cell type") and the gene reliability. There can be multiple records for the same gene from different investigations (experiments) on the same tissues in Normal Tissue dataset.


#### *3.2.1 Evaluation of ConTra*

ConTra was used to analyse the Normal Tissue dataset. Any experiment on a tissue in Normal Tissue dataset which indicates that its identified gene expresses more than one level of expression such as not detected, medium, high or low is inconsistent. As identified through the use of ConTra and discussed in [20], contradictions exist in two of the records (9.09%) of the gene 'TSPAN6' expression levels in the tissue 'Pancreas' in Normal Tissue dataset. This is depicted graphically in **Figure 1** as adopted from [20]. Evidently from **Figure 1**, it will be wrong to state that the pattern of expression of the gene 'TSPAN6' in the tissue 'Pancreas' of Normal Tissue dataset is of a particular level. This is because there are cases of contradictory expression in the associated data (TSPAN6 expression levels). Consequently, a holistic analysis of the expression levels of TSPAN6 on Pancreas in the Normal Tissue dataset should depict the existence of the contradictory data as shown in **Figure 1**.

ConTra provides a platform for visualising such inconsistencies in datasets whose objects exhibit a many attribute value pattern and are associated with mutual exclusive attribute values.

**31**

**3.3 Datax**

**Figure 1.**

features:

Datax<sup>2</sup>

*Visual Identification of Inconsistency in Pattern DOI: http://dx.doi.org/10.5772/intechopen.95506*

 is an open source application that mines missing data and associated patterns from a Comma Separated Values (CSV) Dataset. It is designed to enable the visualisation of the missing data in attribute values of a dataset by generating charts which depicts the incompleteness and any associated pattern. It has the following

• Ability to load and store CSV datasets for further visualisation

(patterns) of missingness in a selected dataset

<sup>3</sup> https://data.world/datafiniti/consumer-reviews-of-amazon-products

the percentage of missingness in each column.

Datafiniti's Product Database4

<sup>2</sup> https://github.com/marioJoker/Datax

<sup>4</sup> https://datafiniti.co/products/product-data/

has a total of 1598 rows.

*3.3.1 Evaluation of Datax*

• Ability to display the statistics of incomplete data in a stored dataset

*Result of the analysis of the normal tissue dataset by ConTra's multiple attribute values approach.*

• Ability to visualise through matrix or bar plot, amount and distribution

The user of Datax can select attribute(s) or column(s) of interest from a dataset to visualise the missingness in them. Bar charts are programmed to use white lines to dynamically indicate the missingness in a dataset. Other important parameters measured in Datax include the number of columns in the investigated dataset and

Datax was used to mine incomplete data in an Amazon open source dataset<sup>3</sup>

product information such as rating, review text, and more for each product. It also

Datax was used to analyse the Amazon product review dataset as provided by Datafiniti's Product Database. **Figure 2** depicts the evaluation pane and shows a sneak peek into the first five rows of the investigated dataset while the right side

. It has a total of 27 columns which includes basic

Amazon dataset has a size of 365.82 MB. It contains a list of over 1,500 consumer reviews of Amazon products such as the Kindle, and Fire TV Stick as provided by

. The

**Figure 1.** *Result of the analysis of the normal tissue dataset by ConTra's multiple attribute values approach.*
