**3.3 Datax**

*Applications of Pattern Recognition*

ments) on the same tissues in Normal Tissue dataset.

i. If 'M' contains more than one mutually exclusive value then

7. Print contradictory objects O(a, b) and consistent objects C(c, d)

1.Given a set of records in CSV format 2.Let G = Set of Objects from a selected column

attribute values

exclusive attributes 'M'

values

ii. Else

End

consistent.

pattern where the contradictory data are objects associated with mutually exclusive attribute values. It enables its users to query attributes of particular objects whose attribute values are mutually exclusive and display the percentage of the values that contradicts the mutual exclusive rules and the percentage of the values that abide by the rule in a pie chart. Algorithm 1 displays the pseudocode for mining objects whose attribute values are contradictory and those whose attribute values are

ConTra was used to analyse objects in a Comma Separated Values (CSV) dataset containing over a million rows and six columns. The dataset 'Normal Tissue' dataset is deposited in [24] and it contains expression profiles for proteins in human tissues. It consists of the following columns: 'Gene', 'Gene name', 'Tissue', 'Cell type', 'Level', and Reliability. It has a size of 79.5 MB. Normal Tissue dataset reports experiments on tissues and identified gene expression levels such as low, medium, and high. It also indicates the annotated cell type ("Cell type") and the gene reliability. There can be multiple records for the same gene from different investigations (experi-

Algorithm 1: ConTra's Algorithm for mining contradictory and consistent data as evident in [11]

3.Let M = Set of Attributes (titles of every column excluding the Object column) 4.Let O(a,b) = empty list where a = contradictory object index and b = contradictory

5.Let C(c,d) = empty list where c = consistent object index and d = consistent attribute

6.For each Object 'g' in the set of objects 'G' which are associated to a set of mutually

Store 'g' and also store each of the contradictory values in the list O(a,b)

Store 'g' in set of consistent objects and also store each of the consistent

ConTra was used to analyse the Normal Tissue dataset. Any experiment on a tissue in Normal Tissue dataset which indicates that its identified gene expresses more than one level of expression such as not detected, medium, high or low is inconsistent. As identified through the use of ConTra and discussed in [20], contradictions exist in two of the records (9.09%) of the gene 'TSPAN6' expression levels in the tissue 'Pancreas' in Normal Tissue dataset. This is depicted graphically in **Figure 1** as adopted from [20]. Evidently from **Figure 1**, it will be wrong to state that the pattern of expression of the gene 'TSPAN6' in the tissue 'Pancreas' of Normal Tissue dataset is of a particular level. This is because there are cases of contradictory expression in the associated data (TSPAN6 expression levels). Consequently, a holistic analysis of the expression levels of TSPAN6 on Pancreas in the Normal Tissue dataset should depict the existence of the contradictory data as

ConTra provides a platform for visualising such inconsistencies in datasets whose objects exhibit a many attribute value pattern and are associated with mutual

**30**

shown in **Figure 1**.

exclusive attribute values.

*3.2.1 Evaluation of ConTra*

values in the list C(c,d)

Datax<sup>2</sup> is an open source application that mines missing data and associated patterns from a Comma Separated Values (CSV) Dataset. It is designed to enable the visualisation of the missing data in attribute values of a dataset by generating charts which depicts the incompleteness and any associated pattern. It has the following features:


The user of Datax can select attribute(s) or column(s) of interest from a dataset to visualise the missingness in them. Bar charts are programmed to use white lines to dynamically indicate the missingness in a dataset. Other important parameters measured in Datax include the number of columns in the investigated dataset and the percentage of missingness in each column.

Datax was used to mine incomplete data in an Amazon open source dataset<sup>3</sup> . The Amazon dataset has a size of 365.82 MB. It contains a list of over 1,500 consumer reviews of Amazon products such as the Kindle, and Fire TV Stick as provided by Datafiniti's Product Database4 . It has a total of 27 columns which includes basic product information such as rating, review text, and more for each product. It also has a total of 1598 rows.
