**Data Mining and Neural Networks: The Impact of Data Representation**

Fadzilah Siraj, Ehab A. Omer A. Omer and Md. Rajib Hasan

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/51594

## **1. Introduction**

96 Advances in Data Mining Knowledge Discovery and Applications

14.

Motifs". *SDM 2009*.

*SIGKDD*, pp 623-631.

Pr. ISBN 0521012309.

67-74.

*2008*.

*VLDB*.

*Trans.Comput. C-22*, 7 (July).

*SIGMOD 2004*.

World Scientific Publishing Company, pp. 1-21.

time series". Data Mining Knowledge Discovery, 15(2).

[27] Mason, Handscomb (2003). *Chebyshev Polynomials*. Chapman & Hall.

[23] Keogh, Chu, Hart, Pazzani (2004). "Segmenting time series: a survey and novel approach". In: Last, M., Kandel, A., Bunke, H. (Eds.), *Data mining in time series database*.

[24] Korn, Jagadish, Faloutsos (1997). "Efficiently supporting ad hoc queries in large datasets of time sequences". *Proceedings of SIGMOD '97*, Tucson, AZ, pp 289-300. [25] Lawrence, Altschul, Boguski, Liu, Neuwald, Wootton (1993). "Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment". *Science 262*, 208–

[26] Lin, Keogh, Wei, Lonardi (2007). "Experiencing SAX: a novel symbolic representation of

[28] Mueen, Keogh, Zhu, Cash, Westover (2009). "Exact Discovery of Time Series

[29] Ng, Cai (2004): "Indexing Spatio-Temporal Trajectories with Chebyshev Polynomials".

[30] Pavlidis (1976). "Waveform segmentation through functional approximation". *IEEE* 

[31] Perng, Wang, Zhang, Parker (2000). "Landmarks: a newmodel for similarity-based

[32] Ratanamahatana, Lin, Gunopulos, Keogh, Vlachos, Das (2010): "Mining Time Series

[33] Salvador, Chan, (2004). "Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms". *Proceedings of the 16th IEEE International Conference* 

[34] Sakoe, Chiba (1978). Dynamic programming algorithm optimization for spoken word

[35] Shieh and Keogh (2008). "iSAX: Indexing and Mining Terabyte Sized Time Series".

[36] Tompa, Buhler (2001). "Finding motifs using random projections". *In proceedings of the 5th Int'l Conference on Computational Molecular Biology*. Montreal, Canada, Apr 22-25. pp

[37] Vlachos, Kollios, Gunopulos (2002). "Discovering similar multidimensional

[38] Von Storch, Zwiers (2001). *Statistical analysis in climate research.* Cambridge Univ

[39] Wang, Ye, Keogh, Shelton (2008). "Annotating Historical Archives of Images". *JCDL* 

[40] Yang, Shahabi (2004). "A PCA-based similarity measure for multivariate time series". In Proceedings of the 2nd ACM international workshop on Multimedia database, pp. 65-74. [41] Yi and C. Faloutsos (2000). "Fast Time Sequence Indexing for Arbitrary Lp Norms".

pattern querying in time series databases". *Proc.2000 ICDE*, pp. 33–42.

Data". *Data Mining and Knowledge Discovery Handbook*, pp. 1049-1077.

recognition. *IEEE Trans Acoustics Speech Signal Process*. ASSP 26:43–49

*on Tools with Artificial Intelligence*, 2004, pp. 576-584.

trajectories". *Proc. 2002 ICDE*, pp. 673–684.

The extensive use of computers and information technology has led toward the creation of extensive data repositories from a very wide variety of application areas [1]. Such vast data repositories can contribute significantly towards future decision making provided appropriate knowledge discovery mechanisms are applied for extracting hidden, but potentially useful information embedded into the data [2].

Data mining (DM) is one of the phases in knowledge discovery in databases. It is the process of extracting the useful information and knowledge in which the data is abundant, incomplete, ambiguous and random [3], [4], [5]. DM is defined as an automated or semiautomated exploratory data analysis of large complex data sets that can be used to uncover patterns and relationships in data with an emphasis on large observational databases [6]. Modern statistical and computational technologies are applied to the problem in order to find useful patterns hidden withina large database [7], [8], [9]. To uncover hidden trends and patterns, DM uses a combination of an explicit knowledge base, sophisticated analytical skills, and domain knowledge. In effect, the predictive models formed from the trends and patterns through DM enable analysts to produce new observations from existing data. DM methods can also be viewed as statistical computation, artificial intelligence (AI) and database approach[10]. However, these methods are not replacing the existing traditional statistics; in fact, it is an extension of traditional techniques. For example, its techniques have been applied to uncover hidden information and predict future trends in financial markets. Competitive advantages achieved by DM in business and finance include increased revenue, reduced cost, and improved market place responsiveness and awareness [11]. It has also been used to derive new information that could be integrated in decision support, forecasting and estimation to help business gain competitive advantage [9]. In higher educational institutions, DM can be used in the process of uncovering hidden trends and patterns that help them in forecasting the students' achievement. For instance, by using DM

© 2012 Siraj et al., licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2012 Siraj et al., licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

approach, a university could predict the accuracy percentage of students' graduation status, whether students will or will not be graduated, the variety of outcomes, such as transferability, persistence, retention, and course success[12], [13].

Data Mining and Neural Networks: The Impact of Data Representation 99

extrapolate the presence of mercury in human blood from animal data. The effect of different data representations such as *As-is, Category, Simple binary, Thermometer*, and *Flag* on the prediction models are investigated. The study concludes that the *Thermometer* data

[16], [21] used five different data representations (*Maximum Value*, *Maximum* and *Minimum Value*, *Logarithm*, T*hermometer* (powers of 10), and *Binary* (powers of 2)) on a set of data to predict maize yield at three scales in east-central Indiana of the Midwest USA [17]. The data used to consist of weather data and yield data from farm, county and state levels from the year 1901 to 1996. The results indicate that data representation has a significant effect on NN performance.

In another study, [21] investigate the performance of data representation formats such as *Binary* and *Integer* on the classification accuracy of network intrusion detection system. Three data mining techniques such as rough sets, NN and inductive learning were applied on binary and integer representations. The experimental results show that different data representations did not cause significant difference to the classification accuracy. This may be due to the fact that the same phenomenon were captured and put into different representation formats [21]. In addition, the data was primarily discrete values of qualitative variables (system class), and

Numerical encoding schemes (*Decimal Normalization and Split Decimal Digit representation*) and bit pattern encoding schemes (*Binary representation, Binary Code Decimal representation, Gray Code representation, Temperature code representation, and Gray Coded Decimal representation*) were applied on Fisher Iris data and the performance of the various encoding approaches were analyzed. The results indicate that encoding approaches affect the training errors (such as maximum error and root mean square error) and encoding methods that uses more input nodes that represent one single parameter resulted in lower training errors. Consequently, [22] work laid an important foundation for later research on the effect of data

[22] conducted an empirical study based on a theoretical provided by [15] to support the findings that input data manipulation could improve neural learning in NN. In addition, [15] evaluated the impact of the modified training sets and how the learning process depends on data distribution within the training sets. NN training was performed on input data set that has been arranged so that three different sets are produced with each set having a different number of occurrences of 1's and 0's. The *Temperature Encoding* is then employed on the three data sets and then being used to train NN again. The results show that by employing *Temperature Encoding* on the data sets, the training process is improved by significantly reducing the number of epochs or iteration needed for training. [15]'s findings proved that by

The methodology for this research is being adapted from [14] by using different data representations on the data set, and the steps involved in carrying out the studies are shown

changing input data representation, the performance in a NN model is affected.

different results could be obtained if the values were continuous variables.

representation on the classification performance using NN.

**4. Methodology** 

representation using NN performs extremely well.

The objective of this study is to investigate the impact of various data representations on predictive data mining models. In the task of prediction, one particular predictive model might give the best result for one data set but gives a poor results in another data set although these two datasets contain the same data with different representations [14],[15],[16], [17]. This study focuses on two predictive data mining models, which are commonly used for prediction purposes, namely neural network (NN) and regression model. A medical data set (known as Wisconsin Breast Cancer) and a business data (German credit) that has Boolean targets are used for experimental purposes to investigate the impact of various data representation on predictive DM model. Seven data representations are employed for this study; they are As\_Is, Min Max normalization, standard deviation normalization, sigmoidal normalization, thermometer representation, flag representation and simple binary representation.

This chapter is organized as follows. The second section describes data mining, and data representation is described in the third section. The methodology and the experiments for carrying out the investigations are covered in Section 4. The results are the subject of discussion which is presented in Section 5. Finally, the conclusion and future research are presented in Section 6.

## **2. Data mining**

It is well known that DM is capable of providing highly accurate information to support decision-making and forecasting for scientific, physiology, sociology, the military and business decision making [13]. DM is a powerful technology with great potential such that it helps users focus on the most important information stored in data warehouses or streamed through communication lines. DM has a potential to answer questions that were very timeconsuming to resolve in the past. In addition, DM can predict future trends and behavior, allowing us to make proactive, knowledge-driven decisions [18].

NN, decision trees, and logistic regression are three classification models that are commonly used in comparative studies [19]. These models have been applied to a prostate cancer data set obtained from SEER (the Surveillance, Epidemiology), and results program of the National Cancer Institute. The results from the study show that NN performed best with the highest accuracy, sensitivity and specificity, followed by decision tree and then logistic regression. Similar models have been applied to detect credit card fraud. The results indicate that NN give better performance than logistic regression and decision tree [20].

## **3. Data representation**

Data representation plays a crucial role on the performance of NN, "especially for the applications of NNs in a real world." In data representation study,[14] used NNs to extrapolate the presence of mercury in human blood from animal data. The effect of different data representations such as *As-is, Category, Simple binary, Thermometer*, and *Flag* on the prediction models are investigated. The study concludes that the *Thermometer* data representation using NN performs extremely well.

[16], [21] used five different data representations (*Maximum Value*, *Maximum* and *Minimum Value*, *Logarithm*, T*hermometer* (powers of 10), and *Binary* (powers of 2)) on a set of data to predict maize yield at three scales in east-central Indiana of the Midwest USA [17]. The data used to consist of weather data and yield data from farm, county and state levels from the year 1901 to 1996. The results indicate that data representation has a significant effect on NN performance.

In another study, [21] investigate the performance of data representation formats such as *Binary* and *Integer* on the classification accuracy of network intrusion detection system. Three data mining techniques such as rough sets, NN and inductive learning were applied on binary and integer representations. The experimental results show that different data representations did not cause significant difference to the classification accuracy. This may be due to the fact that the same phenomenon were captured and put into different representation formats [21]. In addition, the data was primarily discrete values of qualitative variables (system class), and different results could be obtained if the values were continuous variables.

Numerical encoding schemes (*Decimal Normalization and Split Decimal Digit representation*) and bit pattern encoding schemes (*Binary representation, Binary Code Decimal representation, Gray Code representation, Temperature code representation, and Gray Coded Decimal representation*) were applied on Fisher Iris data and the performance of the various encoding approaches were analyzed. The results indicate that encoding approaches affect the training errors (such as maximum error and root mean square error) and encoding methods that uses more input nodes that represent one single parameter resulted in lower training errors. Consequently, [22] work laid an important foundation for later research on the effect of data representation on the classification performance using NN.

[22] conducted an empirical study based on a theoretical provided by [15] to support the findings that input data manipulation could improve neural learning in NN. In addition, [15] evaluated the impact of the modified training sets and how the learning process depends on data distribution within the training sets. NN training was performed on input data set that has been arranged so that three different sets are produced with each set having a different number of occurrences of 1's and 0's. The *Temperature Encoding* is then employed on the three data sets and then being used to train NN again. The results show that by employing *Temperature Encoding* on the data sets, the training process is improved by significantly reducing the number of epochs or iteration needed for training. [15]'s findings proved that by changing input data representation, the performance in a NN model is affected.

## **4. Methodology**

98 Advances in Data Mining Knowledge Discovery and Applications

flag representation and simple binary representation.

allowing us to make proactive, knowledge-driven decisions [18].

presented in Section 6.

**3. Data representation** 

**2. Data mining** 

transferability, persistence, retention, and course success[12], [13].

approach, a university could predict the accuracy percentage of students' graduation status, whether students will or will not be graduated, the variety of outcomes, such as

The objective of this study is to investigate the impact of various data representations on predictive data mining models. In the task of prediction, one particular predictive model might give the best result for one data set but gives a poor results in another data set although these two datasets contain the same data with different representations [14],[15],[16], [17]. This study focuses on two predictive data mining models, which are commonly used for prediction purposes, namely neural network (NN) and regression model. A medical data set (known as Wisconsin Breast Cancer) and a business data (German credit) that has Boolean targets are used for experimental purposes to investigate the impact of various data representation on predictive DM model. Seven data representations are employed for this study; they are As\_Is, Min Max normalization, standard deviation normalization, sigmoidal normalization, thermometer representation,

This chapter is organized as follows. The second section describes data mining, and data representation is described in the third section. The methodology and the experiments for carrying out the investigations are covered in Section 4. The results are the subject of discussion which is presented in Section 5. Finally, the conclusion and future research are

It is well known that DM is capable of providing highly accurate information to support decision-making and forecasting for scientific, physiology, sociology, the military and business decision making [13]. DM is a powerful technology with great potential such that it helps users focus on the most important information stored in data warehouses or streamed through communication lines. DM has a potential to answer questions that were very timeconsuming to resolve in the past. In addition, DM can predict future trends and behavior,

NN, decision trees, and logistic regression are three classification models that are commonly used in comparative studies [19]. These models have been applied to a prostate cancer data set obtained from SEER (the Surveillance, Epidemiology), and results program of the National Cancer Institute. The results from the study show that NN performed best with the highest accuracy, sensitivity and specificity, followed by decision tree and then logistic regression. Similar models have been applied to detect credit card fraud. The results indicate

Data representation plays a crucial role on the performance of NN, "especially for the applications of NNs in a real world." In data representation study,[14] used NNs to

that NN give better performance than logistic regression and decision tree [20].

The methodology for this research is being adapted from [14] by using different data representations on the data set, and the steps involved in carrying out the studies are shown

in Figure 1 [14]. The study starts with data collection, followed by data preparation stage, analysis and experiment stage, and finally, investigation and comparison stage.

Data Mining and Neural Networks: The Impact of Data Representation 101

4 for malignant

7, 8, 9, 10

**value** 

0

**value** 

0

set with nine attributes (excluding Sample Code Number) that represent independent

Table 1 describes the attribute in the data set, code which represents the short form for this attribute, type, which shows the data type for particular attribute, domain, which represents the possible range in the value and the last column, shows the missing values in all attributes in the study. From Table 1, only one attribute has been missing values (a total of

**No Attribute description Code Type Domain Missing** 

1 Sample code number CodeNum Continues Id number 0 2 Clump Thickness CTHick Discrete 1 – 10 0 3 Uniformity of Cell Size CellSize Discrete 1 – 10 0 4 Uniformity of Cell Shape CellShape Discrete 1 – 10 0 5 Marginal Adhesion MarAd Discrete 1 – 10 0 6 Single Epithelial Cell Size EpiCells Discrete 1 – 10 0 7 Bare Nuclei BareNuc Discrete 1 – 10 16 8 Bland Chromatin BLChr Discrete 1 – 10 0 9 Normal Nucleoli NormNuc Discrete 1 – 10 0 10 Mitoses Mito Discrete 1 – 10 0

11 Class: Cl Discrete 2 for benign

Based on the condition of Breast Cancer patients, a total of 65.5% (458) of them has benign

German credit data set classifies applicants as good or bad credit risk based upon a set of attributes specified by financial institutions. The original data set is provided by Professor Hofmann contains categorical and symbolic attributes. A total of 1000 instances have been provided with 20 attributes, excluding the German Credit Class (Table 2). The applicants are

**No. Attribute description Code Type Domain Missing** 

2 Duration in month DurMo Continuous 4 - 72 0 3 Credit history CreditH Discrete 0, 1, 2, 3, 4 0

4 Purpose Purpose Discrete 0, 1, 2, 3, 4, 5, 6,

SECA Discrete 1, 2, 3, 4 0

classified as good credit risk (700) or bad (300) with no missing value in this data set.

variables and one attribute, i.e. Class represent the output or dependent variable.

16 instances), and this attribute is Bare Nuclei.

**Table 1.** Attribute of Wisconsin Breast Cancer Dataset

condition and the rest (34.5% or 241) is Malignant.

*4.2.1.2. German credit dataset* 

1 Status of existing checking account

**Figure 1.** Steps in carrying out the study

## **4.1. Data collection**

At this stage, data sets have been acquired through the UCI machine learning repository which can be accessed at http://archive.ics.uci.edu/ml/ datasets.html. The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for conducting empirical studies on machine learning algorithms. Two types of data have been obtained from UCI; they are Wisconsin Breast Cancer data set and German credit data set.

## **4.2. Data preparation**

After the data has been collected in the previous stage, data preparation would be performed to prepare the data for the experiment in the next stage. Each attribute is examined and missing values are treated prior to training.

#### *4.2.1. Data description*

In this study, two sets of data are used, namely Wisconsin Breast Cancer and German Credit. Each data set is described in details in the following subsections.

#### *4.2.1.1. Wisconsin breast cancer data set*

Wisconsin breast cancer data set is originated from University of Wisconsin Hospitals, Madison donated by Dr. William H. Wolberg. Each instance or data object from the data represents one patient record. Each record comprises of information about Breast Cancer patient whose cancer condition is either benign or malignant. A total of 699 cases in the data set with nine attributes (excluding Sample Code Number) that represent independent variables and one attribute, i.e. Class represent the output or dependent variable.

Table 1 describes the attribute in the data set, code which represents the short form for this attribute, type, which shows the data type for particular attribute, domain, which represents the possible range in the value and the last column, shows the missing values in all attributes in the study. From Table 1, only one attribute has been missing values (a total of 16 instances), and this attribute is Bare Nuclei.


**Table 1.** Attribute of Wisconsin Breast Cancer Dataset

Based on the condition of Breast Cancer patients, a total of 65.5% (458) of them has benign condition and the rest (34.5% or 241) is Malignant.

#### *4.2.1.2. German credit dataset*

100 Advances in Data Mining Knowledge Discovery and Applications

**Figure 1.** Steps in carrying out the study

Breast Cancer data set and German credit data set.

examined and missing values are treated prior to training.

Credit. Each data set is described in details in the following subsections.

**4.1. Data collection** 

4.4 Investigation and comparison

4.3 Analysis and Experiment

4.1 Data collection

4.2 Data Preparation

**4.2. Data preparation** 

*4.2.1. Data description* 

*4.2.1.1. Wisconsin breast cancer data set* 

in Figure 1 [14]. The study starts with data collection, followed by data preparation stage,

4.2.1 Data Description 4.2.2 Data Cleaning

4.3.3 Neural Network model

4.3.2 Logistic Regression

At this stage, data sets have been acquired through the UCI machine learning repository which can be accessed at http://archive.ics.uci.edu/ml/ datasets.html. The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for conducting empirical studies on machine learning algorithms. Two types of data have been obtained from UCI; they are Wisconsin

After the data has been collected in the previous stage, data preparation would be performed to prepare the data for the experiment in the next stage. Each attribute is

In this study, two sets of data are used, namely Wisconsin Breast Cancer and German

Wisconsin breast cancer data set is originated from University of Wisconsin Hospitals, Madison donated by Dr. William H. Wolberg. Each instance or data object from the data represents one patient record. Each record comprises of information about Breast Cancer patient whose cancer condition is either benign or malignant. A total of 699 cases in the data

analysis and experiment stage, and finally, investigation and comparison stage.

4.3.1 Data Representation

> German credit data set classifies applicants as good or bad credit risk based upon a set of attributes specified by financial institutions. The original data set is provided by Professor Hofmann contains categorical and symbolic attributes. A total of 1000 instances have been provided with 20 attributes, excluding the German Credit Class (Table 2). The applicants are classified as good credit risk (700) or bad (300) with no missing value in this data set.



Data Mining and Neural Networks: The Impact of Data Representation 103

The data representations used for the experiments are described in the following

Each data set has been transformed into data representation identified for this study, namely As\_Is, Min Max Normalization, Standard Deviation Normalization, Sigmoidal Normalization, Thermometer Representation, Flag Representation and Simple Binary Representation. In As\_Is representation, the data remain the same as the original data without any changes. The Min Max Normalization is used to transform all values into numbers between 0 and 1. The Min Max Normalization applies linear transformation on the raw data, keeping the relationship to the data values in the same range. This method does not deal with any possible outliers in the future value, and the min max formula [25] is

Where V' is the new value, Min(v(i)) is the minimum value in a particular attribute,

The *Standard Deviation Normalization* is a technique based on the mean value and standard deviation function for each attribute on the data set. For a variable v, the mean value **Mean (v)** and the standard deviation Std\_dev(v) is calculated from the data set itself. The standard

 �� <sup>=</sup> (������(�))

The *Sigmoidal Normalization* transforms all nonlinear input data into the range between -1 and 1 using a sigmoid function. It calculates the mean value and standard deviation function value from the input data. Data points within a standard deviation of the mean are converted to the linear area of the sigmoid. In addition, outlier points to the data are compacted along the sigmoidal function tails. The sigmoidal normalization formula [25] is

�� <sup>=</sup> (������(�))

Max(v(i)) the maximum value in a particular attribute and v is the old value.

deviation normalization formula [25] is written as in Eqn. (2).

�� = (� � �����(�)�)�(�����(�)� � �����(�)�) (1)

�������(�) (2)

�������(�) (3)

**4.3. Analysis and experiment** 

*4.3.1. Data representation* 

written in Eqn. (1).

where

�����(�) <sup>=</sup> ���(�)

given by Eq. (3).

� = �� � ����(�)� ������(�)

Where

� std\_dev(v)= sqr(sum(v2)-(sum(v)2/n)/(n-1))

subsections.

102 Advances in Data Mining Knowledge Discovery and Applications

**Table 2.** Attribute of German Credit Dataset

#### *4.2.2. Data cleaning*

Before using the data that has been collected in the previous stage, missing values should be identified. Several methods that could be performed to solve missing values on data, such as deleting the attributes or instances, replacing the missing values with the mean value of a particular attribute, or ignore the missing values. However, which action would be performed to handle the missing values depends upon the data that has been collected.

German credit application data set has no missing values (refer to Table 2); therefore, no action was taken on German credit data set. On the other hand, Wisconsin breast cancer data set has 16 missing values of an attribute Bare Nuclei (see Table 1). Therefore, these missing values have been resolved by replacing the mean value to this attribute. The mean value to this attribute is 3.54, since the data type for this attribute is categorical so the value was rounded to 4. Finally, all the missing values have been replaced by value 4.

#### **4.3. Analysis and experiment**

The data representations used for the experiments are described in the following subsections.

#### *4.3.1. Data representation*

102 Advances in Data Mining Knowledge Discovery and Applications

7 Present employment

percentage of disposable

8 Instalment rate in

since

income

10 Other debtors / guarantors

16 Number of existing credits at bank

18 Number of people being liable to provide maintenance for

**Table 2.** Attribute of German Credit Dataset

*4.2.2. Data cleaning* 

**No. Attribute description Code Type Domain Missing** 

5 Credit amount CreditA Continuous 250 - 18424 0 6 Savings account/bonds SavingA Discrete 1, 2, 3, 4, 5 0

9 Personal status PersonalS Discrete 1, 2, 3, 4, 5 0

11 Present residence since PresentRe Discrete 1 – 4 0 12 Property Property Discrete 1, 2, 3, 4 0 13 Age in years Age Continuous 19 – 75 0 14 Other instalment plans OtherInst Discrete 1, 2, 3 0 15 Housing Housing Discrete 1, 2, 3 0

17 Job Job Discrete 1, 2, 3, 4 0

19 Telephone Telephone Discrete 1, 2 0 20 Foreign worker ForgnWor Discrete 1, 2 0

Before using the data that has been collected in the previous stage, missing values should be identified. Several methods that could be performed to solve missing values on data, such as deleting the attributes or instances, replacing the missing values with the mean value of a particular attribute, or ignore the missing values. However, which action would be performed to handle the missing values depends upon the data that has been collected.

German credit application data set has no missing values (refer to Table 2); therefore, no action was taken on German credit data set. On the other hand, Wisconsin breast cancer data set has 16 missing values of an attribute Bare Nuclei (see Table 1). Therefore, these missing values have been resolved by replacing the mean value to this attribute. The mean value to this attribute is 3.54, since the data type for this attribute is categorical so the value

was rounded to 4. Finally, all the missing values have been replaced by value 4.

21 German Credit Class GCL Discrete 1 good

EmploPe Discrete 1, 2, 3, 4, 5 0

InstalRate Continuous 2 – 4 0

OtherDep Discrete 1, 2, 3 0

NumCBnk Discrete 1,2,3 0

Numppl Discrete 1, 2 0

2 bad

0

**value** 

Each data set has been transformed into data representation identified for this study, namely As\_Is, Min Max Normalization, Standard Deviation Normalization, Sigmoidal Normalization, Thermometer Representation, Flag Representation and Simple Binary Representation. In As\_Is representation, the data remain the same as the original data without any changes. The Min Max Normalization is used to transform all values into numbers between 0 and 1. The Min Max Normalization applies linear transformation on the raw data, keeping the relationship to the data values in the same range. This method does not deal with any possible outliers in the future value, and the min max formula [25] is written in Eqn. (1).

$$W' = (\upsilon - \operatorname{Min}(\upsilon(i))) / (\operatorname{Max}(\upsilon(i)) - \operatorname{Min}(\upsilon(i))) \tag{1}$$

Where V' is the new value, Min(v(i)) is the minimum value in a particular attribute, Max(v(i)) the maximum value in a particular attribute and v is the old value.

The *Standard Deviation Normalization* is a technique based on the mean value and standard deviation function for each attribute on the data set. For a variable v, the mean value **Mean (v)** and the standard deviation Std\_dev(v) is calculated from the data set itself. The standard deviation normalization formula [25] is written as in Eqn. (2).

$$V' = \frac{(v - mean(v))}{std\_{\text{-}dev(v)}} \tag{2}$$

where

�����(�) <sup>=</sup> ���(�) � std\_dev(v)= sqr(sum(v2)-(sum(v)2/n)/(n-1))

The *Sigmoidal Normalization* transforms all nonlinear input data into the range between -1 and 1 using a sigmoid function. It calculates the mean value and standard deviation function value from the input data. Data points within a standard deviation of the mean are converted to the linear area of the sigmoid. In addition, outlier points to the data are compacted along the sigmoidal function tails. The sigmoidal normalization formula [25] is given by Eq. (3).

$$V' = \frac{(v - mean(v))}{std\_{\text{\textquotedblleft}dev(v)}} \tag{3}$$

Where

� = �� � ����(�)� ������(�)

$$\begin{aligned} mean(\upsilon) &= \frac{Sum(\upsilon)}{n} \\ \text{std\\_dev(\upsilon) &= \text{sqr(sum(v^2)\text{-}(sum(v)^2/n)/(n-1))} \end{aligned}$$

In the *Thermometer* representation, the categorical value was converted into a binary form prior to performing analysis. For example, if the range of values for a category field is 1 to 6, thus value 4 can be represented in thermometer format as "111100" [15].

Data Mining and Neural Networks: The Impact of Data Representation 105

(4)

(5)

���� = � � �� (6)

�������� (7)

 

 1 1 1 1

*X x*

*X x*

 

*k k*

 

*k k*

0

*x*

�����(�) � ������� ���(����) � �� � �

**Figure 2.** Independent and dependent variables of Wisconsin Breast Cancer dataset

*x*

exp 1 exp

Alternatively, the logistic regression equation can be written as Eqn. (5).

risk for developing cancer than working in asbestos mine)[27].

*it x*

model has the form as in Eqn. (6), viz:

not included for analysis.

occurrence of the outcome of interest is as follows:

be categorical or continuous, but�� is always categorical.

 0

Where α = the constant from the equation andβ = the coefficient of the predictor variables.

1 1 <sup>0</sup> log log <sup>1</sup> *X xk k*

Anodd's ratio is formed from logistic regression that calculates the probability or success over the probability of failure. For example, logistic regression is often used for epidemiological studies where the analysis result shows the probability of developing cancer after controlling for other associated risks. In addition, logistic regression also provides knowledge about the relationships and strengths among the variables (e.g., smoking 10 packs a day increases the

Logistic regression is a model which is simpler in terms of computation during training while still giving a good classification performance [28]. The simple logistic regression

Taking the antilog of Eqn. (1) on both sides, an equation to predict the probability to the

� � �����������(� � ��������������������|� � �� ���������������������) � � �����

Where ��is theprobability for the outcome of interest or "event," α is the intercept, ß is the regression coefficient, and e = 2.71828 is the base forthe system of natural logarithms���can

For the Wisconsin Breast Cancer dataset, there are ten independent variables and one dependent variable for logistic regression as shown in Figure 2. However, the CodeNum is

In the *Flag* format, digit 1 is represented in the binary location for the value. Thus, following the same assumption that the range values in a category field is 1 to 6, if the value 4 needs to be represented in *Flag* format, the representation will be shown as "000100." The representation in *Simple Binary* is obtained by directly changing the categorical value into binary. Table 3 exhibits the different representations of Wisconsin Breast Cancer and German Credit data set.


**Table 3.** Various dataset representations

### *4.3.2. Logistic regression*

Logistic regression is one of the statistical methods used in DM for non-linear problems either to classify or for prediction. Logistic Regression is one of the parts of statistical models, which allows one to predict a discrete outcome (known as dependent variable), such as group membership, from a set of variables (also known as independent variables) that may be continuous, discrete, dichotomous, or a combination of any of these. The logistic regression aims to correctly predict the category of outcome for individual cases using the most parsimonious model. In order to achieve the goal, a model is created, which comprises of all predictor (independent) variables that are useful in predicting the desired target. The relationship between the predictor and the target is not linear instead; the logistic regression function is usedwhose equation can be written as Eqn. (4) [26].

Data Mining and Neural Networks: The Impact of Data Representation 105

$$\theta = \frac{\exp\left(\beta\_0 + \beta\_{\,\_1X\_1} + \dots + \beta\_{\,\_kX\_k}\right)}{1 + \exp\left(\beta\_0 + \beta\_{\,\_1X\_1} + \dots + \beta\_{\,\_kX\_k}\right)}\tag{4}$$

Where α = the constant from the equation andβ = the coefficient of the predictor variables. Alternatively, the logistic regression equation can be written as Eqn. (5).

104 Advances in Data Mining Knowledge Discovery and Applications

In the *Thermometer* representation, the categorical value was converted into a binary form prior to performing analysis. For example, if the range of values for a category field is 1 to 6,

In the *Flag* format, digit 1 is represented in the binary location for the value. Thus, following the same assumption that the range values in a category field is 1 to 6, if the value 4 needs to be represented in *Flag* format, the representation will be shown as "000100." The representation in *Simple Binary* is obtained by directly changing the categorical value into binary. Table 3 exhibits the different representations of Wisconsin Breast Cancer and

Logistic regression is one of the statistical methods used in DM for non-linear problems either to classify or for prediction. Logistic Regression is one of the parts of statistical models, which allows one to predict a discrete outcome (known as dependent variable), such as group membership, from a set of variables (also known as independent variables) that may be continuous, discrete, dichotomous, or a combination of any of these. The logistic regression aims to correctly predict the category of outcome for individual cases using the most parsimonious model. In order to achieve the goal, a model is created, which comprises of all predictor (independent) variables that are useful in predicting the desired target. The relationship between the predictor and the target is not linear instead; the logistic regression

function is usedwhose equation can be written as Eqn. (4) [26].

thus value 4 can be represented in thermometer format as "111100" [15].

std\_dev(v)= sqr(sum(v2)-(sum(v)2/n)/(n-1))

����(�) <sup>=</sup> ���(�)

German Credit data set.

**Table 3.** Various dataset representations

*4.3.2. Logistic regression* 

�

$$\log \operatorname{it} \left[ \theta \left( \mathbf{x} \right) \right] = \log \left[ \frac{\theta \left( \mathbf{x} \right)}{1 - \theta \left( \mathbf{x} \right)} \right] = \alpha + \left( \beta\_0 + \beta\_{\mathbf{x}\_1 \mathbf{X}\_1} + \dots + \beta\_{\mathbf{x}\_k \mathbf{X}\_k} \right) \tag{5}$$

Anodd's ratio is formed from logistic regression that calculates the probability or success over the probability of failure. For example, logistic regression is often used for epidemiological studies where the analysis result shows the probability of developing cancer after controlling for other associated risks. In addition, logistic regression also provides knowledge about the relationships and strengths among the variables (e.g., smoking 10 packs a day increases the risk for developing cancer than working in asbestos mine)[27].

Logistic regression is a model which is simpler in terms of computation during training while still giving a good classification performance [28]. The simple logistic regression model has the form as in Eqn. (6), viz:

$$\log \text{int}(Y) = \text{natural} \, \log \text{(odds)} = \ln \left(\frac{\pi}{1 - \pi}\right) = \alpha + \beta X \tag{6}$$

Taking the antilog of Eqn. (1) on both sides, an equation to predict the probability to the occurrence of the outcome of interest is as follows:

$$\pi = \text{Probability}(Y = \text{outcome of interest} \mid X = \text{x}, a \text{ specific value of } X) = \frac{e^{a + \beta \pi}}{1 + e^{a + \beta \pi}} \quad \text{(7)}$$

Where ��is theprobability for the outcome of interest or "event," α is the intercept, ß is the regression coefficient, and e = 2.71828 is the base forthe system of natural logarithms���can be categorical or continuous, but�� is always categorical.

For the Wisconsin Breast Cancer dataset, there are ten independent variables and one dependent variable for logistic regression as shown in Figure 2. However, the CodeNum is not included for analysis.

**Figure 2.** Independent and dependent variables of Wisconsin Breast Cancer dataset

Similar approach is applied to German Credit dataset.

## *4.3.3. Neural network*

NN or artificial neural network (ANN) are one of the DM techniques; defined as an information-processing system which is inspired from the function of the human brain whose performance characteristics are somehow in common with biologicalNN [30]. It comprises of a large number of simple processing units, called artificial neurons or nodes. All nodes are interconnected by links known as connections.These nodes are linked together to perform parallel distributed processing in order to solve a desired computational taskby simulating the learning process [3].

Data Mining and Neural Networks: The Impact of Data Representation 107

Input layer Hidden layer Output layer **Multilayer perceptron** 

**Figure 3.** Simple and MLP architecture

Input layer Output layer **Simple perceptron** 

**Figure 4.** Activation function for BP learning

Multilayer Perceptron (MLP) is one of the most common NN architecture that has been used for diverse applications, particularly in forecasting problems [40]. The MLP network is normally composed of a number of nodes or processing units, and it is organized into a series of two or more layers. The first layer (or the lowest layer) is named as an input layer where it receives the external information while the last layer (or the highest layer) is an output layer where the solution to the problem is obtained. The hidden layer is the intermediate layer in between the input layer and the output layer, and may compose with one or more layers. The training of MLP could be stated as a nonlinear optimization problem. The objective of MLP learning is to find out the best weights that minimize the difference between the input and the output. The most popular training algorithm used in NN is Back propagation (BP), and it has been used in solving many problems in pattern recognition and classification. This algorithm depends upon several parameters such as a number of hidden nodes at the hidden layers 'learning rate, momentum rate, activation function and the number of training to take place. Furthermore, these parameters could

There are three stages involved when training the NN using BP algorithm[36]. The first step is the feed forward of the input training pattern, second is calculating the associated error from the output with the input. The last step is the adjustment to the weight. The learning process basically starts with feed forward stage when each of input units receives the input information and sends the information to each of the hidden units at the hidden layer. Each hidden unit computes the activation and sends its signal to each output unit, and applies the activation to form response of the net for given input pattern. The accuracy of NN is provided by a confusion matrix. In a confusion matrix, the information about actual values and the predictive values are illustrated in Table 4. Each row of the matrix represents the

change the performance on the learning from bad to good accuracy [23].

There are weights associated with the links that represent the connection strengths between two processing units. These weights determine the behavioron the network. The connection strengths determine the relationship between the input and the output for the network, and in a way represent the knowledge stored on the network. The knowledge is acquired by NN through a process of training during which the connection strengths between the nodes are modified. Once trained, the NN keeps this knowledge, and it can be used for the particular task it was designed to do [29]. Through training, a network understands the relationship of the variables and establishes the weights between the nodes.Once the learning occurs, a new case can be loaded over the network to produce more accurate prediction or classification [31].

NN models can learn from experience, generalize and "see through" noise and distortion, and also abstract essential characteristics in the presence of irrelevant data [32]. NN model is also described as a 'black box' approach which has great capacity in predictive modelling. NN models provide a high degree of robustness and fault tolerance since each processing node has primarily local connections[33]. NNs techniques are also advocated as a replacement for statistical forecasting methods because of its capabilities and performance [33], [34], [33]. However, NNs are very much dependent upon the problem at hand.

The techniques of NNs have been extensively used in pattern recognition, speech recognition and synthesis, medical applications (diagnosis, drug design), fault detection, problem diagnosis, robot control, and computer vision [36], [37]. One major application areas of NNs is forecasting, and the NNs techniques have been used as to solve many forecasting problems ([33], [36], [39], [38].

There are two types of perceptron in NN, namely simple or linear perceptron and MLP. Simple perceptron consists of only two layers; the input layer and output layer. MLP consists of at least three layers input layer, hidden layer and output layer. Figure 3 illustrates the two types of perceptron.

The basic operation of NN involves summing its input weights and the activation function is applied to these layers to yield the output. Generally, there are three types of activation functions used in NN, which are threshold function, Piecewise-linear function and Sigmoid function (Figure 4). Among these sigmoid function is the most commonly used in NN.

**Figure 3.** Simple and MLP architecture

*4.3.3. Neural network* 

[31].

simulating the learning process [3].

forecasting problems ([33], [36], [39], [38].

illustrates the two types of perceptron.

Similar approach is applied to German Credit dataset.

NN or artificial neural network (ANN) are one of the DM techniques; defined as an information-processing system which is inspired from the function of the human brain whose performance characteristics are somehow in common with biologicalNN [30]. It comprises of a large number of simple processing units, called artificial neurons or nodes. All nodes are interconnected by links known as connections.These nodes are linked together to perform parallel distributed processing in order to solve a desired computational taskby

There are weights associated with the links that represent the connection strengths between two processing units. These weights determine the behavioron the network. The connection strengths determine the relationship between the input and the output for the network, and in a way represent the knowledge stored on the network. The knowledge is acquired by NN through a process of training during which the connection strengths between the nodes are modified. Once trained, the NN keeps this knowledge, and it can be used for the particular task it was designed to do [29]. Through training, a network understands the relationship of the variables and establishes the weights between the nodes.Once the learning occurs, a new case can be loaded over the network to produce more accurate prediction or classification

NN models can learn from experience, generalize and "see through" noise and distortion, and also abstract essential characteristics in the presence of irrelevant data [32]. NN model is also described as a 'black box' approach which has great capacity in predictive modelling. NN models provide a high degree of robustness and fault tolerance since each processing node has primarily local connections[33]. NNs techniques are also advocated as a replacement for statistical forecasting methods because of its capabilities and performance

The techniques of NNs have been extensively used in pattern recognition, speech recognition and synthesis, medical applications (diagnosis, drug design), fault detection, problem diagnosis, robot control, and computer vision [36], [37]. One major application areas of NNs is forecasting, and the NNs techniques have been used as to solve many

There are two types of perceptron in NN, namely simple or linear perceptron and MLP. Simple perceptron consists of only two layers; the input layer and output layer. MLP consists of at least three layers input layer, hidden layer and output layer. Figure 3

The basic operation of NN involves summing its input weights and the activation function is applied to these layers to yield the output. Generally, there are three types of activation functions used in NN, which are threshold function, Piecewise-linear function and Sigmoid function (Figure 4). Among these sigmoid function is the most commonly used in NN.

[33], [34], [33]. However, NNs are very much dependent upon the problem at hand.

**Figure 4.** Activation function for BP learning

Multilayer Perceptron (MLP) is one of the most common NN architecture that has been used for diverse applications, particularly in forecasting problems [40]. The MLP network is normally composed of a number of nodes or processing units, and it is organized into a series of two or more layers. The first layer (or the lowest layer) is named as an input layer where it receives the external information while the last layer (or the highest layer) is an output layer where the solution to the problem is obtained. The hidden layer is the intermediate layer in between the input layer and the output layer, and may compose with one or more layers. The training of MLP could be stated as a nonlinear optimization problem. The objective of MLP learning is to find out the best weights that minimize the difference between the input and the output. The most popular training algorithm used in NN is Back propagation (BP), and it has been used in solving many problems in pattern recognition and classification. This algorithm depends upon several parameters such as a number of hidden nodes at the hidden layers 'learning rate, momentum rate, activation function and the number of training to take place. Furthermore, these parameters could change the performance on the learning from bad to good accuracy [23].

There are three stages involved when training the NN using BP algorithm[36]. The first step is the feed forward of the input training pattern, second is calculating the associated error from the output with the input. The last step is the adjustment to the weight. The learning process basically starts with feed forward stage when each of input units receives the input information and sends the information to each of the hidden units at the hidden layer. Each hidden unit computes the activation and sends its signal to each output unit, and applies the activation to form response of the net for given input pattern. The accuracy of NN is provided by a confusion matrix. In a confusion matrix, the information about actual values and the predictive values are illustrated in Table 4. Each row of the matrix represents the actual accounts of a class of target for the actual data, while each column represents the predictive value from the actual data. To obtain the accuracy of NN, the summation of the correct instance will be divided by the summation for all instances. The accuracy of NN is calculated using Eqn. (7).

$$Percentage\ of\ Correct = \left(\frac{Total\ of\ correctly\ predicted\ pattern}{Total\ no.of\ pattern}\right) \* 100\%\tag{7}$$

Data Mining and Neural Networks: The Impact of Data Representation 109

The accuracy results obtained from previous experiments are compared and investigated further. Two data sets are considered for this study, the Logistic regression and Neural Network. Logistic regression is a statistical regression model for binary dependent variables [24], which is simpler in terms of computation during training while still giving a good classification performance [27]. Figure 6 shows the general steps involve in performing logistic regression and NN experiments using different data representations in this study.

**Figure 6.** Illustration of Data Representation for NN/ Regression analysis experiments

were performed in order to complete the investigation.

representation (100%) followed by *Simple Binary* representation.

Investigating the prediction performance on different data sets involves many uncertainties for a different data type. In the task of prediction, one particular predictive model might give the best result for one data set but gives the poor results in another data set although these two data sets contain the same data with different representations [14],[15],[16], [17].

Initial experimental results of correlation analysis on Wisconsin Breast Cancer indicate that all attributes (independent variables) has significant correlation with the dependent variable (target). However, German Credit data set indicates otherwise. Therefore, for German Credit data set, two different approaches (all dependent variables and selected variables)

Based on the results exhibited in Table 5, although NN obtained the same percentage of accuracy, *As\_Is* achieved the lowest training results (98.57%, 96.24%). On the other hand, regression exhibits the highest percentage of accuracy for *Thermometre*and *Flag*

Referring to the result shown in Figure 7, similar observation has been noted for German Credit data set when **all variables** are considered for the experiments. *As\_Is* representation obtained the highest percentage of accuracy (79%) for NN model. For regression analysis,

**4.4. Investigation and comparison** 

**5. Results** 

Based on Table 4, the Percentage of correct is calculated as:

Percentage of Correct = ((48 + 39) / (48 + 2 + 11 + 39)) \* 100%

Experiments are conducted to obtain a set of training parameters that gives the optimum accuracy for both data sets. Figure.5 shows general architecture of NN for the Wisconsin Breast Cancer data set. Note that the ID number is not including in the architecture.

**Figure 5.** Neural Network architecture for Wisconsin Breast Cancer

Similar architecture can be drawn for German Credit dataset; however, the number of hidden units and output units will be different from the Wisconsin Breast Cancer.

## **4.4. Investigation and comparison**

108 Advances in Data Mining Knowledge Discovery and Applications

Based on Table 4, the Percentage of correct is calculated as: Percentage of Correct = ((48 + 39) / (48 + 2 + 11 + 39)) \* 100%

**Figure 5.** Neural Network architecture for Wisconsin Breast Cancer

calculated using Eqn. (7).

**Table 4.** Confusion matrix

actual accounts of a class of target for the actual data, while each column represents the predictive value from the actual data. To obtain the accuracy of NN, the summation of the correct instance will be divided by the summation for all instances. The accuracy of NN is

Experiments are conducted to obtain a set of training parameters that gives the optimum accuracy for both data sets. Figure.5 shows general architecture of NN for the Wisconsin

Similar architecture can be drawn for German Credit dataset; however, the number of

hidden units and output units will be different from the Wisconsin Breast Cancer.

Breast Cancer data set. Note that the ID number is not including in the architecture.

்௧Ǥ௧௧ ቁ כ ͳͲͲΨ (7)

ܲ݁ݎܿ݁݊ݐ݂ܽ݃݁ݎݎܥ݁ܿݐ ൌ ቀ்௧௧௬ௗ௧ௗ௧௧

The accuracy results obtained from previous experiments are compared and investigated further. Two data sets are considered for this study, the Logistic regression and Neural Network. Logistic regression is a statistical regression model for binary dependent variables [24], which is simpler in terms of computation during training while still giving a good classification performance [27]. Figure 6 shows the general steps involve in performing logistic regression and NN experiments using different data representations in this study.

**Figure 6.** Illustration of Data Representation for NN/ Regression analysis experiments

## **5. Results**

Investigating the prediction performance on different data sets involves many uncertainties for a different data type. In the task of prediction, one particular predictive model might give the best result for one data set but gives the poor results in another data set although these two data sets contain the same data with different representations [14],[15],[16], [17].

Initial experimental results of correlation analysis on Wisconsin Breast Cancer indicate that all attributes (independent variables) has significant correlation with the dependent variable (target). However, German Credit data set indicates otherwise. Therefore, for German Credit data set, two different approaches (all dependent variables and selected variables) were performed in order to complete the investigation.

Based on the results exhibited in Table 5, although NN obtained the same percentage of accuracy, *As\_Is* achieved the lowest training results (98.57%, 96.24%). On the other hand, regression exhibits the highest percentage of accuracy for *Thermometre*and *Flag* representation (100%) followed by *Simple Binary* representation.

Referring to the result shown in Figure 7, similar observation has been noted for German Credit data set when **all variables** are considered for the experiments. *As\_Is* representation obtained the highest percentage of accuracy (79%) for NN model. For regression analysis,

*Thermometer* and *Flag,* representation obtained the highest percentage of accuracy (80.1%). Similar to earlier observation on the Wisconsin Breast Cancer dataset. Simple *Binary* representation obtained the second highest percentage of accuracy (79.5%).

Data Mining and Neural Networks: The Impact of Data Representation 111

**Figure 8.** German Credit Selected Variables accuracy for Neural Network and Regression

Sigmoidal

normalization

Thermometer

representation

Flag

representation

Simple Binary

representation

Breast lCancer

**Table 6.** The summary of NN experimental results using *As\_Is* representation

regression independent variables include all variables listed in Table 7.

for determining whether a credit application is successful or not.

Percentage of Accuracy **98.57%** 80.00% 79.00% Input units 9 20 12 Hidden units 2 6 20 Learning rate 0.1 0.6 0.6 Momentum rate 0.8 0.1 0.1 Number of epoch 100 100 100

selected variables in the experiments.

64.00% 66.00% 68.00% 70.00% 72.00% 74.00% 76.00% 78.00% 80.00% 82.00%

As\_Is

representation

Min Max

normalization

Standard

Deviation

normalization

Neural Network Wisconsin

For brevity, Table 6 exhibits NN parameters that produce the highest percentage of accuracy for Wisconsin Breast Cancer, and German Credit data set using all variables as well as

The logistic regression and correlation results for Wisconsin Breast Cancer data set are exhibited in Table 7. Note that based on Wald Statistics, variables such as *CellSize*, *Cellshape, EpiCells, NormNuc* and *Mito* are not significant in the prediction model. However, these variables have significant correlation with Type of Breast Cancer. Thus, the logistic

For German Credit data set, NN obtained the highest percentage of accuracy when all variables are considered for the training (see Table 6). The appropriate parameters for this data set are also listed in the same table. The summary of logistic regression results is shown in Table 8. All shaded variables displayed in Table 8 are significant independent variables

German credit using all variables German credit using selected variables

Neural Network Regression


**Table 5.** Percentage of accuracy for Wisconsin Breast Cancer Dataset

**Figure 7.** German Credit All Variables accuracy for Neural Network and Regression

When **selected variables** of German Credit data set was tested with NN, the highest percentage accuracy was obtained using *As\_Is* representation (80%), followed by *Standard Deviation Normalization* (79%) *Min Max Normalization* (78%) and **Thermometer** (78%) representation. The regression results show similar patterns with results illustrated in Figure. In other words, the data representation techniques, namely *Thermometer* (77.4%) and Flag (77.4%) representations produce the highest and second highest percentage of accuracy for selected variables of German Credit.

**Figure 8.** German Credit Selected Variables accuracy for Neural Network and Regression

Min Max

Standard Deviation

Sigmoidal

Thermometer

Simple Binary

70.00% 71.00% 72.00% 73.00% 74.00% 75.00% 76.00% 77.00% 78.00% 79.00% 80.00% 81.00%

As\_Is

representation

Min Max

normalization

Standard

Deviation

normalization

Sigmoidal

normalization

for selected variables of German Credit.

**Table 5.** Percentage of accuracy for Wisconsin Breast Cancer Dataset

**Figure 7.** German Credit All Variables accuracy for Neural Network and Regression

When **selected variables** of German Credit data set was tested with NN, the highest percentage accuracy was obtained using *As\_Is* representation (80%), followed by *Standard Deviation Normalization* (79%) *Min Max Normalization* (78%) and **Thermometer** (78%) representation. The regression results show similar patterns with results illustrated in Figure. In other words, the data representation techniques, namely *Thermometer* (77.4%) and Flag (77.4%) representations produce the highest and second highest percentage of accuracy

Thermometer

representation

Flag

representation

Simple Binary

representation

Neural Network Regression

*Thermometer* and *Flag,* representation obtained the highest percentage of accuracy (80.1%). Similar to earlier observation on the Wisconsin Breast Cancer dataset. Simple *Binary*

As\_Is representation 96.24% 98.57% 96.9%

normalization 96.42% 98.57% 96.9%

normalization 96.42% 98.57% 96.9%

normalization 96.60% 98.57% 96.9%

representation 97.14% 98.57% 100.0% Flag representation 97.67% 98.57% 100.0%

representation 97.14% 98.57% 97.6%

Wisconsin Breast Cancer Neural Network Regression Train Test Accuracy

representation obtained the second highest percentage of accuracy (79.5%).

For brevity, Table 6 exhibits NN parameters that produce the highest percentage of accuracy for Wisconsin Breast Cancer, and German Credit data set using all variables as well as selected variables in the experiments.


**Table 6.** The summary of NN experimental results using *As\_Is* representation

The logistic regression and correlation results for Wisconsin Breast Cancer data set are exhibited in Table 7. Note that based on Wald Statistics, variables such as *CellSize*, *Cellshape, EpiCells, NormNuc* and *Mito* are not significant in the prediction model. However, these variables have significant correlation with Type of Breast Cancer. Thus, the logistic regression independent variables include all variables listed in Table 7.

For German Credit data set, NN obtained the highest percentage of accuracy when all variables are considered for the training (see Table 6). The appropriate parameters for this data set are also listed in the same table. The summary of logistic regression results is shown in Table 8. All shaded variables displayed in Table 8 are significant independent variables for determining whether a credit application is successful or not.


Data Mining and Neural Networks: The Impact of Data Representation 113

Variables German Credit Selected Variables

Neural Network

Regn Train Test Train Test

In this study, the effect of different data representations on the performance of NN and regression was investigated on different data sets that have a binary or boolean class target. The results indicate that different data representation produces a different percentage of accuracy.

Based on the empirical results, data representation *As\_Is*is a better approach for NN with Boolean targets (see also Table 9). NN has shown consistent performance for both data sets. Further inspection of the results exhibited in Table 6 also indicates that for German Credit data set, NN performance improves by 1%. This leads to suggestion that by considering correlation and regression analysis, both NN results using *As\_Is* and *Standard Deviation Normalization* could be improved. For regression analysis, *Thermometer, Flag* and *Simple Binary* representations produce consistent regression performance. However, the performance decreases when the independent variables have been reduced through

As for future research, more data sets will be utilized to investigate further on the effect of data representation on the performance of both NN and regression. One possible area is to investigate which cases fail during training, and how to correct the representation of cases such that the cases will be correctly identified by the model. Studying the effect of different data representations on different predictive models enable future researchers or data mining model's developer to present data correctly for binary or Boolean target in the prediction task.

Regn

German Credit All

As\_Is representation 77.25 79.00 77.0 75.00 80.00 76.8 Min Max normalization 76.50 76.00 77.0 75.25 78.00 76.8

normalization 76.75 77.00 77.0 75.13 79.00 76.8 Sigmoidal normalization 76.75 77.00 77.0 74.00 75.00 76.6

representation 78.38 78.00 80.1 77.00 78.00 77.4 Flag representation 76.75 77.00 80.1 75.13 73.00 77.4

representation 75.75 74.00 79.5 70.63 70.00 77.1

*School of Computing, College of Arts and Sciences, University Utara Malaysia, Sintok, Kedah,* 

**Table 9.** Summary of NN and regression analysis of German Credit dataset

Fadzilah Siraj, Ehab A. Omer A. Omer and Md. Rajib Hasan

Neural Network

**6. Conclusion and future research** 

correlation and regression analysis.

Standard Deviation

Thermometer

Simple Binary

**Author details** 

*Malaysia* 

**Table 7.** List of variables included in logistic regression of Wisconsin breast cancer

Note also that variable *age* is not significant to German Credit target. However, its correlation with the target is significant. Therefore, these are variable included in logistic regression equation that represents German credit application.


**Table 8.** List of variables included in logistic regression of German Credit dataset

## **6. Conclusion and future research**

112 Advances in Data Mining Knowledge Discovery and Applications

CTHick .531 .000

MarAd .240 .036

BareNuc .400 .000 BLChr .411 .009

Constant -9.671 .000

**Regression (Thermometer representation)** 

Variables Logistic

Constant 4.391 .000

**Table 8.** List of variables included in logistic regression of German Credit dataset

regression equation that represents German credit application.

**Logistic Regression Correlation** 

Variables B Sig. R p

**CellSize** .006 .975 .818(\*\*) .000 **CellShape** .333 .109 .819(\*\*) .000

**EpiCells** .069 .645 .683(\*\*) .000

**NormNuc** .145 .157 .712(\*\*) .000 **Mito** .551 .069 .423(\*\*) .000

Note also that variable *age* is not significant to German Credit target. However, its correlation with the target is significant. Therefore, these are variable included in logistic

SECA -.588 000 -.348(\*\*) .000 DurMo .025 .005 .206(\*\*) .000 CreditH -.384 .000 -.222(\*\*) .000 CreditA -.384 .018 .087(\*\*) .003 SavingA -.240 .000 -.175(\*\*) .000 EmploPe -.156 .029 -.120(\*\*) .000 InstalRate .300 .000 .074(\*\*) .010 PersonalS -.267 .022 -.091(\*\*) .002 OtherDep -.363 .041 -0.003 .460 Property .182 .046 .141(\*\*) .000 **Age -.010 .246 -.112(\*\*) .000**  OtherInst -.322 .004 -.113(\*\*) .000 Forgn Work -1.216 .047 -.082(\*\*) .005

**German Credit using all variables (80%)** 

Regression Correlation B Sig. R p

**Table 7.** List of variables included in logistic regression of Wisconsin breast cancer

In this study, the effect of different data representations on the performance of NN and regression was investigated on different data sets that have a binary or boolean class target. The results indicate that different data representation produces a different percentage of accuracy.

Based on the empirical results, data representation *As\_Is*is a better approach for NN with Boolean targets (see also Table 9). NN has shown consistent performance for both data sets. Further inspection of the results exhibited in Table 6 also indicates that for German Credit data set, NN performance improves by 1%. This leads to suggestion that by considering correlation and regression analysis, both NN results using *As\_Is* and *Standard Deviation Normalization* could be improved. For regression analysis, *Thermometer, Flag* and *Simple Binary* representations produce consistent regression performance. However, the performance decreases when the independent variables have been reduced through correlation and regression analysis.

As for future research, more data sets will be utilized to investigate further on the effect of data representation on the performance of both NN and regression. One possible area is to investigate which cases fail during training, and how to correct the representation of cases such that the cases will be correctly identified by the model. Studying the effect of different data representations on different predictive models enable future researchers or data mining model's developer to present data correctly for binary or Boolean target in the prediction task.


**Table 9.** Summary of NN and regression analysis of German Credit dataset

## **Author details**

Fadzilah Siraj, Ehab A. Omer A. Omer and Md. Rajib Hasan *School of Computing, College of Arts and Sciences, University Utara Malaysia, Sintok, Kedah, Malaysia* 

#### **7. References**

[1] C. Li, and G. Biswas, "Unsupervised learning with mixed numeric and nominal data," *IEEE Transactions on Knowledgeand Data Engineering,* vol. 14, no. 4, pp. 673-690, 2002.

Data Mining and Neural Networks: The Impact of Data Representation 115

[17] Wessels, L.F.A., Reinders, M.J.T., Welsem, T.V. & Nederlof, P.M. (2002). Representation and classification for high-throughput data sets. SPIE-BIOS2002, Biomedial Nanotechnology Architectures and Applications, 4626, 226-237, San Jose, USA, Jan 2002. [18] Jovanovic, N. Milutinovic, V. Obradovic, Z. (2002). Neural Network Applications in Electrical Engineering. Neural Network Applications in Electrical Engineering,, pp. 53-

[19] Delen, D. &Patil, N. (2006). Knowledge Extraction from Prostate Cancer Data.*Proceedings of the 39th Annual Hawaii International Conference, HICSS '06: System* 

[20] Shen, A., Tong, R., & Deng, Y. (2007). Application of Classification Models on Credit Card Fraud Detection. International Conference: Service Systems and Service

[21] Zhu, D., Premkumar, G., Zhang, X. & Chu, C.H. (2001). Data mining for Network Intrusion Detection: A Comparison of Alternative Methods. *Decision Sciences*, 32(4), 635-

[22] Jia, J. & Chua, H. C. (1993). Neural Network Encoding Approach Comparison: An Empirical Study. *Proceedings of First New Zealand International Two-Stream Conference on* 

[23] Nawi, N. M., Ransing, M. R. and Ransing R. S. (2006). An Improved Learning Algorithm Based on The Broyden-Fletcher-Goldfarb-Shanno (BFGS) Method For Back Propagation Neural Networks. Sixth International Conference on Intelligent Systems

[24] Yun, W. H., Kim, D. H., Chi, S. Y. & Yoon, H. S. (2007). Two-dimensional Logistic Regression. 19th IEEE International Conference, ICTAI 2007: Tools with Artificial

[25] Kantardzic, M. (2003). DATA MINING: Concepts, Models, Methods and Algorithms.

[26] O'Connor, M., Marquez, L., Hill, T., & Remus, W. (2002). Neural network models for forecast a review. *IEEE proceedings of the 25th Hawaii International Conference on System* 

[27] Duarte, L. M., Luiz, R. R., Marcos, E. M. P. (2008). The cigarette burden (measured by the number of pack-years smoked) negatively impacts the response rate to platinum-

[28] Ksantini, R., Ziou, D., Colin, B., &Dubeau, F. (2008). Weighted Pseudometric Discriminatory Power Improvement Using a Bayesian Logistic Regression Model Based on a Variational Method. *IEEE Transactions*on Pattern Analysis and Machine

[29] Chiang, L. & Wen, L. (2009). A neural network weight determination model designed uniquely for small data set learning. Expert Systems with Applications.36 (6). 9853-9858 [30] Fausett, L. (1994). Fundamentals Of Neural Networks Architectures, Algorithms, and

[31] Lippmann, R.P. (1987). An introduction to Computing with neural neural network.*IEEE* 

based chemotherapy in lung cancer patients. Lung Cancer, 61(2), 244-254.

Applications. Upper Saddle River, New Jersey07458: Prentice Hall.

*Transactions on nets, IEEE ASSP Magazine*, April, pp. 4–22.

*Artificial Neural Networks and Expert Systems.*24-26 November .38-41.

Design and Applications, October 2006, Vol. 1, pp.152-157.

Intelligence, 29-31 October 2007, Vol. 2 (pp. 349-353).

IEEE Transactions on Neural Networks, 14(2), 464-464.

*Sciences*, 4, pp. 494–498.

Intelligence.

58.

660.

*Sciences*. 04-07 Jan. Vol. 5 92b-92b.

Management, 9-11 June 2007 (pp. 1-4).


[17] Wessels, L.F.A., Reinders, M.J.T., Welsem, T.V. & Nederlof, P.M. (2002). Representation and classification for high-throughput data sets. SPIE-BIOS2002, Biomedial Nanotechnology Architectures and Applications, 4626, 226-237, San Jose, USA, Jan 2002.

114 Advances in Data Mining Knowledge Discovery and Applications

*computer Scince, National University of Singapore.*

Retrieved from http://www.crisp-dm.org/CRISPWP.pdf

Application.*Proceeding of Air Forum, Toronto, Canada*

Electrotechnical Conference, 2000,Vol. 2 (pp. 567-569).

Retrieved from www.scopus.com

*Computin*g. 8-10 April.512-517.

Engineering, 83, 31-45.

[1] C. Li, and G. Biswas, "Unsupervised learning with mixed numeric and nominal data," *IEEE Transactions on Knowledgeand Data Engineering,* vol. 14, no. 4, pp. 673-690, 2002. [2] A. Ahmad, and L. Dey, "A k-mean clustering algorithm for mixed numeric and categorical data," *Data &KnowledgeEngineering,* vol. 63, no. 2, pp. 503-527, 2007. [3] Li Kan, LuiYushu, "Agent Based Data Mining Framework for the High Dimensional Environment," Journal of Beijing institute of technology, vol. 14, pp. 113-116, Feb 2004. [4] Pan Ding, ShenJunyi, "Incorporating Domain Knowledge into Data Mining Process: An Ontology Based Framework," Wuhan University Journal of Natural Sciences, vol. 11,

[5] XianyiQian; Xianjun Wang; , "A New Study of DSS Based on Neural Network and Data Mining," E-Business and Information System Security, 2009. EBISS '09. International

[7] Tsantis, L &Castellani, J. (2001) Enhancing Learning Environment Solution-based knowledge Discovery Tools: Forecasting for Self-perpetuating Systematic Reform. JSET

[8] Luan, J (2002). Data Mining Application in Higher education. SPSS Executive Report.

[9] A. Ahmad, and L. Dey, "A k-mean clustering algorithm for mixed numeric and categorical data," *Data &KnowledgeEngineering,* vol. 63, no. 2, pp. 503-527, 2007. [10] Fernandez, G., (2003), Data Mining Using SAS Application. CRC press LLC. pp 1-12 [11] Dongsong Zhang; Lina Zhou; , "Discovering golden nuggets: data mining in financial application," *Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on*, vol.34, no.4, pp.513-522, Nov. 2004 doi: 10.1109/TSMCC.2004.829279 [12] Luan, J (2006). Data Mining and Knowledge Management in Higher education Potential

[13] Siraj, F., &Abdoulha, M. A. (2009). Uncovering hidden information within university's student enrollment data using data mining. Paper presented at the Proceedings - 2009 3rd Asia International Conference on Modelling and Simulation, AMS 2009, 413-418.

[14] Hashemi R. R., Bahar, M., Tyler, A. A. & Young, J. (2002). The Investigation of Mercury Presence in Human Blood: An Extrapolation from Animal Data Using Neural Networks. *Proceedings of International Conference: Information Technology: Coding and* 

[15] Altun, H., Talcinoz, T. &Tezekiei B. S. (2000). Improvement in the Learning Process as a Function of Distribution Characteristics of Binary Data Set. 10th Mediterranean

[16] O'Neal, M.R., Engel, B.A., Ess, D.R. &Frankenberger, J.R. (2002). Neural Network prediction of maize yield using alternative data coding algorithms. Biosystems

Conference on , vol., no., pp.1-4, 23-24 May 2009 doi: 10.1109/EBISS.2009.5137883 [6] Zhihua, X. (1998) Statistics and Data Mining. *Department of Information System and* 

**7. References** 

pp. 165-169, Jan. 2006.

Journal 6

	- [32] Wasserman, P. D. (1989). Neural Computing: Theory and Practice, Van Nostrand-Reinhold, New York.

**Chapter 5** 

© 2012 Othman and Aris, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© 2012 Othman and Aris, licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

**Inconsistent Decision System: Rough Set Data** 

**Mining Strategy to Extract Decision Algorithm** 

Modern numerical protective relays being intelligent electronic devices (IED) are inevitably vulnerable to false tripping or failure of operation for faults in the power system [1]. With regular and rigorous analyses the performance reliability of the digital protective relays can be ascertained, their availability maximized and subsequently their misoperation risks minimized [2]. The precise relay operation analyses would normally be assessing the relay characteristics, evaluating the relay performance and identifying the relay-power system interactions so as to ensure that the protective relays operate in correspond to their

Protection engineers would in practice resort to computing technologies for automating the analysis process when the gravity of event data exploration, manipulation and inferencing incapacitate human manageability. The voluminous amount of data to be processed has prompted the need to use intelligent data mining, an essential constituent in the Knowledge Discovery in Databases (KDD) process [5]. This has motivated the adoption of rough set theory to data mine the protective relay event report so as to discover its decision algorithm.

The following two pertinent problems are the attributing factors in driving this paper into

 Inconsistencies in the device's event report particularly found when upon power system fault inception, a protective relay detects and invokes a common combination of tripping conditions in time succession but having two distinct tripping decisions

**of a Numerical Distance Relay – Tutorial** 

Mohammad Lutfi Othman and Ishak Aris

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/50460

predetermined settings [3,4].

**2. Problem statement and objective** 

studying the protective relay operation analysis:

**1. Introduction** 

