**Knowledge in Imperfect Data**

Andrzej Kochanski, Marcin Perzyk and Marta Klebczyk *Warsaw University of Technology Poland* 

#### **1. Introduction**

180 Advances in Knowledge Representation

Marakakis, E. & Papadakis, N. (2009). An Interactive Verifier for Logic Programs, *Proc. Of* 

130-137, 2009.

*13th IASTED International Conference on Artificial Intelligence and Soft Computing* , pp.

Data bases collecting a huge amount of information pertaining to real-world processes, for example industrial ones, contain a significant number of data which are imprecise, mutually incoherent, and frequently even contradictory. It is often the case that data bases of this kind often lack important information. All available means and resources may and should be used to eliminate or at least minimize such problems at the stage of data collection. It should be emphasized, however, that the character of industrial data bases, as well as the ways in which such bases are created and the data are collected, preclude the elimination of all errors. It is, therefore, a necessity to find and develop methods for eliminating errors from already-existing data bases or for reducing their influence on the accuracy of analyses or hypotheses proposed with the application of these data bases. There are at least three main reasons for data preparation: (a) the possibility of using the data for modeling, (b) modeling acceleration, and (c) an increase in the accuracy of the model. An additional motivation for data preparation is that it offers a possibility of arriving at a deeper understanding of the process under modeling, including the understanding of the significance of its most important parameters.

The literature pertaining to data preparation (Pyle, 1999, 2003; Han & Kamber, 2001; Witten & Frank, 2005; Weiss & Indurkhya, 1998; Masters, 1999; Kusiak, 2001; Refaat, 2007) discusses various data preparation tasks (characterized by means of numerous methods, algorithms, and procedures). Apparently, however, no ordered and coherent classification of tasks and operations involved in data preparation has been proposed so far. This has a number of reasons, including the following: (a) numerous article publications propose solutions to problems employing selected individual data preparation operations, which may lead to the conclusion that such classifications are not really necessary, (b) monographs deal in the minimal measure with the industrial data, which have their own specific character, different from that of the business data, (c) the fact that the same operations are performed for different purposes in different tasks complicates the job of preparing such a classification.

The information pertaining to how time-consuming data preparation is appears in the works by many authors. The widely-held view expressed in the literature is that the time devoted to data preparation constitutes considerably more than a half of the overall data exploration time (Pyle, 2003; McCue, 2007). A systematically conducted data preparation can reduce this time. This constitutes an additional argument for developing the data preparation methodology which was proposed in (Kochanski, 2010).

Knowledge in Imperfect Data 183

performed in any order. In some cases different tasks are performed in parallel: for instance the operations performed in the task of data integration may be performed simultaneously with the operations involved in data transformation, without deciding on any relative priority. Depending on the performed operations, data reduction may either precede data integration (attribute selection) or go at the very end of the data preparation process

Fig. 1. Tasks carried out in the process of data preparation with the distinction into the

stage will be discussed below, in the sections devoted to particular tasks.

possibly the most complete data base and only then make further decisions.

A task is a separate self-contained part of data preparation which may be and which in practice is performed at each stage of the data preparation process. We can distinguish four tasks (as represented in Fig. 1): data cleaning, data transformation, data integration, and data reduction. Depending upon the stage of the data preparation process, a different number of operations may be performed in a task – at the introductory stage the number of operations is limited. These operations, as well as the operations performed in the main

The data preparation process, at the stage of introductory preparation, may start with any task, but after finishing the introductory stage of this task, it is necessary to go through the introductory stage of data cleaning. It is only then when one can perform operations belonging to other tasks, including the operations of the task with which the process of data preparation has started. This procedure follows from the fact that in the further preparation, be it in the introductory or in the main stage, the analyst should have at his disposal

introductory and the main stage (Kochanski, 2010)

**3. Task of data preparation** 

(dimensionality selection).

## **2. A taxonomy of data preparation**

Discussing the issue of data preparation the present work uses the terms *process, stage, task,*  and *operation*. Their mutual relations are diagrammatically represented in Fig. 1 below. The data preparation process encompasses all the data preparation tasks. Two stages may be distinguished within it: the introductory stage and the main stage. The stages of the process involve carrying out tasks, each of which, in turn, involves a range of operations. The data preparation process always starts with the introductory stage. Within each stage (be it the introductory or the main one) different tasks may be performed. In the introductory stage, the performance of the first task (the choice of the first task is dictated by the specific nature of the case under analysis) should always be followed by data cleaning. It is only after data cleaning in the introductory stage that performing the tasks of the main stage may be initiated. In the main stage, the choice of the first task, as well as ordering of the tasks that follow are not predetermined and are dependent on the nature of the case under consideration. As far as the data collected in real-life industrial (production) processes are concerned, four tasks may be differentiated:


Each of the tasks involves one or more operations. An operation is understood here as a single action performed on the data. The same operation may be a part of a few tasks.

In Fig. 1, within each of the four tasks the two stages of the process are differentiated: the introductory stage of data preparation (in the diagram, this represented as the outer circles marked with broken lines) and the main stage of data preparation (represented as inner circles marked with solid lines). These two separate stages, which have been differentiated, have quite different characters and use quite different tools. The first stage - that of the introductory data preparation - is performed just once. Its performance in particular tasks is limited to a single operation. This operation is always followed by the task of data cleaning. The stage of the introductory data cleaning is a non-algorithmized stage. It is not computer-aided and it is based on the knowledge and the experience of the human agent preparing the data.

The second - that is, the main - stage is much more developed. It can be repeated many times at each moment of the modeling and of the analysis of the developed model. The repetition of the data preparation tasks or the change in the tools employed in particular tasks is aimed at increasing the accuracy of the analyses. The order in which the tasks are carried out may change depending upon the nature of the issue under analysis and the form of the collected data. According to the present authors, it seems that it is data cleaning which should always be performed as the first task. However, the remaining tasks may be

Discussing the issue of data preparation the present work uses the terms *process, stage, task,*  and *operation*. Their mutual relations are diagrammatically represented in Fig. 1 below. The data preparation process encompasses all the data preparation tasks. Two stages may be distinguished within it: the introductory stage and the main stage. The stages of the process involve carrying out tasks, each of which, in turn, involves a range of operations. The data preparation process always starts with the introductory stage. Within each stage (be it the introductory or the main one) different tasks may be performed. In the introductory stage, the performance of the first task (the choice of the first task is dictated by the specific nature of the case under analysis) should always be followed by data cleaning. It is only after data cleaning in the introductory stage that performing the tasks of the main stage may be initiated. In the main stage, the choice of the first task, as well as ordering of the tasks that follow are not predetermined and are dependent on the nature of the case under consideration. As far as the data collected in real-life industrial (production) processes are

data cleaning is used in eliminating any inconsistency or incoherence in the collected

 data integration makes possible integrating data bases coming from various sources into a single two-dimensional table, thanks to which algorithmized tools of data mining

 data transformation includes a number of operations aimed at making possible the building of a model, accelerating its building, and improving its accuracy, and employing, among others, widely known normalization or attribute construction

 data reduction limits the dimensionality of a data base, that is, the number of variables; performing this task significantly reduces the time necessary for data mining.

Each of the tasks involves one or more operations. An operation is understood here as a single action performed on the data. The same operation may be a part of a few tasks.

In Fig. 1, within each of the four tasks the two stages of the process are differentiated: the introductory stage of data preparation (in the diagram, this represented as the outer circles marked with broken lines) and the main stage of data preparation (represented as inner circles marked with solid lines). These two separate stages, which have been differentiated, have quite different characters and use quite different tools. The first stage - that of the introductory data preparation - is performed just once. Its performance in particular tasks is limited to a single operation. This operation is always followed by the task of data cleaning. The stage of the introductory data cleaning is a non-algorithmized stage. It is not computer-aided and it is

The second - that is, the main - stage is much more developed. It can be repeated many times at each moment of the modeling and of the analysis of the developed model. The repetition of the data preparation tasks or the change in the tools employed in particular tasks is aimed at increasing the accuracy of the analyses. The order in which the tasks are carried out may change depending upon the nature of the issue under analysis and the form of the collected data. According to the present authors, it seems that it is data cleaning which should always be performed as the first task. However, the remaining tasks may be

based on the knowledge and the experience of the human agent preparing the data.

**2. A taxonomy of data preparation** 

concerned, four tasks may be differentiated:

data

methods

can be employed

performed in any order. In some cases different tasks are performed in parallel: for instance the operations performed in the task of data integration may be performed simultaneously with the operations involved in data transformation, without deciding on any relative priority. Depending on the performed operations, data reduction may either precede data integration (attribute selection) or go at the very end of the data preparation process (dimensionality selection).

Fig. 1. Tasks carried out in the process of data preparation with the distinction into the introductory and the main stage (Kochanski, 2010)

## **3. Task of data preparation**

A task is a separate self-contained part of data preparation which may be and which in practice is performed at each stage of the data preparation process. We can distinguish four tasks (as represented in Fig. 1): data cleaning, data transformation, data integration, and data reduction. Depending upon the stage of the data preparation process, a different number of operations may be performed in a task – at the introductory stage the number of operations is limited. These operations, as well as the operations performed in the main stage will be discussed below, in the sections devoted to particular tasks.

The data preparation process, at the stage of introductory preparation, may start with any task, but after finishing the introductory stage of this task, it is necessary to go through the introductory stage of data cleaning. It is only then when one can perform operations belonging to other tasks, including the operations of the task with which the process of data preparation has started. This procedure follows from the fact that in the further preparation, be it in the introductory or in the main stage, the analyst should have at his disposal possibly the most complete data base and only then make further decisions.

Knowledge in Imperfect Data 185

measured quantity was not taken into consideration at all, and a missing value – when a particular piece of data has not been recorded, since - for instance - it was lost. The methods of missing or empty value replacement may be divided into two groups, which use for the

 either the specially selected or all the values of the collected set (complex replacement). The methods of simple replacement include all the methods based on statistical values (statistics), such as, for instance the mean value, the median, or the standard deviation. The new values which are calculated via these methods and which are to replace the empty or

The complex replacement aims at replacing the absent piece of data with such a value as should have been filled in originally, had it been properly recorded (had it not been lost). Since these methods are aimed at recreating the proper value, a consequence of their application is that the distribution measures of the replacing data set may be and typically

In complex replacement the absent value is calculated from the data collected in the subset which contains selected attributes (Fig. 3a) or selected records (Fig. 3b) and which has been created specifically for this purpose. The choice of a data replacement method depends on the properties of the collected data – on finding or not finding correlations between or among attributes. If there is any correlation between, on the one hand, selected attributes and, on the other, the attribute containing the absent value, it is possible to establish a multilinear regression which ties the attributes under analysis (the attribute containing the absent value and the attributes which are correlated with it). In turn, on this basis the absent value may be calculated. An advantage of this method is the possibility of obtaining new limits, that is, a new minimal and maximal value of the attribute under replacement, which is lower or higher than the values registered so far. It is also for this reason that this method

When there is no correlation between the attribute of the value under replacement and the remaining attributes of the set, the absent value is calculated via a comparison with the selected records from the set containing complete data. This works when the absent value is

The choice of a data replacement method should take into consideration the proposed data modeling method. Simple replacement may be used when the modeling method is insensitive to the noise in the data. For models which cannot cope with the noise in the data

It is a frequent error commonly made in automatized absent data replacement that no record is made with respect to which pieces of the data are originally collected values and which have been replaced. This piece of information is particularly important when the

purpose of calculating the new value the subset of the data containing:

only the values of the attribute under replacement (simple replacement)

the missing values retain the currently defined distribution measures of the set.

either

are changed.

a missing value.

is used for the replacement of empty data.

complex data replacement methods should be used.

model quality is being analyzed.

or

It is only after the introductory stage is completed in all tasks that the main stage can be initiated. This is motivated by the need to avoid any computer-aided operations on raw data.

#### **3.1 Data cleaning**

Data cleaning is a continuous process, which reappears at every stage of data exploration. However, it is especially important at the introductory data preparation stage, among others, because of how much time these operations take. At this phase, it involves a onetime data correction, which resides in the elimination of all kind of errors resulting from human negligence at the stage of data collection. The introductory data cleaning is a laborious process of consulting the source materials, often in the form of paper records, laboratory logs and forms, measuring equipment outprints, etc., and filling in the missing data by hand. The overall documentation collected at this stage may also be used in the two other operations of this task: in accuracy improvement and in inconsistency removal.

Fig. 2. Operations performed in the data cleaning task (with the introductory stage represented as the squared area and the main stage represented as the white area)

As shown in Fig. 2, the data cleaning task may involve three operations, the replacement of the missing or empty values, the accuracy improvement, and the inconsistency removal, which in the main stage employ algorithmized methods:


In what follows, these three operations will be discussed in detail because of their use in the preparation of the data for modeling.

a. The replacement of missing or empty values

The case of the absent data may cover two kinds of data: an empty value – when a particular piece of data could not have been recorded, since - for instance - such a value of the measured quantity was not taken into consideration at all, and a missing value – when a particular piece of data has not been recorded, since - for instance - it was lost. The methods of missing or empty value replacement may be divided into two groups, which use for the purpose of calculating the new value the subset of the data containing:

either

only the values of the attribute under replacement (simple replacement)

or

184 Advances in Knowledge Representation

It is only after the introductory stage is completed in all tasks that the main stage can be initiated. This is motivated by the need to avoid any computer-aided operations on raw data.

Data cleaning is a continuous process, which reappears at every stage of data exploration. However, it is especially important at the introductory data preparation stage, among others, because of how much time these operations take. At this phase, it involves a onetime data correction, which resides in the elimination of all kind of errors resulting from human negligence at the stage of data collection. The introductory data cleaning is a laborious process of consulting the source materials, often in the form of paper records, laboratory logs and forms, measuring equipment outprints, etc., and filling in the missing data by hand. The overall documentation collected at this stage may also be used in the two

other operations of this task: in accuracy improvement and in inconsistency removal.

Fig. 2. Operations performed in the data cleaning task (with the introductory stage represented as the squared area and the main stage represented as the white area)

which in the main stage employ algorithmized methods:

preparation of the data for modeling.

a. The replacement of missing or empty values

replacement or with the use of all attributes of the data set;

As shown in Fig. 2, the data cleaning task may involve three operations, the replacement of the missing or empty values, the accuracy improvement, and the inconsistency removal,

 replacement of missing or empty values employs the methods of calculating the missing or empty value with the use of the remaining values of the attribute under

 accuracy improvement is based on the algorithmized methods of the replacement of the current value with the newly-calculated one or the removal of this current value; inconsistency removal most frequently employs special procedures (e.g. control codes)

In what follows, these three operations will be discussed in detail because of their use in the

The case of the absent data may cover two kinds of data: an empty value – when a particular piece of data could not have been recorded, since - for instance - such a value of the

programmed in data collection sheets prior to the data collecting stage.

**3.1 Data cleaning** 

either the specially selected or all the values of the collected set (complex replacement).

The methods of simple replacement include all the methods based on statistical values (statistics), such as, for instance the mean value, the median, or the standard deviation. The new values which are calculated via these methods and which are to replace the empty or the missing values retain the currently defined distribution measures of the set.

The complex replacement aims at replacing the absent piece of data with such a value as should have been filled in originally, had it been properly recorded (had it not been lost). Since these methods are aimed at recreating the proper value, a consequence of their application is that the distribution measures of the replacing data set may be and typically are changed.

In complex replacement the absent value is calculated from the data collected in the subset which contains selected attributes (Fig. 3a) or selected records (Fig. 3b) and which has been created specifically for this purpose. The choice of a data replacement method depends on the properties of the collected data – on finding or not finding correlations between or among attributes. If there is any correlation between, on the one hand, selected attributes and, on the other, the attribute containing the absent value, it is possible to establish a multilinear regression which ties the attributes under analysis (the attribute containing the absent value and the attributes which are correlated with it). In turn, on this basis the absent value may be calculated. An advantage of this method is the possibility of obtaining new limits, that is, a new minimal and maximal value of the attribute under replacement, which is lower or higher than the values registered so far. It is also for this reason that this method is used for the replacement of empty data.

When there is no correlation between the attribute of the value under replacement and the remaining attributes of the set, the absent value is calculated via a comparison with the selected records from the set containing complete data. This works when the absent value is a missing value.

The choice of a data replacement method should take into consideration the proposed data modeling method. Simple replacement may be used when the modeling method is insensitive to the noise in the data. For models which cannot cope with the noise in the data complex data replacement methods should be used.

It is a frequent error commonly made in automatized absent data replacement that no record is made with respect to which pieces of the data are originally collected values and which have been replaced. This piece of information is particularly important when the model quality is being analyzed.

Knowledge in Imperfect Data 187

In the literature, there are two parallel classifications of the outlier identification methods. The first employs the division into one-dimension methods and multidimension methods. The second divides the relevant methods into the statistical (or traditional) ones and the

0

Fig. 4. Outliers a) one-dimension, b) multidimension; 1, 2 outliers – a description in the text

The analysis of one-dimension data employs definitions which recognize as outliers the data which are distant, relative to the assumed metric, from the main data concentration. Frequently, the expanse of the concentration is defined via its mean value and the standard error. In that case, the outlier is a value (the point marked as 1 in Fig. 4a) which is located outside the interval (Xm – k\* 2SE, Xm + k\* 2SE), where: SE – standard error, k – coefficient, Xm – mean value. The differences between the authors pertain to the value of the k coefficient and the labels for the kinds of data connected with it, for instance the outliers going beyond the boundaries calculated for k = 3,6 (Jimenez-Marquez et al., 2002) or the outliers for k = 1,5 and the extreme data for k = 3 (StatSoft, 2011). Equally popular is the method using a box plot (Laurikkala et al., 2000), which is based on a similar assumption concerning the calculation of the thresholds above and below which a piece of data is

For some kind of data, for instance those coming from multiple technological processes and used for the development of industrial applications, the above methods do not work. The data of this kind exhibit a multidimensional correlation. For correlated data a onedimension analysis does not lead to correct results, which is clearly visible from the example in Fig. 4 (prepared from synthetic data). In accordance with the discussed method, Point 1 (marked as a black square) in Fig. 4a would be classified as an outlier. However, it is clearly visible from multidimensional (two-dimensional) data represented in Fig. 4b that it is point 2 (represented as a black circle) which is an outlier. In one-dimension analysis it was treated

Commonly used tools for finding one-dimension outliers are statistical methods. In contrast to the case of one-dimension outliers, the methods used for finding multidimensional outliers are both statistical methods and advanced methods of data exploration. Statistical methods most frequently employ the Mahalanobis metric. In the basic Mahalanobis method

**b)** 0 2 4 6 8 10

**1 2**

1

2

3

4

5

ones which employ advanced methods of data exploration.

**a)** 0 2 4 6 8 10

as an average value located quite close to the mean value.

**1 2**

0

treated as an outlier.

1

2

3

4

5

Fig. 3. (a) Selected attributes (X3, X8) will serve in calculating the absent value in attribute X6 (b) Selected records (4, 14, 16) will serve in calculating the absent value in attribute 10

#### b. Accuracy improvement

The operation of accuracy improvement, in the main stage of data preparation, is based on algorithmized methods of calculating a value and either replacing the current value (the value recorded in the database) with the newly-calculated one or completely removing the current value from the base. The mode of operation depends upon classifying the value under analysis as either noisy or an outlier. This is why, firstly, this operation focuses on identifying the outliers within the set of the collected data. In the literature, one can find reference to numerous techniques of identifying pieces of the data which are classified as outliers (Alves & Nascimento, 2002; Ben-Gal, 2005; Cateni et al., 2008; Fan et al., 2006; Mohamed et al., 2007). This is an important issue, since it is only in the case of outliers that their removal from the set may be justified. Such an action is justified when the model is developed, for instance, for the purpose of optimizing an industrial process. If a model is being developed with the purpose of identifying potential hazards and break-downs in a process, the outliers should either be retained within the data set or should constitute a new, separate set, which will serve for establishing a separate pattern. On the other hand, noisy data can, at best, be corrected but never removed – unless we have excessive data at our disposal.

**X X1 X2 X3 X4 X5 X6 X7 X8** 0,44264 0,33500 0,25427 0,35923 0,16962 0,00716 0,48662 0,32565 0,22330 0,28883 0,83385 0,47689 0,70585 0,37254 0,93544 0,56954 0,27493 0,33005 0,58590 0,34546 0,01733 0,31112 0,85849 0,45207 0,98221 0,48329 0,60996 0,63946 0,36832 0,47123 0,20616 0,87007 0,13867 0,04751 0,82575 0,95047 0,74808 0,51304 0,12866 0,92690 0,19544 0,64661 0,54804 0,54047 0,19131 0,03098 0,49135 0,71928 0,30852 0,16328 0,55166 0,54092 0,54074 0,04924 0,58695 0,66732 0,00041 0,25978 0,31942 0,45177 0,03746 0,01925 0,14991 0,46064 0,19001 0,25658 0,05029 0,15835 0,33636 0,80948 0,15902 0,96184 0,46663 0,95992 0,54820 0,04761 0,80451 0,27177 0,48748 0,66559 0,95384 0,84559 0,76149 0,79613 0,99048 0,73015 0,97383 0,44050 0,31420 0,95545 0,69524 0,19512 0,47188 0,56926 0,95379 0,34064 0,64198 0,61337 0,27064 0,18290 0,48872 0,95916 0,69138 0,54202 0,40815 0,69619 0,49253 0,33158 0,64028 0,37219 0,78562 0,19946 0,10397 0,62387 0,90537 0,85044 0,66453 0,30543 0,80890 0,41862 0,48597 0,93570 0,73022 0,73567 0,42406 0,68604 0,82964 0,06957 0,49556 0,57802 0,45966 0,56085 0,35513 0,91340 0,75443 0,45747 0,98770 0,43052 0,36385 0,56939 0,14738 0,62208 0,65532 0,09304 0,95937 0,44208 0,61713 0,06287 0,65605 0,26071 0,67018 0,07758 0,17060 0,62618 0,93290 0,19343 0,88051 0,74191 0,94996 (a)

**X X1 X2 X3 X4 X5 X6 X7 X8** 0,44264 0,33500 0,25427 0,35923 0,16962 0,00716 0,48662 0,32565 0,22330 0,28883 0,83385 0,47689 0,70585 0,37254 0,93544 0,16954 0,27493 0,33005 0,58590 0,34546 0,01733 0,41112 0,85849 0,45207 0,98221 0,48329 0,60996 0,63946 0,36832 0,47123 0,20616 0,87007 0,13867 0,04751 0,82575 0,95047 0,74808 0,10304 0,12866 0,82690 0,19544 0,64661 0,54804 0,54047 0,19131 0,03098 0,49135 0,71928 0,30852 0,16328 0,81166 0,54092 0,54074 0,84924 0,58695 0,66732 0,00041 0,25978 0,51942 0,45177 0,03746 0,10925 0,14991 0,46064 0,19001 0,25658 0,05029 0,15835 0,33636 0,59725 0,80948 0,15902 0,96184 0,46663 0,95992 0,54820 0,50451 0,27177 0,28748 0,66559 0,95384 0,84559 0,76149 0,79613 0,99048 0,73015 0,97383 0,44050 0,31420 0,95545 0,69524 0,19512 0,47188 0,56926 0,95379 0,34064 0,64198 0,14337 0,27064 0,18290 0,48872 0,95916 0,09138 0,94202 0,40815 0,69619 0,49253 0,33158 0,64028 0,37219 0,48562 0,19946 0,10397 0,32387 0,90537 0,85044 0,36453 0,30543 0,80890 0,91862 0,48597 0,93570 0,53022 0,73567 0,52406 0,18604 0,29636 0,06957 0,49556 0,07802 0,45966 0,56085 0,35513 0,31340 0,35443 0,45747 0,98770 0,43052 0,36385 0,56939 0,14738 0,62208 0,65532 0,09304 0,95937 0,14208 0,61713 0,06287 0,65605 0,26071 0,67018 0,07758 0,17060 0,62618 0,93290 0,19343 0,88051 0,74191 0,54996 (b) Fig. 3. (a) Selected attributes (X3, X8) will serve in calculating the absent value in attribute X6 (b) Selected records (4, 14, 16) will serve in calculating the absent value in attribute 10

The operation of accuracy improvement, in the main stage of data preparation, is based on algorithmized methods of calculating a value and either replacing the current value (the value recorded in the database) with the newly-calculated one or completely removing the current value from the base. The mode of operation depends upon classifying the value under analysis as either noisy or an outlier. This is why, firstly, this operation focuses on identifying the outliers within the set of the collected data. In the literature, one can find reference to numerous techniques of identifying pieces of the data which are classified as outliers (Alves & Nascimento, 2002; Ben-Gal, 2005; Cateni et al., 2008; Fan et al., 2006; Mohamed et al., 2007). This is an important issue, since it is only in the case of outliers that their removal from the set may be justified. Such an action is justified when the model is developed, for instance, for the purpose of optimizing an industrial process. If a model is being developed with the purpose of identifying potential hazards and break-downs in a process, the outliers should either be retained within the data set or should constitute a new, separate set, which will serve for establishing a separate pattern. On the other hand, noisy data can, at best, be corrected but

never removed – unless we have excessive data at our disposal.

b. Accuracy improvement

In the literature, there are two parallel classifications of the outlier identification methods. The first employs the division into one-dimension methods and multidimension methods. The second divides the relevant methods into the statistical (or traditional) ones and the ones which employ advanced methods of data exploration.

Fig. 4. Outliers a) one-dimension, b) multidimension; 1, 2 outliers – a description in the text

The analysis of one-dimension data employs definitions which recognize as outliers the data which are distant, relative to the assumed metric, from the main data concentration. Frequently, the expanse of the concentration is defined via its mean value and the standard error. In that case, the outlier is a value (the point marked as 1 in Fig. 4a) which is located outside the interval (Xm – k\* 2SE, Xm + k\* 2SE), where: SE – standard error, k – coefficient, Xm – mean value. The differences between the authors pertain to the value of the k coefficient and the labels for the kinds of data connected with it, for instance the outliers going beyond the boundaries calculated for k = 3,6 (Jimenez-Marquez et al., 2002) or the outliers for k = 1,5 and the extreme data for k = 3 (StatSoft, 2011). Equally popular is the method using a box plot (Laurikkala et al., 2000), which is based on a similar assumption concerning the calculation of the thresholds above and below which a piece of data is treated as an outlier.

For some kind of data, for instance those coming from multiple technological processes and used for the development of industrial applications, the above methods do not work. The data of this kind exhibit a multidimensional correlation. For correlated data a onedimension analysis does not lead to correct results, which is clearly visible from the example in Fig. 4 (prepared from synthetic data). In accordance with the discussed method, Point 1 (marked as a black square) in Fig. 4a would be classified as an outlier. However, it is clearly visible from multidimensional (two-dimensional) data represented in Fig. 4b that it is point 2 (represented as a black circle) which is an outlier. In one-dimension analysis it was treated as an average value located quite close to the mean value.

Commonly used tools for finding one-dimension outliers are statistical methods. In contrast to the case of one-dimension outliers, the methods used for finding multidimensional outliers are both statistical methods and advanced methods of data exploration. Statistical methods most frequently employ the Mahalanobis metric. In the basic Mahalanobis method

Knowledge in Imperfect Data 189

Because of the tools used in knowledge extraction it is necessary that the data be represented in the form of a flat, two-dimensional table. It is becoming possible to analyze the data recorded in a different form, but the column-row structure – as in a calculation sheet - is the best one (Pyle, 2003). The task of data integration may be both simple and complicated, since it depends on the form in which the data is collected. In the case of synthetic data, as well as in appropriately prepared industrial data collecting systems this is unproblematic. However, the majority of industrial databases develop in an uncoordinated way. Different departments of the same plant develop their own databases for their own purposes. These databases are further developed without taking into consideration other agencies. This results in a situation in which the same attributes are repeated in numerous databases, the same attributes are labeled differently in different databases, the same attributes have different proportions of absent data in different bases, the developed databases have different primary keys (case identifiers), etc. The situation gets even worse when the databases under integration have been developed not within a single plant but in

Fig. 5. The operations performed in the data integration task (with the introductory stage represented as the squared area and the main stage represented as the white area)

under analysis will identify such redundant records without any problem.

As represented in Fig 5, data integration consists of three operations. The introductory stage of data integration focuses on two of them: identification and redundancy removal. Both these operations are closely connected with one another. The first operation – identification – is necessary, since the task of data integration starts with identifying the quantity around which the database sets may be integrated. Also, at the data integration introductory stage the removal of obvious redundancies should be performed. Their appearance may result, for instance, from the specific way in which the industrial production data is recorded. The results of tests conducted in centrally-operated laboratories are used by different factory departments, which store these results independently of one another, in their own databases. The result of data integration performed without introductory analysis may be that the same quantities may be listed in the end-product database a number of times and, moreover, they may be listed there under different labels. A person who knows the data

**3.2 Data integration** 

more distant places, for example in competing plants.

for each vector xi the Mahalanobis distance is calculated according to the following formula (1) (Rousseeuw & Zomeren, 1990):

$$\text{MD}\_{\text{i}} = \left( \left( \mathbf{x}\_{\text{i}} - T(\mathbf{X}) \right) \mathbf{C} (\mathbf{X})^{-1} \left( \mathbf{x}\_{\text{i}} - T(\mathbf{X}) \right)^{\text{t}} \right)^{1/2} \qquad \text{for } \text{i} = 1, \text{ 2, 3, \dots n} \tag{1}$$

where: *T*(**X**) is the arithmetic mean of the data set, **C**(**X**) is the covariance matrix.

This method takes into account not only the central point of the data set, but also the shape of the data set in multidimensional space. A case with a large Mahalanobis distance can be identified as a potential outlier. On the basis of a comparison with the chi-square distribution, an outlier can be removed as was suggested in (Filzmoser, 2005). In the literature, further improvements of the Mahalanobis distance method can be found, for example the one called the Robust Mahalanobis Distance (Bartkowiak, 2005). The outlier detection based on the Mahalanobis distance in industrial data, was performed in e.g. (Jimenez-Marquez et al., 2002). As far as data exploration is concerned, in principle all methods are used. The most popular ones are based on artificial neural networks (Alves & Nascimento, 2002), grouping (Moh'd Belal Al- Zgubi, 2009), or visualization (Hasimah et al., 2007), but there are also numerous works which suggest new possibilities of the use of other methods of data mining, for example of the rough set theory (Shaari et al., 2007). An assumption of methodologies employing data exploration methods is that the data departing from the model built with their help should be treated as outliers.

#### c. Inconsistency removal

At the introductory stage inconsistencies are primarily removed by hand, via reference to the source materials and records. At this stage it will also encompass the verification of the attributes of the collected data. It is important to remove all redundancies at this stage. Redundant attributes may appear both in basic databases and in databases created via combining a few smaller data sets. Most often, this is a consequence of using different labels in different data sets to refer to the same attributes. Redundant attributes in merged databases suggest that we have at our disposal a much bigger number of attributes than is in fact the case. An example of a situation of this kind is labeling the variables (columns) referring to metal elements in a container as, respectively, *element mass*, *the number of elements in a container*, and *the mass of elements in the container*. If the relevant piece of information is not repeated exactly, for example, if instead of the attribute *the mass of elements in the container* what is collected is *the mass of metal in the container*, then the only problem is increasing the model development time. It is not always the case that this is a significant problem, since the majority of models have a bigger problem with the number of records, than with the number of attributes. However, in the modeling of processes aimed at determining the influence of signal groups, an increase in the number of inputs is accompanied with an avalanche increase in the number of input variable group combinations (Kozlowski, 2009). It should also be remembered that some modeling techniques, especially those based on regression, cannot cope with two collinear attributes (Galmacci, 1996). This pertains also to a small group of matrix-based methods (Pyle, 1999).

At the main stage of data preparation inconsistency removal may be aided with specially designed procedures (e.g. control codes) or tools dedicated to finding such inconsistencies (e.g. when the correlations/interdependencies among parameters are known).

#### **3.2 Data integration**

188 Advances in Knowledge Representation

for each vector xi the Mahalanobis distance is calculated according to the following formula

This method takes into account not only the central point of the data set, but also the shape of the data set in multidimensional space. A case with a large Mahalanobis distance can be identified as a potential outlier. On the basis of a comparison with the chi-square distribution, an outlier can be removed as was suggested in (Filzmoser, 2005). In the literature, further improvements of the Mahalanobis distance method can be found, for example the one called the Robust Mahalanobis Distance (Bartkowiak, 2005). The outlier detection based on the Mahalanobis distance in industrial data, was performed in e.g. (Jimenez-Marquez et al., 2002). As far as data exploration is concerned, in principle all methods are used. The most popular ones are based on artificial neural networks (Alves & Nascimento, 2002), grouping (Moh'd Belal Al- Zgubi, 2009), or visualization (Hasimah et al., 2007), but there are also numerous works which suggest new possibilities of the use of other methods of data mining, for example of the rough set theory (Shaari et al., 2007). An assumption of methodologies employing data exploration methods is that the data

At the introductory stage inconsistencies are primarily removed by hand, via reference to the source materials and records. At this stage it will also encompass the verification of the attributes of the collected data. It is important to remove all redundancies at this stage. Redundant attributes may appear both in basic databases and in databases created via combining a few smaller data sets. Most often, this is a consequence of using different labels in different data sets to refer to the same attributes. Redundant attributes in merged databases suggest that we have at our disposal a much bigger number of attributes than is in fact the case. An example of a situation of this kind is labeling the variables (columns) referring to metal elements in a container as, respectively, *element mass*, *the number of elements in a container*, and *the mass of elements in the container*. If the relevant piece of information is not repeated exactly, for example, if instead of the attribute *the mass of elements in the container* what is collected is *the mass of metal in the container*, then the only problem is increasing the model development time. It is not always the case that this is a significant problem, since the majority of models have a bigger problem with the number of records, than with the number of attributes. However, in the modeling of processes aimed at determining the influence of signal groups, an increase in the number of inputs is accompanied with an avalanche increase in the number of input variable group combinations (Kozlowski, 2009). It should also be remembered that some modeling techniques, especially those based on regression, cannot cope with two collinear attributes (Galmacci, 1996). This pertains also to a small group of matrix-based methods (Pyle, 1999). At the main stage of data preparation inconsistency removal may be aided with specially designed procedures (e.g. control codes) or tools dedicated to finding such inconsistencies

1/2 <sup>1</sup> <sup>t</sup> MD i i <sup>i</sup> *T T* for i 1, 2, 3, ... n **x X CX x X** (1)

where: *T*(**X**) is the arithmetic mean of the data set, **C**(**X**) is the covariance matrix.

departing from the model built with their help should be treated as outliers.

(e.g. when the correlations/interdependencies among parameters are known).

(1) (Rousseeuw & Zomeren, 1990):

c. Inconsistency removal

Because of the tools used in knowledge extraction it is necessary that the data be represented in the form of a flat, two-dimensional table. It is becoming possible to analyze the data recorded in a different form, but the column-row structure – as in a calculation sheet - is the best one (Pyle, 2003). The task of data integration may be both simple and complicated, since it depends on the form in which the data is collected. In the case of synthetic data, as well as in appropriately prepared industrial data collecting systems this is unproblematic. However, the majority of industrial databases develop in an uncoordinated way. Different departments of the same plant develop their own databases for their own purposes. These databases are further developed without taking into consideration other agencies. This results in a situation in which the same attributes are repeated in numerous databases, the same attributes are labeled differently in different databases, the same attributes have different proportions of absent data in different bases, the developed databases have different primary keys (case identifiers), etc. The situation gets even worse when the databases under integration have been developed not within a single plant but in more distant places, for example in competing plants.

Fig. 5. The operations performed in the data integration task (with the introductory stage represented as the squared area and the main stage represented as the white area)

As represented in Fig 5, data integration consists of three operations. The introductory stage of data integration focuses on two of them: identification and redundancy removal. Both these operations are closely connected with one another. The first operation – identification – is necessary, since the task of data integration starts with identifying the quantity around which the database sets may be integrated. Also, at the data integration introductory stage the removal of obvious redundancies should be performed. Their appearance may result, for instance, from the specific way in which the industrial production data is recorded. The results of tests conducted in centrally-operated laboratories are used by different factory departments, which store these results independently of one another, in their own databases. The result of data integration performed without introductory analysis may be that the same quantities may be listed in the end-product database a number of times and, moreover, they may be listed there under different labels. A person who knows the data under analysis will identify such redundant records without any problem.

Knowledge in Imperfect Data 191

a different hour, shift, day, week or month. Finding a common higher-order interval of time is necessary for further analyses. The same reasons make necessary the performance of generalization or normalization. Generalization may be dictated, for instance, by the integration of databases in which the same attribute is once recorded in a nominal scale (e.g. ultimate tensile strength - ductile iron grade 500/07) and once in a ratio scale (ultimate tensile strength – UTS = 572 MPa). We can talk about normalization in the introductory stage only when we understand this term in its widest sense, that is, as a conversion of quantities from one range into those of another range. In that case, such a transformation as measurement unit conversion may be understood as normalization at the introductory

Data transformation encompasses all the issues connected with transforming the data into a form which makes data exploration possible. At the introductory stage it involves six

 smoothing – this resides in transforming the data in such a way that local data deviations having the character of noise are eliminated. Smoothing encompasses, among others, the techniques such as, for example, binning, clustering, or regression; aggregation – this resides in summing up the data, most frequently in the function of time, for example from a longer time period encompassing not just a single shift but a

 generalization – this resides in converting the collected data containing the measurements of the registered process quantity into higher-order quantities, for

normalization – this resides in the rescaling (adjustment) of the data to a specified,

 attribute (feature) construction – this resides in mathematical transformations of attributes (features) with the aim of obtaining a new attribute (feature), which will

 accommodation – this resides in transforming the data into a format used by a specific algorithm or a tool, for example into the ARFF format (Witten & Frank, 2005).

Fig. 6. Operations performed in the data transformation task (with the introductory stage represented as the squared area and the main stage represented as the white area).

stage.

operations represented in Fig. 6:

instance via their discretization;

narrow range, for instance, from 0.0 to 1.0;

replace in modeling its constituent attributes;

whole month;

At the main stage, unlike in other data preparation tasks, the integration operations are only performed with the aid of algorithmized tools and are still based on the knowledge about the process under analysis. At this stage, the operations represented in Fig. 5 are performed with the following aims:


The operation of identification uses methods which make it possible to identify the correlation between the attribute under analysis and other identified process attributes. In this way it is possible to establish, for instance, what a particular attribute pertains to, where the sensor making the recorded measurement could have been installed, etc.

The second performance of the redundancy removal operation pertains only to attributes which have been identified only in the second, main stage of data integration. As far as this second performance is concerned, redundancy removal may be identified with the attribute selection operation from the data reduction task and it uses the same methods as the attribute selection operation.

Unification is very close to the data transformation task. In most cases it amounts to transforming all the data in such a way that they are recorded in a unified form, via converting part of the data using different scales into the same units, for instance, converting inches into millimeters, meters into millimeters, etc. A separate issue is the unification of the data collected in different sets and originating from measurements employing different methods. In such cases one should employ the algorithms for converting one attribute form into another. These may include both analytical formulas (exact conversion) and approximate empirical formulas. When this is not possible, unification may be achieved via one of the data transformation operations, that is, via normalization (Liang & Kasabov, 2003).

The main principle that the person performing the data integration task should stick to is that the integrated database should retain all the information collected in the data sets which have been integrated. Seemingly, the suggestion, expressed in (Witten & Frank, 2005), that data integration should be accompanied by aggregation, is misguided. Aggregation may accompany integration, but only when this is absolutely necessary.

#### **3.3 Data transformation**

Data transformation introductory stage usually amounts to a single operation dictated by data integration, such as aggregation, generalization, normalization or attribute (feature) construction. Performing aggregation may be dictated by the fact that the data collected in multiple places during the production process may represent results from different periods:

At the main stage, unlike in other data preparation tasks, the integration operations are only performed with the aid of algorithmized tools and are still based on the knowledge about the process under analysis. At this stage, the operations represented in Fig. 5 are performed

 identification – serves the purpose of identifying the attributes which could not have been identified at the introductory stage, for instance, in those cases in which the label

 redundancy removal – just like the attribute selection in the data reduction task, which is characterized below, uses algorithmized methods of comparing the attributes which have been identified only in the main stage with the aim of removing the redundant

unification – this is performed so that the data collected in different sets have the same

The operation of identification uses methods which make it possible to identify the correlation between the attribute under analysis and other identified process attributes. In this way it is possible to establish, for instance, what a particular attribute pertains to, where

The second performance of the redundancy removal operation pertains only to attributes which have been identified only in the second, main stage of data integration. As far as this second performance is concerned, redundancy removal may be identified with the attribute selection operation from the data reduction task and it uses the same methods as the

Unification is very close to the data transformation task. In most cases it amounts to transforming all the data in such a way that they are recorded in a unified form, via converting part of the data using different scales into the same units, for instance, converting inches into millimeters, meters into millimeters, etc. A separate issue is the unification of the data collected in different sets and originating from measurements employing different methods. In such cases one should employ the algorithms for converting one attribute form into another. These may include both analytical formulas (exact conversion) and approximate empirical formulas. When this is not possible, unification may be achieved via one of the data transformation operations, that is, via

The main principle that the person performing the data integration task should stick to is that the integrated database should retain all the information collected in the data sets which have been integrated. Seemingly, the suggestion, expressed in (Witten & Frank, 2005), that data integration should be accompanied by aggregation, is misguided. Aggregation may

Data transformation introductory stage usually amounts to a single operation dictated by data integration, such as aggregation, generalization, normalization or attribute (feature) construction. Performing aggregation may be dictated by the fact that the data collected in multiple places during the production process may represent results from different periods:

accompany integration, but only when this is absolutely necessary.

the sensor making the recorded measurement could have been installed, etc.

with the following aims:

data,

does not explain anything,

form, for instance the same units.

attribute selection operation.

normalization (Liang & Kasabov, 2003).

**3.3 Data transformation** 

a different hour, shift, day, week or month. Finding a common higher-order interval of time is necessary for further analyses. The same reasons make necessary the performance of generalization or normalization. Generalization may be dictated, for instance, by the integration of databases in which the same attribute is once recorded in a nominal scale (e.g. ultimate tensile strength - ductile iron grade 500/07) and once in a ratio scale (ultimate tensile strength – UTS = 572 MPa). We can talk about normalization in the introductory stage only when we understand this term in its widest sense, that is, as a conversion of quantities from one range into those of another range. In that case, such a transformation as measurement unit conversion may be understood as normalization at the introductory stage.

Data transformation encompasses all the issues connected with transforming the data into a form which makes data exploration possible. At the introductory stage it involves six operations represented in Fig. 6:


Fig. 6. Operations performed in the data transformation task (with the introductory stage represented as the squared area and the main stage represented as the white area).

Knowledge in Imperfect Data 193

1,0 1,5 2,0 2,5 3,0 3,5 4,0

In the case of industrial data the issue of smoothing should be treated with an appropriate care. The removal or smoothing of an outlier may determine the model's capability for predicting hazards, break-downs or product defects. The industrial data often contain quantities which result from the process parameter synergy. The quantity which may undergo smoothing is, for instance, the daily or monthly furnace charge for the prediction of the trend in its consumption, but not the form moisture content for the prediction of the

Fig. 8. Measurements performed with different frequency and encompassing different time periods: from a measurement encompassing 1 minute to a measurement encompassing the data from an entire month; (the data are collected in different places: DTxx – a symbol for

the process technical documentation) (Research report, 2005)

ultimate compressive strength

regresja regression

Fig. 7. The regression method applied to the industrial data: the ultimate compressive strength of greensand in the function of moisture content (the authors' own research)

0,00

appearance of porosity.

0,04

0,08

0,12

0,16

0,20

Smoothing is an operation which follows from the assumption that the collected data are noisy with random errors or with divergence of measured attributes. In the case of industrial (also laboratory) data such a situation is a commonplace, unlike in the case of business data (the dollar exchange rate cannot possibly be noisy). Smoothing techniques are aimed at removing the noise, without at the same time interfering with the essence of the measured attributes. Smoothing methods may be of two kinds: (a) those which focus on comparing the quantity under analysis with its immediate neighborhood and (b) those which analyze the totality of the collected data.

Group (a) includes methods which analyze a specified lag or window. In the former case, the data are analyzed in their original order, while in the latter they are changed in order. These methods are based on a comparison with the neighboring values and use, for instance, the mean, the weighted moving mean, or the median. Group (b) includes methods employing regression or data clustering. The method of loess – local regression could be classified as belonging to either group.

In the first group of methods (a) a frequent solution is to combine a few of techniques into a single procedure. This is supposed to bring about a situation in which further smoothing does not introduce changes into the collected data. A strategy of this kind is resmoothing, in which two approaches may be adopted: either 1) smoothing is continued up to the moment when after a subsequent smoothing the curve does not change or 2) a specified number of smoothing cycles is performed, but this involves changing the window size. Two such procedures, 4253H and 3R2H were discussed in (Pyle, 1999). A method which should also be included in the first group (a) is the PVM (peak – valley – mean) method.

Binning, which belongs to group (a) is a method of smoothing (also data cleaning) residing in the creation of bins and the ascription of data to these bins. The data collected in the respective bins are compared to one another within each bin and unified via one of a few methods, for example, via the replacement with the mean value, the replacement with the median, or the replacement with the boundary value. A significant parameter of smoothing via binning is the selection of the size of bins to be transformed. Since comparing is performed only within the closest data collected in a single bin, binning is a kind of local smoothing of the data.

As was already mentioned above, the second important group of smoothing methods (b) are the techniques employing all the available data. Among others, this includes the methods of regression and of data clustering.

Fig.7 represents the smoothing of the laboratory data attribute (ultimate compressive strength) of greensand in the function of moisture content. The data may be smoothed via adjusting the function to the measured data. Regression means the adjustment of the best curve to the distribution of two attributes, so that one of these attributes could serve for the prediction of the other attribute. In multilinear regression the function is adjusted to the data collected in the space which is more than two-dimensional.

An example of smoothing with the use of data clustering techniques was discussed in (Kochanski, 2006). In the data which is grouped the outliers may easily be detected and either smoothed or removed from the set. This procedure makes possible defining the range of variability of the data under analysis.

Smoothing is an operation which follows from the assumption that the collected data are noisy with random errors or with divergence of measured attributes. In the case of industrial (also laboratory) data such a situation is a commonplace, unlike in the case of business data (the dollar exchange rate cannot possibly be noisy). Smoothing techniques are aimed at removing the noise, without at the same time interfering with the essence of the measured attributes. Smoothing methods may be of two kinds: (a) those which focus on comparing the quantity under analysis with its immediate neighborhood and (b) those

Group (a) includes methods which analyze a specified lag or window. In the former case, the data are analyzed in their original order, while in the latter they are changed in order. These methods are based on a comparison with the neighboring values and use, for instance, the mean, the weighted moving mean, or the median. Group (b) includes methods employing regression or data clustering. The method of loess – local regression could be

In the first group of methods (a) a frequent solution is to combine a few of techniques into a single procedure. This is supposed to bring about a situation in which further smoothing does not introduce changes into the collected data. A strategy of this kind is resmoothing, in which two approaches may be adopted: either 1) smoothing is continued up to the moment when after a subsequent smoothing the curve does not change or 2) a specified number of smoothing cycles is performed, but this involves changing the window size. Two such procedures, 4253H and 3R2H were discussed in (Pyle, 1999). A method which should also

Binning, which belongs to group (a) is a method of smoothing (also data cleaning) residing in the creation of bins and the ascription of data to these bins. The data collected in the respective bins are compared to one another within each bin and unified via one of a few methods, for example, via the replacement with the mean value, the replacement with the median, or the replacement with the boundary value. A significant parameter of smoothing via binning is the selection of the size of bins to be transformed. Since comparing is performed only within the closest data collected in a single bin, binning is a kind of local

As was already mentioned above, the second important group of smoothing methods (b) are the techniques employing all the available data. Among others, this includes the methods of

Fig.7 represents the smoothing of the laboratory data attribute (ultimate compressive strength) of greensand in the function of moisture content. The data may be smoothed via adjusting the function to the measured data. Regression means the adjustment of the best curve to the distribution of two attributes, so that one of these attributes could serve for the prediction of the other attribute. In multilinear regression the function is adjusted to the data

An example of smoothing with the use of data clustering techniques was discussed in (Kochanski, 2006). In the data which is grouped the outliers may easily be detected and either smoothed or removed from the set. This procedure makes possible defining the range

be included in the first group (a) is the PVM (peak – valley – mean) method.

which analyze the totality of the collected data.

classified as belonging to either group.

smoothing of the data.

regression and of data clustering.

of variability of the data under analysis.

collected in the space which is more than two-dimensional.

Fig. 7. The regression method applied to the industrial data: the ultimate compressive strength of greensand in the function of moisture content (the authors' own research)

In the case of industrial data the issue of smoothing should be treated with an appropriate care. The removal or smoothing of an outlier may determine the model's capability for predicting hazards, break-downs or product defects. The industrial data often contain quantities which result from the process parameter synergy. The quantity which may undergo smoothing is, for instance, the daily or monthly furnace charge for the prediction of the trend in its consumption, but not the form moisture content for the prediction of the appearance of porosity.

Fig. 8. Measurements performed with different frequency and encompassing different time periods: from a measurement encompassing 1 minute to a measurement encompassing the data from an entire month; (the data are collected in different places: DTxx – a symbol for the process technical documentation) (Research report, 2005)

Knowledge in Imperfect Data 195

data transformation makes possible a comparison of the results from different tests

 min-max normalization – it resides, in the most general form, in the linear transformation of the variable range in such a way that the current minimum takes the value 0 (zero), while the current maximum takes the value 1 (one). This kind of normalization comes in many variants, which differ from one another, among others,

 standarization – it resides in such a transformation in which the new feature has the expectation mean value which equals zero and the variance which equals one. The most frequently used standardization is the Z-score standardization, which is calculated

*<sup>x</sup> <sup>Z</sup>*

1

nonlinear normalization, for example the logarithmic one (Bonebakker, 2007).

the tangent transformation equation in (5) were employed in (Olichwier, 2011): '

'

2

where: k - coefficient, μ – mean value, σ – standard population deviation.

*<sup>X</sup> X k*

'1

*i*

(2)

*<sup>X</sup> <sup>X</sup>* (3)

lg( ) *X w cX i i* (4a)

lg(1 ) *X w cX i i* (4b)

(5)

10*<sup>j</sup>*

A weakness of the normalization methods discussed above is that they cannot cope with extreme data or with the outliers retained in the set under analysis. A solution to this problem is a linear – nonlinear normalization, for example the soft-max normalization or a

As was mentioned above, the nonlinear normalization makes it possible to include into the modeling data set extreme data or even outliers, as well as to change the distribution of the variable under transformation. The logarithmic transformation equations given in (4a,b) and

[tanh( ( ) 1]

The use of the logarithmic transformation made it possible to locally spread the distributions of the selected parameters. The original skew parameter distribution after the

*i*

where: x – feature under standarization, μ – mean value, σ – standard population deviation; decimal normalization, in which the decimal separator is moved to the place in which

The following may be listed as the most popular normalization methods:

with respect to the new ranges, for instance (-1,1); (-0,5, 0,5);

according to the following formula (2):

the following equation (3) is satisfied:

where: w, c - coefficients

where: j is the smallest integer value, for which *Max* (|X1|) < 1.

(Liang & Kasabov, 2003).

In manufacture processes, when different operations are performed in different places and records are collected in separate forms, a frequent situation is the impossibility of integrating the separate cases into a single database. The absence of a single key makes data integration impossible. (Research report, 2005) discusses data integration with the use of aggregation. Data aggregation (represented as parentheses) and data location on the time axis, which makes possible further integration, are diagrammed in Fig. 8.

Generalization in data preparation is an operation which is aimed at reducing the number of the values that an individual attribute may take. In particular, it converts a continuous quantity into an attribute which takes the value of one of the specified ranges. The replacement of a continuous quantity with a smaller number of labeled ranges makes easier grasping the nature of the correlation found in the course of data exploration. Generalization applies to two terms: discretization and notional hierarchy.

Discretization is a method which makes possible splitting the whole range of attribute variation into a specified number of subregions. Discretization methods in the two main classifications are characterized in terms of the direction in which they are performed or in terms of whether or not in feature division they employ information contained in the features other than the one under discretization. In the first case, we talk about bottom-up or top-down methods. In the second case, we distinguish between supervised and unsupervised methods.

The notional hierarchy for a single feature makes possible reducing the number of the data via grouping and the replacement of a numerical quantity, for instance the percentage of carbon in the chemical composition of an alloy, with a higher-order notion, that is, *a lowcarbonic* and *a high-carbonic alloy*. This leads to a partial loss of information but, in turn, thanks to the application of this method it may be easier to provide an interpretation and in the end this interpratation may become more comprehensible.

There are many commonly used methods of data normalization (Pyle, 1999; Larose, 2008). The effect of performing the operation of normalization may be a change in the variable range and/or a change in the variable distribution. Some data mining tools, for example artificial neural networks, require normalized quantities. Ready-made commercial codes have special modules which normalize the entered data. When we perform data analysis, we should take into consideration the method which has been employed in normalization. This method may have a significant influence on the developed model and the accuracy of its predictions (Cannataro, 2008; Ginoris et al., 2007; Al Shalabi et al., 2006).

The following are the main reasons for performing normalization:


In manufacture processes, when different operations are performed in different places and records are collected in separate forms, a frequent situation is the impossibility of integrating the separate cases into a single database. The absence of a single key makes data integration impossible. (Research report, 2005) discusses data integration with the use of aggregation. Data aggregation (represented as parentheses) and data location on the time

Generalization in data preparation is an operation which is aimed at reducing the number of the values that an individual attribute may take. In particular, it converts a continuous quantity into an attribute which takes the value of one of the specified ranges. The replacement of a continuous quantity with a smaller number of labeled ranges makes easier grasping the nature of the correlation found in the course of data exploration.

Discretization is a method which makes possible splitting the whole range of attribute variation into a specified number of subregions. Discretization methods in the two main classifications are characterized in terms of the direction in which they are performed or in terms of whether or not in feature division they employ information contained in the features other than the one under discretization. In the first case, we talk about bottom-up or top-down methods. In the second case, we distinguish between supervised and

The notional hierarchy for a single feature makes possible reducing the number of the data via grouping and the replacement of a numerical quantity, for instance the percentage of carbon in the chemical composition of an alloy, with a higher-order notion, that is, *a lowcarbonic* and *a high-carbonic alloy*. This leads to a partial loss of information but, in turn, thanks to the application of this method it may be easier to provide an interpretation and in

There are many commonly used methods of data normalization (Pyle, 1999; Larose, 2008). The effect of performing the operation of normalization may be a change in the variable range and/or a change in the variable distribution. Some data mining tools, for example artificial neural networks, require normalized quantities. Ready-made commercial codes have special modules which normalize the entered data. When we perform data analysis, we should take into consideration the method which has been employed in normalization. This method may have a significant influence on the developed model and the accuracy of

 a transformation of the ranges of all attributes into a single range makes possible the elimination of the influence of the feature magnitude (their order 10, 100, 1000) on the

a nonlinear transformation makes possible relieving the frequency congestion and

a nonlinear transformation makes it possible to take into consideration the outliers,

developed model. In this way we can avoid revaluing features with high values, a transformation of the data into a dimensionless form and the consequent achievement of commensurability of a few features makes possible calculating the case distance, for

axis, which makes possible further integration, are diagrammed in Fig. 8.

Generalization applies to two terms: discretization and notional hierarchy.

the end this interpratation may become more comprehensible.

its predictions (Cannataro, 2008; Ginoris et al., 2007; Al Shalabi et al., 2006).

uniformly distributing the relevant cases in the range of features,

The following are the main reasons for performing normalization:

instance with the use of the Euclid metric,

unsupervised methods.

 data transformation makes possible a comparison of the results from different tests (Liang & Kasabov, 2003).

The following may be listed as the most popular normalization methods:


$$Z = \frac{\chi - \mu}{\sigma} \tag{2}$$

where: x – feature under standarization, μ – mean value, σ – standard population deviation;

 decimal normalization, in which the decimal separator is moved to the place in which the following equation (3) is satisfied:

$$X^1 = \frac{X}{10^j} \tag{3}$$

where: j is the smallest integer value, for which *Max* (|X1|) < 1.

A weakness of the normalization methods discussed above is that they cannot cope with extreme data or with the outliers retained in the set under analysis. A solution to this problem is a linear – nonlinear normalization, for example the soft-max normalization or a nonlinear normalization, for example the logarithmic one (Bonebakker, 2007).

As was mentioned above, the nonlinear normalization makes it possible to include into the modeling data set extreme data or even outliers, as well as to change the distribution of the variable under transformation. The logarithmic transformation equations given in (4a,b) and the tangent transformation equation in (5) were employed in (Olichwier, 2011):

$$X\_i^\top = w \cdot \lg(\mathfrak{c} + X\_i) \tag{4a}$$

$$X\_i = w \cdot \lg(1 + cX\_i) \tag{4b}$$

where: w, c - coefficients

$$X\_i = \frac{1}{2} [\tanh(k \cdot (\frac{X\_i - \mu}{\sigma}) + 1) \tag{5}$$

where: k - coefficient, μ – mean value, σ – standard population deviation.

The use of the logarithmic transformation made it possible to locally spread the distributions of the selected parameters. The original skew parameter distribution after the

Knowledge in Imperfect Data 197

Fig. 9. Data conversion – the conversion of a file from the xls to the txt format brings about a

At the introductory stage of data preparation this task is limited to a single operation – attribute selection – and it is directly connected with the expert opinion and experience pertaining to the data under analysis. It is only as a result of a specialist analysis (for example, of brainstorming) that one can remove from the set of collected data those attributes which

The aim of the main stage data reduction techniques is a significant decrease in the data representation, which at the same time preserves the features of the basic data set. Reduced data mining makes then possible obtaining models and analyses which are identical (or

 attribute selection – this resides in reducing the data set by eliminating from it the attributes which are redundant or have little significance for the phenomenon under

dimension reduction – this resides in transforming the data with the aim of arriving at a

numerosity reduction – this is aimed at reducing the data set via eliminating the

discretization – this resides in transforming a continuous variable into a limited and

 aggregation – this resides in summing up the data, most frequently in the function of time, for instance from a longer time period encompassing not just the period of a

The collected industrial database set may contain tens or even hundreds of attributes. Many of these attributes may be insignificant for the current analysis. It is frequently the case that integrated industrial databases contain redundant records, which is a consequence of the specific way in which the data is collected and stored in a single factory in multiple places. Keeping insignificant records in a database which is to be used for modeling may lead not only to – often a significant – teaching time increase but also to developing a poor quality model. Of course, at the introductory stage, a specialist in a given area may identify the

without question do not have any influence on the features under modeling.

Data reduction includes five operations, which are represented in Fig. 10:

nearly identical) to the analyses performed for the basic data.

reduced representation of the basic data;

single shift but e.g. the period of a whole month.

recurring or very similar cases;

specified number of ranges;

partial loss of information (the authors' own work)

**3.4 Data reduction** 

modeling;

transformation approached the normal distribution. One effect of the transformation in question was a visible increase in the prediction quality of the resulting model.

The data in the databases which are under preparation for mining may be collected in many places and recorded in separate sets. As a result of integrating different data sources it is possible to perform their transformation, for instance in the domain of a single manufacture process. The data represented by two or more features may be replaced with a different attribute. A classical example of this is the replacement of the two dates defining the beginning and the end of an operation, which were originally recorded in two different places, with a single value defining the duration of the operation, for instance the day and the hour of molding, as well as the day and the hour of the mould assembly are replaced with the mould drying time. A deep knowledge of the data makes possible using mathematical transformations which are more complex than a mere difference. As a consequence, new, semantically justified feature combinations are created. An example of this was discussed in (Kochanski, 2000).

The knowledge of the properties of the algorithm employed in data mining is a different kind of reason behind the creation of new attributes. In the case of algorithms capable only of the division which is parallel to the data space axis (for instance in the case of the majority of decision trees) it is possible to replace the attributes pertaining to features characterized with linear dependence with a new attribute which is the ratio of the former attributes (Witten & Frank, 2005).

It is possible to perform any other transformations of attributes which do not have any mathematical justification but which follow from the general world knowledge: the names of the days of the week may be replaced with the dates, the names of the elements may be replaced with their atomic numbers, etc.

The last encountered way of creating attributes is the replacement of two attributes with a new attribute which is their product. A new attribute created in this way may have no counterpart in reality.

The common use of accommodation suggests that it should be considered as one of the issues of data transformation. What is used in data mining are databases which have been created without any consideration for their future specific application, that is, the application with the use of selected tools or algorithms. In effect, it is often the case that the collected data format does not fit the requirements of the tool used for mining, especially in a commercial code. The popular ARFF (attribute - relation file format) makes use only of the data recorded in a nominal or interval scale. The knowledge gathered in the data recorded in a ratio scale will be lost as a result of being recorded in the ARFF format.

The calculation spreadsheet Excel, which is popular and which is most frequently used for gathering the industrial data, may also be the cause of data distortion. Depending upon the format of the cell in which the registered quantity is recorded, the conversion of files into the txt format (which is required by a part of data mining programs) may result in the loss of modeling quality. This is a consequence of the fact that a defined cell contains the whole number which was recorded in it but what is saved in file conversion is only that part of this number which is displayed (Fig.9).

transformation approached the normal distribution. One effect of the transformation in

The data in the databases which are under preparation for mining may be collected in many places and recorded in separate sets. As a result of integrating different data sources it is possible to perform their transformation, for instance in the domain of a single manufacture process. The data represented by two or more features may be replaced with a different attribute. A classical example of this is the replacement of the two dates defining the beginning and the end of an operation, which were originally recorded in two different places, with a single value defining the duration of the operation, for instance the day and the hour of molding, as well as the day and the hour of the mould assembly are replaced with the mould drying time. A deep knowledge of the data makes possible using mathematical transformations which are more complex than a mere difference. As a consequence, new, semantically justified feature combinations are created. An example of

The knowledge of the properties of the algorithm employed in data mining is a different kind of reason behind the creation of new attributes. In the case of algorithms capable only of the division which is parallel to the data space axis (for instance in the case of the majority of decision trees) it is possible to replace the attributes pertaining to features characterized with linear dependence with a new attribute which is the ratio of the former attributes

It is possible to perform any other transformations of attributes which do not have any mathematical justification but which follow from the general world knowledge: the names of the days of the week may be replaced with the dates, the names of the elements may be

The last encountered way of creating attributes is the replacement of two attributes with a new attribute which is their product. A new attribute created in this way may have no

The common use of accommodation suggests that it should be considered as one of the issues of data transformation. What is used in data mining are databases which have been created without any consideration for their future specific application, that is, the application with the use of selected tools or algorithms. In effect, it is often the case that the collected data format does not fit the requirements of the tool used for mining, especially in a commercial code. The popular ARFF (attribute - relation file format) makes use only of the data recorded in a nominal or interval scale. The knowledge gathered in the data recorded

The calculation spreadsheet Excel, which is popular and which is most frequently used for gathering the industrial data, may also be the cause of data distortion. Depending upon the format of the cell in which the registered quantity is recorded, the conversion of files into the txt format (which is required by a part of data mining programs) may result in the loss of modeling quality. This is a consequence of the fact that a defined cell contains the whole number which was recorded in it but what is saved in file conversion is only that part of this

in a ratio scale will be lost as a result of being recorded in the ARFF format.

question was a visible increase in the prediction quality of the resulting model.

this was discussed in (Kochanski, 2000).

replaced with their atomic numbers, etc.

number which is displayed (Fig.9).

(Witten & Frank, 2005).

counterpart in reality.


Fig. 9. Data conversion – the conversion of a file from the xls to the txt format brings about a partial loss of information (the authors' own work)

#### **3.4 Data reduction**

At the introductory stage of data preparation this task is limited to a single operation – attribute selection – and it is directly connected with the expert opinion and experience pertaining to the data under analysis. It is only as a result of a specialist analysis (for example, of brainstorming) that one can remove from the set of collected data those attributes which without question do not have any influence on the features under modeling.

The aim of the main stage data reduction techniques is a significant decrease in the data representation, which at the same time preserves the features of the basic data set. Reduced data mining makes then possible obtaining models and analyses which are identical (or nearly identical) to the analyses performed for the basic data.

Data reduction includes five operations, which are represented in Fig. 10:


The collected industrial database set may contain tens or even hundreds of attributes. Many of these attributes may be insignificant for the current analysis. It is frequently the case that integrated industrial databases contain redundant records, which is a consequence of the specific way in which the data is collected and stored in a single factory in multiple places. Keeping insignificant records in a database which is to be used for modeling may lead not only to – often a significant – teaching time increase but also to developing a poor quality model. Of course, at the introductory stage, a specialist in a given area may identify the

Knowledge in Imperfect Data 199

The operation of dimensionality reduction may be performed either by itself or after attribute selection. The latter takes place when the attribute selection operation has still left a considerable number of attributes used for modeling. The difference between dimensionality reduction and attribute selection resides in the fact that dimensionality reduction may lead to a partial loss of the information contained in the data. However, this often leads to the increase in the quality of the obtained model. A disadvantage of dimensionality reduction is the fact that it makes more difficult the interpretation of the obtained results. This results from the impossibility of naming the new (created) attributes. In some publications attribute selection and dimensionality reduction are combined into a single task of data dimensionality reduction1 (Chizi & Maimon, 2005; Fu & Wang, 2003; Villalba & Cunningham, 2007). However, because of the differences in the methods, means,

Discretization, in its broadest sense, transforms the data of one kind into the data of another kind. In the literature, this is discussed as the replacement of the quantitative data with the qualitative data or of the continuous data with the discrete data. The latter approach is the approach pertaining to the most frequent cases of discretization. It should be remembered, however, that such an approach narrows down the understanding of the notion of discretization. This is the case since both the continuous and the discrete data are numerical data. The pouring temperature, the charge weight, the percent element content, etc. are continuous data, while the number of melts or the number of the produced casts are discrete data. However, all the above-mentioned examples of quantities are numbers. Discretization makes possible the replacement of a numerical quantity, for instance, the ultimate tensile strength in MPa with the strength characterized verbally as high, medium, or low. In common practice ten methods of discretization method classification are used. These have been put forward and characterized in (Liu et al., 2002; Yang et al., 2005). A number of published works focus on a comparison of discretization methods which takes into account the influence of the selected method on different aspects of further modeling, for instance, decision trees. These comparisons most frequently analyze three parameters upon which

Comparative analyses are conducted on synthetic data, which are generated in accordance with a specified pattern (Ismail & Ciesielski, 2003; Boullé, 2006), on the widely known data sets, such as Glass, Hepatitis, Iris, Pima, Wine, etc. (Shi & Fu 2005, Boullé, 2006; Ekbal, 2006; Wu QX et al., 2006; Jin et al., 2009; Mitov et al., 2009), and on production data (Perzyk, 2005). As has been widely demonstrated, the number of cases recorded in the database is decisive with respect to the modeling time. However, no matter what the size of the data set is, this time is negligibly short in comparison with the time devoted to data preparation. Because of that, as well as because of a small size, in comparison, for instance, with the business data

sets, numerosity reduction is not usually performed in industrial data sets.

1Data Dimensionality Reduction DDR, Dimension Reduction Techniques DRT

and aims, the two operations should be considered as distinct.

data preparation – discretization exerts influence: the time needed for developing the model, the accuracy of the developed model, the comprehensibility of the model.

group of attributes which are relevant for further modeling, but this may be very timeconsuming, especially given that the nature of the phenomenon under analysis is not fully known. Removing the significant attributes or keeping the insignificant ones may make it very difficult, or even impossible, to develop an accurate model for the collected data.

Fig. 10. The operations performed in the data reduction task (with the introductory stage represented as the squared area and the main stage represented as the white area)

At the main stage, attribute selection is performed without any essential analysis of the data under modeling, but only with the help of selection algorithms. This is supposed to lead to defining the intrinsic (or effective) dimensionality of the collected data set (Fukunaga, 1990) The algorithms in question are classified as either *filters* or *wrappers*, depending on whether they are employed as the operation preceding the modeling proper (filters) or as the operation which is performed alternately with the modeling (wrappers).

Attribute selection is performed for four reasons: (a) to reduce the dimensionality of the attribute space, (b) to accelerate the learning algorithms, (c) to increase the classification quality, and (d) to facilitate the analysis of the obtained modeling results (Liu et al., 2003). According to the same authors, it may – in particular circumstances – increase the grouping accuracy.

Dimensionality reduction is the process of creating new attributes via a transformation of the current ones. The result of dimensionality reduction is that the current inventory of attributes X{x1,x2,…,xn} is replaced with a new one Y{y1,y2,…,ym} in accordance with the following equation (6):

$$\mathbf{Y} = \mathbf{F}\begin{pmatrix} \mathbf{x}\_{1\prime} \ \mathbf{x}\_{2\prime} \ \dots \ \mathbf{x}\_{n} \end{pmatrix} \tag{6}$$

where F() is a mapping function and m < n. In the specific case Y1 = a1x1 + a2x2, where a1 and a2 are coefficients (Liu & Motoda, 1998). The same approach, which takes into consideration in its definition the internal dimensionality, may be found in (Van der Maaten et al., 2009). The collected data are located on or in the neighborhood of a multidimensional curve. This curve is characterized with m attributes of the internal dimensionality, but is placed in the ndimensional space. The reduction frees the data from the excessive dimensionality, which is important since it is sometimes the case that m<<n.

group of attributes which are relevant for further modeling, but this may be very timeconsuming, especially given that the nature of the phenomenon under analysis is not fully known. Removing the significant attributes or keeping the insignificant ones may make it very difficult, or even impossible, to develop an accurate model for the collected data.

Fig. 10. The operations performed in the data reduction task (with the introductory stage represented as the squared area and the main stage represented as the white area)

operation which is performed alternately with the modeling (wrappers).

accuracy.

following equation (6):

important since it is sometimes the case that m<<n.

At the main stage, attribute selection is performed without any essential analysis of the data under modeling, but only with the help of selection algorithms. This is supposed to lead to defining the intrinsic (or effective) dimensionality of the collected data set (Fukunaga, 1990) The algorithms in question are classified as either *filters* or *wrappers*, depending on whether they are employed as the operation preceding the modeling proper (filters) or as the

Attribute selection is performed for four reasons: (a) to reduce the dimensionality of the attribute space, (b) to accelerate the learning algorithms, (c) to increase the classification quality, and (d) to facilitate the analysis of the obtained modeling results (Liu et al., 2003). According to the same authors, it may – in particular circumstances – increase the grouping

Dimensionality reduction is the process of creating new attributes via a transformation of the current ones. The result of dimensionality reduction is that the current inventory of attributes X{x1,x2,…,xn} is replaced with a new one Y{y1,y2,…,ym} in accordance with the

where F() is a mapping function and m < n. In the specific case Y1 = a1x1 + a2x2, where a1 and a2 are coefficients (Liu & Motoda, 1998). The same approach, which takes into consideration in its definition the internal dimensionality, may be found in (Van der Maaten et al., 2009). The collected data are located on or in the neighborhood of a multidimensional curve. This curve is characterized with m attributes of the internal dimensionality, but is placed in the ndimensional space. The reduction frees the data from the excessive dimensionality, which is

Y F x , x , , x 12 n (6)

The operation of dimensionality reduction may be performed either by itself or after attribute selection. The latter takes place when the attribute selection operation has still left a considerable number of attributes used for modeling. The difference between dimensionality reduction and attribute selection resides in the fact that dimensionality reduction may lead to a partial loss of the information contained in the data. However, this often leads to the increase in the quality of the obtained model. A disadvantage of dimensionality reduction is the fact that it makes more difficult the interpretation of the obtained results. This results from the impossibility of naming the new (created) attributes. In some publications attribute selection and dimensionality reduction are combined into a single task of data dimensionality reduction1 (Chizi & Maimon, 2005; Fu & Wang, 2003; Villalba & Cunningham, 2007). However, because of the differences in the methods, means, and aims, the two operations should be considered as distinct.

Discretization, in its broadest sense, transforms the data of one kind into the data of another kind. In the literature, this is discussed as the replacement of the quantitative data with the qualitative data or of the continuous data with the discrete data. The latter approach is the approach pertaining to the most frequent cases of discretization. It should be remembered, however, that such an approach narrows down the understanding of the notion of discretization. This is the case since both the continuous and the discrete data are numerical data. The pouring temperature, the charge weight, the percent element content, etc. are continuous data, while the number of melts or the number of the produced casts are discrete data. However, all the above-mentioned examples of quantities are numbers. Discretization makes possible the replacement of a numerical quantity, for instance, the ultimate tensile strength in MPa with the strength characterized verbally as high, medium, or low. In common practice ten methods of discretization method classification are used. These have been put forward and characterized in (Liu et al., 2002; Yang et al., 2005). A number of published works focus on a comparison of discretization methods which takes into account the influence of the selected method on different aspects of further modeling, for instance, decision trees. These comparisons most frequently analyze three parameters upon which data preparation – discretization exerts influence:


Comparative analyses are conducted on synthetic data, which are generated in accordance with a specified pattern (Ismail & Ciesielski, 2003; Boullé, 2006), on the widely known data sets, such as Glass, Hepatitis, Iris, Pima, Wine, etc. (Shi & Fu 2005, Boullé, 2006; Ekbal, 2006; Wu QX et al., 2006; Jin et al., 2009; Mitov et al., 2009), and on production data (Perzyk, 2005).

As has been widely demonstrated, the number of cases recorded in the database is decisive with respect to the modeling time. However, no matter what the size of the data set is, this time is negligibly short in comparison with the time devoted to data preparation. Because of that, as well as because of a small size, in comparison, for instance, with the business data sets, numerosity reduction is not usually performed in industrial data sets.

<sup>1</sup>Data Dimensionality Reduction DDR, Dimension Reduction Techniques DRT

Knowledge in Imperfect Data 201

**750 950 1150 1350 1550 1750 1950**

With the use of the methodology discussed above, the database was filled in, which resulted in a significant decrease in the percentage of the absent data for the particular inputs and outputs. The degree of the absent data was between 0,5 and 28,1% for inputs and 26,9% for the UTS and 30,6% for the elongation. The result of filling in the missing data was that the degree of the absent data for inputs was reduced to the value from 0,5 to 12,8%, and for strength and elongation to, respectively, 1,3% and 0,8%. Fig. 13 represents the proportion of the number of records in the particular output ranges "before" and "after" this replacement operation. Importantly, we can observe a significant increase in the variability range for both dependent variables, that is, for elongation from the level of 0÷15% to the level of 0÷27,5%, and for strength from the level of 585÷1725MPa to the level of 406÷1725MPa, despite the fact that the percentage of the records in the respective ranges remained at a similar level for the data both "before" and "after" the filling operation. Similar changes occurred for the majority of independent variables – inputs. In this way, via an increase in the learning data set, the reliability of the model was also increased. Also, the range of the practical applications of the taught ANN model was increased, via an increase in the

The data which was prepared in the way discussed above were then used as the learning set for an artificial neural network (ANN). Basing on the earlier research [4], two data sets were created which characterized the influence of the melt chemical composition and the heat treatment parameters (Tab. 1) of ductile cast iron ADI alloys on the listed mechanical properties. The study employed a network of the MLP type, with one hidden layer and with the number of neurons hidden in that layer equaling the number of inputs. For each of the two cases, 10 learning sessions were run and the learning selected for further analysis was

A qualitative analysis of the taught model demonstrated that the prediction quality obtained for the almost twice as numerous learning data set obtained after the absent data replacement was comparable, which is witnessed by a comparable quantitative proportion

Fig. 11. The elongation distribution in the function of the ultimate tensile strength. The

colored pints represent cases identified as outliers (the authors' own work)]

**Ultimate tensile strength [MPa]**

variability range of the parameters under analysis.

the one with the smallest mean square error.

of errors obtained on the learning data set for both cases (Fig. 14).

**Elongation [%]**

The operation of aggregation was first discussed in connection with the discussion of the data transformation task. The difference between these two operations does not reside in the method employed, since the methods are the same, but in the aims with which aggregation is performed. Aggregation performed as a data reduction operation may be treated, in its essence, as the creation of a new attribute, which is the sum of other attributes, a consequence of this being a reduction in the number of events, that is, numerosity reduction.

### **4. The application of the selected operations of the industrial data preparation methodology**

For the last twenty years, many articles have been published which discuss the results of research on ductile cast iron ADI. These works discuss the results of research conducted with the aim of investigating the influence of the parameters of the casting process, as well as of the heat treatment of the ductile iron casts on their various properties. These properties include, on the one hand, the widely investigated ultimate tensile strength, elongation, and hardness, and, on the other, also properties which are less widely discussed, such as graphite size, impact, or austenite fraction. The results discussed in the articles contain large numbers of data from laboratory tests and from industrial studies. The data set collected for the present work contains 1468 cases coming both from the journal-published data and from the authors' own research. They are characterized via 27 inputs, such as: the chemical composition (characterized with reference to 14 elements), the structure as cast (characterized in terms of 7 parameters), the features as cast (characterized in terms of 2 parameters), the heat treatment parameters (characterized in terms of 4 parameters), as well as via 11 outputs: the cast structure after heat treatment, characterized in terms of the retained austenite fraction and the cast features (characterized in terms of 10 parameters). 13 inputs were selected from the set, 9 of them characterizing the melt chemical composition and 4 characterizing the heat treatment parameters. Also, 2 outputs were selected – the mechanical properties after treatment – the ultimate tensile strength and the elongation. The set obtained in this way, which contained 922 cases, was prepared as far as data cleaning is concerned, in accordance with the methodology discussed above.

Prior to preparation, the set contained only 34,8 % of completely full records, containing all the inputs and outputs. The set contained a whole range of cases in which outliers were suspected.

For the whole population of the data set under preparation outliers and high leverage points were defined. This made possible defining influentials, that is, the points exhibiting a high value for Cook's distance. In Fig 11, which represents the elongation distribution in the function of the ultimate tensile strength, selected cases were marked (a single color represents a single data source). The location of these cases in the diagram would not make it possible to unequivocally identify them as outliers. However, when this is combined with an analysis of the cases with a similar chemical composition and heat treatment parameters, the relevant cases may be identified as such.

Fig. 12 below represents the generated correlation matrix of the collected data. The high level of the absent data in the database is visible, among others, in the form of the empty values of the correlation coefficient – for instance, the database contains no case of a simultaneous appearance of both *Al contents* and *participation of graphite nodules*.

The operation of aggregation was first discussed in connection with the discussion of the data transformation task. The difference between these two operations does not reside in the method employed, since the methods are the same, but in the aims with which aggregation is performed. Aggregation performed as a data reduction operation may be treated, in its essence, as the creation of a new attribute, which is the sum of other attributes, a consequence of this being a reduction in the number of events, that is, numerosity reduction.

For the last twenty years, many articles have been published which discuss the results of research on ductile cast iron ADI. These works discuss the results of research conducted with the aim of investigating the influence of the parameters of the casting process, as well as of the heat treatment of the ductile iron casts on their various properties. These properties include, on the one hand, the widely investigated ultimate tensile strength, elongation, and hardness, and, on the other, also properties which are less widely discussed, such as graphite size, impact, or austenite fraction. The results discussed in the articles contain large numbers of data from laboratory tests and from industrial studies. The data set collected for the present work contains 1468 cases coming both from the journal-published data and from the authors' own research. They are characterized via 27 inputs, such as: the chemical composition (characterized with reference to 14 elements), the structure as cast (characterized in terms of 7 parameters), the features as cast (characterized in terms of 2 parameters), the heat treatment parameters (characterized in terms of 4 parameters), as well as via 11 outputs: the cast structure after heat treatment, characterized in terms of the retained austenite fraction and the cast features (characterized in terms of 10 parameters). 13 inputs were selected from the set, 9 of them characterizing the melt chemical composition and 4 characterizing the heat treatment parameters. Also, 2 outputs were selected – the mechanical properties after treatment – the ultimate tensile strength and the elongation. The set obtained in this way, which contained 922 cases, was prepared as far as data cleaning is

Prior to preparation, the set contained only 34,8 % of completely full records, containing all the inputs and outputs. The set contained a whole range of cases in which outliers were

For the whole population of the data set under preparation outliers and high leverage points were defined. This made possible defining influentials, that is, the points exhibiting a high value for Cook's distance. In Fig 11, which represents the elongation distribution in the function of the ultimate tensile strength, selected cases were marked (a single color represents a single data source). The location of these cases in the diagram would not make it possible to unequivocally identify them as outliers. However, when this is combined with an analysis of the cases with a similar chemical composition and heat treatment parameters,

Fig. 12 below represents the generated correlation matrix of the collected data. The high level of the absent data in the database is visible, among others, in the form of the empty values of the correlation coefficient – for instance, the database contains no case of a

simultaneous appearance of both *Al contents* and *participation of graphite nodules*.

**4. The application of the selected operations of the industrial data** 

concerned, in accordance with the methodology discussed above.

the relevant cases may be identified as such.

**preparation methodology** 

suspected.

Fig. 11. The elongation distribution in the function of the ultimate tensile strength. The colored pints represent cases identified as outliers (the authors' own work)]

With the use of the methodology discussed above, the database was filled in, which resulted in a significant decrease in the percentage of the absent data for the particular inputs and outputs. The degree of the absent data was between 0,5 and 28,1% for inputs and 26,9% for the UTS and 30,6% for the elongation. The result of filling in the missing data was that the degree of the absent data for inputs was reduced to the value from 0,5 to 12,8%, and for strength and elongation to, respectively, 1,3% and 0,8%. Fig. 13 represents the proportion of the number of records in the particular output ranges "before" and "after" this replacement operation. Importantly, we can observe a significant increase in the variability range for both dependent variables, that is, for elongation from the level of 0÷15% to the level of 0÷27,5%, and for strength from the level of 585÷1725MPa to the level of 406÷1725MPa, despite the fact that the percentage of the records in the respective ranges remained at a similar level for the data both "before" and "after" the filling operation. Similar changes occurred for the majority of independent variables – inputs. In this way, via an increase in the learning data set, the reliability of the model was also increased. Also, the range of the practical applications of the taught ANN model was increased, via an increase in the variability range of the parameters under analysis.

The data which was prepared in the way discussed above were then used as the learning set for an artificial neural network (ANN). Basing on the earlier research [4], two data sets were created which characterized the influence of the melt chemical composition and the heat treatment parameters (Tab. 1) of ductile cast iron ADI alloys on the listed mechanical properties. The study employed a network of the MLP type, with one hidden layer and with the number of neurons hidden in that layer equaling the number of inputs. For each of the two cases, 10 learning sessions were run and the learning selected for further analysis was the one with the smallest mean square error.

A qualitative analysis of the taught model demonstrated that the prediction quality obtained for the almost twice as numerous learning data set obtained after the absent data replacement was comparable, which is witnessed by a comparable quantitative proportion of errors obtained on the learning data set for both cases (Fig. 14).

Fig. 12. The input and output correlation matrix. The blue color represents the positive correlation while the red color represents the negative one; the deformation size correlation, of the circle (ellipsoidality) and the color saturation indicate the correlation strength (Murdoch & Chow, 1996)

Knowledge in Imperfect Data 203

after before

**after before**

**Ultimate tensile strength [MPa]**

**before replacing after replacing**

**before replacing after replacing**

Fig. 13. The proportions of the numbers of records in the specified ranges of the output variables a) elongation [%], b) ultimate tensile strength [MPa] before and after replacing the

**Elongation prediction errors** 

**Ultimate tensile strength prediction errors**

Fig. 14. Output prediction error distribution a) elongation [%], b) ultimate tensile strength [MPa]

(a)

(b)

(a)

(b)

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

**Errors distributions**

from 0 to 10 from 10 to 20 from 20 to 30 from 30 to 40 from 40 to 50 from 50 to 60 from 60 ro 70 from 70 to 80 from 80 to 90 from 90 to 100

**Errors distributions**

from 0 to 10 from 10 to 20 from 20 to 30 from 30 to 40 from 40 to 50 from 50 to 60 from 60 ro 70 from 70 to 80 from 80 to 90 from 90 to 100

**Number of records**

missing and empty values

od 400 do 540 od 540 do 680 od 680 do 820 od 820 do 960 od 960 do 1100 od 1100 do 1240 od 1240 do 1380 od 1380 do 1520 od 1520 do 1660 od 1660 do 1800

**Number of records**

 od 0 do 3 od 3 do 6 od 6 do 9 od 9 do 12 od 12 do 15 od 15 do 18 od 18 do 21 od 21 do 24 od 24 do 27 od 27 do 30 **Elongation [%]**

Fig. 12. The input and output correlation matrix. The blue color represents the positive correlation while the red color represents the negative one; the deformation size correlation,

of the circle (ellipsoidality) and the color saturation indicate the correlation strength

(Murdoch & Chow, 1996)

Fig. 13. The proportions of the numbers of records in the specified ranges of the output variables a) elongation [%], b) ultimate tensile strength [MPa] before and after replacing the missing and empty values

Fig. 14. Output prediction error distribution a) elongation [%], b) ultimate tensile strength [MPa]

Knowledge in Imperfect Data 205

make predictions in the whole possible range of dependent variables. This may suggest that the absent data was replaced with appropriate values, reflecting the obtaining general tendencies and correlations. It should also be noted that the biggest prediction errors occur for the low values of the parameters under analysis (e.g.: for A5 within the range 0÷1%), which may be directly connected with the inaccuracy and the noise in the measurement (e.g.: with a premature break of a tensile specimen resulting from discontinuity or non-

The application of the proposed methodology makes possible a successful inclusion into data sets of the pieces of information coming from different sources, including also

An increase in the number of cases in the data set used for modeling results in the model accuracy increase, at the same time significantly widening the practical application range of the taught ANN model, via a significant increase, sometimes even doubling, of the

Multiple works suggested that ANN models should not be used for predicting results which are beyond the learning data range. The methodology proposed above makes possible the absent data replacement and therefore, the increase of the database size. This, in turn, makes

Fig. 16. The distribution of the real results and the predictions of the UTS (ultimate tensile strength) variable for a) the data set without replacement; b) the data set after replacement

possible developing models with a significantly wider applicability range.

metallic inclusions).

variability range of the parameters under analysis.

uncertain data.

A further confirmation of this comes from a comparative analysis of the real results and the ANN answers which was based on correlation graphs and a modified version of the correlation graph, taking into account the variable distribution density (Fig. 15 and 16). In all cases under analysis we can observe that the ANN model shows a tendency to overestimate the minima and to underestimate the maxima. This weakness of ANNs was discussed in (Kozlowski, 2009).

Fig. 15. The distribution of the real results and the predictions of the elongation variable for a) the data set without replacement; b) the data set after replacement

The analyzed cases suggest that an ANN model taught on a data set after absent data replacement exhibits similar and at the same high values of the model prediction quality, that is, of the mean square error, as well as of the coefficient of determination R2. In the case of predicting A5, the more accurate model was the ANN model based on the data set after replacement (for Rm the opposite was the case). However, the most important observable advantage following from data replacement and from the modeling with the use of the data set after replacement is the fact that the ANN model increased its prediction range with respect to the extreme output values (in spite of a few deviations and errors in their predictions). This is extremely important from the perspective of the practical model application, since it is the extreme values which are frequently the most desired (for instance, obtaining the maximum value Rm with, simultaneously, the highest possible A5), and, unfortunately, the sources – for objective reasons – usually do not spell out the complete data, for which these values were obtained. In spite of the small number of records characterizing the extreme outputs, an ANN model was successfully developed which can

A further confirmation of this comes from a comparative analysis of the real results and the ANN answers which was based on correlation graphs and a modified version of the correlation graph, taking into account the variable distribution density (Fig. 15 and 16). In all cases under analysis we can observe that the ANN model shows a tendency to overestimate the minima and to underestimate the maxima. This weakness of ANNs was discussed in

Fig. 15. The distribution of the real results and the predictions of the elongation variable for

The analyzed cases suggest that an ANN model taught on a data set after absent data replacement exhibits similar and at the same high values of the model prediction quality, that is, of the mean square error, as well as of the coefficient of determination R2. In the case of predicting A5, the more accurate model was the ANN model based on the data set after replacement (for Rm the opposite was the case). However, the most important observable advantage following from data replacement and from the modeling with the use of the data set after replacement is the fact that the ANN model increased its prediction range with respect to the extreme output values (in spite of a few deviations and errors in their predictions). This is extremely important from the perspective of the practical model application, since it is the extreme values which are frequently the most desired (for instance, obtaining the maximum value Rm with, simultaneously, the highest possible A5), and, unfortunately, the sources – for objective reasons – usually do not spell out the complete data, for which these values were obtained. In spite of the small number of records characterizing the extreme outputs, an ANN model was successfully developed which can

a) the data set without replacement; b) the data set after replacement

(Kozlowski, 2009).

make predictions in the whole possible range of dependent variables. This may suggest that the absent data was replaced with appropriate values, reflecting the obtaining general tendencies and correlations. It should also be noted that the biggest prediction errors occur for the low values of the parameters under analysis (e.g.: for A5 within the range 0÷1%), which may be directly connected with the inaccuracy and the noise in the measurement (e.g.: with a premature break of a tensile specimen resulting from discontinuity or nonmetallic inclusions).

The application of the proposed methodology makes possible a successful inclusion into data sets of the pieces of information coming from different sources, including also uncertain data.

An increase in the number of cases in the data set used for modeling results in the model accuracy increase, at the same time significantly widening the practical application range of the taught ANN model, via a significant increase, sometimes even doubling, of the variability range of the parameters under analysis.

Multiple works suggested that ANN models should not be used for predicting results which are beyond the learning data range. The methodology proposed above makes possible the absent data replacement and therefore, the increase of the database size. This, in turn, makes possible developing models with a significantly wider applicability range.

Fig. 16. The distribution of the real results and the predictions of the UTS (ultimate tensile strength) variable for a) the data set without replacement; b) the data set after replacement

Knowledge in Imperfect Data 207

Chizi B. & Maimon O. (2005). Dimension reduction and feature selection, *In Data Mining and* 

Chuanng-Chien Ch.; Tong-Hong L. & Ben-Yi L. (2005). Using correlation coefficient in ECG

Ekbal A. (2006). Improvement of Prediction Accuracy Using Discretization and Voting Classifier, *18th International Conference on Pattern Recognition*, Vol. 2, pp: 695 – 698 Fan H.; Zaïane O.R.; Foss A. & Wu J. (2006). A Nonparametric Outlier Detection for Effectively

Filzmoser P. (2005). Identification of multivariate outliers: A performance study, *Austrian* 

Fu X. & Wang L.(2003). Data Dimensionality Reduction With Application to Simplifying

Fukunaga K. (1990). Introduction to Statistical Pattern Recognition, 2nd ed., *Academic Press*,

Galmacci G. (1996). Collinearity detection in linear regression models, *Computational* 

Ginoris Y.P.; Amaral A.L.; Nicolau A.; Coelho M.A.Z. & Ferreira E.C. (2007). Raw data pre-

Hasimah Hj M.; Abdul Razak H. & Azuraliza Abu B.(2007). Pixel-based Parallel-Coordinates

Ismail M.K. & Ciesielski V. (2003). An empirical investigation of the impact of discretization

Jimenez-Marquez S.A.; Lacroix C. & Thibault J.(2002). Statistical Data Validation Methods

Jin R.; Breitbart Y. & Muoh Ch.(2009).Data discretization unification, *Journal Knowledge and* 

Kochanski A. (2000)., Combined methods of ductile cast iron modeling, *III Polski Kongres* 

*Odlewnictwa, Zbior Materialow*, str. 160-165, Warszawa (in Polish)

*and Data Mining*, *Lecture Notes in Artificial Intelligence*, 2006, Vol. 3918

(Online)

2003, pp. 399 – 409

701, IOS Press

0219-3116 (Online)

San Diego, ISBN 0-12-269851-7

*Economics*, Vol. 9, pp. 215-227

Publisher, ISBN 1-55860-489-8

Bandung, Indonesia June 17-19, 2007

American Dairy Science Association, 2002

*Communications*, 2005(June), 17, 147-152

*Journal of Statistics*, Vol. 34, No. 2, pp. 127-138,

*Knowledge Discovery Handbook*, Ch. 5, pp. 93–111, edited by: Maimon O., Rokach L., *Springer Science+Business Media*, ISBN 978-0-387-24435-8 (Print) 978-0-387-25465-4

waveform for arrhythmia detection, *Biomedical Engineering - Applications, Basis &* 

Discovering Top-N Outliers from Engineering Data, *Advances in Knowledge Discovery* 

RBF Network Structure and Improving Classification Performance, *IEEE Transactions on systems, man and cybernetics – part B: Cybernetics*, Vol. 33, No. 3, June

processing in the protozoa and metazoa identification by image analysis and multivariate statistical techniques, *Journal of chemometrics*, Vol. 21, pp. 156–164 Han J. & Kamber M. (2001). Data Mining. Concepts and Techniques, Morgan Kaufmann

technique for outlier detection in Cardiac Patient Dataset*, Proceedings of the International Conference on Electrical Engineering and Informatics,* Institut Teknologi

on common data distributions*, Proc. of the Third International Conference on Hybryd Intelligent Systems (HIS'03)*, edited by Abraham A., Koppen M., Franke K., pp.692-

for Large Cheese Plant Database*, Journal of Dairy Science*, Vol. 85, No. 9, 2081–2097,

*Information Systems*, Vol. 19, No.1, (April, 2009), pp. 1 – 29, ISSN 0219-1377 (Print)

## **5. Conclusion**

The proposed methodology makes possible the full use of imperfect data coming from various sources. Appropriate preparation of the data for modeling via, for instance, absent data replacement makes it possible to widen, directly and indirectly, the parameter variability range and to increase the set size. Models developed on such sets are high-quality models and their prediction accuracy can be more satisfactory.

The consequent wider applicability range of the model and its stronger reliability, in combination with its higher accuracy, open a way to a deeper and wider analysis of the phenomenon or process under analysis.

## **6. References**


The proposed methodology makes possible the full use of imperfect data coming from various sources. Appropriate preparation of the data for modeling via, for instance, absent data replacement makes it possible to widen, directly and indirectly, the parameter variability range and to increase the set size. Models developed on such sets are high-quality

The consequent wider applicability range of the model and its stronger reliability, in combination with its higher accuracy, open a way to a deeper and wider analysis of the

Agre G. & Peev S. (2002). On Supervised and Unsupervised Discretization*, Cybernetics and* 

Al Shalabi L. & Shaaban Z. (2006). Normalization as a Preprocessing Engine for Data

Al Shalabi L.; Shaaban Z. & Kasasbeh B. (2006). Data Mining: A Preprocessing Engine,

Alves R.M.B. & Nascimento C.A.O. (2002). Gross errors detection of industrial data by

Bartkowiak A. (2005). Robust Mahalanobis distances obtained using the 'Multout' and

Ben-Gal I. (2005). Outlier detection, *In Data Mining and Knowledge Discovery Handbook: A* 

Bensch M.; Schröder M.; Bogdan M. & Rosenstiel W. (2005). Feature Selection for High-

Boullé M. (2006). MODL: A Bayes optimal discretization method for continuous attributes,

Cannataro M. (2008). Computational proteomics: management and analysis of proteomics

Cateni S.; Colla V. & Vannucci M. (2008). Outlier Detection Methods for Industrial

Aramburo J. and Trevino A.R., Publisher InTech, ISBN 978-953-7619-16-9

*Journal of Computer Science* 2 (9); pp. 735-739, ISSN 1549-3636

19, No. 04, pp. 483 - 489, October - December 2002

L., Kluwer Academic Publishers, ISBN 0-387-24435-2

data, *Briefings in Bioinformatics*, Vol. 9, No. 2, pp. 97-101

*Microsystems*, ISBN/EAN: 978-90-5638-187-5

*information technologies*, Vol. 2, No. 2, pp. 43-57, Bulgarian Academy of Science,

Mining and the Approach of Preference Matrix*, Proceedings of the International Conference of Dependability of Computer Systems (DEPCOS-RELCOMEX'06)*, ISBN:

neural network and cluster techniques, *Brazilian Journal of Chemical Engineering*, Vol.

'Fast-mcd' methods, *Biocybernetics and Biomedical Engineering*, Vol. 25, No. 1, pp. 7-

*Complete Guide for Practitioners and Researchers*, edited by Maimon O. and Rockach

Dimensional Industrial Data, *ESANN2005 proceeding – European Symposium on Artificial Neural Networks,* Bruges, Belgium, ISBN2-930307-05-6, 27-29 April, 2005 Bonebakker J. L. (2007). Finding representative workloads for computer system design, *Sun* 

*Journal Machine Learning*, Vol. 65, No. 1, pp. 131 – 165, ISSN 0885-6125 (Print) 1573-

Applications in Advances and Robotics*, Automation and Control*, edited by

models and their prediction accuracy can be more satisfactory.

phenomenon or process under analysis.

**5. Conclusion** 

**6. References** 

Sofia

21,

0-7695-2565-2

0565 (Online)


Knowledge in Imperfect Data 209

Murdoch, D.J. & Chow, E.D. (1996). A graphical display of large correlation matrices. The

Olichwier J. (2011). The influence of data preparation on the accuracy of the modeling with

Perzyk M. ; Biernacki R.& Kochanski A.(2005). Modelling of manufacturing processes

Pyle D. (1999). Data Preparation for Data Mining, Morgan Kaufmann Publisher, ISBN 1-

Pyle D. (2003). Data Collection, Preparation, Quality, and Visualization, In: *The Handbook* 

Refaat M. (2007). Data Preparation for Data Mining Using SAS, Morgan Kaufmann

Rousseeuw P.J. & Zomeren B.C. van (1990). Unmasking multivariate outliers and leverage

Saeys Y.; Inza I. & Larrañaga P. (2007). A review of feature selection techniques in

Shaari F.; Abu Bakar A. & Razak Hamdan A. (2007). On New Approach in Mining Outlier,

Shi H.& Fu J-Z.(2005). A global discretization method based on rough sets, *Proceedings of the* 

Villalba S. D. & Cunningham P. (2007). An Evaluation of Dimension Reduction Techniques

Weiss S. M. & Indurkhya N. (1998). Predictive Data Mining: a practical guide, Morgan

Witten I.H. & Frank E.(2005). Data Mining. Practical Machine Learning Tools and

Wu QX.; Bell D.; McGinnity M.; Prasad G.; Qi G. & Huang X. (2006). Improvement of

bioinformatics, *Bioinformatics*, Vol. 23, No. 19, pp. 2507–2517

*Institut Teknologi Bandung*, Indonesia, June 17-19, 2007

StatSoft (2011) web page http://www.statsoft.pl/textbook/stathome.html

Techniques, 2nd ed., Elsevier Inc., ISBN-13: 978-0-12-088407-0

Bandung, Indonesia June 17-19, 2007

Research raport KBN Nr 003353/C.T08-6/2003 (in Polish)

Publisher, ISBN 13-978-0-12-373577-5

American Statistical Association

American Statistician 50, 178-180

Production Engineering

1430-1435

55860-529-0

21 August, 2005

Dublin, , August 13th, 2007

(Print) 1611-3349 (Online)

Kaufmann Publisher, ISBN 1-55860-403-0

081-8

*International Conference on Electrical Engineering and Informatics*, Institut Teknologi

the use of the laboratory and literature ADI ductile cast iron data, *MSc thesis, (in Polish)*, supervised by A. Kochanski, Warsaw University of Technology, Faculty of

by learning systems: the naive Bayesian classifier versus artificial neural networks, *Journal of Materials Processing Technology*, Elsevier, Vol. 164-165, pp.

*of Data Mining* edited by Nong Ye, Lawrence Erlbaum Associates, ISBN 0-80584-

points, *Journal of the American Statistical Association*, Vol. 85, No. 411, pp. 633-651,

*Proceedings of the International Conference on Electrical Engineering and Informatics* 

*Fourth International Conference on Machine Learning and Cybernetics*, Guangzhou, 18-

for One-Class Classification, *Technical Report UCD-CSI-2007-9,* University College

Decision Accuracy Using Discretization of Continuous Attributes, *Lecture Notes in Computer Science*, Vol. 4223, Publisher Springer Berlin / Heidelberg, ISSN 0302-9743


Kochanski A. (2006). Aiding the detection of cast defect causes. Polish Metallurgy 2002 –

Kochanski A. (2010). Data preparation, *Computer Methods in Materials Science*, Vol. 10, No. 1,

Kozlowski J. (2009). Aiding the foundry process control with the use of advanced artificial

Kusiak A. (2001). Feature Transformation Methods in Data Mining, *IEEE Transactions on* 

Larose D. T. (2005). Discovering Knowledge in Data. An Introduction to DATA MINING,

Laurikkala, J.; Juhola M.& Kentala E. (2000). Informal Identification of Outliers in Medical

Liang Goh & Kasabov, N.(2003). Integrated gene expression analysis of multiple microarray

Liu H., Hussain F.; Tan C.L. & Dash M. (2002). Discretization: An Enabling Technique, Data

Liu H. & Motoda H.(1998). Feature extraction, construction and selection: a data mining perspective, Kluwer Academic Publishers, ISBN 0-7923-8196-3,print. 2001 Liu H.; Motoda H. & Yu L. (2003). Feature Extraction, Selection, and Construction, In: *The* 

Loy Ch. Ch.; Lai MW. K. & Lim Ch. P. (2006). Dimensionality reduction of protein mass

Van der Maaten L.; Postma E. & Van den Herik J. (2009) *Dimensionality Reduction: A* 

McCue C.(2007). Data Mining and Predictive Analysis: Intelligence Gathering and Crime Analysis, Butterworth-Heinemann, ISBN 0750677961, 9780750677967 Mitov I.; Ivanova K.; Markov K.; Velychko V.; Stanchev P. & Vanhoof K. (2009). Comparison

*International Journal - Information Technologies and Knowledge* Volume 3 Moh'd Belal Al- Zgubi (2009). An Effective Clustering-Based Approach for Outlier Detection, *European Journal of Scientific Research*, ISSN 1450-216X Vol.28 No.2, pp.310-316 Mohamed H. Hj. ; Hamdan A.R. & Bakar A.A. (2007). Pixel-based Parallel-Coordinates

Masters T. (1993). Practical Neural Network Recipes in C+*+*, Academic Press Inc.

159-4-4, Krakow (in Polish)

John Wiley & Sons, Inc.

1728, 20-24 July, 2003

8058—4081-8

4233, ISBN 978-3-540-46481-5

*Comparative Review*, preprint

Technology, Faculty of Production Engineering

*Medicine and Pharmacology IDAMAP-2000,* Berlin

Publishers, ISSN: 1384-5810 (Print) 1573-756X (Online)

*Electronics Packaging Manufacturing*, Vol. 24, No. 3, July 2001

pp. 25-29

2006, *Komitet Metalurgii Polskiej Akademii Nauk*, red. K. Swiatkowski, ISBN 83-910-

neural network analysis methods, *PhD thesis, (in Polish)*, Warsaw University of

Data, *Proceeding in the Fifth International Workshop on Intelligent Data Analysis on* 

data sets based on a normalization technique and on adaptive connectionist model, *Proceedings of the International Joint Conference on Neural Networks*, Vol. 3, pp. 1724 –

Mining and Knowledge Discovery, Vol. 6, No. 4, pp. 393–423, Kluwer Academic

*Handbook of Data Mining* ed. Nong Ye, Lawrence Erlbaum Associates, ISBN 0-

spectrometry data using Random Projection, *Lecture Notes in Computer Science*, Vol.

of discretization methods for preprocessing data for Pyramidal Growing Network classification method, *New Trends in Intelligent Technologies Supplement to* 

technique for outlier detection in Cardiac Patient Dataset, *Proceedings of the* 

*International Conference on Electrical Engineering and Informatics*, Institut Teknologi Bandung, Indonesia June 17-19, 2007


**0**

**9**

*Altilia srl Italy*

**A Knowledge Representation Formalism for**

*High Performance Computing and Networking Institute of the National Research Council,*

Business process models are increasingly used to create clarity about the logical sequence of activities in public and private organizations belonging to different industries and areas. To improve *Business Process Management* (BPM), semantic technologies (like ontologies, reasoners, and semantic Web services) should be integrated in BPM tools in order to enable semantic BPM. Semantic Business Process Management (SBPM) approaches and tools aim at allowing more efficient and effective business process management across complex organizations. By semantic BPM decision makers can get transparent, fast, and comprehensive view of relevant business processes for better analyzing and driving processes. In defining semantic BPM tools aimed at improving the quality of process models and subsequent process analyses, a key aspect to take into account is to represent in combined way static knowledge regarding a specific application domain (i.e. domain ontologies) and dynamic knowledge related to process schemas and instances that are typically performed in a given domain. For example, in the health care domain, where the evidence-based medicine has contributed to define and apply clinical processes for caring a wide variety of diseases, a process-oriented vision of clinical practices may allow for enhancing patient safety by

In this Chapter is firstly summarized the large body of work currently available in the field of knowledge representation formalisms and approaches for representing and managing business processes. Then a novel ontology-based approach to business process representation and management, named Static/Dynamic Knowledge Representation Framework (SD-KRF), is presented. The SD-KRF allows for expressing in a combined way domain ontologies, business processes and related business rules. It supports semantic business process management and contributes to enhancing existing BPM solutions in order to achieve more flexible, dynamic and manageable business processes. More in detail, the presented

1. Creating ontologies of business processes that can be queried and explored in a semantic

2. Expressing business rules (by means of *reasoning tasks*) that can be used for monitoring

3. Extracting information from business documents. Semantic information extraction allows the acquisition of information and metadata useful for the correct execution of business

**1. Introduction**

enabling better risks management capabilities.

framework allows methods for:

fashion.

processes.

**Semantic Business Process Management**

Ermelinda Oro and Massimo Ruffolo

Yang Y.; Webb G.I. & Wu X. (2005). Discretization Methods, In: *Data Mining and Knowledge Discovery Handbook*, Ch. 6, pp. 113–130, edited by: Maimon O., and Rokach L., Springer Science+Business Media, ISBN 978-0-387-24435-8 (Print)

## **A Knowledge Representation Formalism for Semantic Business Process Management**

Ermelinda Oro and Massimo Ruffolo

*High Performance Computing and Networking Institute of the National Research Council, Altilia srl Italy*

#### **1. Introduction**

210 Advances in Knowledge Representation

Yang Y.; Webb G.I. & Wu X. (2005). Discretization Methods, In: *Data Mining and Knowledge* 

Springer Science+Business Media, ISBN 978-0-387-24435-8 (Print)

*Discovery Handbook*, Ch. 6, pp. 113–130, edited by: Maimon O., and Rokach L.,

Business process models are increasingly used to create clarity about the logical sequence of activities in public and private organizations belonging to different industries and areas. To improve *Business Process Management* (BPM), semantic technologies (like ontologies, reasoners, and semantic Web services) should be integrated in BPM tools in order to enable semantic BPM. Semantic Business Process Management (SBPM) approaches and tools aim at allowing more efficient and effective business process management across complex organizations. By semantic BPM decision makers can get transparent, fast, and comprehensive view of relevant business processes for better analyzing and driving processes. In defining semantic BPM tools aimed at improving the quality of process models and subsequent process analyses, a key aspect to take into account is to represent in combined way static knowledge regarding a specific application domain (i.e. domain ontologies) and dynamic knowledge related to process schemas and instances that are typically performed in a given domain. For example, in the health care domain, where the evidence-based medicine has contributed to define and apply clinical processes for caring a wide variety of diseases, a process-oriented vision of clinical practices may allow for enhancing patient safety by enabling better risks management capabilities.

In this Chapter is firstly summarized the large body of work currently available in the field of knowledge representation formalisms and approaches for representing and managing business processes. Then a novel ontology-based approach to business process representation and management, named Static/Dynamic Knowledge Representation Framework (SD-KRF), is presented. The SD-KRF allows for expressing in a combined way domain ontologies, business processes and related business rules. It supports semantic business process management and contributes to enhancing existing BPM solutions in order to achieve more flexible, dynamic and manageable business processes. More in detail, the presented framework allows methods for:


event-driven process chains. Models expressed in terms of BPMN are called Business Process Diagrams (BPDs). A BPD is a flowchart having different elements: Flow Objects, Connecting Objects, Swim-lanes, Artifacts, Events, Activities, and Gateways. Events are comparable to places in a Petri net, in fact they are used to trigger and/or connect activities. Whereas in UML Activity Diagrams and in BPMN resource types are captured as swim-lanes, with each task belonging to one or more swim-lane, in Event-driven Process Chains (EPC) resource types are explicitly attached to each task. These type of languages describe only the desired behavior of processes, and do not have a formal semantics,

A Knowledge Representation Formalism for Semantic Business Process Management 213

• **Execution languages**. Because formal languages are too general and conceptual languages are aiming at the representation of processes and not directly at execution, languages that consider the processes enactment have been defined. The most common language in this category is BPEL (Business Process Execution Language) Wohed et al. (2006). BPMN diagrams are refined into BPEL specifications, but such translation is a difficult task because BPMN lacks of formal semantics. Therefore, several attempts have been made to provide semantics for a subset of BPMN Weske (2007). Other proprietary enactment languages have been defined. For example, XPDL XPDL (2011) is a very common language

BPM is a difficult task because the semantic of a business processes is frequently hidden in complex models obtained by different description and enactment languages. The explicit representation of domain knowledge related to business processes combined to explicit description of the semantic processes could help to obtain advices, alerts, and reminders. Furthermore, reasoning capabilities allow for representing and managing business rules and better enacting and monitoring of processes Peleg (2009). Classical languages adopted for representing process models provide a low degree of automation in the BPM lifecycle. In particular, there are many difficulties in the translations of business modeling (performed by business expert analist) to workflow models (which are executable IT representations of business processes). Like Semantic Web Services achieve more automation in discovery and mediation with respect to conventional Web services, BPM systems can obtain more automation by using knowledge representation and reasoning, and therefore semantic

Initially, knowledge representation and reasoning is been used for artificial intelligence tasks Newell (1980) to support humans in decision making. Then, rule-based systems were introduced. An important example is Mycin Shortliffe (1976) that represented clinical knowledge and contained if-then-else rules in order to derive diagnoses and treatments for a given disease. After, it was integrated database for representing knowledge in decision support systems. An important example is the Arden System for Medical Logic Modules (MLMs) Hripcsak et al. (1994) system. MLMs, in Arden Syntax, define decision logic via a knowledge category that has data, event, logic, and action slots useful in representing processes. Finally, ontologies Gruber (1995) were used for formally representing the knowledge as a set of concepts within a domain, and the relationships between those concepts. An ontology may be used to describe the domain, and to reason about the entities and relations within that domain in order to provide decision support. It is noteworthy that to add, delete or modify knowledge in rule-based systems was a difficult task, whereas ontologies

therefore they are not suitable for enabling processes execution.

based on BPMN.

technologies Hepp et al. (2005).

**2.1 Semantic business process management**

processes from unstructured sources and the storage of extracted information into structured machine-readable form. Such a facility makes available large amount of data on which data mining techniques, can be performed to discover patterns related to adverse events, errors and cost dynamics, hidden in the structure of the business processes, that are cause of risks and of poor performances.


SD-KRF is an homogeneous framework where the domain knowledge, the process structures, and the behavioral semantics of processes are combined in order to allow querying, advanced analysis and management of business processes in a more flexible and dynamic way.

#### **2. Semantic business process management at a glance**

BPM links processes and information systems. One of the most important aspect of BPM is the modeling of processes. Historically, process modeling has mainly been performed with general purpose languages, such as Activity Diagrams (AD), Business process Modeling Notation (BPMN) or Event-driven Process Chains (EPC). Such instruments are not suitable for an automated semantic process analysis because semantic modeling of structural elements and domain knowledge are missing. In recent years different languages and approaches for semantic business process management have emerged. In this Section will be briefly described languages for representing processes and their semantics.

By considering the abilities of representing business processes, as described in van der Aalst (2009), existing languages for processes modeling can be classified in:


event-driven process chains. Models expressed in terms of BPMN are called Business Process Diagrams (BPDs). A BPD is a flowchart having different elements: Flow Objects, Connecting Objects, Swim-lanes, Artifacts, Events, Activities, and Gateways. Events are comparable to places in a Petri net, in fact they are used to trigger and/or connect activities. Whereas in UML Activity Diagrams and in BPMN resource types are captured as swim-lanes, with each task belonging to one or more swim-lane, in Event-driven Process Chains (EPC) resource types are explicitly attached to each task. These type of languages describe only the desired behavior of processes, and do not have a formal semantics, therefore they are not suitable for enabling processes execution.

• **Execution languages**. Because formal languages are too general and conceptual languages are aiming at the representation of processes and not directly at execution, languages that consider the processes enactment have been defined. The most common language in this category is BPEL (Business Process Execution Language) Wohed et al. (2006). BPMN diagrams are refined into BPEL specifications, but such translation is a difficult task because BPMN lacks of formal semantics. Therefore, several attempts have been made to provide semantics for a subset of BPMN Weske (2007). Other proprietary enactment languages have been defined. For example, XPDL XPDL (2011) is a very common language based on BPMN.

#### **2.1 Semantic business process management**

2 Will-be-set-by-IN-TECH

5. Executing business processes and acquiring process instances by means of either *workflow enactment* (predefined process schemas are automatically executed) or *workflow composition*

7. Analyzing acquired business process instances, by means of querying and inference capabilities, in order to recognize errors and risks for the process and the whole

SD-KRF is an homogeneous framework where the domain knowledge, the process structures, and the behavioral semantics of processes are combined in order to allow querying, advanced

BPM links processes and information systems. One of the most important aspect of BPM is the modeling of processes. Historically, process modeling has mainly been performed with general purpose languages, such as Activity Diagrams (AD), Business process Modeling Notation (BPMN) or Event-driven Process Chains (EPC). Such instruments are not suitable for an automated semantic process analysis because semantic modeling of structural elements and domain knowledge are missing. In recent years different languages and approaches for semantic business process management have emerged. In this Section will be briefly described

By considering the abilities of representing business processes, as described in van der Aalst

• **Formal languages**. Processes are described by using formal models, for examples Markov

• **Conceptual languages**. Processes are represented by user-friendly semi-formal languages. Example of well known conceptual languages are UML activity diagrams, BPMN (Business Process Modeling Notation), and EPCs (Event- Driven Process Chains). Activity diagrams (or control flow diagrams) is a type of UML (unified modeling language OMG (2011)) diagrams. They provide a graphical notation to define the sequential, conditional, and parallel composition of lower-level behaviors, therefore they are suitable for modeling business processes. The Event-driven Process Chain (EPC) van der Aalst (1999) is a type of flowchart used for business process modeling and compared to UML activity diagrams, the EPC covers more aspects such as a detailed description of business organization units together with their respective functions as well as information and material resources used in each function. These essential relationships are not explicitly shown in activity diagrams. The Business Process Model and Notation (BPMN) White (2006) is a graphical notation for drawing business processes, proposed as a standard notation. The language is similar to other informal notations such as UML activity diagrams and extended

6. Monitoring business processes during the execution by running reasoning tasks.

analysis and management of business processes in a more flexible and dynamic way.

are cause of risks and of poor performances.

organization.

4. Querying directly enterprise database in order to check activity status.

(activity to execute are chosen step-by-step by humans).

**2. Semantic business process management at a glance**

languages for representing processes and their semantics.

(2009), existing languages for processes modeling can be classified in:

chains and Petri nets. Such languages have unambiguous semantics.

processes from unstructured sources and the storage of extracted information into structured machine-readable form. Such a facility makes available large amount of data on which data mining techniques, can be performed to discover patterns related to adverse events, errors and cost dynamics, hidden in the structure of the business processes, that

> BPM is a difficult task because the semantic of a business processes is frequently hidden in complex models obtained by different description and enactment languages. The explicit representation of domain knowledge related to business processes combined to explicit description of the semantic processes could help to obtain advices, alerts, and reminders. Furthermore, reasoning capabilities allow for representing and managing business rules and better enacting and monitoring of processes Peleg (2009). Classical languages adopted for representing process models provide a low degree of automation in the BPM lifecycle. In particular, there are many difficulties in the translations of business modeling (performed by business expert analist) to workflow models (which are executable IT representations of business processes). Like Semantic Web Services achieve more automation in discovery and mediation with respect to conventional Web services, BPM systems can obtain more automation by using knowledge representation and reasoning, and therefore semantic technologies Hepp et al. (2005).

> Initially, knowledge representation and reasoning is been used for artificial intelligence tasks Newell (1980) to support humans in decision making. Then, rule-based systems were introduced. An important example is Mycin Shortliffe (1976) that represented clinical knowledge and contained if-then-else rules in order to derive diagnoses and treatments for a given disease. After, it was integrated database for representing knowledge in decision support systems. An important example is the Arden System for Medical Logic Modules (MLMs) Hripcsak et al. (1994) system. MLMs, in Arden Syntax, define decision logic via a knowledge category that has data, event, logic, and action slots useful in representing processes. Finally, ontologies Gruber (1995) were used for formally representing the knowledge as a set of concepts within a domain, and the relationships between those concepts. An ontology may be used to describe the domain, and to reason about the entities and relations within that domain in order to provide decision support. It is noteworthy that to add, delete or modify knowledge in rule-based systems was a difficult task, whereas ontologies

processes and vertical telecommunications ontologies to support domain-specific annotation. SUPER ontologies allow telecoms business managers to search existing processes, to model new business processes, to modify process models, to search for semantic web services that

A Knowledge Representation Formalism for Semantic Business Process Management 215

The project plugIT PLUG-IT (2011), that means Plug Your Business into IT, has been co-funded by the European Union Intelligent Content and Semantics. It is based on the observation of the necessity to align Business and Information Technology (IT). The result of the project was the IT-Socket Knowledge portal. The project is based on the idea that IT can be consumed by plugging in business, like electric power is consumed when plugging in electronic devices in a power socket. ITSocket externalizes the expert knowledge by using graphical semi-formal models, such a knowledge is formalized in order to enable a computer-supported alignment using semantic technologies. Figure 1 shows business and IT experts that formalize the knowledge and so they enable automated support of business and IT alignment. In particular

COIN, that means Enterprise COllaboration and INteroperability, COIN (2011) is an integrated project in the European Commission Seventh Framework Programme, that stars in 2008 and ends in 2011. The scope of the project is create a pervasive and self-adapting knowledge that enable enterprise collaboration and interoperability services in order to

In literature a lot of ontology-based approaches to business process management have been proposed. Initially, a set of approaches was proposed to apply techniques borrowed from the Semantic Web to the BP management context SUPER (2011). In Missikoff et al. (2011) an ontology-based approach for querying business process repositories for the retrieval of process fragments to be reused in the composition of new BPs is presented. The proposed solution is composed by an ontological framework (OPAL) aimed at capturing the semantics of a business scenario, and a business process modelling framework (BPAL) to represent the workflow logic of BPs. In Markovic (2008) a querying framework based on ontologies is

compose business processes and to execute implemented business process models.

the alignment can be delegated to semantic technologies.

Fig. 1. IT-Socket for business and IT alignment

manage and effectively operate different forms of business collaborations.

are more simply modifiable. Ontologies and Semantic Web service technologies can be used throughout the BPM lifecycle Hepp et al. (2005); Wetzstein et al. (2007).

The use of semantics in BPM creates Semantic Business Process Management (SBPM) System. The goal of Semantic Business Process Management is to achieve more automation in BPM by using semantic technologies. In Wetzstein et al. (2007) the SBPM lifecycle is described. There are 4 principal phases: *Process Modeling*, *Process Implementation*, *Process Execution*, and *Process Analysis*. The usage of semantic technologies increases the automation degree and the BPMS functionalities.

During the *process modeling phase*, the annotation of business process models allows for associating semantics to task and decisions in the process. The annotation is usually performed by using ontologies that describe domains or processes components. Generally, ontologies are created by ontology engineers, domain experts and business analysts. Different types of ontologies are relevant to business process management Hepp & Roman (2007). For instance, an organizational ontology is used to specify which organizational tasks have to be performed, in combination with a Semantic Web Service (SWS) ontology that specify the IT services that implement tasks, and domain ontologies that describe data used in the processes. The processes annotation enables additional semantics functionalities. In fact, ontological annotation of tasks enables the reuse of process fragments in different business processes in the *implementation phase*. During the *execution phase*, semantic instances are created, semantic checks of obtained instances can be automatically evaluated by calling reasoning tasks. During the semantic BP *analysis phase*, two different features are distinguished: (i) process monitoring which aims at providing relevant information about running process instances in the process execution phase, (ii) process mining that analyzes already executed process instances, in order to detect points of improvement for the process model. Such features take advantages by the semantic annotation. For instance, business analysts can formulate semantic queries and use reasoning to deduce implicit knowledge. Analysis allows for improving business processes for decreasing costs or risks in processes executions Medeiros & Aalst (2009).

There exist a lot of work addressing the enhancement of Business Process Management Systems ter Hofstede et al. (2010) by using Semantic Web techniques and, in particular, computational ontologies Hepp et al. (2005).

Robust and effective results have been obtained from European research projects CORDIS (2011), such as SUPER SUPER (2011), PLUG-IT PLUG-IT (2011) and COIN COIN (2011). They implemented new semantics-based software tools that enhance BPs of companies, and lower costs and risks. SUPER SUPER (2011), that means Semantics Utilised for Process Management within and between Enterprises, is an European Project financed from the Europen Union 6th Framework Program, within Information Society Technologies (IST) priority. The project successfully concluded at 31st of March 2009. The objective of SUPER was to make BPM accessible for business experts adding semantics to BP. Semantic Web and, in particular, Semantic Web Services (SWS) technology allows for integrating applications at the semantic level. Based on ontologies, Semantic Web (SW) technologies provide scalable methods and tools for the machine readable representation of knowledge. Semantic Web Services (SWS) use SW technologies to support the automated discovery, substitution, composition, and execution of software components (Web Services). BPM is a natural application for SW and SWS technology. The project SUPER combines SWS and BPM, provided a semantic-based and context-aware framework, and created horizontal ontologies which describe business 4 Will-be-set-by-IN-TECH

are more simply modifiable. Ontologies and Semantic Web service technologies can be used

The use of semantics in BPM creates Semantic Business Process Management (SBPM) System. The goal of Semantic Business Process Management is to achieve more automation in BPM by using semantic technologies. In Wetzstein et al. (2007) the SBPM lifecycle is described. There are 4 principal phases: *Process Modeling*, *Process Implementation*, *Process Execution*, and *Process Analysis*. The usage of semantic technologies increases the automation degree and the BPMS

During the *process modeling phase*, the annotation of business process models allows for associating semantics to task and decisions in the process. The annotation is usually performed by using ontologies that describe domains or processes components. Generally, ontologies are created by ontology engineers, domain experts and business analysts. Different types of ontologies are relevant to business process management Hepp & Roman (2007). For instance, an organizational ontology is used to specify which organizational tasks have to be performed, in combination with a Semantic Web Service (SWS) ontology that specify the IT services that implement tasks, and domain ontologies that describe data used in the processes. The processes annotation enables additional semantics functionalities. In fact, ontological annotation of tasks enables the reuse of process fragments in different business processes in the *implementation phase*. During the *execution phase*, semantic instances are created, semantic checks of obtained instances can be automatically evaluated by calling reasoning tasks. During the semantic BP *analysis phase*, two different features are distinguished: (i) process monitoring which aims at providing relevant information about running process instances in the process execution phase, (ii) process mining that analyzes already executed process instances, in order to detect points of improvement for the process model. Such features take advantages by the semantic annotation. For instance, business analysts can formulate semantic queries and use reasoning to deduce implicit knowledge. Analysis allows for improving business processes for decreasing costs or risks in processes executions Medeiros

There exist a lot of work addressing the enhancement of Business Process Management Systems ter Hofstede et al. (2010) by using Semantic Web techniques and, in particular,

Robust and effective results have been obtained from European research projects CORDIS (2011), such as SUPER SUPER (2011), PLUG-IT PLUG-IT (2011) and COIN COIN (2011). They implemented new semantics-based software tools that enhance BPs of companies, and lower costs and risks. SUPER SUPER (2011), that means Semantics Utilised for Process Management within and between Enterprises, is an European Project financed from the Europen Union 6th Framework Program, within Information Society Technologies (IST) priority. The project successfully concluded at 31st of March 2009. The objective of SUPER was to make BPM accessible for business experts adding semantics to BP. Semantic Web and, in particular, Semantic Web Services (SWS) technology allows for integrating applications at the semantic level. Based on ontologies, Semantic Web (SW) technologies provide scalable methods and tools for the machine readable representation of knowledge. Semantic Web Services (SWS) use SW technologies to support the automated discovery, substitution, composition, and execution of software components (Web Services). BPM is a natural application for SW and SWS technology. The project SUPER combines SWS and BPM, provided a semantic-based and context-aware framework, and created horizontal ontologies which describe business

throughout the BPM lifecycle Hepp et al. (2005); Wetzstein et al. (2007).

functionalities.

& Aalst (2009).

computational ontologies Hepp et al. (2005).

processes and vertical telecommunications ontologies to support domain-specific annotation. SUPER ontologies allow telecoms business managers to search existing processes, to model new business processes, to modify process models, to search for semantic web services that compose business processes and to execute implemented business process models.

The project plugIT PLUG-IT (2011), that means Plug Your Business into IT, has been co-funded by the European Union Intelligent Content and Semantics. It is based on the observation of the necessity to align Business and Information Technology (IT). The result of the project was the IT-Socket Knowledge portal. The project is based on the idea that IT can be consumed by plugging in business, like electric power is consumed when plugging in electronic devices in a power socket. ITSocket externalizes the expert knowledge by using graphical semi-formal models, such a knowledge is formalized in order to enable a computer-supported alignment using semantic technologies. Figure 1 shows business and IT experts that formalize the knowledge and so they enable automated support of business and IT alignment. In particular the alignment can be delegated to semantic technologies.

COIN, that means Enterprise COllaboration and INteroperability, COIN (2011) is an integrated project in the European Commission Seventh Framework Programme, that stars in 2008 and ends in 2011. The scope of the project is create a pervasive and self-adapting knowledge that enable enterprise collaboration and interoperability services in order to manage and effectively operate different forms of business collaborations.

Fig. 1. IT-Socket for business and IT alignment

In literature a lot of ontology-based approaches to business process management have been proposed. Initially, a set of approaches was proposed to apply techniques borrowed from the Semantic Web to the BP management context SUPER (2011). In Missikoff et al. (2011) an ontology-based approach for querying business process repositories for the retrieval of process fragments to be reused in the composition of new BPs is presented. The proposed solution is composed by an ontological framework (OPAL) aimed at capturing the semantics of a business scenario, and a business process modelling framework (BPAL) to represent the workflow logic of BPs. In Markovic (2008) a querying framework based on ontologies is

and other clinical observations. Machine-readable nomenclature for medical procedures and services performed by physicians are descried in CPT, the Current Procedural Terminology CPT (2011), a registered trademark of the American Medical Association. A comprehensive meta-thesaurus of biomedical terminology is the NCI-EVS NCI-EVS (2011) cancer ontology. Some medical ontologies are, also, due to European medical organizations. For example, CCAM the Classification Commune des Actes Medicaux CCAM (2011), is a French coding system of clinical procedures that consists in a multi-hierarchical classification of medical terms related to physician and dental surgeon procedures. A classification of the terminology related to surgical operations and procedures that may be carried out on a patient is OPCS4, the Office of Population Censuses and Surveys Classification of Surgical Operations and Procedures 4th Revision OPCS-4 (2011), developed in UK by NHS. The most famous and used ontology in the field of healthcare information systems is UMLS, the Unified Medical Language System UMLS (2011), that consists in a meta-thesaurus and a semantic network with lexical applications. UMLS includes a large number of national and international vocabularies and classifications (like SNOMED, ICD-10-CM, and MeSH) and provides a mapping structure between them. This amount of ontologies constitutes machine-processable medical knowledge that can be used for creating semantically-aware health care information

A Knowledge Representation Formalism for Semantic Business Process Management 217

The evidence-based medicine movement, that aims at providing standardized clinical guidelines for treating diseases Sackett et al. (1996), has stimulated the definition of a wide set of approaches and languages for representing clinical processes. A well known formalisms is GLIF, the Guideline Interchange Format GLIF (2011). It is a specification consisting of an object-oriented model that allows for representing sharable computer-interpretable and executable guidelines. In GLIF3 specification is possible to refer to patient data items defined by a standard medical vocabularies (such as UMLS), but no inference mechanisms are provided. Pro*forma* Sutton & Fox (2003) is essentially a first-order logic formalism extended to support decision making and plan execution. Arden Syntax HL7 (2011); Peleg et al. (2001); Pryor & Hripcsak (1993) allows for encoding procedural medical knowledge in a knowledge base that contains so called Medical Logic Modules (MLMs). An MLM is a hybrid between a production rule (i.e. an "if-then" rule) and a procedural formalism. It is less declarative than GLIF and Pro*forma*, its intrinsic procedural nature hinders knowledge sharing. EON Musen et al. (1996) is a formalism in which a guideline model is represented as a set of scenarios, action steps, decisions, branches, synchronization nodes connected by a "followed-by" relation. EON allows for associating conditional goals (e.g. if patient is diabetic, the target blood pressures are 135/80) with guidelines and subguidelines. Encoding of EON guidelines is done by Protégé-2000 Protégé (2011) knowledge-engineering environment.

The key idea which the Static and Dynamic Knowledge Representation Framework (SD-KRF) is based on is that elements of the workflow meta-model (i.e. processes, nodes, tasks, events, transitions, actions, decisions) are expressed as ontology classes Oro & Ruffolo (2009); Oro et al. (2009b). This way workflow elements and domain knowledge can be easily combined in order to organize processes and their elements as an ontology. More in detail, the SD-KRF allows for representing extensional and intensional aspects of both declarative and procedural

• *Ontology and Process Schemas*. The former expresses concepts related to specific domains. Ontology contents can be obtained by importing other existing ontologies and thesaurus

**3. Static/dynamic knowledge representation framework**

knowledge by means of:

systems.

presented. In Francescomarino & Tonella (2008) a visual query language for business Process is described. Processes are represented trough a BPMN meta-model ontology annotated by using domain ontologies, SPARQL queries are visually formulated.

Other approaches based on meta-model ontologies have been presented in Haller et al. (2008; 2006) In Hornung et al. (2007) the authors present an initial idea for an automatic approach for completion of BP models. Their system recommends appropriate completions to initial process fragments based on business rules and structural constraints. The main elements are modeled by using an OWL representation of Petri nets that allows for efficiently computing the semantic similarity between process model variants. Additionally the approach makes use of the Semantic Web Rule Language (SWRL), which is based upon a combination of OWL DL with Unary/Binary Datalog RuleML Boley et al. (2001), to model additional constraints imposed by business rules. Ontoprocess system Stojanovic & Happel (2006) for semantic business process management semantically described, business processes are combined with SWRL rules by using a set of shared ontologies that capture knowledge about a business domain. These formal specifications enable to automatically verify if a process description satisfies the consistency constraints defined by business rules. In order to query BPs, some graph-matching-based approaches have been proposed Awad et al. (2008); Haller et al. (2006) . In Awad et al. (2008) BPs are compiled to finite state models, so model checking techniques allow for verifying structural features of process schemas. However, the semantics of the business domain is not considered. Other approaches that allow for modeling and reasoning over workflows are based on logic programming Montali et al. (2008); Roman & Kifer (2007) have been introduced. Such approaches allow for checking and enacting BPs, but they are not used for querying.

As shown, a lot of approaches have been proposed in litterature, but no one is capable to semantically manage in a comprehensive way all phases of SBPM lifecycle.

#### **2.2 Business process management in the health care domain**

The health care domain is of great interest for BPM. In fact, in the recent past, a strong research effort has been taken to provide standard representations of both declarative and procedural medical knowledge. In particular, in the area of medical knowledge and clinical processes representation, there exist one of the most rich collection of domain ontologies available worldwide and a wide variety of formalisms for clinical process representation. In the following, available approaches and systems for medical ontologies and clinical process representation and management are described.

A very famous and widely adopted medical thesaurus is Mesh, the Medical Subject Headings classification MESH (2011). It provides a controlled vocabulary in the fields of medicine, nursing, dentistry, veterinary medicine, etc. MeSH is used to index, catalogue and retrieve the world's medical literature contained in PubMed. Another classification, that has become the international standard diagnostic classification for all medical activities and health management purposes, is ICD10-CM ICD (2011); WHO (2011) the International Classification of Diseases Clinical Modification, arrived to its 10th Revision. The most comprehensive medical terminology developed to date is SNOMED-CT SNOMED (2011), the Systematized Nomenclature of Medicine Clinical Terms, based on a semantic network containing a controlled vocabulary. Electronic transmission and storing of medical knowledge is facilitated by LOINC, the Logical Observation Identifiers Names and Codes LOINC (2011), that consists in a set of codes and names describing terms related to clinical laboratory results, test results 6 Will-be-set-by-IN-TECH

presented. In Francescomarino & Tonella (2008) a visual query language for business Process is described. Processes are represented trough a BPMN meta-model ontology annotated by

Other approaches based on meta-model ontologies have been presented in Haller et al. (2008; 2006) In Hornung et al. (2007) the authors present an initial idea for an automatic approach for completion of BP models. Their system recommends appropriate completions to initial process fragments based on business rules and structural constraints. The main elements are modeled by using an OWL representation of Petri nets that allows for efficiently computing the semantic similarity between process model variants. Additionally the approach makes use of the Semantic Web Rule Language (SWRL), which is based upon a combination of OWL DL with Unary/Binary Datalog RuleML Boley et al. (2001), to model additional constraints imposed by business rules. Ontoprocess system Stojanovic & Happel (2006) for semantic business process management semantically described, business processes are combined with SWRL rules by using a set of shared ontologies that capture knowledge about a business domain. These formal specifications enable to automatically verify if a process description satisfies the consistency constraints defined by business rules. In order to query BPs, some graph-matching-based approaches have been proposed Awad et al. (2008); Haller et al. (2006) . In Awad et al. (2008) BPs are compiled to finite state models, so model checking techniques allow for verifying structural features of process schemas. However, the semantics of the business domain is not considered. Other approaches that allow for modeling and reasoning over workflows are based on logic programming Montali et al. (2008); Roman & Kifer (2007) have been introduced. Such approaches allow for checking and enacting BPs, but they are not

As shown, a lot of approaches have been proposed in litterature, but no one is capable to

The health care domain is of great interest for BPM. In fact, in the recent past, a strong research effort has been taken to provide standard representations of both declarative and procedural medical knowledge. In particular, in the area of medical knowledge and clinical processes representation, there exist one of the most rich collection of domain ontologies available worldwide and a wide variety of formalisms for clinical process representation. In the following, available approaches and systems for medical ontologies and clinical process

A very famous and widely adopted medical thesaurus is Mesh, the Medical Subject Headings classification MESH (2011). It provides a controlled vocabulary in the fields of medicine, nursing, dentistry, veterinary medicine, etc. MeSH is used to index, catalogue and retrieve the world's medical literature contained in PubMed. Another classification, that has become the international standard diagnostic classification for all medical activities and health management purposes, is ICD10-CM ICD (2011); WHO (2011) the International Classification of Diseases Clinical Modification, arrived to its 10th Revision. The most comprehensive medical terminology developed to date is SNOMED-CT SNOMED (2011), the Systematized Nomenclature of Medicine Clinical Terms, based on a semantic network containing a controlled vocabulary. Electronic transmission and storing of medical knowledge is facilitated by LOINC, the Logical Observation Identifiers Names and Codes LOINC (2011), that consists in a set of codes and names describing terms related to clinical laboratory results, test results

semantically manage in a comprehensive way all phases of SBPM lifecycle.

**2.2 Business process management in the health care domain**

representation and management are described.

using domain ontologies, SPARQL queries are visually formulated.

used for querying.

and other clinical observations. Machine-readable nomenclature for medical procedures and services performed by physicians are descried in CPT, the Current Procedural Terminology CPT (2011), a registered trademark of the American Medical Association. A comprehensive meta-thesaurus of biomedical terminology is the NCI-EVS NCI-EVS (2011) cancer ontology. Some medical ontologies are, also, due to European medical organizations. For example, CCAM the Classification Commune des Actes Medicaux CCAM (2011), is a French coding system of clinical procedures that consists in a multi-hierarchical classification of medical terms related to physician and dental surgeon procedures. A classification of the terminology related to surgical operations and procedures that may be carried out on a patient is OPCS4, the Office of Population Censuses and Surveys Classification of Surgical Operations and Procedures 4th Revision OPCS-4 (2011), developed in UK by NHS. The most famous and used ontology in the field of healthcare information systems is UMLS, the Unified Medical Language System UMLS (2011), that consists in a meta-thesaurus and a semantic network with lexical applications. UMLS includes a large number of national and international vocabularies and classifications (like SNOMED, ICD-10-CM, and MeSH) and provides a mapping structure between them. This amount of ontologies constitutes machine-processable medical knowledge that can be used for creating semantically-aware health care information systems.

The evidence-based medicine movement, that aims at providing standardized clinical guidelines for treating diseases Sackett et al. (1996), has stimulated the definition of a wide set of approaches and languages for representing clinical processes. A well known formalisms is GLIF, the Guideline Interchange Format GLIF (2011). It is a specification consisting of an object-oriented model that allows for representing sharable computer-interpretable and executable guidelines. In GLIF3 specification is possible to refer to patient data items defined by a standard medical vocabularies (such as UMLS), but no inference mechanisms are provided. Pro*forma* Sutton & Fox (2003) is essentially a first-order logic formalism extended to support decision making and plan execution. Arden Syntax HL7 (2011); Peleg et al. (2001); Pryor & Hripcsak (1993) allows for encoding procedural medical knowledge in a knowledge base that contains so called Medical Logic Modules (MLMs). An MLM is a hybrid between a production rule (i.e. an "if-then" rule) and a procedural formalism. It is less declarative than GLIF and Pro*forma*, its intrinsic procedural nature hinders knowledge sharing. EON Musen et al. (1996) is a formalism in which a guideline model is represented as a set of scenarios, action steps, decisions, branches, synchronization nodes connected by a "followed-by" relation. EON allows for associating conditional goals (e.g. if patient is diabetic, the target blood pressures are 135/80) with guidelines and subguidelines. Encoding of EON guidelines is done by Protégé-2000 Protégé (2011) knowledge-engineering environment.

#### **3. Static/dynamic knowledge representation framework**

The key idea which the Static and Dynamic Knowledge Representation Framework (SD-KRF) is based on is that elements of the workflow meta-model (i.e. processes, nodes, tasks, events, transitions, actions, decisions) are expressed as ontology classes Oro & Ruffolo (2009); Oro et al. (2009b). This way workflow elements and domain knowledge can be easily combined in order to organize processes and their elements as an ontology. More in detail, the SD-KRF allows for representing extensional and intensional aspects of both declarative and procedural knowledge by means of:

• *Ontology and Process Schemas*. The former expresses concepts related to specific domains. Ontology contents can be obtained by importing other existing ontologies and thesaurus

• N is a finite set of *nodes* partitioned in the following subsets: task nodes *NT* (that represent activities in which a humans or machines perform tasks), subprocess nodes *NSP* (that model activities referring processes external to the current one), group nodes *NG* (that represent a set of nodes that can be executed without a specific order), custom nodes *NC* (that model activities in which custom methods can be executed and handled automatically), wait nodes *NW* (that represent activities that temporary stop the execution while they execute methods), join nodes *NJ*, and fork nodes *NF* (that are respectively used to combine or split execution paths) and decision nodes *ND* (that allow for controlling the execution flow on the base of conditions, variables or choices performed automatically or

A Knowledge Representation Formalism for Semantic Business Process Management 219

• *Ar* is a set of *actors*. Actors can be human or automatic. They represent the agents that

• *An* is a set of *actions*. An action is a special activity that can be performed as answer to the

• *E* = {⟨*x*, *y*⟩ ∶ *x* ∈ *NFrom* ∧ *y* ∈ *NTo*} is a set of *transitions* in which the following restrictions hold, when *NFrom* ≡ *NFN* ∪ *ND* then *NTo* ≡ *NFN* ∪ *NFCN* ∪ *Nend* and when *NFrom* ≡ *NFCN* then *NTo* ≡ *NFN* ∪ *ND*. Moreover, for each process there is a transition of the form *estart* = ⟨*Nstart*, *y*⟩ where *y* ∈ *NFN* and one of the form *eend* = ⟨*x*, *Nend*⟩ where *x* ∈ {*NFN* ∪ *ND*}. The subset *Ed* ⊂ *E* where *Ed* = {⟨*x*, *y*⟩ ∶ *x* ∈ *ND* ∧ *y* ∈ *NFN*} is the set of *decisions*. A decision relates a decision node to a flow node and could hold a *decision rule* that is used at run-time

• *Ev* is a set of *events*. An event causes the execution of an action that constitutes the answer to the event. An event can be, for example, the throwing of an exception during the

Formally the Static/Dynamic Knowledge Representation Framework (SD-KRF) is the 5-tuple

O=⟨*D*, *A*,*C*, *R*, *I*⟩. Where ontology/process schemas are expressed by using elements of *D*, *A*, *C* and *R* in O that are *finite* and *disjoint* sets of entity names respectively called *data-types*, *attribute-names*, *classes* and *relations*. The set of classes *C* is organized in taxonomies and partitioned in two subsets:

• The set of process classes *CP* = *N* ∪ *Ar* ∪ *An* ∪ *Tk* ∪ *Ev* that represents elements of the workflow meta-model. It is constituted by the union of classes representing nodes, actors,

• The set of ontology classes *CO* that represent concepts related to a specific knowledge

The set *R* is a set of ontological relations partitioned in two subsets: the set of transition *RP* = *E*, and the set of relations *RM* used for representing relations between ontology concepts. In the following meaning and usage of O is explained by describing the implementation of a running

• *Tk* is a set of *tasks* that represent tasks to execute in task nodes.

to automatically control the execution flow of a process.

by human actors).

occurrence of an event.

execution of a task.

having the following form:

actions, tasks and events.

domains.

example.

**3.2 Modeling static and dynamic knowledge**

execute a given task or activity.

or by means of direct manual definition. The latter are expressed according with the workflow meta-model illustrated in Section 3.1.


The SD-KRF constitutes an innovative approach for semantic business process management in the field of healthcare information systems. Main features of the presented semantic approach (founded on logic programming) is that it, conversely to already existing systems and approaches, enables to represent process ontologies that can be equipped with expressive business rules. In particular, the proposed framework allows for jointly managing declarative and procedural aspects of domain knowledge and express reasoning tasks that exploit represented knowledge in order to prevent errors and risks that can take place during processes execution. The framework enables, also: (i) manual process execution in which each activity to execute in a given moment is chosen by a human actor on the base of the current configuration of patient and disease parameters, and (ii) automatic execution by means of the enactment of an already designed process schema (e.g. guidelines execution). During process execution, process and ontology instances are acquired and stored in a knowledge base. The system is able to automatically acquire information from electronic unstructured medical documents, exploiting a semantic information extraction approach. Extracted information are stored into a knowledge base as concept instances. The processes execution can be monitored by running (over clinical process schemas and instances) reasoning tasks that implements business rules.

#### **3.1 Modelling process**

A significant amount of research has been already done in the specification of mechanisms for process modeling (see, Georgakopoulos et al. (1995) for an overview of different proposals). The most widely adopted formalism is the control flow graph, in which a workflow is represented by a labeled directed graph whose nodes correspond to the activities to be performed, and whose arcs describe the precedences among them. In the SD-KRF we adopt the graph-oriented workflow meta-model shown in Figure 2.a and 2.b, inspired by the JPDL JPDL (2011) process modeling approach. The adopted meta-model: (i) covers the most important and typical constructs required in workflow specification; (ii) allows for executing processes in the SD-KRF by using the JBPM workflow engine; (iii) allows for using workflow mining techniques grounded on graph-oriented meta-models.

Since our scope is to allow the semantic representation of processes, we need firstly to formally define the process meta-model as the following 6-tuple:

$$\mathcal{P} = \{ \mathbf{N}, A\_{\nu}, E\_{\overline{\nu}\nu} A\_{\nu\nu} T\_{k\nu} E \}$$

where:

8 Will-be-set-by-IN-TECH

• *Ontology and Process Instances* both expressed in term of ontology instances. In particular, ontology class instances can be obtained by importing them from already existing ontologies or by creating them during process execution. Process instances are created

• *Concept Descriptors* that express information extraction rules capable to recognize and extract ontology instances contained in unstructured documents written in natural language. By concept descriptors ontology instances can be automatically recognized in documents and used for both enriching the knowledge base and annotating unstructured

The SD-KRF constitutes an innovative approach for semantic business process management in the field of healthcare information systems. Main features of the presented semantic approach (founded on logic programming) is that it, conversely to already existing systems and approaches, enables to represent process ontologies that can be equipped with expressive business rules. In particular, the proposed framework allows for jointly managing declarative and procedural aspects of domain knowledge and express reasoning tasks that exploit represented knowledge in order to prevent errors and risks that can take place during processes execution. The framework enables, also: (i) manual process execution in which each activity to execute in a given moment is chosen by a human actor on the base of the current configuration of patient and disease parameters, and (ii) automatic execution by means of the enactment of an already designed process schema (e.g. guidelines execution). During process execution, process and ontology instances are acquired and stored in a knowledge base. The system is able to automatically acquire information from electronic unstructured medical documents, exploiting a semantic information extraction approach. Extracted information are stored into a knowledge base as concept instances. The processes execution can be monitored by running (over clinical process schemas and instances) reasoning tasks that implements

A significant amount of research has been already done in the specification of mechanisms for process modeling (see, Georgakopoulos et al. (1995) for an overview of different proposals). The most widely adopted formalism is the control flow graph, in which a workflow is represented by a labeled directed graph whose nodes correspond to the activities to be performed, and whose arcs describe the precedences among them. In the SD-KRF we adopt the graph-oriented workflow meta-model shown in Figure 2.a and 2.b, inspired by the JPDL JPDL (2011) process modeling approach. The adopted meta-model: (i) covers the most important and typical constructs required in workflow specification; (ii) allows for executing processes in the SD-KRF by using the JBPM workflow engine; (iii) allows for using workflow

Since our scope is to allow the semantic representation of processes, we need firstly to formally

P=⟨*N*, *Ar*, *Ev*, *An*, *Tk*, *E*⟩

exclusively during process execution. Instances are stored in a knowledge base. • *Reasoning Tasks* that express, for instance, decisions, risks and business rules.

workflow meta-model illustrated in Section 3.1.

documents related to given domains and processes.

mining techniques grounded on graph-oriented meta-models.

define the process meta-model as the following 6-tuple:

business rules.

where:

**3.1 Modelling process**

or by means of direct manual definition. The latter are expressed according with the


#### **3.2 Modeling static and dynamic knowledge**

Formally the Static/Dynamic Knowledge Representation Framework (SD-KRF) is the 5-tuple having the following form:

$$\mathcal{O} = \{D, A, C, R, I\}.$$

Where ontology/process schemas are expressed by using elements of *D*, *A*, *C* and *R* in O that are *finite* and *disjoint* sets of entity names respectively called *data-types*, *attribute-names*, *classes* and *relations*. The set of classes *C* is organized in taxonomies and partitioned in two subsets:


The set *R* is a set of ontological relations partitioned in two subsets: the set of transition *RP* = *E*, and the set of relations *RM* used for representing relations between ontology concepts. In the following meaning and usage of O is explained by describing the implementation of a running example.

8. Sub-process *Therapy administration*, models a subprocess that constitutes the guideline to

A Knowledge Representation Formalism for Semantic Business Process Management 221

9. Decision node *Therapy ended* models a decision activity about effects of the therapy and the

10. Task node *Discharging* models the discharging of the patient from the ward end allows for

In the activities (6) and (8) risk and error conditions can be identified. At each guideline, chosen in (6), corresponds a prescription of drugs (chemotherapy). Hence the computation of doses, which may depend on patient's biomedical parameters such as body's weight or skin's surface, is required. Cross-checking doses is fundamental here, because if a wrong dose is given to the patient the outcome could be lethal. Furthermore, therapy administration ((8)-th activity) must contain checks that aims at verify type and quantity of chemotherapeutic drugs

Fig. 2. (a) The process meta-model. (b) The nodes hierarchy. (c) A clinical process for caring

This section presents the syntax of the SD-KRF language by example. A class in *C* ∈ O can be thought as an aggregation of individuals (objects) that have the same set of properties (attributes *A* ∈ O). From a syntactical point of view, a class is a name and an ordered list of attributes identifying the properties of its instances. Each attribute is identified by a name and

In the following the implementation of the workflow meta-model in the SD-KRF language is firstly presented. In particular, *nodes* in *CP* are implemented by using the class hierarchy (built

class node(name:string, container: process, start\_time:integer,

execute for caring the patient.

to submit to the cared patient.

the breast neoplasm

**4.1 Ontology and process schemas**

has a type specified as a data-type or class.

up by using isa key-word) shown below.

end\_time:integer).

class process(name:string).

possibility to stop or continue cares.

acquiring final clinical parameter values.

#### **4. An example: representing ontologies and a process schemas in the medical domain**

The medical domain offers many ontologies and thesaura describing diseases, drugs, medical examinations, medical treatments, laboratory terms, anatomy, patients administration, and clinical risks. Examples of medical ontologies are UMLS UMLS (2011), LOINC LOINC (2011), ICD10-CM ICD (2011), SNOMED SNOMED (2011). Many of such ontologies can be freely obtained from international organization that maintain them and can be automatically imported in the SD-KRF or manually entered by means of direct manual definition.

This section describes a clinical process for caring the breast neoplasm (Figure 2.c). Such an example in the medical domain, that will be used in the rest of the chapter as running example, allows for describing the ability of the SD-KRF to enable the representation of ontologies representing in combined way process schemas, ontological concepts and relations, and instances of both processes and ontologies.

The example considers practices carried out in the oncological ward of an Italian hospital, hence it is not general but specific for the domain of the considered ward. The clinical process is organized in the following 10 activities:


In the activities (6) and (8) risk and error conditions can be identified. At each guideline, chosen in (6), corresponds a prescription of drugs (chemotherapy). Hence the computation of doses, which may depend on patient's biomedical parameters such as body's weight or skin's surface, is required. Cross-checking doses is fundamental here, because if a wrong dose is given to the patient the outcome could be lethal. Furthermore, therapy administration ((8)-th activity) must contain checks that aims at verify type and quantity of chemotherapeutic drugs to submit to the cared patient.

Fig. 2. (a) The process meta-model. (b) The nodes hierarchy. (c) A clinical process for caring the breast neoplasm

#### **4.1 Ontology and process schemas**

10 Will-be-set-by-IN-TECH

The medical domain offers many ontologies and thesaura describing diseases, drugs, medical examinations, medical treatments, laboratory terms, anatomy, patients administration, and clinical risks. Examples of medical ontologies are UMLS UMLS (2011), LOINC LOINC (2011), ICD10-CM ICD (2011), SNOMED SNOMED (2011). Many of such ontologies can be freely obtained from international organization that maintain them and can be automatically

This section describes a clinical process for caring the breast neoplasm (Figure 2.c). Such an example in the medical domain, that will be used in the rest of the chapter as running example, allows for describing the ability of the SD-KRF to enable the representation of ontologies representing in combined way process schemas, ontological concepts and relations,

The example considers practices carried out in the oncological ward of an Italian hospital, hence it is not general but specific for the domain of the considered ward. The clinical process

1. Task node *Acceptance* models patient enrollment. A patient arrives to the ward with an already existing clinical diagnosis of a breast neoplasm. This activity can be performed manually by an oncologist that collects patient personal data, and directly acquiring information from electronic medical records in natural language. The information extraction task is performed by exploiting the semantic information extraction approach described in Section 5. Extracted information are stored as ontology instances Oro et al.

2. Group node *Anamnesis* represents a set of anamnesis activities: *general anamnesis* in which physiological general data (e.g. allergies, intolerances) are being collected; *remote pathological anamnesis*, concerning past pathologies; *recent pathological anamnesis*, in which each data or result derived from examinations concerning the current pathologies are acquired. These activities can be manually executed without a specific order or by exploiting semantic extraction rules (descriptors) that enable the recognition of

3. Task node *Initial clinical evaluation* allows for acquiring the result of an examination of the

4. Decision node *More clinical test requested* represents the decision to perform or not

5. Group node *Other exams* models possible additional clinical tests. If requested these tests are conducted to find out general or particular conditions of patient and disease not fully

6. Task node *Therapeutic strategy definition* models the selection of a guideline with related drug prescription. At execution time the physician picks a guideline (selected among the guidelines already available in the knowledge base) that depends upon actual pathology

7. Task node *Informed agreement sign* models the agreement of the patient concerning understanding and acceptance of consequences (either side effects or benefits) which may

information into unstructured source like pre-existing EMR.

derive from the chosen chemotherapy, and privacy agreements.

**4. An example: representing ontologies and a process schemas in the medical**

imported in the SD-KRF or manually entered by means of direct manual definition.

and instances of both processes and ontologies.

is organized in the following 10 activities:

**domain**

(2009a).

patient by an oncologist.

additional examination on the patient.

deducible from the test results already available.

state as well as other collected patient data.

This section presents the syntax of the SD-KRF language by example. A class in *C* ∈ O can be thought as an aggregation of individuals (objects) that have the same set of properties (attributes *A* ∈ O). From a syntactical point of view, a class is a name and an ordered list of attributes identifying the properties of its instances. Each attribute is identified by a name and has a type specified as a data-type or class.

In the following the implementation of the workflow meta-model in the SD-KRF language is firstly presented. In particular, *nodes* in *CP* are implemented by using the class hierarchy (built up by using isa key-word) shown below.

```
class process(name:string).
class node(name:string, container: process, start_time:integer,
           end_time:integer).
```
class acceptance\_node(tasks:[acceptance\_form],handler:physician)

class recentPathological\_anamnesis\_node(tasks:[pathology\_form],

remotePathological\_anamnesis\_node, recentPathological\_anamnesis\_node])

class therapeutic\_strategy\_definition\_node(tasks:[therapeutic\_strategy\_form],

A Knowledge Representation Formalism for Semantic Business Process Management 223

class therapy\_administration\_node(sub\_process:therapy\_administration\_process)

acceptance and therapeutic\_trategy\_definition process activities are represented as subclasses of task\_node class, in fact they represent activities in which tasks consist in the execution of forms filled by humans. Whereas anamnesis\_node, which Recent Pathological anamnesis activity belongs to, is represented as a subclass of group\_node class. therapy\_administration\_node and more\_tests\_node are specializations of sub\_pr oc\_node and decision\_node respectively. Human actors that operate in this clinical process could be physicians, nurses and patients. They are represented by a person hierarchy that exploits multiple inheritance capabilities in order to express that persons are

class more\_tests\_node(task:more\_tests\_form) isa{manual\_decision\_node}.

class person(fiscalCode:string,name:string,surname:string,sex:sex\_type,

class patient(hospitalCard:string, weight:float, heigthCm:float)

Class schemas representing tasks related to task-nodes can be expressed by using the following class schemas. Attribute types can be classes represented in *CM* expressing different medical concepts (e.g. diseases, drugs, body parts). During task execution values of resulting

class acceptance\_form(patient:patient, acc\_date:date) isa{task}.

class chemotherapeutic\_strategy\_form(strategy:therapeuticStrategy)

In a clinical process, an event can be activated by an exception during the execution of a node or by a reasoning task aimed at control business rules. A reasoning task checks parameters values of running node and already acquired node instances and throws an event related to an error. An example of different kinds of possible errors is shown in the following taxonomy, where the attribute msg of the class view\_msg (action) is the message to display when the

bornDate:date,address:address).

class pathology\_form(disease:disease) isa{task}.

class more\_tests\_form(choice:boolean)isa{task}.

class healthCareEmploy(occupation:string, role:string)

isa {person,human\_actor}.

isa {person,human\_actor}. class nurse() isa {healthCareEmploy}. class physician() isa {healthCareEmploy}.

class instances are obtained from fields filled in forms.

class task(name: string).

error occurs.

isa{task}.

class anamnesis\_node(nodes:[general\_anamnesis\_node,

handler:physician) isa {task\_node}.

handler:nurse) isa {task\_node}.

isa{task\_node}.

isa {group\_node}.

isa{sub\_process\_node}.

also human actors of the clinical process.

```
class start_node() isa{node}.
class end_node() isa{node}.
class common_node () isa{node}.
    class flowControl_node() isa{common_node}.
        class fork() isa{flowControl_node}.
        class join() isa{flowControl_node}.
        class wait_node() isa{flowControl_node}.
    class flow_node() isa{common_node}.
        class task_node(tasks:[task], handler:human_actor)
                        isa{flow_node}.
        class custom_node(handler: automatic_actor, method:string)
                        isa{flow_node}.
        class group_node(nodes:[node]) isa{flow_node}.
        class sub_process_node(sub_proc: process) isa{flow_node}.
    class decision_node(handler:actor) isa{common_node}.
        class automatic_decision_node(handler:automatic_actor)
                        isa{decision_node}.
        class manual_decision_node(task:task, handler:human_actor)
                        isa{decision_node}.
```
Task nodes and manual decision nodes contain *tasks* that are performed by humans. Tasks class task(name: string). collects values of activity variables given in input by human actor. *Actors* of a process (that can be human or automatic) represent the agents that execute a given task. They are represented by means of the following classes in *CP*:

```
class actor(name:string).
    class human_actor() isa {actor}.
    class automatic_actor(uri:string) isa {actor}.
```
During the process enactment, by running risk and business rules, *events* may occur. Furthermore, an event can be generated by an exception during the execution of a task. Events, and related actions to performs in response, are represented in *CP* by the following classes.

```
class event(relativeTo:object, timestamp:integer).
    class node_event(relativeTo:node) isa{event}.
    class task_event(relativeTo:task) isa{event}.
    class process_event(relativeTo:process) isa{event}.
class action(method:string).
```
Relationships among objects are represented by means of relations, which like classes, are defined by a name and a list of attributes. *Transitions* and *decisions*, in *RP*, that relate couple of nodes, are represented by means of the following ontology relations.

```
relation transition(name:string, from:node, to:node).
relation decision(name:string, from:decision_node, to:node).
```
When the user defines a specific process schema s/he can specialize original meta-model elements for adding new semantic attribute required by the specific process. In the following are shown some classes representing nodes of the running example depicted in Figure 2.

12 Will-be-set-by-IN-TECH

class task\_node(tasks:[task], handler:human\_actor) isa{flow\_node}.

isa{flow\_node}. class group\_node(nodes:[node]) isa{flow\_node}.

class decision\_node(handler:actor) isa{common\_node}.

execute a given task. They are represented by means of the following classes in *CP*:

class custom\_node(handler: automatic\_actor, method:string)

class sub\_process\_node(sub\_proc: process) isa{flow\_node}.

class manual\_decision\_node(task:task, handler:human\_actor)

class automatic\_decision\_node(handler:automatic\_actor) isa{decision\_node}.

isa{decision\_node}.

Task nodes and manual decision nodes contain *tasks* that are performed by humans. Tasks class task(name: string). collects values of activity variables given in input by human actor. *Actors* of a process (that can be human or automatic) represent the agents that

During the process enactment, by running risk and business rules, *events* may occur. Furthermore, an event can be generated by an exception during the execution of a task. Events, and related actions to performs in response, are represented in *CP* by the following

Relationships among objects are represented by means of relations, which like classes, are defined by a name and a list of attributes. *Transitions* and *decisions*, in *RP*, that relate couple of

When the user defines a specific process schema s/he can specialize original meta-model elements for adding new semantic attribute required by the specific process. In the following are shown some classes representing nodes of the running example depicted in Figure 2.

class flowControl\_node() isa{common\_node}. class fork() isa{flowControl\_node}. class join() isa{flowControl\_node}. class wait\_node() isa{flowControl\_node}.

class flow\_node() isa{common\_node}.

class start\_node() isa{node}. class end\_node() isa{node}. class common\_node () isa{node}.

class actor(name:string).

class action(method:string).

classes.

class human\_actor() isa {actor}.

class automatic\_actor(uri:string) isa {actor}.

class event(relativeTo:object, timestamp:integer). class node\_event(relativeTo:node) isa{event}. class task\_event(relativeTo:task) isa{event}. class process\_event(relativeTo:process) isa{event}.

nodes, are represented by means of the following ontology relations.

relation decision(name:string, from:decision\_node, to:node).

relation transition(name:string, from:node, to:node).

```
class acceptance_node(tasks:[acceptance_form],handler:physician)
      isa{task_node}.
class anamnesis_node(nodes:[general_anamnesis_node,
      remotePathological_anamnesis_node, recentPathological_anamnesis_node])
      isa {group_node}.
class recentPathological_anamnesis_node(tasks:[pathology_form],
      handler:physician) isa {task_node}.
class therapeutic_strategy_definition_node(tasks:[therapeutic_strategy_form],
      handler:nurse) isa {task_node}.
class therapy_administration_node(sub_process:therapy_administration_process)
      isa{sub_process_node}.
```
class more\_tests\_node(task:more\_tests\_form) isa{manual\_decision\_node}.

acceptance and therapeutic\_trategy\_definition process activities are represented as subclasses of task\_node class, in fact they represent activities in which tasks consist in the execution of forms filled by humans. Whereas anamnesis\_node, which Recent Pathological anamnesis activity belongs to, is represented as a subclass of group\_node class. therapy\_administration\_node and more\_tests\_node are specializations of sub\_pr oc\_node and decision\_node respectively. Human actors that operate in this clinical process could be physicians, nurses and patients. They are represented by a person hierarchy that exploits multiple inheritance capabilities in order to express that persons are also human actors of the clinical process.

```
class person(fiscalCode:string,name:string,surname:string,sex:sex_type,
             bornDate:date,address:address).
    class patient(hospitalCard:string, weight:float, heigthCm:float)
             isa {person,human_actor}.
    class healthCareEmploy(occupation:string, role:string)
             isa {person,human_actor}.
        class nurse() isa {healthCareEmploy}.
        class physician() isa {healthCareEmploy}.
```
Class schemas representing tasks related to task-nodes can be expressed by using the following class schemas. Attribute types can be classes represented in *CM* expressing different medical concepts (e.g. diseases, drugs, body parts). During task execution values of resulting class instances are obtained from fields filled in forms.

```
class task(name: string).
    class acceptance_form(patient:patient, acc_date:date) isa{task}.
    class pathology_form(disease:disease) isa{task}.
    class chemotherapeutic_strategy_form(strategy:therapeuticStrategy)
          isa{task}.
    class more_tests_form(choice:boolean)isa{task}.
```
In a clinical process, an event can be activated by an exception during the execution of a node or by a reasoning task aimed at control business rules. A reasoning task checks parameters values of running node and already acquired node instances and throws an event related to an error. An example of different kinds of possible errors is shown in the following taxonomy, where the attribute msg of the class view\_msg (action) is the message to display when the error occurs.

**4.2 Ontology and process instances**

handler:#27).

#1:neoplasm\_process(name:"Breast Neoplasm").

related to task nodes are shown in the following.

that fills in a form. It contains an instance of patient class.

medical ontologies and can be declared as follows:

admRoute:["oral"], notes:"").

#A01.236: bodyRegion(name:"breast").

subCat:"0").

end\_time:16580, tasks:[#1\_1\_1], handler:#27).

end\_time:26580, nodes:[#1\_2\_1, #1\_2\_2, #1\_2\_3])

following.

...

reasoning tasks.

Clinical process instances are expressed by ontology instances and created exclusively during process execution. Classes instances (objects) are defined by their *oid* (that starts with #) and a list of attributes. Instances obtained by executing the running example, are shown in the

A Knowledge Representation Formalism for Semantic Business Process Management 225

#1\_2\_3:recentPathological\_anamnesis\_node(name:"Recent Pathological Anamnesis", container:#1, start\_time:19580, end\_time:26570,tasks:[#1\_2\_3\_1],

As described in section 4, instance of anamnesis\_node #1\_2 is composed by a set of anamnesis activities represented by means of their *id*. The object #1\_2\_3 belongs to #1\_2. Objects #1\_1, #1\_2\_3 are tasks executed in custom and manual decision node and are stored as their attributes. When execution arrives in a task node or in a manual decision node, task instances are created and the user input is stored as values of the task attributes. Some tasks

#1\_1\_1:acceptance\_form(name:"Acceptance", patient:#21, acc\_date:#data\_089).

For example, the instance 1\_1\_1 of the class acceptance\_form is created by an oncologist

Transition and decision tuples, created during the process execution, are shown in the following. The tuple of decision in the following example is obtained as a manual choice of an oncologist, but instances of decisions could be automatically generated by means of

Instances of the classes bodyRegion, breast\_primarySited\_neoplasm, and of the subclasses of drug and code, can be obtained by importing them from already existing

#neoB\_01: breast\_primarySited\_neoplasm(descr:"Malignant neoplasm of breast",

#icd10\_C50.0: icd10Code(c:"C50.0", chapter:2, block:"C", category:"50",

#mesh08\_C04.588.180: mesh08Code(c:"C04.588.180",category:"C", subCat:"04").

decision(name:"More Clinical Tests requested - No",from:#1\_4, to:#1\_6).

#1\_2\_3\_1:pathology\_form(name:"Recent Pathology", disease:#neoB\_01).

transition(name:"Acceptance-Anamnesis",from:#1\_0, to:#1\_1).

site:#A01.236, zone:"Nipple and areola"). #L02BG03: enzymeInhibitors(name:"Anastrozole", ddd:1, unit:mg,

#L02AA04: estrogens(name:"Fosfestrol", ddd:0.25, unit:g, admRoute:["oral","parenteral"], notes:"").

#2:therapy\_administration\_process(name:"Therapy Administration"). #1\_1:acceptance\_node(name:"Acceptance", container:#1, start\_time:6580,

#1\_2:anamnesis\_node(name:"Anamnesis", container:#1, start\_time:16570,

```
class task_event(relativeTo:task) isa{event}.
    class medicalError(msg:string) isa{task_event}.
        class drugPrescriptionError() isa {medicalError}.
class view_msg(msg:string) isa {action}.
```
Class schemas in *CM* expressing knowledge concerning anatomy, breast neoplasm disease and related therapies and drugs have been obtained (imported) from the Medical Subject Headings (Mesh) Tree Structures , the International Classification of Diseases (ICD10-CM) . and the Anatomical Therapeutic Chemical (ATC/DDD) .

```
class anatomy(name:string).
    class bodyRegion() isa {anatomy}.
class disease(descr:string).
  class neoplasm() isa {disease}.
    class malignant_neoplasm() isa {neoplasm}.
       class primarySited_neoplasm(site:bodyRegion,zone:string)
             isa {malignantNeoplasm}.
          class breast_primarySited_neoplasm() isa {primarySited_neoplasm}.
class drug(name:string, ddd:float, unit:unitOfMeasure,admRoute:[string],
           notes:string).
    class antineoplasticAndImmunomodulatingAgent() isa {drug}.
        class endocrineTherapy() isa {antineoplasticAndImmunomodulatingAgent}.
            class hormoneAntagonistsAndRelatedAgents()isa {endocrineTherapy}.
                class enzymeInhibitors()
                      isa {hormoneAntagonistsAndRelatedAgents}.
            class hormoneAndRelatedAgents()isa {endocrineTherapy}.
                class estrogens() isa {hormoneAndRelatedAgents}.
class code(c:string).
    class icd10Code(chapter:integer, block:string,category:string,
          subCat:string) isa {code}.
    class mesh08Code(category:string, subCat:string) isa {code}.
class therapy(name:string, dru:drug, dose:float).
class therapeuticStrategy(patient:patient, therapy:therapy,startDate:date,
                          nDay:integer).
```
The previous classes are a fragment of a medical ontology inherent (breast) neoplasm cares and are used to model the clinical process shown in Section 4. Class primarySited\_neoplasm shows the ability to specify user-defined classes as attribute types (i.e. site:bodyRegion). Class drug has a list-type attribute admRoute:[string] representing possible route of administration for a drug (for example inhalation, nasal, oral, parenteral). Relation schemas expressing medical knowledge can be declared by using the following syntax:

```
relation suffers (patient:patient, disease:disease).
relation relatedDrug (dis:disease, dru:drug).
relation sideEffect (dru:drug, effect:string).
relation classifiedAs (dis:disease, c:code).
```
Relation suffer asserts diseases suffered by a patient. Relations relatedDrug and sideEffect associates respectively drugs to a diseases and side effects to drugs. Moreover, relation classifiedAs enables users to query the ontologies by using codes defined in the original medical ontologies.

#### **4.2 Ontology and process instances**

14 Will-be-set-by-IN-TECH

Class schemas in *CM* expressing knowledge concerning anatomy, breast neoplasm disease and related therapies and drugs have been obtained (imported) from the Medical Subject Headings (Mesh) Tree Structures , the International Classification of Diseases (ICD10-CM) .

class primarySited\_neoplasm(site:bodyRegion,zone:string)

class drug(name:string, ddd:float, unit:unitOfMeasure,admRoute:[string],

class icd10Code(chapter:integer, block:string,category:string,

class therapeuticStrategy(patient:patient, therapy:therapy,startDate:date,

The previous classes are a fragment of a medical ontology inherent (breast) neoplasm cares and are used to model the clinical process shown in Section 4. Class primarySited\_neoplasm shows the ability to specify user-defined classes as attribute types (i.e. site:bodyRegion). Class drug has a list-type attribute admRoute:[string] representing possible route of administration for a drug (for example inhalation, nasal, oral, parenteral). Relation schemas expressing medical knowledge can be declared by using the

Relation suffer asserts diseases suffered by a patient. Relations relatedDrug and sideEffect associates respectively drugs to a diseases and side effects to drugs. Moreover, relation classifiedAs enables users to query the ontologies by using codes defined in the

class mesh08Code(category:string, subCat:string) isa {code}.

nDay:integer).

class antineoplasticAndImmunomodulatingAgent() isa {drug}.

class breast\_primarySited\_neoplasm() isa {primarySited\_neoplasm}.

class endocrineTherapy() isa {antineoplasticAndImmunomodulatingAgent}. class hormoneAntagonistsAndRelatedAgents()isa {endocrineTherapy}.

isa {hormoneAntagonistsAndRelatedAgents}.

class hormoneAndRelatedAgents()isa {endocrineTherapy}. class estrogens() isa {hormoneAndRelatedAgents}.

class task\_event(relativeTo:task) isa{event}.

and the Anatomical Therapeutic Chemical (ATC/DDD) .

class malignant\_neoplasm() isa {neoplasm}.

isa {malignantNeoplasm}.

subCat:string) isa {code}.

class therapy(name:string, dru:drug, dose:float).

relation suffers (patient:patient, disease:disease).

relation relatedDrug (dis:disease, dru:drug). relation sideEffect (dru:drug, effect:string). relation classifiedAs (dis:disease, c:code).

class enzymeInhibitors()

class view\_msg(msg:string) isa {action}.

class bodyRegion() isa {anatomy}.

class neoplasm() isa {disease}.

notes:string).

class anatomy(name:string).

class disease(descr:string).

class code(c:string).

following syntax:

original medical ontologies.

class medicalError(msg:string) isa{task\_event}.

class drugPrescriptionError() isa {medicalError}.

Clinical process instances are expressed by ontology instances and created exclusively during process execution. Classes instances (objects) are defined by their *oid* (that starts with #) and a list of attributes. Instances obtained by executing the running example, are shown in the following.

```
#1:neoplasm_process(name:"Breast Neoplasm").
#2:therapy_administration_process(name:"Therapy Administration").
#1_1:acceptance_node(name:"Acceptance", container:#1, start_time:6580,
    end_time:16580, tasks:[#1_1_1], handler:#27).
#1_2:anamnesis_node(name:"Anamnesis", container:#1, start_time:16570,
    end_time:26580, nodes:[#1_2_1, #1_2_2, #1_2_3])
#1_2_3:recentPathological_anamnesis_node(name:"Recent Pathological Anamnesis",
       container:#1, start_time:19580, end_time:26570,tasks:[#1_2_3_1],
       handler:#27).
...
```
As described in section 4, instance of anamnesis\_node #1\_2 is composed by a set of anamnesis activities represented by means of their *id*. The object #1\_2\_3 belongs to #1\_2. Objects #1\_1, #1\_2\_3 are tasks executed in custom and manual decision node and are stored as their attributes. When execution arrives in a task node or in a manual decision node, task instances are created and the user input is stored as values of the task attributes. Some tasks related to task nodes are shown in the following.

```
#1_1_1:acceptance_form(name:"Acceptance", patient:#21, acc_date:#data_089).
#1_2_3_1:pathology_form(name:"Recent Pathology", disease:#neoB_01).
```
For example, the instance 1\_1\_1 of the class acceptance\_form is created by an oncologist that fills in a form. It contains an instance of patient class.

Transition and decision tuples, created during the process execution, are shown in the following. The tuple of decision in the following example is obtained as a manual choice of an oncologist, but instances of decisions could be automatically generated by means of reasoning tasks.

```
transition(name:"Acceptance-Anamnesis",from:#1_0, to:#1_1).
decision(name:"More Clinical Tests requested - No",from:#1_4, to:#1_6).
```
Instances of the classes bodyRegion, breast\_primarySited\_neoplasm, and of the subclasses of drug and code, can be obtained by importing them from already existing medical ontologies and can be declared as follows:

```
#A01.236: bodyRegion(name:"breast").
#neoB_01: breast_primarySited_neoplasm(descr:"Malignant neoplasm of breast",
          site:#A01.236, zone:"Nipple and areola").
#L02BG03: enzymeInhibitors(name:"Anastrozole", ddd:1, unit:mg,
          admRoute:["oral"], notes:"").
#L02AA04: estrogens(name:"Fosfestrol", ddd:0.25, unit:g,
          admRoute:["oral","parenteral"], notes:"").
#icd10_C50.0: icd10Code(c:"C50.0", chapter:2, block:"C", category:"50",
          subCat:"0").
#mesh08_C04.588.180: mesh08Code(c:"C04.588.180",category:"C", subCat:"04").
```
ID:drugPrescription\_medicalError(relativeTo:TASK,timestamp:TIME,msg:MSG):-

A Knowledge Representation Formalism for Semantic Business Process Management 227

MSG:="Prescribed dose " + DOSE + "exceed recommend dose " + RD, @newID(ID),

The generated prescription error event must be properly handled in the process, for example

X:drugPrescription\_medicalError(relativeTo:TASK, timestamp:TIME, msg:MSG),

malNeoplasm\_f\_patient(patient:P):- P:patient(sex:#F),suffer(patient:P,disease:D),

In the Static/Dynamic Knowledge Representation Framework classes and instances can be enriched by *concepts descriptors* that are rules allowing to recognize and extract concepts contained in unstructured documents written in natural language. Concepts extracted by descriptors are stored in the knowledge base as instances of the related classes in the ontology. Considering the example of clinical process described in Section 4 semantic information extraction tasks can be applied to Electronic Medical Records (EMRs), results of examinations, medical reports, etc. coming from different hospital wards in order to populate the clinical

In the following, a set of semantic extraction rules (i.e. *descriptors*) that allow for extracting patient name, surname, age and disease is shown. Descriptors exploit concepts and relationships referred to the disease, its diagnosis, cares in term of surgical operations and chemotherapies with the associated side effects. Concepts related to persons (patients), body parts and risk causes are also represented. All the concepts related to the cancer come from the ICD10-CM diseases classification system, whereas the chemotherapy drugs taxonomy, is inspired at the Anatomic Therapeutic Chemical (ATC) classification system. Extracted information are exploited to construct, for each cared patient, an instance of lung cancer

D:malignant\_neoplasm().

*Queries* can be also used for exploring clinical processes ontologies in a semantic fashion. For instance malNeoplasm\_f\_patient(patient:P)? returns every female patients suffering of any malignant neoplasm (e.g P=#21, P=#34 ids are given for answer), where malNeoplasm\_f\_

recommendedDose(drug:DRUG, dose:RD, minAge:MA, MinWeight:MW),

TASK:chemotherapeutic\_strategy\_form(strategy:STR), STR:therapeuticStrategy(patient:P, therapy:T), P:patient(bornDate:DATE,weight:W), @age(date,AGE),

an error message is visualized by means of a GUI to the physician.

ID:view\_msg(method:"exception.jar", msg:MSG):-

T:therapy(dru:DRUG,dose:DOSE),

AGE<MA, W<MW, DOSE>RD,

@now(TIME).

@newID(ID).

patient(patient:P):

clinical process.

class anatomy ().

**5. Semantic information extraction**

process instance related to a given patient.

class bodyRegion (bp:string) isa {anatomy}.

class organ isa {body\_part}. lung: organ("Lung").

The object having *id* #neoB\_01, is an instance of the breast\_primarySited\_neoplasm class. Its attributes descr and zone (which type is string) have string values: "Malignant neoplasm of breast" and "Nipple and areola", whereas the attribute site has value #A01.236 that is an *id* representing an instance of the class bodyRegion. Tuples expressing medical knowledge can be declared by using the following syntax:

```
suffer (pat:#21, dis:#neoB_01).
relatedDrug (dis:#C50.9, dru:#L02BG03).
sideEffect (dru:#L02BG03, effect:"Chest pain").
sideEffect (dru:#L02BG03, effect:"Shortness of breath").
classifiedAs (dis:#neoB 01, c:#icd10 C50.0).
classifiedAs (dis:#neoB 01, c:#mesh08 C04.588.180).
```
The tuple of the relation suffer asserts that the patient #p\_002 suffers of the disease #neoB\_01. The same diseases is classified in the ICD10-CM with identifier code #icd10\_C50.0, and is stored in Mesh tree structure with identifier code #mesh08\_C04.588.180. By means of the relation classifiedAs an user is enabled to querying concepts in the SD-KRF ontology by using one of the identifiers in the original medical ontologies.

#### **4.3 Reasoning over schemas and instances**

Integrity constraints, business rules and complex inference rules can be expressed over schemas and instances respectively by means of *axioms* and *reasoning tasks*. For example, the following axiom prevents the prescription of a drug to a patient that has an allergy to a particular constituent of the drug.

```
::-therapyStrategy(patient:P, therapy:T, drug:D),
   hasActivePrinciple(drug:D,constituent:C),
   allergy(patient:P,actPrin:C).
```
Axioms could be, also, used for: (i) specify constraints about transitions behavior. For example, the axiom "::-P:process(), not start\_node(container:P)." expresses that a start\_node must exist for each process. Constraints are also used for expressing that a transition links nodes belonging to the same process, and corresponds to an effective edge of the process model as shown in the following:

```
::-transition(from:N1,to:N2), N1:node(container:P1),
   N2:node(container:P2), P1!=P2.
::-transition(from:N1,to:N2), N1:node(start_time:ST1),
   N2:node(start_time:ST2), ST1>=ST2.
::-P:neoplasm_process(), transition(from:N1,to:N2),
   N1:acceptance_node(container:P),
   not N2:anamnesis_node(container:P).
...
```
A reasoning task can be used for expressing a business rule. The following reasoning task, for instance, throws a medical error event when the prescribed dose exceed the recommended dose based on individual characteristics (i.e. age and weight) of the interested patient. Such a check is useful when a therapeutic\_strategy\_form is created while therapeutic\_strategy \_definition\_node is active.

```
ID:drugPrescription_medicalError(relativeTo:TASK,timestamp:TIME,msg:MSG):-
    TASK:chemotherapeutic_strategy_form(strategy:STR),
    STR:therapeuticStrategy(patient:P, therapy:T),
    P:patient(bornDate:DATE,weight:W), @age(date,AGE),
    T:therapy(dru:DRUG,dose:DOSE),
    recommendedDose(drug:DRUG, dose:RD, minAge:MA, MinWeight:MW),
    AGE<MA, W<MW, DOSE>RD,
    MSG:="Prescribed dose " + DOSE + "exceed recommend dose " + RD, @newID(ID),
    @now(TIME).
```
The generated prescription error event must be properly handled in the process, for example an error message is visualized by means of a GUI to the physician.

```
ID:view_msg(method:"exception.jar", msg:MSG):-
   X:drugPrescription_medicalError(relativeTo:TASK, timestamp:TIME, msg:MSG),
   @newID(ID).
```
*Queries* can be also used for exploring clinical processes ontologies in a semantic fashion. For instance malNeoplasm\_f\_patient(patient:P)? returns every female patients suffering of any malignant neoplasm (e.g P=#21, P=#34 ids are given for answer), where malNeoplasm\_f\_ patient(patient:P):

```
malNeoplasm_f_patient(patient:P):- P:patient(sex:#F),suffer(patient:P,disease:D),
                                   D:malignant_neoplasm().
```
#### **5. Semantic information extraction**

16 Will-be-set-by-IN-TECH

The object having *id* #neoB\_01, is an instance of the breast\_primarySited\_neoplasm class. Its attributes descr and zone (which type is string) have string values: "Malignant neoplasm of breast" and "Nipple and areola", whereas the attribute site has value #A01.236 that is an *id* representing an instance of the class bodyRegion. Tuples

The tuple of the relation suffer asserts that the patient #p\_002 suffers of the disease #neoB\_01. The same diseases is classified in the ICD10-CM with identifier code #icd10\_C50.0, and is stored in Mesh tree structure with identifier code #mesh08\_C04.588.180. By means of the relation classifiedAs an user is enabled to querying concepts in the SD-KRF ontology by using one of the identifiers in the original

Integrity constraints, business rules and complex inference rules can be expressed over schemas and instances respectively by means of *axioms* and *reasoning tasks*. For example, the following axiom prevents the prescription of a drug to a patient that has an allergy to

Axioms could be, also, used for: (i) specify constraints about transitions behavior. For example, the axiom "::-P:process(), not start\_node(container:P)." expresses that a start\_node must exist for each process. Constraints are also used for expressing that a transition links nodes belonging to the same process, and corresponds to an effective edge

A reasoning task can be used for expressing a business rule. The following reasoning task, for instance, throws a medical error event when the prescribed dose exceed the recommended dose based on individual characteristics (i.e. age and weight) of the interested patient. Such a check is useful when a therapeutic\_strategy\_form is created while therapeutic\_strategy

expressing medical knowledge can be declared by using the following syntax:

suffer (pat:#21, dis:#neoB\_01).

medical ontologies.

...

relatedDrug (dis:#C50.9, dru:#L02BG03).

**4.3 Reasoning over schemas and instances**

a particular constituent of the drug.

allergy(patient:P,actPrin:C).

of the process model as shown in the following:

N2:node(start\_time:ST2), ST1>=ST2.

N1:acceptance\_node(container:P), not N2:anamnesis\_node(container:P).

\_definition\_node is active.

N2:node(container:P2), P1!=P2.

sideEffect (dru:#L02BG03, effect:"Chest pain").

classifiedAs (dis:#neoB 01, c:#mesh08 C04.588.180).

::-therapyStrategy(patient:P, therapy:T, drug:D), hasActivePrinciple(drug:D,constituent:C),

::-transition(from:N1,to:N2), N1:node(container:P1),

::-P:neoplasm\_process(), transition(from:N1,to:N2),

::-transition(from:N1,to:N2), N1:node(start\_time:ST1),

classifiedAs (dis:#neoB 01, c:#icd10 C50.0).

sideEffect (dru:#L02BG03, effect:"Shortness of breath").

In the Static/Dynamic Knowledge Representation Framework classes and instances can be enriched by *concepts descriptors* that are rules allowing to recognize and extract concepts contained in unstructured documents written in natural language. Concepts extracted by descriptors are stored in the knowledge base as instances of the related classes in the ontology.

Considering the example of clinical process described in Section 4 semantic information extraction tasks can be applied to Electronic Medical Records (EMRs), results of examinations, medical reports, etc. coming from different hospital wards in order to populate the clinical process instance related to a given patient.

In the following, a set of semantic extraction rules (i.e. *descriptors*) that allow for extracting patient name, surname, age and disease is shown. Descriptors exploit concepts and relationships referred to the disease, its diagnosis, cares in term of surgical operations and chemotherapies with the associated side effects. Concepts related to persons (patients), body parts and risk causes are also represented. All the concepts related to the cancer come from the ICD10-CM diseases classification system, whereas the chemotherapy drugs taxonomy, is inspired at the Anatomic Therapeutic Chemical (ATC) classification system. Extracted information are exploited to construct, for each cared patient, an instance of lung cancer clinical process.

```
class anatomy ().
    class bodyRegion (bp:string) isa {anatomy}.
        class organ isa {body_part}.
            lung: organ("Lung").
```
**5.1 Implementation issues**

obtained by semantic query of acquired process instances.

A prototype of the SD-KRF has been implemented by combining the JBPM engine JPDL (2011) and the XONTO system Oro et al. (2009a). It is designed to follow a clinical processes life-cycle model based on 3 phases: processes and ontologies *design and implementations*, *execution and monitoring*, *analysis*. The first module exploits the XONTO system. It provides functionalities for importing and/or representing (by using direct "on-screen" drawing and manual specification facilities): ontologies and processes. Moreover, semantic extraction rules (descriptors), that enable the recognition and extraction of information from unstructured sources, can be modeled by exploiting the XONTO approach. A set of business/risk/error rules can be described in terms of ontology constraints and/or reasoning tasks. Acquired and/or represented schemas and instances are stored in a knowledge base and can be queried by using querying and meta-querying capabilities. The *Execution & Monitoring* module is mainly constituted by the JBPM engine that interact with the XONTO system. Process execution is performed in two ways: (i) by using a *workflow enactment* strategy. In this case, a process schema is imported in JBPM and automatically executed involving actors that can be humans or machines (e.g. legacy systems supplying results of medical examinations); (ii) by using a dynamic *workflow composition* strategy. In this case, nodes to execute are selected step by step by choosing the most appropriate one in a given moment. Nodes are chosen by using semantic querying capabilities of XONTO system and executed by JBPM. Queries allows for specifying data and each significant information available in the particular moment of the execution. The execution generates process instances that are stored in a knowledge base. Reasoning, querying and meta-querying over schemas and available instances are possible. This way, process execution can be monitored by running business rules that equip process schemas. Events generated by rules alert process actors that can react for preventing risks and errors. The *Analytics* module aims at allowing analysis of the clinical processes instances after their acquisition. The execution of clinical processes makes available process instances that are also ontology instances. This way a large amount of semantically enriched data becomes available for retrieval, querying and analysis. Analysis are performed by creating reports

A Knowledge Representation Formalism for Semantic Business Process Management 229

The SD-KRF constitutes an innovative approach for semantic business process management in the field of healthcare information systems. Main features of the presented semantic approach (founded on logic programming) is that it, conversely to already existing systems and approaches, enables to represent process ontologies that can be equipped with expressive business rules. In particular, the proposed framework allows for jointly managing declarative and procedural aspects of domain knowledge and express reasoning tasks that exploit represented knowledge in order to prevent errors and risks that can take place during processes execution. The framework enables, also: (i) manual process execution in which each activity to execute in a given moment is chosen by a human actor on the base of the current configuration of patient and disease parameters, and (ii) automatic execution by means of the enactment of an already designed process schema (e.g. guidelines execution). During process execution, process and ontology instances are acquired and stored in a knowledge base. The system is able to automatically acquire information from electronic unstructured medical documents, exploiting a semantic information extraction approach. Extracted information are stored into a knowledge base as concept instances and are also used to fill XML documents enabling interoperability. The processes execution can be monitored by running (over clinical process schemas and instances) reasoning tasks that implements

```
<lung> -> <X:token(), matches(X,"[Ll]ung")>.
        ...
    ...
class disease (name:string).
    tumor: disease("Tumor").
    <tumor> -> <X:token(), matches(X,"[Tt]umor")>.
    cancer: disease("Cancer").
    <cancer> -> <X:token(), matches(X,"[Cc]ancer")>.
    ...
relation synonym (d1:disease,d2:disease)
    synonym(cancer,tumor).
    ...
class body_part_desease () isa {disease}.
    lung_cancer: body_part_disease("Lung cancer").
    <lung_cancer> -> <diagnosis_section> CONTAIN <lung> &
                     <X:desease(),synonym(cancer,X)>
    ...
collection class patient_data (){}
    collection class patient_name (name:string){}
        <patient_name(Y)> -> <T:token(),defBy(T, "name:")>
                             <X:token()> {Y := X;} SEPBY <X:space()>.
    collection class patient_surname (surname:string){}
            <patient_surname(Y)> -> <X:hiStr(),matches(X,"sur(:?name)?:")>
                                     <X:token()> {Y:=X;} SEPBY <X:space()>.
        collection class patient_age (age:integer){}
            <patient_age(Y)> -> <X:token(),matches(X,"age:")>
                                 <Z:token()>{Y := $str2int(Z);}
                                 SEPBY <X:space()>.
    ...
collection class patient_data (name:string, surname:string,
                               age:integer, diagnosis:body_part_disease){}
     <patient_data(X,Y,Z,lung_cancer)> -> <hospitalization_section>
                                           CONTAIN
                                           <P:patient_name(X1)>{X:=X1} &
                                           <P:patient_surname(Y1)>{Y:=Y1} &
                                           <P:patient_age(Z1)>{Z:=Z1} &
                                           <lung_cancer>.
...
```
The classes diagnosis\_section and hospitalization\_section, used in the above descriptors, represent text paragraphs containing personal data and diagnosis data recognized by proper descriptors that aren't shown for lack of space. The extraction mechanism can be considered in a WOXM fashion: Write Once eXtract Many, in fact the same descriptors can be used to enable the extraction of metadata related to patient affected by lung\_cancer in unstructured EMRs that have different arrangement. Moreover, descriptors are obtained by automatic writing methods (as happens, for example, for the cancer and tumor concepts) or by visual composition (as happens for patient\_data)

Metadata extracted by using the descriptors are stored as class instances into a knowledge base. Using descriptors shown above the extraction process generates the following

```
patient_
data class instance for an EMR: "#1":patient_data("Mario", "Rossi","70",lung_
cancer).
```
#### **5.1 Implementation issues**

18 Will-be-set-by-IN-TECH

<lung> -> <X:token(), matches(X,"[Ll]ung")>.

<tumor> -> <X:token(), matches(X,"[Tt]umor")>.

lung\_cancer: body\_part\_disease("Lung cancer").

collection class patient\_name (name:string){}

<lung\_cancer> -> <diagnosis\_section> CONTAIN <lung> &

collection class patient\_surname (surname:string){}

collection class patient\_age (age:integer){}

collection class patient\_data (name:string, surname:string,

by visual composition (as happens for patient\_data)

<patient\_name(Y)> -> <T:token(),defBy(T, "name:")>

<X:desease(),synonym(cancer,X)>

<patient\_age(Y)> -> <X:token(),matches(X,"age:")>

<patient\_data(X,Y,Z,lung\_cancer)> -> <hospitalization\_section>

The classes diagnosis\_section and hospitalization\_section, used in the above descriptors, represent text paragraphs containing personal data and diagnosis data recognized by proper descriptors that aren't shown for lack of space. The extraction mechanism can be considered in a WOXM fashion: Write Once eXtract Many, in fact the same descriptors can be used to enable the extraction of metadata related to patient affected by lung\_cancer in unstructured EMRs that have different arrangement. Moreover, descriptors are obtained by automatic writing methods (as happens, for example, for the cancer and tumor concepts) or

Metadata extracted by using the descriptors are stored as class instances into a knowledge

data class instance for an EMR: "#1":patient\_data("Mario", "Rossi","70",lung\_

base. Using descriptors shown above the extraction process generates the following

<X:token()> {Y := X;} SEPBY <X:space()>.

<Z:token()>{Y := \$str2int(Z);}

CONTAIN

<lung\_cancer>.

<X:token()> {Y:=X;} SEPBY <X:space()>.

age:integer, diagnosis:body\_part\_disease){}

<P:patient\_name(X1)>{X:=X1} & <P:patient\_surname(Y1)>{Y:=Y1} & <P:patient\_age(Z1)>{Z:=Z1} &

<patient\_surname(Y)> -> <X:hiStr(),matches(X,"sur(:?name)?:")>

SEPBY <X:space()>.

<cancer> -> <X:token(), matches(X,"[Cc]ancer")>.

...

class disease (name:string). tumor: disease("Tumor").

cancer: disease("Cancer").

collection class patient\_data (){}

synonym(cancer,tumor).

relation synonym (d1:disease,d2:disease)

class body\_part\_desease () isa {disease}.

...

...

...

...

...

...

patient\_

cancer).

A prototype of the SD-KRF has been implemented by combining the JBPM engine JPDL (2011) and the XONTO system Oro et al. (2009a). It is designed to follow a clinical processes life-cycle model based on 3 phases: processes and ontologies *design and implementations*, *execution and monitoring*, *analysis*. The first module exploits the XONTO system. It provides functionalities for importing and/or representing (by using direct "on-screen" drawing and manual specification facilities): ontologies and processes. Moreover, semantic extraction rules (descriptors), that enable the recognition and extraction of information from unstructured sources, can be modeled by exploiting the XONTO approach. A set of business/risk/error rules can be described in terms of ontology constraints and/or reasoning tasks. Acquired and/or represented schemas and instances are stored in a knowledge base and can be queried by using querying and meta-querying capabilities. The *Execution & Monitoring* module is mainly constituted by the JBPM engine that interact with the XONTO system. Process execution is performed in two ways: (i) by using a *workflow enactment* strategy. In this case, a process schema is imported in JBPM and automatically executed involving actors that can be humans or machines (e.g. legacy systems supplying results of medical examinations); (ii) by using a dynamic *workflow composition* strategy. In this case, nodes to execute are selected step by step by choosing the most appropriate one in a given moment. Nodes are chosen by using semantic querying capabilities of XONTO system and executed by JBPM. Queries allows for specifying data and each significant information available in the particular moment of the execution. The execution generates process instances that are stored in a knowledge base. Reasoning, querying and meta-querying over schemas and available instances are possible. This way, process execution can be monitored by running business rules that equip process schemas. Events generated by rules alert process actors that can react for preventing risks and errors. The *Analytics* module aims at allowing analysis of the clinical processes instances after their acquisition. The execution of clinical processes makes available process instances that are also ontology instances. This way a large amount of semantically enriched data becomes available for retrieval, querying and analysis. Analysis are performed by creating reports obtained by semantic query of acquired process instances.

The SD-KRF constitutes an innovative approach for semantic business process management in the field of healthcare information systems. Main features of the presented semantic approach (founded on logic programming) is that it, conversely to already existing systems and approaches, enables to represent process ontologies that can be equipped with expressive business rules. In particular, the proposed framework allows for jointly managing declarative and procedural aspects of domain knowledge and express reasoning tasks that exploit represented knowledge in order to prevent errors and risks that can take place during processes execution. The framework enables, also: (i) manual process execution in which each activity to execute in a given moment is chosen by a human actor on the base of the current configuration of patient and disease parameters, and (ii) automatic execution by means of the enactment of an already designed process schema (e.g. guidelines execution). During process execution, process and ontology instances are acquired and stored in a knowledge base. The system is able to automatically acquire information from electronic unstructured medical documents, exploiting a semantic information extraction approach. Extracted information are stored into a knowledge base as concept instances and are also used to fill XML documents enabling interoperability. The processes execution can be monitored by running (over clinical process schemas and instances) reasoning tasks that implements

LOINC (2011). Logical observation identifiers names and codes.

*Web Semantics I*, number 46, pp. 35–80.

NCI-EVS (2011). NCI Enterprise Vocabulary Services (EVS).

operations and procedures, 4th revision.

URL: *http://www.openclinical.org/medTermOPCS4.html*

URL: *http://www.nlm.nih.gov/mesh/*

processes, *DEXA (2)*, pp. 294–302.

Markovic, I. (2008). Advanced querying and reasoning on business process models, *BIS*,

A Knowledge Representation Formalism for Semantic Business Process Management 231

Medeiros, A. K. A. D. & Aalst, W. M. (2009). Process mining towards semantics, *Advances in*

Missikoff, M., Proietti, M. & Smith, F. (2011). Querying semantically enriched business

Montali, M., Torroni, P., Alberti, M., Chesani, F., Gavanelli, M., Lamma, E. & Mello, P.

Musen, M. A., Tu, S. W., Das, A. K. & Shahar, Y. (1996). Eon: A component-based approach to

URL: *http://ncicb.nci.nih.gov/NCICB/infrastructure/cacore\_overview/vocabulary* Newell, A. (1980). The knowledge level (presidential address), *AI Magazine* 2(2): 1–20.

OPCS-4 (2011). The office of population censuses and surveys classification of surgical

Oro, E. & Ruffolo, M. (2009). Towards a semantic system for managing clinical processes,

Oro, E., Ruffolo, M. & Saccà, D. (2009a). Ontology-based information extraction from

Oro, E., Ruffolo, M. & Saccà, D. (2009b). A semantic clinical knowledge representation framework for effective health care risk management, *BIS*, pp. 25–36. Peleg, M. (2009). Executable knowledge, *Encyclopedia of Database Systems*, pp. 1073–1079. Peleg, M., Ogunyemi, O., Tu, S., Boxwala, A. A., Zeng, Q., Greenes, R. A. & Shortliffe, E. H.

Pryor, T. A. & Hripcsak, G. (1993). The arden syntax for medical logic modules., *Journal of*

Roman, D. & Kifer, M. (2007). Reasoning about the behavior of semantic web services with

Sackett, D. L., Rosenberg, W. M. C., Gray, M. J. A., Haynes, B. R. & Richardson, S. W. (1996). Evidence based medicine: what it is and what it isn't, *BMJ* 312(7023): 71–72.

pdf documents with xonto, *International Journal on Artificial Intelligence Tools (IJAIT)*

(2001). Using features of arden syntax with object-oriented medical data models for guideline modeling, *in American Medical Informatics Association (AMIA) Annual*

(2008). Verification from declarative specifications using logic programming, *ICLP*,

automation of protocol-directed therapy., *Journal of the American Medical Informatics*

URL: *http://loinc.org/*

MESH (2011). Medical subject headings.

*Association* 3(6): 367–388.

OMG (2011). Unified modeling language. URL: *http://www.uml.org/*

*ICEIS (2)*, pp. 180–187.

*Symposium, 2001*, pp. 523–527.

URL: *http://protege.stanford.edu*

URL: *http://plug-it.org/*

PLUG-IT (2011). It-socket knowledge portal - plug-it initiatives.

Protégé (2011). Ontology Editor and Knowledge Acquisition System.

*Clinical Monitoring and Computing* 10(4): 215–224.

concurrent transaction logic, *VLDB*, pp. 627–638.

URL: *http://bmj.bmjjournals.com/cgi/content/full/312/7023/71*

18(5): 673–695.

pp. 189–200.

pp. 440–454.

business rules. The framework could be used in many application fields by modeling proper domain ontologies and processes.

#### **6. References**


URL: *http://www.ama-assn.org/ama/pub/category/3113.html*


URL: *http://dl.acm.org/citation.cfm?id=188778.188784*

ICD (2011). International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM).

URL: *http://www.cdc.gov/nchs/about/otheract/icd9/icd10cm.htm*

JPDL (2011). jbpm process definition language. URL: *http://www.jboss.org/jbossjbpm/jpdl/*


20 Will-be-set-by-IN-TECH

business rules. The framework could be used in many application fields by modeling proper

Awad, A., Decker, G. & Weske, M. (2008). Efficient compliance checking using bpmn-q and

Boley, H., Tabet, S. & Wagner, G. (2001). Design rationale for ruleml: A markup language for

CPT (2011). Current procedural terminology (cpt): a registered trademark of the american

Francescomarino, C. D. & Tonella, P. (2008). Crosscutting concern documentation by visual query of business processes, *Business Process Management Workshops*, pp. 18–31. Georgakopoulos, D., Hornick, M. & Sheth, A. (1995). An overview of workflow management:

Gruber, T. R. (1995). Toward principles for the design of ontologies used for knowledge

Haller, A., Gaaloul, W. & Marmolowski, M. (2008). Towards an xpdl compliant process

Haller, A., Oren, E. & Kotinurmi, P. (2006). m3po: An ontology to relate choreographies to

Hepp, M., Leymann, F., Domingue, J., Wahler, A. & Fensel, D. (2005). Semantic business

Hepp, M. & Roman, D. (2007). An ontology framework for semantic business process

Hornung, T., Koschmider, A. & Oberweis, A. (2007). Rule-based autocompletion of business

Hripcsak, G., Ludemann, P., Pryor, T. A., Wigertz, O. B. & Clayton, P. D. (1994). Rationale for

ICD (2011). International Classification of Diseases, Tenth Revision, Clinical Modification

process management: A vision towards using semantic web services for business

from process modeling to workflow automation infrastructure, *Distrib. Parallel*

domain ontologies and processes.

temporal logic, *BPM*, pp. 326–341.

URL: *http://www.ccam.sante.fr/*

URL: *http://www.coin-ip.eu/*

URL: *http://cordis.europa.eu/*

medical association (ama).

*Databases* 3(2): 119–153. GLIF (2011). The guideline interchange format.

HL7 (2011). Health Level Seven.

(ICD-10-CM).

URL: *http://www.hl7.org/*

process models, *CAiSE Forum*.

JPDL (2011). jbpm process definition language.

URL: *http://www.jboss.org/jbossjbpm/jpdl/*

ontology, *SERVICES I*, pp. 83–86.

workflow models, *IEEE SCC*, pp. 19–27.

process management, *ICEBE*, pp. 535–540.

management, *Wirtschaftsinformatik (1)*, pp. 423–440.

the arden syntax, *Comput. Biomed. Res.* 27: 291–324. URL: *http://dl.acm.org/citation.cfm?id=188778.188784*

URL: *http://www.cdc.gov/nchs/about/otheract/icd9/icd10cm.htm*

semantic web rules, *SWWS*, pp. 381–401. CCAM (2011). Classification commune des actes médicaux.

COIN (2011). Enterprise collaboration and interoperability.

CORDIS (2011). Community research and development information service.

URL: *http://www.ama-assn.org/ama/pub/category/3113.html*

URL: *http://www.openclinical.org/gmm\_glif.html*

sharing?, *Int. J. Hum.-Comput. Stud.* 43(5-6): 907–928.

**6. References**

	- URL: *http://www.openclinical.org/medTermOPCS4.html*

**10** 

*Italy* 

Antonella Carbonaro

*Mura Anteo Zamboni,* 

**Automatic Concept Extraction** 

*Computer Science Department, University of Bologna,* 

**in Semantic Summarization Process** 

The Semantic Web offers a generic infrastructure for interchange, integration and creative reuse of structured data, which can help to cross some of the boundaries that Web 2.0 is facing. Currently, Web 2.0 offers poor query possibilities apart from searching by keywords or tags. There has been a great deal of interest in the development of semantic-based systems to facilitate knowledge representation and extraction and content integration [1], [2]. Semantic-based approach to retrieving relevant material can be useful to address issues like trying to determine the type or the quality of the information suggested from a personalized environment. In this context, standard keyword search has a very limited effectiveness. For example, it cannot filter for the type of information, the level of

Potentially, one of the biggest application areas of content-based exploration might be personalized searching framework (e.g., [3],[4]). Whereas search engines provide nowadays largely anonymous information, new framework might highlight or recommend web pages related to key concepts. We can consider semantic information representation as an important step towards a wide efficient manipulation and retrieval of information [5], [6], [7]. In the digital library community a flat list of attribute/value pairs is often assumed to be available. In the Semantic Web community, annotations are often assumed to be an instance of an ontology. Through the ontologies the system will express key entities and relationships describing resources in a formal machine-processable representation. An ontology-based knowledge representation could be used for content analysis and object recognition, for reasoning processes and for enabling user-friendly and intelligent

Text summarization has been an interesting and active research area since the 60's. The definition and assumption are that a small portion or several keywords of the original long document can represent the whole informatively and/or indicatively. Reading or processing this shorter version of the document would save time and other resources [8]. This property is especially true and urgently needed at present due to the vast availability of information. Concept-based approach to represent dynamic and unstructured information can be useful to address issues like trying to determine the key concepts and to summarize the

**1. Introduction** 

information or the quality of information.

multimedia content search and retrieval.

information exchanged within a personalized environment.



WHO (2011). World health organization.

URL: *http://www.who.int/classifications/en/*

Wohed, P., van der Aalst, W. M. P., Dumas, M., ter Hofstede, A. H. M. & Russell, N. (2006). On the suitability of bpmn for business process modelling, *Business Process Management*, pp. 161–176.

XPDL (2011). Xpdl2.1 complete specification. URL: *http://www.wfmc.org/xpdl.html*
