Data Mining for Prediction

#### **Chapter 4**

## Revealing Interesting If-Then Rules

*Abraham Meidan*

#### **Abstract**

One of the challenges of data mining is revealing the *interesting* if-then rules in the mined dataset. We present two methods for the automated discovery of unexpected, and, thereby, interesting rules within the set of all the if-then rules previously revealed in the dataset. The first method calculates, for each rule, the probability that the rule exists accidentally. The lower this probability, the more unexpected the rule is. The second method calculates the conditional probability of the event described by the rule, given the relevant more basic rules. Once again, the lower this conditional probability, the more unexpected the rule is. These two methods are independent and can be combined.

**Keywords:** data mining, if-then rules, interesting rules, unexpected rules, probability

#### **1. Introduction**

Several data mining algorithms reveal if-then rules in the mined data [1]. However, in many cases, the size of the analyzed data is enormous [2], and as a result, the number of rules is so big that the user cannot practically review all of them manually. Consequently, the user would like to have a method of accessing the most *interesting* rules among the discovered ones.

What makes a rule interesting? Several papers in data mining literature have addressed this issue. Two different interpretations of the concept of "interesting rules" can be found in this literature. Some papers, such as [3], refer to "interesting rules" in the sense of "rules that are best or optimal." Other papers refer to another meaning of "interesting rules," according to which an interesting rule satisfies the user's curiosity. In this paper, we refer to this second meaning.

One can filter the non-interesting rules using the support and confidence levels [2, 4, 5]. The support level of an if-then rule denotes the number of records where the rule holds (both the rule's conditions and conclusion are true) relative to the total number of records in the dataset. The user may determine the threshold of the support level, implying that rules having a support level below this threshold are not interesting. The confidence level designates the frequency of cases where both the rule's conditions and conclusion hold relative to the cases where the rule's conditions hold with or without the conclusion. (In other words, the confidence level denotes the rule probability. For example, If the value in field A is a, and value in field B is b, then there is an 80% probability that the value in field C is c). Obviously, a rule is interesting only if the confidence level is significantly higher than the primary frequency of the conclusion in the dataset. Once again, the user may determine a

threshold implying that rules having a confidence level below this threshold are not interesting. However, this method alone cannot guarantee to reveal all the interesting rules. When either the support level and/or the confidence level are too high, interesting rules may be missed, and when they are too low, non-interesting rules may be revealed.

Following this line, it was suggested that other means should be used on top of the support level and confidence level to reveal the interesting rules. One suggestion was to measure how unexpected each rule is. The idea that unexpected phenomena are interesting was suggested already in the ancient world. In the twentieth century, it was mentioned in the Philosophy of Science by Agassi [6]. In the field of data mining, it was presented by Silberschatz and Tuzhilin [7, 8]. According to this view, users classify a rule as interesting when it is inconsistent with their expectations. In other words, rules are interesting when they are unexpected.

How can the level of a rule's unexpectedness be measured? Liu and Hsu [9] and Liu, Hsu, and Chen [10] suggest measuring the extent of the rule's unexpectedness by comparing the rule with a predefined set of expectations. At the preliminary stage, the users enter their expectations, and based on this set of expectations, the program calculates an evaluation of unexpectedness for each rule. Such an approach can be used when the user can formulate all the expected rules, but it is not easy. Moreover, since some of the discovered rules might result from noise, and obviously, the user does not expect these rules, they will be presented as unexpected and, thereby, interesting. Sahar [11] suggests a process where the system selects a few rules, the user marks the rules that are *not* interesting, and the program then automatically eliminates the other associated rules. Sahar reports that "by little over five iterations, this process often reduces the list of rules to half its original size." However, in many cases, such a reduction is not sufficient. When the list of rules contains 10,000 rules, reducing the list to 5000 might not satisfy the user. They might instead prefer sorting the rules by their level of interestingness and reviewing just the 10 or 100 most interesting ones. Wesley Romão, Alex A. Freitas, and Itana M.de S. Gimenes [12] suggest using a genetic algorithm for revealing surprising if-then rules based on user-defined general impressions (subjective knowledge). A similar approach was proposed by Jyoti Vashishtha, Dharminder Kumar, Saroj Ratnoo, and Kapila Kundu [13].

In this paper, we suggest two additional methods for revealing the unexpected and, thereby interesting rules. Contrary to the previous methods, our methods reveal the unexpected rules automatically, that is, unassisted by the user. Moreover, these methods *sort* the rules by their level of unexpectedness and, thereby, interestingness.

The first method for measuring the degree of unexpectedness rests on the assumption that when the users have no preliminary knowledge about the relations among the fields and their values in the data, their "natural" expectation is that the relations are accidental. By calculating the probability that a given rule is accidental, we can therefore measure the extent to which the rule is unexpected. The logic underlying this idea is similar to that underlying the calculation of the significance level in classical statistical tests, such as t-test or F-test. These tests refer to the concept of α, which denotes the probability that the phenomenon is accidental. The lower this probability, the more significant the phenomenon is. Following the same line of thought, we calculate the probability that a given rule is accidental. The lower this probability, the more unexpected and, thereby, more interesting the rule is. In other words, we assume that the user expects accidental relations between the fields and their values. Each rule refutes this expectation. The level of the refutation can be

#### *Revealing Interesting If-Then Rules DOI: http://dx.doi.org/10.5772/intechopen.111376*

measured by calculating the probability that the rule is accidental. The lower this probability, the more significant and interesting the rule is.

The second method for measuring the degree of unexpectedness rests on the assumption that the set of one-condition rules can be considered the user's basic expectations. Given the one-condition rules, the conditional probability of the rules having more than one condition can then be calculated. The lower this conditional probability, the more unexpected the rule is. For example, suppose that the following three rules were discovered in the data: (1) If A then R, (2) If B then R, and (3) If A and B then NOT R. Rule (3) can be considered as unexpected relative to rules (1) and (2). Assuming that the user already knows rules (1) and (2), being one-condition rules, rule (3) turns out to be unexpected and thereby interesting. This approach can be elaborated by considering not just the one-condition rules as the basic expectations. Rather the conditional probability of any rules having more than one condition can be calculated given any rules, the conditions of which are a subset of the conditions of the rule under discussion. Thus a three-condition rule can be compared against the relevant one-condition and two-condition rules.

A similar approach was suggested by Suzuki [14]. He analyzed rule pairs such as "If A then B" and "If A and C then *not* B." However, this approach does not calculate the conditional probability of the unexpected rule given the one-condition rules and is limited to pairs of if-then rules where the conclusions are inconsistent. Our approach covers the abovementioned rules pairs as a private case.

In what follows, we will present the algorithms for calculating the level of the unexpectedness of the rules according to these two methods.

#### **2. If-then rules**

There are several algorithms for revealing if-then rules in the data [2, 3]. For the sake of simplicity and without loss of generality, we will limit the discussion to association rules that are related to one Boolean field in the "then" part (the dependent variable).

Consider a dataset containing *n* þ 1 fields, among which one field is selected as the dependent variable, *y.* The remaining *n* fields *x*1, … , *xn* are considered as input data used for explaining the selected field *y*. Consider the most frequent case when field *y* is Boolean. Without loss of generality we assume *y*∈f g 0, 1 .

Let *Ii* ¼ *ai*1, … , *aimi* f g be the set of codes of values of variable (field) *xi*. For example, if field *xi* is quantitative, *aij* denotes the code of the interval of change of values of variable *xi*. A single condition (1-condition) is a condition of the following type: *xi* ¼ *aij*, *j*∈f g 1, … , *mi* . A composite condition is a conjunction of *q* single conditions, *q* ¼ 2, … , *n*. We will call a composite condition consisting of *q* single conditions *q*-condition. Thus, *q*-condition is a condition of the type:

*xi*<sup>1</sup> ¼ *ai*1*<sup>j</sup>* 1 <sup>∧</sup> … <sup>∧</sup> *xiq* <sup>¼</sup> *aiqj q* .

An if-then rule is the following statement:

$$\text{If } (\text{the } q-\text{condition}) \text{ then } y \text{ is } y\_1. \tag{1}$$

Rule's probability (confidence level) = *p*.

Rule's support level = *s*,

where *y*<sup>1</sup> belongs to the set of values of field *y*, that is, *y*<sup>1</sup> ¼ 1 or *y*<sup>1</sup> ¼ 0.

The number of records, *s,* at which both the rule's condition (the *q*-condition) and the rule's conclusion (*y is y*1) are fulfilled is called the *rule's support level*. (Contrary to other definitions of "support level" we refer to the *number* of records satisfying the rule rather than the *percentage* of records out of the total number of records in the dataset.) The *rule's probability, p,* is the ratio of the rule's support to the number of records satisfying the rule's condition.

Let *pa* be the a priori probability that *<sup>y</sup>* <sup>¼</sup> 1. *pa* <sup>¼</sup> *<sup>M</sup> <sup>N</sup>*, where *N* is the total number of records in the mined dataset, and *M* is the number of records at which *y* ¼ 1. Rules of type (1) can be interesting only if their probability significantly deviates from *pa*. Formally, only the rules of type (1) in which *p* ≥*p*<sup>1</sup> or *p*≤ *p*<sup>0</sup> can be interesting, where *p*<sup>1</sup> >*pa* and *p*<sup>0</sup> <*pa* are the predetermined values of a probability that *y is y*1. Here we suppose that all the rules of type (1) are represented such that they have *y is 1* in the "then" part. (Note that a rule having probability *p*≤ *p*<sup>0</sup> that *y is y*<sup>1</sup> is equivalent to the rule containing the same condition in the "if" part and having a probability 1 � *p* ≥ 1 � *p*<sup>0</sup> that *y is 0*.) Moreover, the support level of an interesting rule must be not less than the pre-given value *s*min, that is, *s*≥*s*min.

Assume that all the rules satisfying the abovementioned requirements have already been found. Then, a new problem unexpectedly arises. It turns out that the number of such rules can be so huge that it is quite impractical for the user to review all of them in order to select the interesting ones. The user can decrease the number of revealed rules by raising the minimum rule's support level, *s*min and/or the minimum rule's probability (confidence level) *p*1, or by decreasing the value *p*0, but this can lead to missing rules having a low support level or a probability, not strongly deviating from *pa*, which are nevertheless interesting.

#### **3. Calculating the probability that a rule is accidental**

This section will present the first method for measuring the degree of unexpectedness. We will present an algorithm for calculating the probability that a given if-then rule is accidental. As mentioned, the lower this probability, the more unexpected and interesting the rule is.

Let *J* be the set of all the discovered association rules. Consider a rule *j*∈*J*. Let us try to evaluate the *significance level* of the rule *j,* namely the probability that rule *j* in the investigated dataset exists *not* accidentally.

The significance level of rule *j* may be determined as 1 � *αj*, where *α<sup>j</sup>* is the probability that the rule *j* in the investigated dataset exists by chance. To precisely define *αj*, assume that the rule *j* has probability *pj* ≥*p*<sup>1</sup> (that *y is 1*) and support level *sj*. It is intuitively clear that there is a certain a priori probability that this rule, or any such rule where the probability is not less than *pj* , will be revealed. *α<sup>j</sup>* is this a priori probability. Formally, assume that *mj* records have been chosen at random from a dataset containing *N* records, among which there are *paN* records where *y is 1*. Here *mj* <sup>¼</sup> *sj pj* is the number of records satisfying the condition of rule *j*. *α<sup>j</sup>* is the probability that there are not less than *sj* records where *y is 1* among these *mj* records. *α<sup>j</sup>* is calculated by the formula of hypergeometric distribution, that is:

*Revealing Interesting If-Then Rules DOI: http://dx.doi.org/10.5772/intechopen.111376*

$$a\_{\dot{j}} = \sum\_{k=s\_{\dot{j}}}^{m\_{\dot{j}}} P\_{N,M}(m\_{\dot{j}}, \ k), \tag{2}$$

where

$$P\_{N,M}(m\_j, \ k) = \frac{\binom{k}{M} \cdot \binom{m\_j - k}{N - M}}{\binom{m\_j}{N}} \tag{3}$$

*α<sup>j</sup>* for the rule *j* with the probability *pj* ≤*p*<sup>0</sup> that *y* is *1* (i.e., for a rule having the probability 1 � *pj* that *y* is *0*, where 1 � *pj* ≥1 � *p*0) implies the a priori probability that this rule, or any such rule where the probability that *y* is *1* is not greater than *pj* , will be revealed. In this case, *α<sup>j</sup>* is calculated as follows:

$$a\_j = \sum\_{k=0}^{s\_j} P\_{N,M} \begin{pmatrix} m\_j & k \end{pmatrix},\tag{4}$$

where *PN*,*<sup>M</sup> mj*, *k* � � is calculated by formula (3).

To calculate *PN*,*<sup>M</sup> mj*, *k* � � by formula (3), at first ln *PN*,*<sup>M</sup> mj*, *k* � � � � is calculated by the approximation formula lnð Þ *n*! ≈ ð Þ *n* þ 0*:*5 lnðÞ � *n n* þ 0*:*5 ln 2ð Þ *π* . Then *e*ln *PN*,*<sup>M</sup> mj* ð Þ ð Þ , *<sup>k</sup>* is calculated.

The less *αj*, the greater the significance level of rule *j*. The significance level of rule *j* may be interpreted as the measure of the unexpectedness of this rule, thereby as a designation of the extent to which the rule is interesting.

Thus, we can calculate the significance level for each rule *j*∈ *J*, and sort the array of all discovered rules by the descending order of the values of the significance level. This descending order allows the user to sort the rules by their level of unexpectedness according to the first criteria of unexpectedness.

Now, the first *k* rules of the sorted array can be considered unexpected and interesting. The number *k* may be determined based on the condition that the significance level of a rule is not less than a pre-given value. Another way of determining the boundary value for the significance level is as follows. The average value *a* of significance levels for all rules *j*∈*J* and the corresponding standard deviation *σ* are calculated. The value *a* þ *σ* can be accepted as the boundary value for the significance level of an interesting rule.

#### **4. Calculating the conditional probability of a rule given the basic trends**

The previous section referred to the first method for measuring the degree of unexpectedness. We turn now to the second method. This method refers to rules having more than one condition. It calculates the conditional probability of each of these rules given the basic trends in the data.

Consider first an example. Assume that the a priori probability that *y=1* is 0.3, and we wish to find all the if-then rules having a probability (1) not less than *p*<sup>1</sup> ¼ 0*:*4, and (2) not greater than *p*0¼ 0*:*2. Assume also that the following two rules were discovered (among other):

1. If *x*<sup>1</sup> *is 3*, then there is a probability of 0.45 that *y is 1*.

2. If *x*1*is 3* and *x*<sup>2</sup> *is 5*, then there is a probability of 0.85 that *y is 0* (i.e., a probability of 0.15 that *y is 1).*

This pair of rules may be considered as unexpected because (1) the first rule states that under the condition *x*<sup>1</sup> *is 3*, the probability that *y is 1* significantly deviates from *pa* in the upper side, and (2) when adding the condition *x*<sup>2</sup> *is 5* to *x*<sup>1</sup> *is 3*, the probability that *y is 1* unexpectedly deviates from *pa* in the opposite direction.

If the probability that *y is 1* in the second rule were not 0.15 but, for example, 0.9, this fact would also be considered as unexpected, since adding the second condition to the first one leads to a sudden leap of probability. In this case, even if the first rule had not been revealed, the second rule would still be considered as unexpected. One can easily see the reason for this by considering why the first rule had not been discovered. The first rule's support cannot be less than the second rule's support *s*≥*s*min. Consequently, the only possible reason for not discovering the first rule is the nonfulfillment of both, the condition that *p*≥ *p*<sup>1</sup> and the condition that *p*≤ *p*0, where *p* is the first rule's probability. Hence, *p*<sup>0</sup> <*p* < *p*<sup>1</sup> . Thus, analogously to the previous case, the addition of the condition *x*2*is 5* to *x*1*is 3* leads to an even greater leap of probability that *y is 1*.

Let us now give a formal definition for an unexpected rule (following the second method).

**Definition:** A rule containing *q*-condition (*q* > 1) in the "if" part, *y is y*<sup>1</sup> in the "then" part, and having probability *P* that *y is y*<sup>1</sup> is called *unexpected* if at least one of the following two requirements is fulfilled:


The second requirement means that the probability that *y is y*<sup>1</sup> for any discovered rule with the above-defined *q*1-condition in the "if" part and *y is y*<sup>1</sup> in the "then" part must be much less than *P*.

To reveal the unexpected rules according to the above definition, it is first necessary to precisely define the relation "< <". Consider the following possible definitions for relation *p<<P*.

Assume that *p* is relatively small (for definiteness, *p*< 0*:*2). To determine the lower boundary value for *P*, let us add a third of the length of the segment ½ � *p*, 1 to *p*. We get the inequality *P*≥ <sup>2</sup>*p*þ<sup>1</sup> <sup>3</sup> . Let 0*:*2 ≤*p*< 0*:*5 . In this case, to determine the lower boundary value for *P*, let us add a half of the length of the segment ½ � *p*, 1 to *p*. We get *P*≥ *<sup>p</sup>*þ<sup>1</sup> <sup>2</sup> . Let 0*:*5≤*p*< 0*:*7. To determine the lower boundary value for *P*, let us add two thirds of the length of segment ½ � *<sup>p</sup>*, 1 to *<sup>p</sup>*. We get *<sup>P</sup>*<sup>≥</sup> *<sup>p</sup>*þ<sup>2</sup> <sup>3</sup> . For the case when

0*:*7 ≤ *p*<0*:*9, let us define the relation *p<<P* as *P* ¼ 1. If *p* ≥0*:*9, we will not define the relation *p<<P*, that is, in this case, the fulfillment of the first of the two abovementioned requirements is the necessary condition for accepting the rule having probability *P* as unexpected.

Another definition uses a notion of an expected probability for a rule containing the *q*-condition in the "if" part (*q*>1) and suspected as unexpected. Consider the set of 1-conditions entered in the *q*-condition. Let *pi* be the probability that *y is 1* under *i*th 1-condition, *i* ¼ 1, … , *q*. (Note that if for a fixed *i pi* ≥*p*<sup>1</sup> or *pi* ≤ *p*0, then we have the corresponding rule with 1-condition.) Let *si* be the number of records satisfying both the *i*th 1-condition and the condition *y is 1*. Then the number of records satisfying the *<sup>i</sup>*th 1-condition is *mi* <sup>¼</sup> *si pi* . The probability that *i*th 1-condition is fulfilled on the set of records of the investigated dataset is *mi <sup>N</sup>* . Assume that all the events, each of which is defined by the set of records satisfying *i*th 1-condition, are independent. Then the expected number *kind* of records satisfying all of these conditions can be calculated as follows:

$$k\_{ind} = N \cdot \prod\_{i=1}^{q} \frac{m\_i}{N} \tag{5}$$

or

$$k\_{ind} = \frac{\prod\_{i=1}^{q} m\_i}{N^{q-1}} \tag{6}$$

Assume that all the events, defined as "both the *i*th 1-condition and *y is 1* are fulfilled," are independent. Then the expected number of records satisfying all these events is

$$k\_{ind}^{(1)} = M \cdot \prod\_{i=1}^{q} \frac{s\_i}{M} \, \, = \, \frac{\prod\_{i=1}^{q} s\_i}{M^{q-1}} \tag{7}$$

Assume that all the events, defined as "both the *i*th 1-condition, and *y is 0* are fulfilled," are independent. Then the expected number of records satisfying all these events is

$$k\_{ind}^{(0)} = (N - M) \cdot \prod\_{i=1}^{q} \frac{m\_i - s\_i}{N - M} = \frac{\prod\_{i=1}^{q} (m\_i - s\_i)}{(N - M)^{q - 1}} \tag{8}$$

Let us now calculate the expected probability *Pind* that *y is 1* under the abovementioned assumptions:

$$P\_{ind} = \frac{k\_{ind}^{(1)}}{k\_{ind}^{(1)} + k\_{ind}^{(0)}}\tag{9}$$

After the transformations, we get:

*Research Advances in Data Mining Techniques and Applications*

$$P\_{ind} = \frac{1}{1+A},\tag{10}$$

where

$$A = \left(\frac{p\_a}{1 - p\_a}\right)^{q - 1} \cdot \prod\_{i = 1}^{q} \frac{1 - p\_i}{p\_i} \tag{11}$$

Assume now that all the events, each of which is defined by the set of records satisfying *i*th 1-condition, are dependent to a maximal degree. In this case, the number *kdep* of records satisfying all the 1-conditions is equal to the minimum numbers of records satisfying each individual 1-condition, that is:

$$k\_{dcp} = \sum\_{i=1,\ldots,q} m\_i \tag{12}$$

Let *i*<sup>0</sup> ¼ arg min *mi*. The probability *Pdep* that *y is 1* for the rule containing all the 1-conditions is equal to *pio* , i.e.

$$P\_{dep} = p\_{i\_0} \tag{13}$$

Thus, we have considered two extreme cases where the 1-conditions are (1) independent and (2) dependent as much as possible. The extent of the dependency between the 1-conditions can be determined by the number of records satisfying all the 1-conditions. This is the number of records satisfying the *q*-condition of the rule, which is suspected as unexpected. Let us denote this number by *K*. It is obvious that *K* ≤ *kdep*. The nearer *K* to *kdep*, the greater the extent of dependency between the 1-conditions. If *K* is near to *kind*, then it can be assumed that the 1-conditions are independent or "almost" independent. In this case, when *K* <*kind* we can say that the events, each of which is defined by the set of records satisfying *i*th 1-condition, are more inconsistent than independent. But, in this case, there is a very low chance of establishing the abovementioned rule with the *q*-condition as a result of nonfulfillment of the necessary condition *K* ≥*s*min.

Consider now function *P*exp ¼ *P*expð Þ *K* , where *P*exp is the expected probability that *y is 1* for the above rule with the *q*-condition. For the sake of simplicity, assume that this function is linear, for *kind* ≤*K* ≤*kdep*. *P*exp ¼ *Pind*, if *K* ¼ *kind*; *P*exp ¼ *Pdep*, if *K* ¼ *kdep*. For extremely rare cases when *K* <*kind*, let us set *P*exp ¼ *Pind*. Then, *P*exp is calculated as follows:

$$P\_{\rm exp} = \frac{P\_{\rm dep} - P\_{\rm ind}}{k\_{\rm dep} - k\_{\rm ind}} \cdot (K - k\_{\rm ind}) \quad + \quad P\_{\rm ind}, \text{if } k\_{\rm ind} \le K \le k\_{\rm dep} \tag{14}$$

$$P\_{\text{exp}} = P\_{ind}, \text{if } K < k\_{ind} \tag{15}$$

where *kind*, *Pind*, *kdep*, *Pdep* are calculated by formulas (6)–(13).

Thus, a rule containing the *q*-condition in the "if" part and having probability *P* that *y is 1* can be defined as unexpected if *P*exp < < *P*. The definition for the relation "< <" was previously presented.

Assume that we have found the unexpected rule, containing *q*-condition in the "if" part, whose support level is *S* and the probability (confidence level) is *P*. We can

*Revealing Interesting If-Then Rules DOI: http://dx.doi.org/10.5772/intechopen.111376*

determine the probability of the event described by this rule given the events described by each set of records defined by *i*th 1-condition such that this 1-condition belongs to the set of 1-conditions entered in the *q*-condition, *i* ¼ 1, … , *q*. Note that the corresponding rule with the *i*th 1-condition has perhaps been revealed for all *i* or some of them. If such a rule having *i*th 1-condition exists, let us denote the support level, the probability, and the number of records satisfying the rule's condition by *si*, *pi* , and *mi* respectively. In the case when a rule having *i*th 1-condition does not exist, we will use the same notations.

Let us define the level of unlikelihood *Ui* of the unexpected rule relative to the rule *i* as the probability of the existence of this unexpected rule provided that the rule *i* holds (more precisely, provided that the *i*th 1-condition is defined by the three above parameters whose values are *si*, *pi* , *mi*). Hence, the level of unlikelihood *Ui* can be calculated by the formula of the hypergeometric distribution:

$$U\_i = \frac{\begin{pmatrix} S \\ s\_i \end{pmatrix} \cdot \begin{pmatrix} K - S \\ m\_i - s\_i \end{pmatrix}}{\begin{pmatrix} K \\ m\_i \end{pmatrix}} \tag{16}$$

The maximum (or the minimum) of *Ui* can be considered as the level of the unlikelihood of the unexpected rule.

The rules can then be sorted by their level of unlikelihood, and the higher this level, the higher the level of unexpectedness according to the second criteria.

#### **5. Experiments with real data**

The abovementioned two algorithms have been implemented in a commercial data mining tool, WizWhy, which can be downloaded from www.wizsoft.com.

In what follows, we present the reports issued by this tool in regard to the Diabetes dataset (www.ics.uci.edu/�mlearn/MLRepository.html). We issued an analysis having the following thresholds:

Minimum support level: 20 records (The total number of records was 768.)

Minimum confidence level for if-then rules: 0.48 (The a priori probability was 0.349.)

Minimum confidence level for if-then-not rules: 0.79.

Two hundred twenty rules were discovered. The two rules having the lowest error probability (α being smaller than 0.0000001) were:

1.*If* **Plasma glucose** *is* **108.00 ... 197.00** (average = **142.80**)

and **Age** *is* **29.00 ... 81.00** (average = **42.39**)

*Then* **Predict** *is* **2**

*Confidence level: Rule's probability is* **0.593**

*Support level: The rule exists in* **172** *records.*

```
2.If Plasma glucose is 0.00 ... 107.00 (average = 90.65)
```
*Then* **Predict** *is not* **2**

*Confidence level: Rule's probability is* **0.875**

*Support level: The rule exists in* **253** *records.*

These rules are the most interesting rules according to the first method. Thirty-four rules (out of the total 220 rules that were discovered) were unexpected according to the second method. The first rule in this list was:

3.*If* **Number of times pregnant** *is* **1.00 ... 2.00** (*average* = **1.35**)

and **Diastolic blood pressure** *is* **64.00 ... 88.00** (average = **74.20**)

and **Age** *is* **29.00 ... 62.00** (average = **36.83**)

*Then* **Predict** *is* **2**

*Confidence level: Rule's probability is* **0.550**

*Support level: The rule exists in* **22** *records.*

*Significance Level: Error probability <* 0.01

This rule was found to be unlikely relative to the following three rules and trends:

#### 4.*If* **Number of times pregnant** *is* **1.00 ... 2.00** (*average* = **1.43**)

*Then* **Predict** *is not* **2**

*Confidence level: Rule's probability is* **0.798** *Support level: The rule exists in* **190** *records.*

#### 5.*If* **Diastolic blood pressure** *is* **64.00 ... 88.00**

*Then* **Predict** *is* **2**

*Confidence level: Trend's probability is* **0.369** *Support level: The trend exists in* **190** *records.*

6.*If* **Age** *is* **29.00 ... 81.00**

*Then* **Predict** *is* **2**

*Confidence level: Trend's probability is* **0.491**

*Support level: The trend exists in* **197** *records.*

Note that rule # (4) says that if the number of pregnant is 1 or 2, then there is a high probability (0.798) that the value in the Predict field is *not* 2. However, following rule # (3), if the conditions presented in rules # (5) and (6) are added to the condition of rule # (4), then there is a high probability that the Predict field *is* 2. On the basis of the events described in rules # (4), (5), and (6), we should have expected that the confidence level in rule # (3) would have been 0.332, while in fact it is 0.55. In this sense, rule # (3) is unexpected and as such interesting.

Following experiments with the datasets in the abovementioned repository, these results are quite representative. On average, about 5% of the rules have an error probability lower than 0.01, and about another 5% are unexpected, according to the second method.

#### **6. Conclusion**

We have presented two methods for the automated discovery of unexpected and, thereby, interesting rules within the set of all the if-then rules previously revealed in the mined dataset.

The first method calculates, for each rule, the probability that the rule exists accidentally. The lower this probability, the more unexpected the rule is. The second method calculates the conditional probability of each rule having more than one condition, given the relevant more basic rules and trends. The lower this conditional probability, the more unexpected the rule is.

These two methods are independent, and it makes sense always to use both. One can calculate a combined score by multiplying the scores of the two methods. And since there are additional methods for revealing interesting rules (some of them were mentioned in the first section of this paper), one can develop a score that combines all these methods. The search for finding additional methods for revealing interesting rules and combining these methods is a project to be continued.

The author would like to thank Boris Levin and Ilya Vorobyov for helping to write this paper and to develop the software program presented in this paper.

### **Author details**

Abraham Meidan WizSoft Inc., Syosset, New York, United States

\*Address all correspondence to: abraham@wizsoft.com

© 2023 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

### **References**

[1] Fayyad U, Piatetsky-Shapiro G, Smyth P. From data mining to knowledge discovery: An overview. Advances in Knowledge Discovery and Data Mining. 1996:1-34

[2] Agrawal R, Imielinski T and Swami A. Mining association rules between sets of items in large databases. Proceedings of the ACM SIGKDD Conference on Management of Data. 1993. pp. 207-216

[3] Bayardo R and Agrawal R. Mining the most interesting rules. Proceedings: KDD-99 The Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1999. pp. 145-154

[4] Agrawal R, Mannila H, Srikant R, Toivonen H and Verkamo AI. Fast discovery of association rules, in advances in knowledge discovery and data mining. 1995

[5] Piatetsky-Shapiro G and Matheus CJ. The interestingness of deviations. Proceedings of the AAAI-94 Workshop on Knowledge Discovery in Databases. 1994. pp. 25-36

[6] Agassi J. Science in Flux. Reidel; 1975:**28**

[7] Silberschatz A, Tuzhilin A. What makes patterns interesting in knowledge discovery systems. IEEE Transactions on Knowledge and Data Engineering. 1996; **8**(6):970-974

[8] Silberschatz A and Tuzhilin A. On subjective measures of interestingness in knowledge discovery. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining. 1997. pp. 259-262

[9] Liu B and Hsu W. Post-Analysis of Learned Rules, AAAI-96. 1996. pp. 828-834

[10] Liu B, Hsu W, and Chen S. Using general impressions to analyze discovered classification rules. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining. 1997. pp. 31-36

[11] Sahar S. Interestingness via what is not interesting. Proceedings: KDD-99 The Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1999. pp. 332-336

[12] Romão W, Freitas AA, de Itana M, Gimenes S. Discovering interesting knowledge from a science and technology database with a genetic algorithm. Applied Soft Computing. 2004;**4**(2):121-137

[13] Vashishtha J, Kumar D, Ratnoo S, Kundu K. Mining comprehensible and interesting rules: A genetic algorithm approach. International Journal of Computer Applications (0975–8887). 2011;**31**(1):39-47

[14] Suzuki E. Autonomous discovery of reliable exception rules, Proceedings of the Third International Conference on Knowledge Discovery and Data Mining. 1997. pp. 259-262

#### **Chapter 5**

## COVID-19 Social Lethality Characterization in Some Regions of Mexico through the Pandemic Years Using Data Mining

*Enrique Luna-Ramírez, Jorge Soria-Cruz, Iván Castillo-Zúñiga and Jaime Iván López-Veyna*

#### **Abstract**

In this chapter, an analysis of the data provided by the Federal Government of Mexico related to the COVID-19 disease during the pandemic years is described. For this study, nineteen significant variables were considered, which included the test result for detecting the presence of the SARS-CoV-2 virus, the alive/deceased people cases, and different comorbidities that affect a person's health such as diabetes, hypertension, obesity, and pneumonia, among other variables. Thus, based on the KDD (Knowledge Discovery in Databases) process and data mining techniques, we undertook the task of preprocessing such data to generate classification models for identifying patterns in the data or correlations among the different variables that could have influence on COVID-19 deaths. The models were generated by using different classification algorithms, were selected based on a high correct classification rate, and were validated with the help of the cross-validation test. In this way, the period corresponding to the five SARS-CoV-2 infection waves that occurred in Mexico between March 2020 and October 2022 was analyzed with the main purpose of characterizing the COVID-19 social lethality in the most contagious regions of Mexico.

**Keywords:** SARS-CoV-2 infection waves, COVID-19 lethality, KDD process, data mining, classification models

#### **1. Introduction**

Since the first case of SARS-CoV-2 in Mexico, diagnosed on February 28, 2020, five infection waves of this virus have occurred until October 2022, as shown in **Figure 1**. Associated with these infection waves are the COVID-19 death waves, which are shown in **Figure 2**, where it is observed that the second wave had the highest lethality although this wave was not the one with the highest infection rate. This fact can be observed by comparing both figures, obtained from [1].

**Figure 1.**

*Five SARS-CoV-2 virus infection waves in Mexico.*

**Figure 2.** *Behavior of COVID-19 deaths in Mexico.*

From the previous figures, it can also be inferred that although the last two infection waves were the highest, they had at the same time the lowest lethality, which was a natural consequence of the growing application of anti-COVID vaccines, supplied every day to the different sectors of the Mexican population.

Thus, based on the data published by the Federal Government of Mexico during the pandemic years, which can be consulted on the General Directorate of Epidemiology website [2], a study was carried out to analyze them using data mining techniques, specifically classification algorithms, to detect patterns related to COVID-19 social lethality, particularly in the regions of Mexico with the highest SARS-CoV-2 virus infection rate.

#### **2. Theoretical framework**

In principle, this work was based on data mining techniques [3–6] and on the Knowledge Discovery in Databases (KDD) process [7–9], which together allow to extract hidden knowledge from large data volumes. Thus, by using these techniques and process, we manage to extract knowledge from the COVID-19 dataset provided by the Federal Government of Mexico, which will be described later.

It is important to point out that we use classification algorithms to generate models containing knowledge in the form of rules, highlighting the use of Naïve Bayes and J48 classifiers, which are among the most widely used algorithms in the field of data mining research [10]. Thus, according to Taheri et al. [11], the Naïve Bayes classifier is useful for high dimensional data as the probability of each variable is estimated independently. Therefore, if C denotes the class of an observation of a set *X* of variables, *X* = {*X*1, *X*2, …, *Xn*}, then the class C can be predicted by using Bayes' rule:

*COVID-19 Social Lethality Characterization in Some Regions of Mexico through the Pandemic… DOI: http://dx.doi.org/10.5772/intechopen.113261*

$$P(C \mid X) = \frac{P(C) \prod\_{i=1}^{n} P(X\_i \mid C)}{P(X)} \tag{1}$$

In this way, we could predict different classes in our dataset, for instance, the class associated with alive or deceased patients, fundamental in our work.

Regarding the J48 classifier, it is worth mentioning that this is an C4.5 algorithm implementation [12], which allows to build decision trees from a set of training data, based on information entropy. This concept refers to the uncertainty measurement of an information source so that the source elements with less probability (less frequency) are those that provide more information. Thus, Shannon's formula [13] for calculating the entropy of a random variable *X* that can take on *x*1, *x*2, …, *xn* states is given by:

$$H(X) = -\sum\_{i=1}^{n} p\_i \log p\_i \tag{2}$$

where *pi* is the probability of *xi*, *i* = 1, 2, …, *n*. This is how the J48 classifier operated on different variables of our dataset for predicting a certain class, considering that a negative entropy implies a lower level of information uncertainty.

On the other hand, to measure the classification reliability of nominal variables, Cohen's Kappa Coefficient [14, 15] and Fleiss Kappa [16, 17] are commonly used measures, based on the agreement between what is observed in a dataset and what could happen randomly. Like most correlation statistics, the Kappa statistic can vary from −1 to +1, associating negative values to disagreement and positive values to agreement, so that values as low as 0,41 could be acceptable in terms of reliability according to Cohen [18]. Thus, by using different classifiers, several models were generated and validated with the cross-validation test, considered a widespread validation strategy because of its simplicity [19].

#### **3. Related work**

There are some interesting works related to the use of data mining and machine learning techniques focused on developing algorithms and models to analyze and forecast SARS-CoV-2 infections and COVID-19 disease behavior, some of which made use of epidemiological data referring to Mexico [20–22]. Thus, in [20, 21], classifiers such as decision tree, support vector machine, naïve Bayes, and random forest were used to generate forecasting models, while a multi-objective evolutionary algorithm was used in [22] for retrieving high-quality rules to identify the most susceptible groups to COVID-19 disease. Also, in [23], logistic regression models were employed to assess the association between demographic factors, comorbidities, wave and vaccination, and the risk of severe disease and in-hospital death. This work was carried out during the five COVID-19 waves in Mexico.

On the other hand, important works using the WEKA machine learning tool [24] (used in our work) were identified. In a generic way, such works [25–27] used several supervised machine learning algorithms (classifiers) available in this tool for building classification models using COVID-19 datasets. Other relevant works used different strategies and tools: python programming language in developing data mining models for predicting COVID-19 infected patients' recovery using an epidemiological dataset of South Korea [28]; another generated its own dataset with the help of specialist

physicians for predicting mortality in patients with COVID-19 based on data mining techniques [29]; another developed a model to predict the COVID-19 incidence rate in different regions of the world through a least-square classification algorithm [30]; another discovered rules on factors interrelated with COVID-19 pandemic using data mining methodologies [31], and one more used the RapidMiner Studio software [32] for creating a model to analyze and forecast the existence of COVID-19 using the so-called Kaggle dataset [33].

#### **4. Methodology**

As mentioned before, to carry out this work, the KDD process was used, shown in **Figure 3**. Thus, following the stages marked out in this process, the starting point was the data retrieval from the databases provided by the Federal Government of Mexico on SARS-CoV-2 infection cases and COVID-19 deaths.

It is important to note that the provided data were basically numbers associated to a catalog of codes, which contained omissions and errors. Therefore, it was necessary to preprocess the raw data so that they could be exploited with WEKA, the main analysis tool used in our study. A preprocessed data sample is shown in **Figure 4**.

To preprocess data, an extraction, transformation, and load (ETL) process was carried out using Microsoft Power BI and the Python programming language. That is, with the combination of these tools, different COVID-19 databases were integrated in a unique dataset containing clean, standardized, and transformed data, which were used to generate classification models. First, some preliminary models were generated

#### **Figure 3.** *Used methodology for extracting knowledge from COVID-19 data.*


### **Figure 4.**

*A COVID-19 preprocessed data sample.*

*COVID-19 Social Lethality Characterization in Some Regions of Mexico through the Pandemic… DOI: http://dx.doi.org/10.5772/intechopen.113261*

**Figure 5.** *Variable importance analysis.*

to carry out a Variable Importance analysis using R and WEKA, whose results are shown in **Figure 5**.

This analysis was realized taking the ALIVE\_OR\_DECEASED variable as the class to predict and considering only the SARS-CoV-2 positive cases. Thus, new models were generated, some of them using all the previous variables and others using the most significant ones. The best models are presented in the next section.

#### **5. Results**

In **Figure 6**, a preliminary analysis on all SARS-CoV-2 positive cases of Mexican population corresponding to the pandemic years (2020, 2021 and 2022) is shown. This analysis was realized by residence place so that it was possible to identify the regions of Mexico with the highest SARS-CoV-2 virus infection rate.

About half of the 2,425,514 cases were concentrated in five States: Mexico City, Mexico State, Guanajuato, Nuevo Leon, and Jalisco. Thus, our work focused on these five regions, as well as the overall country, generating classification models through different classifier algorithms with different parameters, including the WEKA's default parameters. **Figure 7** shows an example of how the classifiers were tuned to improve their accuracy.

In this way, a summary of the best classification models found in 2020 is shown in **Figure 8**. The models were selected based on the highest accuracy and validated with a 10-fold cross-validation test. It is important to point out that the selected models were compared with the preliminary models used to realize the Variable Importance analysis so that the best classifier identified in the preliminary models changed in some regions (Guanajuato and Jalisco) after a new classifier tuning.

With respect to the findings identified in the best model for each region, related to deceased patients, the highlights are summarized below.

#### **Figure 6.**

*Mexican population with a positive SARS-CoV-2 test.*


#### **Figure 7.**

*Example of tuning a classifier algorithm.*

The main findings on deceased people in Guanajuato include 6,86% of lethality with respect to SARS-CoV-2 positive cases, in an approximate ratio of 2 to 1 between men and women (62,3% men and 37,7% women); 17,5% were intubated cases, 63% hospitalized cases (without intubating), and 19,5% ambulatory cases; pneumonia, diabetes, hypertension, and obesity emerge as the main comorbidities associated to lethality, and July and November appear as the months with the highest lethality.

In the case of Jalisco, the main findings on deceased people include 12.31% of lethality respect to SARS-CoV-2 positive cases, in an approximate ratio of 2 to 1 between men and women (63,6% men and 36,4% women); 26,6% were intubated cases, 63,4% hospitalized cases (without intubating), and 10% ambulatory cases; just like in Guanajuato, pneumonia, diabetes, hypertension, and obesity emerge as the main comorbidities associated to lethality, and June, July, August, November, and December appear as the months with the highest lethality. By comparing the findings of Guanajuato and Jalisco, it can be inferred that the main comorbidities associated to lethality are the same due in a certain way to their proximity and similar weather conditions.

*COVID-19 Social Lethality Characterization in Some Regions of Mexico through the Pandemic… DOI: http://dx.doi.org/10.5772/intechopen.113261*


#### **Figure 8.** *Summary of best classification models in 2020.*

Other regions with a high similarity, not only in weather but also in urban and demographic characteristics, are Mexico City and Mexico State. This is due to Mexico City and the most populated areas of Mexico State conform the same urban region (the metropolitan area of Mexico City). Therefore, as could be expected, much of the knowledge contained in their classification models is similar with respect to the characteristics of deceased patients. For instance, while in Mexico City, the risk of death for an intubated patient older than 56 years was 85%, in Mexico State, this risk for an intubated patient older than 54 years was 86%.

In the case of Nuevo Leon, the most important knowledge contained in its classification model respect to deceased people refers to both ambulatory and hospitalized (without intubating) patients older than 63 years. In the first case, the risk of death was 79%, and in the case of hospitalized patients, this risk was 74%, mainly in the period July–November.

For the year 2021, a summary of the best classification models found is shown in **Figure 9**. Again, the models were selected based on the highest accuracy and validated with a 10-fold cross-validation test.

As can be observed, like in 2020, the best classifier in all cases was J48 compared to Random Forest and Naïve Bayes classifiers. The most important findings related to deceased patients were identified in these models and are described below.

In the case of Guanajuato, the risk of death for an intubated patient was 82%, no matter any other factor, while for a not-intubated patient (hospitalized), older than 64 years and suffering from pneumonia, the risk of death was 66%, mainly in January (winter). In Jalisco, also in January, the risk of death for a hospitalized patient older than 67 years was 66%, the same as Guanajuato. In Mexico City, the risk of death for an intubated patient older than 50 years was 79%; this pattern remains in a certain


#### **Figure 9.**

*Summary of best classification models in 2021.*

way like in 2020. In Mexico State, for an intubated patient older than 53 years and suffering from pneumonia, the risk of death was 85%, compared to the risk of 86% in 2020 for patients with similar characteristics; it can be inferred that this pattern remains the same. Finally, in Nuevo Leon, for an intubated patient older than 51 years and suffering from pneumonia, the risk of death was 85%.

Finally, a summary of the best classification models found for the year 2022 is shown in **Figure 10**. As in the years 2020 and 2021, the models were selected based on the highest accuracy and validated with a 10-fold cross-validation test.

Again, the best classifier in all cases was J48; however, in this case, Kappa statistic could be considered low for the best two models in Mexico City, Mexico State, and Nuevo Leon. Therefore, considering that the third-best model, built with the Naïve Bayes classifier, has a more acceptable Kappa value [18] and a high enough classification percentage, the best rules were mainly searched in this model for the three mentioned places.

In the case of Guanajuato, the risk of death for an intubated patient older than 48 years was 85%, mainly in January, which is usually the coldest month in central Mexico. In Jalisco, the risk of death for an intubated patient older than 43 years and suffering from pneumonia was 86%. With respect to Mexico City, Mexico State, and Nuevo Leon, pneumonia, diabetes, and hypertension emerge as the main comorbidities associated to lethality and January as the most lethal month. Practically, 100% of deceased cases were hospitalized, mostly without being intubated. Besides, the mean age of death was 71 years in Mexico City, with a standard deviation of 16, which suggests people over 50 years as the most affected. For Mexico State and Nuevo Leon, the mean and standard deviation values were 67 and 18 and 69 and 16, respectively,

*COVID-19 Social Lethality Characterization in Some Regions of Mexico through the Pandemic… DOI: http://dx.doi.org/10.5772/intechopen.113261*


#### **Figure 10.**

*Summary of best classification models in 2022.*

#### **Figure 11.**

*Subtree of the classification model for 2020.*

which suggests people over 49 years in Mexico State and people over 53 years in Nuevo Leon were the most affected by death.

To finish the description of our work, as a tree-based representation example, **Figure 11** shows a subtree of the classification model corresponding to the year 2020 (generated with the J48 classifier), in which the presence of most variables that have the highest impact on the ALIVE\_OR\_DECEASED class can be observed.

This model, like the classification models for 2021 and 2022, contains overall rules, in contrast with the models generated for the five analyzed regions, which contain more specific rules.

#### **6. Conclusions**

As can be read in the title of this chapter, the main objective of this work was to characterize the COVID-19 disease lethality in Mexico throughout the five SARS-CoV-2 virus infection waves occurred between March 2020 and October 2022, for which classification algorithms were used, as part of data mining techniques, to extract knowledge from the pandemic databases provided by the Government of Mexico. As a first stage to carry this out, an ETL process was executed on such databases to integrate them in a unique minable dataset.

Thus, from the consolidated dataset, some preliminary classification models were generated and used to realize a Variable Importance analysis for identifying the variables with the highest impact on the ALIVE\_OR\_DECEASED class, which was our base feature to characterize the COVID-19 disease lethality.

Our study focused on the five regions with the highest contagion rate in Mexico, identified through an analysis of the preprocessed data. In this way, Guanajuato, Jalisco, Mexico City, Mexico State, and Nuevo Leon emerged as the case studies, which together represented 42% of total SARS-CoV-2 infection cases in the pandemic period. For each of these regions, various classification models were generated in 2020, 2021, and 2022 using different classifiers, which were tuned by varying their parameters so that the best models could be found based on their accuracy and other metrics.

As an important part of the knowledge extracted from the classification models, various characteristics and conditions were identified in patients who died and whose test result had been confirmed as positive to SARS-CoV-2. For example, in 2020, the risk of death for intubated patients older than 54 years in Mexico City and Mexico State was 85 and 86%, respectively, in an approximate 1 to 2 women–men ratio. In 2021, this rule remained to some extent for Mexico City, dropping the risk of death to 79% for intubated patients older than 50 years, but remained practically unchanged for Mexico State with an 85% death risk for patients older than 53 years with pneumonia. As mentioned previously, the similarity of the lethality behavior in these populations is mainly because of the most populated regions of Mexico State are part of the metropolitan area of Mexico City.

In this way, the COVID-19 lethality in Mexico was characterized using different classifiers, highlighting WEKA's J48 as the classifier with the best performance in all cases. Nonetheless, Random Forest and Naïve Bayes classifiers also helped extract important knowledge from the pandemic dataset.

To conclude, it is important to point out that in 2022, the COVID-19 lethality decreased drastically throughout Mexico as a natural consequence of the constant anti-COVID vaccination campaigns. However, the most affected population by this disease continued to be people over 50 years, according to what was described in this chapter.

#### **Acknowledgements**

We want to thank to the TECNOLÓGICO NACIONAL DE MÉXICO for its contribution and support to this work.

*COVID-19 Social Lethality Characterization in Some Regions of Mexico through the Pandemic… DOI: http://dx.doi.org/10.5772/intechopen.113261*

#### **Author details**

Enrique Luna-Ramírez1 \*, Jorge Soria-Cruz1 , Iván Castillo-Zúñiga1 and Jaime Iván López-Veyna2

1 National Technological Institute of Mexico Campus El Llano Aguascalientes, Mexico

2 National Technological Institute of Mexico Campus Zacatecas, Mexico

\*Address all correspondence to: enrique.lr@llano.tecnm.mx

© 2023 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

### **References**

[1] Our World in Data. Coronavirus (COVID-19) cases. Available from: https://ourworldindata.org/covid-cases [Accessed: August 17, 2023]

[2] General Directorate of Epidemiology (Mexico). Historical COVID-19 databases. Available from: https://www. mendeley.com/search/?query=https:// www.gob.mx/salud/documentos/datosabiertos-bases-historicas-direcciongeneral-de-epidemiologia [Accessed: August 2, 2023]

[3] Witten IH, Frank E, Hall MA, Pal CJ. Data Mining: Practical Machine Learning Tools and Techniques. Fourth ed. USA: Morgan Kaufmann Publishers; 2011. 2016. DOI: 10.1016/C2009-0-19715-5

[4] Frank E, Hall MA, Witten IH. The WEKA workbench. Data Mining: Practical Machine Learning Tools and Techniques. Fourth ed. USA: Morgan Kaufmann Publishers; 2016

[5] Singh J, Dhiman G. A survey on machine-learning approaches: Theory and their concepts. Materials Today Proceedings. 2021. DOI: 10.1016/j. matpr.2021.05.335

[6] Yu B, Mao W, Lv Y, Zhang C, Xie Y. A survey on federated learning in data mining. WIREs: Data Mining and Knowledge Discovery. 2021;**12**(1):1-20. DOI: 10.1002/widm.1443

[7] Fayyad U, Piatetsky-Shapiro G, Smyth P. The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM. 1996;**39**(11):27-34. DOI: 10.1145/240455.240464

[8] Safhi HM, Frikh B, Ouhbi B. Assessing reliability of big data knowledge discovery process. Procedia Computer Science. 2019;**148**:30-36. DOI: 10.1016/j.procs.2019.01.005

[9] Plotnikova V, Dumas M, Milani F. Adaptations of data mining methodologies: A systematic literature review. PeerJ Computer Science. 2020;**6**:1-43. DOI: 10.7717/ PEERJ-CS.267

[10] Wu X, Kumar V, Ross Quinlan J, et al. Top 10 algorithms in data mining. Knowledge and Information Systems. 2008;**14**:1-37

[11] Taheri S, Mammadov M. Learning the naive Bayes classifier with optimization models. International Journal of Applied Mathematics and Computer Science. 2013;**23**:787-795

[12] Salzberg SL. C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993. Machine Learning; 1994;**16**:235-240. DOI: 10.1007/bf00993309

[13] Shannon CE. A mathematical theory of communication. Bell System Technical Journal. 1948;**27**(3):379- 423. DOI: 10.1002/j.1538-7305.1948. tb01338.x

[14] Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement. 1960;**20**(1):37-46. DOI: 10.1177/001316446002000104

[15] Cohen J. A power primer. Psychological Bulletin. 1992;**112**(1):155- 159. DOI: 10.1037/0033-2909.112.1.155

[16] Fleiss JL. Measuring nominal scale agreement among many raters. Psychological Bulletin. 1971;**76**(5):378- 382. DOI: 10.1037/h0031619

*COVID-19 Social Lethality Characterization in Some Regions of Mexico through the Pandemic… DOI: http://dx.doi.org/10.5772/intechopen.113261*

[17] Fleiss JL, Cohen J. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement. 1973;**33**(3):613-619. DOI: 10.1177/001316447303300309

[18] McHugh ML. Interrater reliability: The kappa statistic. Biochemia Medica (Zagreb). 2012;**22**(3):276-282. DOI: 10.11613/bm.2012.031

[19] Arlot S, Celisse A. A survey of cross-validation procedures for model selection. Statistics Surveys. 2010;**4**:40- 79. DOI: 10.1214/09-SS054

[20] Muhammad LJ, Algehyne EA, Usman SS, Ahmad A, Chakraborty C, Mohammed IA. Supervised machine learning models for prediction of COVID-19 infection using epidemiology dataset. SN Computer Science. 2021;**2**(11):1-13. DOI: 10.1007/ s42979-020-00394-7

[21] Abrol P, Kalrupia N, Kaur J. Hybrid voting classifier model for COVID-19 prediction by embedding machine learning techniques. Turkish Journal of Computer and Mathematics Education. 2022;**13**(2):171-183

[22] Sinisterra-Sierra S, Godoy-Calderón S, Pescador-Rojas M. COVID-19 data analysis with a multi-objective evolutionary algorithm for causal association rule mining. Mathematical and Computational Applications. 2023;**28**(12):1-15. DOI: 10.3390/ mca28010012

[23] Ascencio-Montiel IJ, Ovalle-Luna OD, Rascón-Pacheco RA, Borja-Aburto VH, Chowell G. Comparative epidemiology of five waves of COVID-19 in Mexico, March 2020– August 2022. BMC Infectious Diseases.

2022;**22**(813):1-11. DOI: 10.1186/ s12879-022-07800-w

[24] Waikato University. Weka 3 - Data mining with open source machine learning software in Java. Available from: https://www.cs.waikato.ac.nz/ml/weka/ [Accessed: August 14, 2023]

[25] Villavicencio CN, Macrohon JJE, Inbaraj XA, Jeng JH, Hsieh JG. Covid-19 prediction applying supervised machine learning algorithms with comparative analysis using weka. Algorithms. 2021;**14**(7):1-22. DOI: 10.3390/a14070201

[26] Kalezhi J, Chibuluma M, Chembe C, Chama V, Lungo F, Kunda D. Modelling Covid-19 infections in Zambia using data mining techniques. Results in Engineering. 2022;**13**:1-7. DOI: 10.1016/j. rineng.2022.100363

[27] Vig V, Kaur A. Time series forecasting and mathematical modeling of COVID-19 pandemic in India: A developing country struggling to cope up. International Journal of System Assurance Engineering and Management. 2022;**13**(6):2920-2933. DOI: 10.1007/s13198-022-01762-7

[28] Muhammad LJ, Islam MM, Usman SS, Ayon SI. Predictive data mining models for novel coronavirus (COVID-19) infected patients' recovery. SN Computer Science. 2020;**1**(4):1-7. DOI: 10.1007/s42979-020-00216-w

[29] Moulaei K, Ghasemian F, Bahaadin-Beigy K, Sarbi RE, Taghiabad ZM. Predicting mortality of COVID-19 patients based on data mining techniques. Journal of Biomedical Physics & Engineering. 2021;**11**(5):653-662. DOI: 10.31661/jbpe.v0i0.2104-1300

[30] Ahouz F, Golabpour A. Predicting the incidence of COVID-19 using data mining. BMC Public Health.

2021;**21**(1087):1-12. DOI: 10.1186/ s12889-021-11058-3

[31] Yavuz Ö. A data mining analysis of COVID-19 cases in states of United States of America. International Journal of Electrical and Computer Engineering. 2022;**12**(2):1754-1758. DOI: 10.11591/ ijece.v12i2.pp1754-1758

[32] RapidMiner. The RapidMiner Platform. Available from: https:// rapidminer.com/ [Accessed: August 26, 2023]

[33] Sher T, Rehman A, Kim D. COVID-19 outbreak prediction by using machine learning algorithms. Computers, Materials & Continua. 2023;**74**(1):1561- 1574. DOI: 10.32604/cmc.2023.032020

### **Chapter 6**

## Data Mining Strategy to Prevent Adverse Drug Events: The Cases of Rosiglitazone and COVID-19 Vaccines

*Maria-Isabel Jimenez-Serrania*

### **Abstract**

This chapter analyzes how a simple strategy of early detection of safety signals using data mining can prevent the potential risk of adverse events with new or former drugs. We first present the case of an active antidiabetic ingredient, rosiglitazone. The capability of the strategy to detect the risk of heart failure among the data reported during the first 8 years of commercialization was demonstrated 2 years before rosiglitazone was withdrawn from the market in 2020 due to that risk. Ten years later, agility in obtaining safety signals after marketing a drug was put to the test with COVID-19 vaccines. Among adverse events notified during only 2 months of follow-up, we early detected thrombosis following COVID-19 vaccines. Several weeks after, these events were in the spotlight of the vaccination campaign and defined changes in the type of vaccine administered according to susceptible age groups. This early analysis strategy of suspected adverse drug reactions reported can provide useful information in making decisions in a faster way than the standard data mining methodology.

**Keywords:** data mining, early detection, adverse reaction, rosiglitazone, heart failure, COVID-19 vaccines

### **Keypoints:**


#### **1. Introduction**

We want to expose and analyze two different situations in the extent and duration of adverse drug reactions (ADRs) that lead to regulatory actions years after and the capability of a specific strategy to detect them on time.

The 'looking back' example is about rosiglitazone, an active ingredient only used for a restricted population, and we analyze cumulative data during 8 years since commercialization.

The 'looking at the present' example is about viral vector COVID-19 vaccines, worldwide used, and we analyze during only 2 months since commercialization.

First of all, we present the background situations.

#### **1.1 Looking back: the case of rosiglitazone and cardiovascular risk**

Rosiglitazone is an active ingredient used to treat type 2 diabetes. There has been much debate about the cardiovascular risks, particularly heart failure, associated with its use.

In the 2000s, several studies suggested that rosiglitazone may increase the risk of cardiovascular events such as heart attacks and strokes. For example, a meta-analysis published in 2007 in the *New England Journal of Medicine* found that rosiglitazone was associated with a significant increase in the risk of heart attacks, as well as an increased risk of death from cardiovascular causes [1]. Another meta-analysis found rosiglitazone was associated with a significant increase in the risk of myocardial infarction [2]. The same study found that rosiglitazone was associated with a nearly twofold increase in the risk of congestive heart failure [3].

However, other studies found no significant increase in cardiovascular risk with rosiglitazone. For example, a study published in 2009 in the Lancet found no significant difference in the risk of cardiovascular events between patients treated with rosiglitazone and those treated with other diabetes medications [4].

Rosiglitazone was withdrawn from the European market in 2010 due to its cardiovascular risk. The European Medicines Agency (EMA) implemented an immediate suspension of the drug, meaning that it was no longer available in Europe [5, 6]. The suspension of the marketing authorizations of rosiglitazone was recommended across the European Union by the European Committee on Medicinal Products for Human Use [6]. The withdrawal of rosiglitazone from clinical use was also recommended in the UK [6, 7]. The withdrawal of rosiglitazone-containing medicines was a result of a European wide review of available data on the risks and benefits of rosiglitazone.

Although the FDA has also warned that rosiglitazone causes or exacerbates congestive heart failure in some patients [8], some scientific uncertainty about the cardiovascular safety of rosiglitazone medicines remains. In light of the new re-evaluation of the Rosiglitazone Evaluated for Cardiovascular Outcomes and Regulation of Glycemia in Diabetes (RECORD) trial, the concern is substantially reduced, and the rosiglitazone Risk Evaluation and Mitigation Strategy (REMS) program requirements will be modified [9].

#### **1.2 Looking at the present: the case of COVID-19 vaccines and immune thrombotic thrombocytopenia (VITT)**

The Regulatory Agency of the United Kingdom (Medicines and Healthcare Products Regulatory Agency, MHRA), followed by the United States (Food and Drug *Data Mining Strategy to Prevent Adverse Drug Events: The Cases of Rosiglitazone and COVID-19… DOI: http://dx.doi.org/10.5772/intechopen.112412*

Administration, FDA) and the European Union (European Medicines Agency, EMA), issued an emergency use authorization for the first COVID-19 vaccine over December 2020 [10–12].

The main difference between Conditional marketing authorization in Europe (or Emergency Use Authorization in the United States) – EUA – compared with full approval is the amount of data required by the Regulatory Agencies to grant approval. A EUA may be issued based on interim results from clinical trials, while a Biologics License requires completion of clinical trials.

For example, for a EUA for a COVID-19 vaccine, the FDA requires that at least half of the clinical trial participants be followed for at least 2 months after vaccination. For full FDA approval of a COVID-19 vaccine, participants are followed for at least 6 months.

The first COVID-19 vaccines were Pfizer-BioNTech (after Comirnaty) [13] and Moderna (after Spikevax) [14], both of which are mRNA vaccines. This vaccine contains mRNA that carries the surface glycoprotein S (spike) of the SARS-CoV-2 virus and is encapsulated in a lipid shell that helps stabilize the RNA and facilitate the entry of the vaccine into cells. To maintain stability of mRNA, these must be stored and transported in ultra-low temperature (ULT) conditions of −90°C to −60°C.

Both the Oxford-AstraZeneca (after Vaxzevria) [15] and Janssen (after JCOVDEN) [16] vaccines, as well as the Gam-COVID-Vac (after Sputnik V) vaccine [17] in Russia, are carrier- or vector-vaccines, which instruct human cells to make the SARS-CoV-2 spike protein. For this vaccine technology, scientists engineer a harmless, inactivated common adenovirus (which can cause colds and other illnesses when it is active) that carries genetic code -DNA- to a vaccine recipient's cells. The code then instructs the cells to produce a spike protein that trains the body's immune system, which then creates antibodies and memory cells to protect against an actual SARS-CoV-2 infection.

Viral vector vaccines for COVID-19 (Oxford-AstraZeneca's vaccine; Janssen-Johnson & Johnson; Sputnik V) are stronger than mRNA vaccines (Pfizer, Moderna). DNA is not as fragile as RNA, and the tough protein coat of the adenovirus helps protect the genetic material it contains. As a result, viral vector vaccines do not have to remain frozen. The vaccine is expected to last at least 6 months if refrigerated at 2–8°C.

Thrombosis, the formation of blood clots in blood vessels, has been observed in a small number of people who have received certain COVID-19 vaccines. This has led to concerns and investigations into the potential link between thrombosis and COVID-19 vaccines.

In late February 2021, a prothrombotic syndrome was observed in a few individuals who received the adenoviral vector-based vaccine. Subsequently, similar findings were observed in individuals who received the Janssen; Johnson & Johnson vaccine, also based on an adenoviral vector [18, 19].

The specific type of blood clot that has been observed in a small number of people who have received the AstraZeneca and Janssen COVID-19 vaccines is called cerebral venous sinus thrombosis (CVST). This type of blood clot occurs in the veins that drain blood from the brain and can lead to serious complications if not treated promptly. Additionally, CVST and thrombocytopenia together are called thrombosis-thrombocytopenia syndrome (TTS); and TTS associated with COVID-19 vaccination has been termed vaccine-induced immune thrombotic thrombocytopenia (VITT) [20].

VITT is characterized by the presence of single or multiple thrombosis, mainly venous but also arterial, with a certain predilection for affecting unusual locations, such as the splanchnic territory or the cerebral venous sinuses. The presence of anti-platelet factor 4 (anti-PF4) antibodies causes platelet aggregation and micro- and macrothrombosis, causing marked thrombocytopenia and the characteristic thrombotic manifestations of the syndrome. VITT has been associated with nonreplicating adenovirus vector vaccines [21].

To mitigate the risks associated with thrombosis and COVID-19 vaccines, health authorities and vaccine manufacturers closely monitored the situation and implemented measures such as age restrictions and enhanced warning labels [22]. Additionally, individuals who receive COVID-19 vaccines are aware of the symptoms of thrombosis, which can include severe headache, chest pain, leg swelling, and shortness of breath, and seek medical attention immediately if these symptoms occur.

It is important to note that the risk of developing thrombosis after receiving a COVID-19 vaccine appears to be much lower than the risk of developing thrombosis after contracting COVID-19 itself [23]. COVID-19 infection is associated with a higher risk of blood clots, and the benefits of vaccination in preventing COVID-19 and its serious complications far outweigh the risks.

#### **2. Materials and methods**

#### **2.1 Methodological bases**

First, it is important to review some actual definitions about adverse reactions and signals.

There is currently no unified definition for the term 'signal' in pharmacovigilance. According to the World Health Organization (WHO), a signal or alert is a 'notification of a possible causal relationship between an adverse event and a drug, previously unknown or incompletely documented' [24]; while according to the Pharmaceutical Research of Manufacturers of America-Food and Drug Administration (PhRMA-FDA) Collaborative Working Group on Safety Evaluation Tools, a signal is 'a relationship between a drug and an event that strong enough, using a predetermined threshold or analyst-defined set of criteria, to warrant further evaluation' [25].

The spontaneous notification systems in pharmacovigilance have large databases and are mainly focused on the early detection of adverse reactions of commercialized drugs with cumulative data over time [26].

In the past, this signal detection was based on a case-by-case analysis. In recent years, data mining techniques have become a more efficient method, understanding it as an analysis of data from different perspectives and the extraction of relevant information from them.

In the case of signal detection, these automated methods use algorithms to discover unexpected events within large and entire pharmacovigilance databases. These algorithms are based on analyzing how much the number of cases observed through notifications differs from the number of expected cases; that is, they calculate estimators of the disproportionality of notifications [27]. That is why, currently, in addition to the alerts generated in the Regional Pharmacovigilance Centers, active searches for signals can be carried out using these automated methods.

Since 1998, the Uppsala Monitoring Center belonging to the WHO has been using a specific Bayesian method as an automated system for signal detection from the WHO database of suspected adverse drug reactions [28].

*Data Mining Strategy to Prevent Adverse Drug Events: The Cases of Rosiglitazone and COVID-19… DOI: http://dx.doi.org/10.5772/intechopen.112412*

It starts from the method called Bayesian Confidence Propagation Neural Network (BCPNN) based on that, for each individual notification in an ADR suspicion notification database, there is a probability that a specific reaction already is collected in that base, that is, the previous probability is available. If the reports of these cases contain a specific drug, the posterior probability will be obtained. If the posterior probability is greater than the prior probability, it means, on the one hand, that the presence of the drug in the notification increases the probability that the reaction will be present; and, on the other hand, that the drug-ADR pair is present in the database more frequently than expected [28, 29].

For more mathematical detail, the BCPNN model is based on the calculation of the called Information Component (IC) for each drug-ADR combination of the integral base [30].

This IC is the logarithmic measure of disproportionality used by the BCPNN method. It is defined mathematically by the equation: IC = log2 (PXY / PXPY); where PX is the probability of finding a certain drug in a notification; PY is the probability of finding a given adverse reaction in a notification; and PXY is the probability of finding a drug-adverse reaction combination in a report [31]. The source of comparison is the entire database.

When calculated from a finite number of reports, it is really an estimate of the true value of IC. ADR-drug combinations with positive IC values represent the most frequently reported combinations than expected, while those with negative ICs represent the combinations reported less frequently than expected [31].

The BCPNN model adopts a Bayesian approach assuming a prior distribution centered around a relative risk of 1 (RR0 = 1, interpreted as no relationship between drug and ADR), based on empirical evidence. The IC levels are, therefore, averages of the posterior distribution of the true relative risk [28, 29]. From the IC distribution obtained, the exact variance and standard deviation are calculated, the latter being the measure of CI robustness [32].

Once the IC is calculated, a signal is generated if the 2.5% quantile of the IC distribution is greater than 0 (Q0.025IC > 0). The IC distribution initially approximated the normal distribution [28], and a more accurate model was subsequently proposed based on empirical evidence from the WHO database [29] and on extensions of Monte Carlo simulations [33].

It should be noted that the IC value only provides a quantitative indication of the correlation between a medication and an ADR. To establish a causal relationship between the latter, the strength of the clinical diagnosis must be estimated by studying individual reports or through controlled trials [34].

#### *2.1.1 Extension to the multiple comparison setting*

The original decision rules for the automatic generation of signals in pharmacovigilance include models such as BCPNN that are based on arbitrary limits; that is, there is no signal evaluation measure associated with the adopted decision rule.

Due to this aspect, the review carried out within the general Bayesian decision structure of the BCPNN model applied in pharmacovigilance has resulted in a new ordering procedure for drug-ADR pairs based on the posterior probability of the null hypothesis of interest [26]. The approach used makes it possible to obtain, indirectly, the Bayesian estimators for false positives (FDR) and for false negatives (FNR) that serve as an evaluation measure of the detected signals [35]. The key estimator is the calculated Bayesian false discovery rate (FDR) and the threshold to a positive signal fixed in FDR < 0.05. Bayesian estimators of sensitivity (Se) and specificity (Sp) are also considered useful [36].

These Bayesian methods have been shown to outperform other data mining methods that use the relationship between the proportional reporting ratio (PRR) and the reporting odds ratio (ROR) as estimators of disproportionality [37]. Additionally, the capacity of the BCPNN method for the early detection of new adverse drug reactions (ADRs) has been widely demonstrated [26, 28, 30, 34, 38–41].

#### **2.2 A strategy for early detection of safety signals**

Obviously, adaptations of this methodology can be valuable and trustworthy with a correct interpretation of the signals [42]. The new strategy consists of contrasting all the ADR of a specific Anatomical Therapeutic Chemical (ATC) Classification System subgroup isolated from the integral database.

The algorithm was performed with the following arguments: the value of the relative risk (RR) proven to be higher than 1 (RR > 1 or RR > 2); minimum number of cases per pair [drug-adverse reaction] to be potentially considered as a signal (N = 1); rule of decision for the generation of signals: false discovery rate (FDR); limit or threshold for the decision rule: FDR > 0.05; statistics used for ordering the drug-ADR pairs: the posterior probability of the null hypothesis (post.H0); calculation of the distribution of the statistic of interest: by approximation to the normal distribution [28, 32] and using empirical estimation through Monte Carlo simulations (NB.MC = 10,000 or NB.MC = 50,000) [33]. The estimator of FDR < 0.05 and specificity (Sp) ≥ 0.99 are considered to interpret the results. Sensitivity (Se) values are typically low in the BCPNN approach [43], Se ≥0.20 is considered as a reference.

The estimator FDR assures that at least 95% of the signals detected are positive (only 5% of false positives). Moreover, if the estimator of false negatives (FNR) is 50% or lower, it implies that, at least, half of the signals rejected are effectively negative. In the results presented, all the FNRs were lower than 49%.

All signals were obtained and categorized according to the standard terminology, in essence, preferred terms (PT) of the Medical Dictionary for Regulatory Activities (MedDRA) [44].

If we apply this strategy described to our cases of interest, algorithms were performed as followed.

#### **2.3 Looking back: early detection strategy for rosiglitazone**


*Data Mining Strategy to Prevent Adverse Drug Events: The Cases of Rosiglitazone and COVID-19… DOI: http://dx.doi.org/10.5772/intechopen.112412*

This information was requested from a Spanish Regional Pharmacovigilance Center (Valladolid) with prior permission from the Spanish Medicines Agency.

For the treatment of information on suspected ADRs, the 'Criteria for the use of data from the FEDRA Database of the Spanish Pharmacovigilance System' – SEFV/1/ CT and the 'Rules for the correct interpretation and use of the data of the SEFV' – SEFV/2/CT.

#### **2.4 Looking at the present: early detection strategy for COVID-19 vaccines**


This database allows searching by active ingredient (not brand names) for all data coming from over 110 countries, undersigning a statement of the responsibility for the appropriate use and interpretation of data ('Important points to consider') [46]. It is not possible in VigiAccess to separate the numbers for specific vaccines.

This information was requested in the form of a free consultation. To access the search function, you must confirm that you have read and understood the statements for the treatment of information on suspected ADRs [47].

#### **3. Results**

#### **3.1 Looking back for rosiglitazone**

The only signals reported about heart failure appeared for the combination of active ingredient rosiglitazone o for the combination rosiglitazone plus metformin (see **Table 1**).


#### **Table 1.**

*Heart failure, cardiovascular and related positive safety signals detected among notifications of antidiabetics in Spain, until 2008.*

Due that data considered for this approach was until 2008, it is shown that the cardiovascular risk of rosiglitazone could have been detected 2 years in advance of its international alert and subsequent withdrawal in 2010.

No other cardiac risk was detected, but among vascular signals were peripheral edema by detemir and rosiglitazone plus metformin, edema by pioglitazone, and angioedema by rosiglitazone for RR > 1, for RR > 2, edema by rosiglitazone or by rosiglitazone plus metformin, and peripheral edema by pioglitazone were obtained.

Complete results for 'heart failure, cardiovascular, and related' signals during that period are reported in **Table A1** for RR0 > 1 and for RR0 > 2 in **Table A2**. It is relevant that signals appeared for relative risk RR0 > 1 and RR0 > 2 and without additional Montecarlo simulation than the referenced method.

#### **3.2 Looking back for COVID-19 vaccines**

If we consider the standard relative risk (RR0 > 1) and Montecarlo simulations, we can only detect thrombotic events with the smallpox vaccine. It has more sense to increase Montecarlo simulations than relative risk because the data reported are only for 2 months without cumulative information. So, if considering the same relative risk (RR0 > 1) and increasing the Montecarlo simulations, COVID-19 vaccines appear in the results with five different types of thrombotic events; meanwhile, smallpox vaccines are almost the same. And that these last ones are almost the same in both situations validates the consistency of the results for COVID-19 vaccines (see **Table 2**). It is especially interesting to the signal of CVST and other unusual thrombotic locations in pelvic veins.

If we consider data and study periods, it is shown that the risk of CVST could have been detected 2 months before its international alert and subsequent management.

Complete results for 'thrombosis, thrombocytopenia, and related' events are reported in **Table A3** (Montecarlo simulations = 10,000) and in **Table A4** (for Montecarlo simulations = 50,000).


*N (count), number of couples 'active ingredient-ADR' reported; RR0, Relative Risk; NB.MC, Number of Monte Carlo simulations; and FDR, False Discovery Rate.*

#### **Table 2.**

*Thrombosis, thrombocytopenia and related positive safety signals detected among notifications of COVID-19 vaccines in VigiAccess™, until the end of January 2021.*

*Data Mining Strategy to Prevent Adverse Drug Events: The Cases of Rosiglitazone and COVID-19… DOI: http://dx.doi.org/10.5772/intechopen.112412*

#### **4. Discussion**

#### **4.1 Looking back: positive signal of heart failure and rosiglitazone**

This thiazolidinedione was marketed in 2001 and was exclusively indicated for combination with other antidiabetics in patients with diabetes mellitus type 2 in whom treatment with metformin or sulfonylureas is ineffective or contraindicated. It was presented with the potential advantage of having a better safety profile at the cardiac level [48] and was considered an active principle of 'eventual utility' (terminology indicating that the novelty brings some modest but real improvement, which may be useful in some eventual clinical situation) [49]. Subsequently, since 2007, rosiglitazone has been the subject of multiple safety information notes related to cardiac risk [50–52], the risk of fractures in women [53], and its benefit-risk ratio [54]. In 2008, it was once again warned of its cardiovascular risk [55].

The security warnings issued during its commercialization did not affect its offer of three presentations until its suspension in December 2010.

It is striking that, although no fatal cases due to heart failure related to rosiglitazone were reported in the study period, they were reported as severe cases.

Finally, in 2010, after more results were available on its benefit-risk relationship [56, 57], its commercialization was suspended [58–60]; specific recommendations were issued for patients receiving treatment with rosiglitazone so that, under medical supervision, they received an alternative treatment appropriate to their case [61, 62].

It is important to note the marketed fixed-dose combination of rosiglitazone, and related results obtained in this study. Since their commercialization began, both the combination of rosiglitazone and metformin (2005) and glimepiride-rosiglitazone (2007) have been classified as novelties that 'do not contribute anything new' [63] and 'do not represent a therapeutic advance' [64, 65], respectively. Both combinations could cause heart failure, among other adverse reactions, as we also obtained for rosiglitazone plus metformin, and the advantage of their use was limited to simplifying the treatment to facilitate therapeutic compliance.

Finally, and at the same time as rosiglitazone alone, in 2010, the marketing of rosiglitazone-metformin and glimepiride-rosiglitazone was suspended due to evidence of cardiovascular risk associated with rosiglitazone [51, 54–56, 58, 59].

Nowadays, the availability of rosiglitazone varies depending on the country and its health regulation. In the United States, for example, rosiglitazone is still available, but it can only be prescribed to patients who cannot control their diabetes with other medications and who have been informed about the associated cardiovascular risks. In Europe, some countries have allowed its use in special circumstances. In the United Kingdom, rosiglitazone can be prescribed in exceptional cases when other treatments are ineffective or contraindicated.

In other countries, such as Australia, Canada, and Japan, rosiglitazone is available, but its use has been recommended with caution due to the associated cardiovascular risks.

Ongoing research has been conducted to better understand the cardiovascular risks associated with rosiglitazone. A more recent systematic review and metaanalysis of the effects of rosiglitazone treatment on cardiovascular risk and mortality found that rosiglitazone is associated with an increased cardiovascular risk, especially for heart failure events [66]. This study also found that the strength of the evidence

varied, and effect estimates were attenuated when sources and analytical approaches were varied, and conclusions were corrected subsequently.

Another study found that rosiglitazone is associated with a significantly increased risk of heart failure, with little increased risk of myocardial infarction, without a significantly increased risk of stroke, cardiovascular mortality, and all-cause mortality compared with placebo or active controls [67].

It is important to note that information about the availability of rosiglitazone may change over time, so it is always advisable to consult with a healthcare professional or local regulatory authority for the most up-to-date information.

#### **4.2 Looking at the present: positive signal of thrombosis-thrombocytopenia syndrome associated with COVID-19 vaccination (VITT)**

The most similar viral vector-based vaccines existing up to the same time of COVID-19 vaccines that time were smallpox and Ebola vaccines, in fact, the COVID-19 vaccines were first included in 2021 in the same J07BX group.

The signal of COVID-19 obtained more related to the actual VITT was cerebral venous sinus thrombosis (CVST). The smallpox vaccine also shows signals for thrombotic events and thrombocytopenia at a low or high number of simulations. This last result can act as a control because all COVID-19 signals only appear with a high number of simulations (see **Table 2**). It is in accordance with the limited data of spontaneous reports in the first month of world vaccination.

In essence, at the beginning of 2021, it would have been useful to monitor these types of ADRs related to vector-based vaccines to detect early signals with COVID-19 vaccines, as it appears with the results commented.

As previously mentioned, this CVST, after being renamed VITT, consists of a rare autoimmune response, more common in women under 60 years of age, which presents as thrombus formation in the cerebral sinus (intracranial) or in abdominal veins, associated with a low platelet count. It occurs between the third and twenty-first-day postvaccination. Some authors recommended close monitoring of at-risk patients, every 2–3 days, especially during the above-mentioned time interval of the first 15 days after vaccination [68].

COVID-19 itself also carries a high risk of thrombosis and coagulation abnormalities in hospitalized individuals [20]. However, it is important to note that the benefits of COVID-19 vaccination in preventing severe illness and death outweigh the risks of these rare adverse events.

In March 2021, countries in Europe and elsewhere put a pause on that vaccine after a handful of people – mostly women younger than 60 – also developed VITT. The European Medicines Agency (EMA) investigated the situation and concluded that these complications should be listed as very rare side effects of the AstraZeneca vaccine and said the benefits still outweighed the risks [69]. But several countries have restricted the use of the vaccine because of the clots.

Even though data of differentiated branded COVID vaccine cannot be extracted from VigiAccess™, it is striking that VITT cases only began to appear in March 2021 in Europe and not in the United States, where the AstraZeneca vaccine was not authorized at that time. In turn, 75% of the spontaneous reports as of January 25, 2021, had been notified from Europe and 25% from America.

So much so that, in the United States, a small number of serious blood clots has also been reported in people who -perhaps- received the AstraZeneca vaccine outside.

*Data Mining Strategy to Prevent Adverse Drug Events: The Cases of Rosiglitazone and COVID-19… DOI: http://dx.doi.org/10.5772/intechopen.112412*

At that point, this observation and signals detected could lead to thinking that mRNA-based vaccines do not present the same risk as adenovirus-based vaccines of generating VITT. If the vaccination advanced almost parallelly in Europe and the United States, the difference in adverse reactions would be marked by the type of authorized vaccines.

#### **4.3 Looking at our early signal detection strategy**

Data mining can be very useful for detecting adverse drug reactions. With the increasing availability of electronic health records and other digital health data sources, data mining techniques can help to identify previously unknown or poorly understood ADRs by analyzing large datasets [70–72].

Overall, data mining can help to improve drug safety by detecting adverse reactions that may not have been identified through traditional methods, such as clinical trials or spontaneous reporting systems. These methods can help to detect patterns and relationships within the data that may not be immediately apparent, including potential associations between drugs and adverse events.

Traditional data mining algorithms can perform disproportionality analysis on spontaneous reporting system data to improve drug safety surveillance [71] but it requires access to huge and complete databases to perform the analysis.

This strategy apports agility and fewer requirements than extended database analysis. The approach generates the possibility of a sustainable follow-up for specific ATC groups of interest and short databases.

The strategy presented can be extended to other groups of active ingredients, which due to their mechanism of action or therapeutic approach, already present some associated risks, which allows the preliminary study of possible adverse reactions in new drugs that are being included in the same ATC group.

#### **5. Limitations of the study**

In the preliminary analyses, values of specificity and sensitivity of the BCPNN methodology, it is known are typically low [44]. Nonetheless, it is acceptable with very high specificity and low but conservative sensitivity.

VigiAccess™ only allows searching by active ingredient, and it implies the impossibility of separating the numbers for branded vaccines using this free database.

The extent list of other events reported in every case of study, and signals obtained, was not provided in this manuscript, but all algorithms were performed taking all of them into account.

#### **6. Conclusions**

The strategy of early signal detection of adverse drug reactions presented has been demonstrated that would have been useful to detect in the past the signal of rosiglitazone and cardiovascular risk, and also, in the present, for the signal of thrombosisthrombocytopenia syndrome associated with COVID-19 vaccination.

The advantage of this data mining approach compared with the standard BCPNN based on IC, or other Bayesian methods based on a relationship between the Proportional Reporting Ratio (PRR) and the Reporting Odds Ratio (ROR) as estimators of disproportionality, is the versatility shown using ATC group records from specific studies or from international databases and also validates it as a useful method for the early detection of ADRs. Its application could help to improve drug and vaccine safety and reduce health risks to patients.

Definitively, signals of ADRs would have to be more considered as a basis of study and regulatory risk-minimization actions in pharmacovigilance and reducing financial costs.

#### **Acknowledgements**

To the European University of Miguel de Cervantes (UEMC, Valladolid, Spain), for giving me time and permission to perform this study.

#### **Conflict of interest**

The author declares no conflict of interest.


### **A. Appendix**

*N (count), Number of couples 'active ingredient-ADR' reported; post.H0, Posterior probability of null hypothesis; FDR, False Discovery Rate; FNR, False Negative Rate; Se, Sensitivity; and Sp, Specificity.*

#### **Table A1.**

*Heart failure, cardiovascular and related positive safety signals detected among notifications of antidiabetics in Spain, until 2008, and with relative risk (RR)>1.*

*Data Mining Strategy to Prevent Adverse Drug Events: The Cases of Rosiglitazone and COVID-19… DOI: http://dx.doi.org/10.5772/intechopen.112412*


*N (count), Number of couples 'active ingredient-ADR' reported; post.H0, posterior probability of null hypothesis; FDR, False Discovery Rate; FNR, False Negative Rate; Se, Sensitivity; and Sp, Specificity.*

#### **Table A2.**

*Heart failure, cardiovascular and related positive safety signals detected among notifications of antidiabetics in Spain, until 2008, and with relative risk (RR)>2.*


*N (count), Number of couples 'active ingredient-ADR' reported; post.H0, posterior probability of null hypothesis; FDR, False Discovery Rate; FNR, False Negative Rate; Se, Sensitivity; and Sp, Specificity.*

#### **Table A3.**

*Thrombosis, thrombocytopenia and related positive safety signals detected among notifications of COVID-19 vaccines in VigiAccess™, until the end of January 2021, and with relative risk (RR)>1.*



*N (count), Number of couples 'active ingredient-ADR' reported; post.H0, posterior probability of null hypothesis; FDR, False Discovery Rate; FNR, False Negative Rate; Se, Sensitivity; and Sp, Specificity.*

#### **Table A4.**

*Thrombosis, thrombocytopenia and related positive safety signals detected among notifications of COVID-19 vaccines in VigiAccess™, until the end of January 2021, and with relative risk (RR)>1 and Monte Carlo simulations NB.MC = 50,000.*

### **Author details**

Maria-Isabel Jimenez-Serrania ADViSE Group, Department of Health Science, Miguel de Cervantes European University (UEMC), Valladolid, Spain

\*Address all correspondence to: ijimenez@uemc.es

© 2023 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

*DOI: http://dx.doi.org/10.5772/intechopen.112412 Data Mining Strategy to Prevent Adverse Drug Events: The Cases of Rosiglitazone and COVID-19…*

#### **References**

[1] Nissen SE, Wolski K. Effect of rosiglitazone on the risk of myocardial infarction and death from cardiovascular causes. New England Journal of Medicine. 2007;**356**(24):2457-2471. DOI: 10.1056/NEJMoa072761

[2] Kaul S, Diamond GA. Rosiglitazone and cardiovascular risk. Current Atherosclerosis Reports. 2008;**10**(5):398- 404. DOI: 10.1007/s11883-008-0062-7

[3] Kaul S, Bolger AF, Herrington D, Giugliano RP, Eckel RH. Thiazolidinedione drugs and cardiovascular risks: A science advisory from the American Heart Association and American College of Cardiology Foundation. Circulation. 2010;**121**(16):1868-1877. DOI: 10.1161/ CIR.0b013e3181d34114

[4] Home PD, Pocock SJ, Beck-Nielsen H, et al. Rosiglitazone evaluated for cardiovascular outcomes in oral agent combination therapy for type 2 diabetes (RECORD): A multicentre, randomised, open-label trial. Lancet. 2009;**373**(9681):2125-2135. DOI: 10.1016/ S0140-6736(09)60953-3

[5] Cheung BM. Behind the rosiglitazone controversy. Expert Review of Clinical Pharmacology. 2010;**3**(6):723-725. DOI: 10.1586/ecp.10.126

[6] Medicines and Healthcare products Regulatory Agency. Rosiglitazone: recommended withdrawal from clinical use [Internet]. 2010. Available from: www.gov.uk/drug-safety-update/ rosiglitazone-recommended-withdrawalfrom-clinical-use [Accessed: February 23, 2023]

[7] National Prescribing Centre (NPC). NPC Archive Item: Rosiglitazone:

withdrawal from clinical use [Internet]. 2010. Available from: www. centreformedicinesoptimisation.co.uk/ rosiglitazone-withdrawal-from-clinicaluse/ [Accessed: February 23, 2023]

[8] Food and Drug Administration. FDA Drug Safety Communication: Ongoing review of Avandia (rosiglitazone) and cardiovascular safety [Internet]. 2010. Available from: www.fda. gov/drugs/postmarket-drug-safetyinformation-patients-and-providers/ fda-drug-safety-communicationongoing-review-avandia-rosiglitazoneand-cardiovascular-safety [Accessed: February 23, 2023]

[9] Food and Drug Administration. FDA Drug Safety Communication: Ongoing review of Avandia (rosiglitazone) and cardiovascular safety. FDA Drug Safety Communication: FDA requires removal of some prescribing and dispensing restrictions for rosiglitazone-containing diabetes medicines [Internet]. 2013. Available from: www.fda.gov/ drugs/drug-safety-and-availability/ fda-drug-safety-communication-fdarequires-removal-some-prescribingand-dispensing-restrictions [Accessed: February 23, 2023]

[10] Medicines and Healthcare products Regulatory Agency (MHRA). Regulatory approval of Pfizer/BioNTech vaccine for COVID-19. [Internet]. 2020. Available from: www.gov.uk/government/ publications/regulatory-approval-ofpfizer-biontech-vaccine-for-covid-19 [Accessed: May 3, 2023]

[11] U.S. Department of Health and Human Services. COVID-19 Vaccine Milestones. [Internet]. 2023. Available from: www.hhs.gov/coronavirus/

covid-19-vaccines/index.html. [Accessed: May 3, 2023]

[12] European Medicines Agency. COVID-19 vaccines: Authorised. [Internet]. 2023. Available from: www. ema.europa.eu/en/human-regulatory/ overview/public-health-threats/ coronavirus-disease-covid-19/ treatments-vaccines/vaccines-covid-19/ covid-19-vaccines-authorised [Accessed: May 3, 2023]

[13] Committee for Medicinal Products for Human Use (CHMP). Comirnaty. Annex I. Summary of product characteristics [Internet]. 2021. Available from: https://www.ema.europa.eu/ documents/product-information/ comirnaty-epar-product-information\_ en.pdf [Accessed: October 17, 2022]

[14] Committee for Medicinal Products for Human Use (CHMP). Spikevax. Annex I. Summary of product characteristics [Internet]. 2021. Available from: https:// www.ema.europa.eu/documents/ product-information/spikevaxpreviously-covid-19-vaccine-modernaepar-product-information\_en.pdf [Accessed: October 17, 2022]

[15] Committee for Medicinal Products for Human Use (CHMP). Vaxzevria. Annex I. Summary of product characteristics [Internet]. 2021. Available from: https://www.ema.europa.eu/ documents/product-information/ vaxzevria-previously-covid-19-vaccineastrazeneca-epar-product-information\_ en.pdf [Accessed: October 17, 2022]

[16] Committee for Medicinal Products for Human Use (CHMP). Jcovden. Annex I. Summary of product characteristics [Internet]. 2021. Available from: https://www.ema.europa.eu/ documents/product-information/ jcovden-previously-covid-19-vaccinejanssen-epar-product-information\_ en.pdf [Accessed: October 17, 2022]

[17] Central Drugs Standard Control Organisation (CDSCO). Sputnik V. Summary of product characteristics [Internet]. 2021. Available from: https:// cdsco.gov.in/opencms/resources/ UploadCDSCOWeb/2018/UploadSmPC/ SMPCsputinikdr.Reddys.pdf [Accessed: October 17, 2022]

[18] Medicines and Healthcare products Regulatory Agency (MHRA). Yellow Card. COVID-19 Vaccine AstraZeneca. [Internet]. 2021. Available from: https://yellowcard.mhra.gov.uk/idaps/ CHADOX1%20NCOV-19 [Accessed: May 1, 2023]

[19] Greinacher A, Thiele T, Warkentin TE, Weisser K, Kyrle PA, Eichinger S. Thrombotic thrombocytopenia after ChAdOx1 nCov-19 vaccination. The New England Journal of Medicine. 2021;**384**(22):2092-2101. DOI: 10.1056/ NEJMoa2104840

[20] Warkentin TE, Cuker A. COVID-19: vaccine-induced immune thrombotic thrombocytopenia (VITT) [Internet]. 2023. Available from: www.uptodate. com/contents/covid-19-vaccine-inducedimmune-thrombotic-thrombocytopeniavitt [Accessed: May 3, 2023]

[21] García-Azorín D, Lázaro E, Ezpeleta D, et al. Thrombosis with Thrombocytopenia Syndrome following adenovirus vector-based vaccines to prevent COVID-19: Epidemiology and clinical presentation in Spain. Neurologia. 2022. DOI: 10.1016/j. nrl.2022.04.010 [Online ahead of print]

[22] World Health Organization (WHO). Guidance for clinical case management of thrombosis with thrombocytopenia syndrome (TTS) following vaccination to prevent coronavirus disease

*DOI: http://dx.doi.org/10.5772/intechopen.112412 Data Mining Strategy to Prevent Adverse Drug Events: The Cases of Rosiglitazone and COVID-19…*

(COVID-19). [Internet]. 2021. Available from: apps.who.int/iris/ bitstream/handle/10665/342999/ WHO-2019-nCoV-TTS-2021.1-eng. pdf?sequence=1&isAllowed=y [Accessed: May 1, 2023]

[23] European Medicines Agency (EMA). COVID-19 vaccine AstraZeneca: benefits still outweigh the risks despite possible link to rare blood clots with low blood platelets. [Internet]. 2021. Available from: www.ema.europa. eu/en/news/covid-19-vaccineastrazeneca-benefits-still-outweighrisks-despite-possible-link-rare-bloodclots#:~:text=COVID%2D19%20 Vaccine%20AstraZeneca%20is- ,blood%20to%20clot)%20after%20 vaccination [Accessed: May 1, 2023]

[24] Lindquist M, Edwards IR, Bate A, et al. From association to alert: A revised approach to international signal analysis. Pharmacoepidemiology and Drug Safety. 1999;**8**(Suppl 1):S15-S25. DOI: 10.1002/ (sici)1099-1557(199904)8:1+3.3.co;2-2

[25] Almenoff J, Tonning JM, Gould AL, et al. Perspectives on the use of data mining in pharmaco-vigilance. Drug Safety. 2005;**28**:981-1007. DOI: 10.2165/00002018-200528110- 00002

[26] Ahmed I, Haramburu F, Fourrier-Reglat A, et al. Bayesian pharmacovigilance signal detection methods revisited in a multiple comparison setting. Statistics in Medicine. 2009;**28**:1774-1792. DOI: 10.1002/sim.3586

[27] Figueiras A, editor. New Era in Drug Safety: A Challenge for Public Health. Santiago de Compostela: Táktika Comunicación; 2009

[28] Bate A, Lindquist M, Edwards IR, et al. A Bayesian neural network method for adverse drug reaction signal generation. European Journal of Clinical Pharmacology. 1998;**54**:315-321. DOI: 10.1007/s002280050466

[29] Nóren GN, Bate A, Orre R, et al. Extending the methods used to screen the WHO drug safety database towards analysis of complex associations and improved accuracy for rare events. Statistics in Medicine. 2006;**25**:3740-3757. DOI: 10.1002/sim.2473

[30] Bate A, Lindquist M, Orre R, et al. Data-mining analyses of pharmacovigilance signals in relation to relevant comparison drugs. European Journal of Clinical Pharmacology. 2002;**58**:483-490. DOI: 10.1007/ s00228-002-0484-z

[31] Rodríguez JL. Data mining in the WHO database. Knowledge detection. Information Centers for Medicines and Pharmacovigilance. 2023. Available from: evirtual.uaslp.mx/FCQ/ farmaciahospitalaria/Paginas/modulo5. aspx [Accessed: April 4, 2023]

[32] Gould AL. Practical pharmacovigilance analysis strategies. Pharmacoepidemiology and Drug Safety. 2003;**12**(7):559-574. DOI: 10.1002/ pds.771

[33] Nóren N, editor. A Monte Carlo Method for Bayesian Dependency Derivation. Gothenburg: Chalmers University of Technology; 2002

[34] Bate A, Lindquist M, Edwards IR. The application of knowledge discovery in databases to post-marketing drug safety: Example of the WHO database. Fundamental & Clinical Pharmacology. 2008;**22**(2):127-140. DOI: 10.1111/j.1472-8206.2007.00552.x

[35] Müller P, Parmigiani G, Robert C, et al. Optimal sample size for multiple testing: The case of gene expression microarrays. Journal of the American Statistical Association. 2004;**99**:990-1001. DOI: 10.1198/ 016214504000001646

[36] McLachlan GJ, Bean RW, Jones LB. A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays. Bioinformatics. 2006;**22**:1608-1615. DOI: 10.1093/bioinformatics/btl148

[37] Roux E, Thiessard F, Fourrier A, et al. Evaluation of statistical association measures for the automatic signal generation in pharmacovigilance. IEEE Transactions on Information Technology in Biomedicine. 2005;**9**:518-527. DOI: 10.1109/titb.2005.855566a

[38] Bate A. Bayesian confidence propagation neural network. Drug Safety. 2007;**30**:62. DOI: 10.2165/00002018- 200730070-00011

[39] Zorych I, Madigan D, Ryan P, et al. Disproportionality methods for pharmacovigilance in longitudinal observational databases. Statistical Methods in Medical Research. 2013;**22**(1):39-56. DOI: 10.1177/0962280211403602

[40] Bate A, Evans SJ. Quantitative signal detection using spontaneous ADR reporting. Pharmacoepidemiology and Drug Safety. 2009;**18**:427-436. DOI: 10.1002/pds.1742

[41] Lindquist M, Stahl M, Bate A, et al. A retrospective evaluation of a data mining approach to aid finding new adverse drug reaction signals in the WHO international database. Drug Safety. 2000;**23**:533-542. DOI: 10.2165/00002018-200023060- 00004

[42] Norén GN, Edwards IR. Modern methods of pharmacovigilance:

Detecting adverse effects of drugs. Clinical Medicine (London, England). 2009;**9**(5):486-489. DOI: 10.7861/ clinmedicine.9-5-486

[43] Tada K, Maruo K, Isogawa N, Yamaguchi Y, Gosho M. Borrowing external information to improve Bayesian confidence propagation neural network. European Journal of Clinical Pharmacology. 2020;**76**(9):1311-1319. DOI: 10.1007/s00228-020-02909-w

[44] International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use (ICH). MedDRA Hierarchy. In: Medical Dictionary for Regulatory Activities [Internet]. 2018. Available from: www. meddra.org/how-to-use/basics/hierarchy [Accessed: March 26, 2023]

[45] Spanish System of Pharmacovigilance of Medicines for Human Use. FEDRA database (Spanish Pharmacovigilance, Adverse Reaction Data) 2000-2008. Spanish Medicines Agency. [Internet]. 2008. Available from: www.aemps.gob. es/medicamentos-de-uso-humano/ farmacovigilancia-de-medicamentosde-uso-humano/informacion-sobre-elacceso-a-los-datos-de-fedra/ [Accessed: March 26, 2023]

[46] Uppsala Monitoring Centre. WHO Collaborating Centre for International Drug Monitoring. VigiBase® [Internet]. 2019. Available from: www.who-umc. org/vigibase/vigibase [Accessed: April 21, 2023]

[47] Uppsala Monitoring Centre. WHO Collaborating Centre for International Drug Monitoring. VigiAccessTM [Internet]. 2023. Available from: https:// www.vigiaccess.org/ [Accessed: February 15, 2023]

[48] Cuesta MT, Martínez M. New active principles. Rosiglitazone. Inf Ter Sis Nac Salud. 2001;**25**:112-113

*DOI: http://dx.doi.org/10.5772/intechopen.112412 Data Mining Strategy to Prevent Adverse Drug Events: The Cases of Rosiglitazone and COVID-19…*

[49] CADIME, EASP. [Rosiglitazone (DCI)]. Ficha de novedad terapéutica. 2001;1-2

[50] European Medicines Agency (EMEA). EMEA statement on recent publication on cardiac safety of rosiglitazone (Avandia, Avandamet, Avaglim). Press release. [Internet]. 2009. Available from: www.ema.europa. eu/docs/en\_GB/document\_library/ Press\_release/2009/11/WC500013467. pdf [Accessed: February 13, 2023]

[51] Spanish Agency for Medicines and Health Products. Informative note 2007/08. [Cardiac risk associated with rosiglitazone: Communication from the AEMPS on the recently published data] [Internet]. 2007. Available from: www.aemps.gob. es/informa/notasInformativas/ medicamentosUsoHumano/ seguridad/2007/docs/NI\_2007-08\_ rosiglitazona.pdf [Accessed: February 13, 2023]

[52] Ministerio de Sanidad PSeI. [Rosiglitazone: Cardiac risk] (ref.:2007/08, mayo). Inf Ter Sis Nac Salud. 2007;**31**:101-102

[53] General Subdirectorate of Medicines for Human Use – Spanish Agency of Medicines and Health Products. Informative note 2007/05. [Rosiglitazone and pioglitazone: increased risk of fractures in women] [Internet]. 2007. Available from: www.aemps. gob.es/informa/notasInformativas/ medicamentosUsoHumano/ seguridad/2007/docs/NI\_2007- 05\_rosiglitazona.pdf [Accessed: March 8, 2023]

[54] General Subdirectorate of Medicines for Human Use – Spanish Agency of Medicines and Health Products. Informative note 2007/13. [Pioglitazone and rosiglitazone: Conclusions from

the evaluation of the benefit-risk balance in Europe] [Internet]. 2007. Available from: www.aemps.gob. es/informa/notasInformativas/ medicamentosUsoHumano/ seguridad/2007/docs/NI\_2007-13\_ glitazonas.pdf [Accessed: March 8, 2023]

[55] General Subdirectorate of Medicines for Human Use – Spanish Agency of Medicines and Health Products. Informative note 2008/02. [Rosiglitazone and cardiovascular risk: New contraindications and restrictions of use] [Internet]. 2008. Available from: www.aemps.gob. es/informa/notasInformativas/ medicamentosUsoHumano/ seguridad/2008/docs/NI\_2008- 02\_rosiglitazona.pdf [Accessed: March 8, 2023]

[56] General Subdirectorate of Medicines for Human Use – Spanish Agency of Medicines and Health Products. Informative note 2010/08. [Information on the benefit-risk assessment of rosiglitazone] [Internet]. 2010. Available from: www.aemps. gob.es/informa/notasInformativas/ medicamentosUsoHumano/ seguridad/2010/docs/NI\_2010- 08\_rosiglitazona.pdf [Accessed: March 8, 2023]

[57] Spanish Ministry of Health PSeI. Rosiglitazone: information on the benefit-risk balance (ref.: 2010/08, July). Inf Ter Sis Nac Salud. 2010;**34**:131

[58] Spanish Agency for Medicines and Health Products. Informative note 2010/12. [Rosiglitazone (Avandia®, Avaglim®, Avandamet®): marketing suspensión] [Internet]. 2010. Available from: www.aemps. gob.es/informa/notasInformativas/ medicamentosUsoHumano/ seguridad/2010/docs/

NI\_2010-12\_rosiglitazona.pdf [Accessed: March 8, 2023]

[59] Spanish Agency for Medicines and Health Products. Informative note 2010/18. [Rosiglitazone (Avandia®, Avaglim®, Avandamet®): Marketing suspension on December 29. Information for healthcare professionals] [Internet]. 2010. Available form: www.aemps. gob.es/informa/notasInformativas/ medicamentosUsoHumano/seguridad/ 2010/docs/NI\_2010-18\_rosiglitazona.pdf [Accessed: March 8, 2023]

[60] Spanish Ministry of Health PSeI. Rosiglitazone: Marketing suspension (ref: 2010/12 and 18, September and December). Inf Ter Sis Nac Salud. 2011;**35**:67

[61] Spanish Agency for Medicines and Health Products. Informative note 2010/05. [Rosiglitazone (Avandia®, Avaglim®, Avandamet®): marketing suspension. Information for patients on drug safety] [Internet]. 2010. Available from: www.aemps.gob.es/informa/ notasInformativas/medicamentos UsoHumano/seguridad/ciudadanos/ 2010/docs/NIP\_2010-05\_rosiglitazona. pdf [Accessed: March 8, 2023]

[62] Spanish Agency for Medicines and Health Products. Informative note 2010/06. [Rosiglitazone (Avandia®, Avaglim®, Avandamet®): marketing suspension on December 29. Information for patients on drug safety] [Internet]. 2010. Available from: www.aemps. gob.es/informa/notasInformativas/ medicamentosUsoHumano/seguridad/ ciudadanos/2010/docs/NIP\_2010- 06\_rosiglitazona.pdf [Accessed: March 8, 2023]

[63] CADIME, EASP. Rosiglitazone (DCI)/Metformin (DCI). Ficha de novedad terapéutica. 2005;**2005**:1-2 [64] CADIME, EASP. Glimepiride (DCI)/ Rosiglitazone (DCI). Ficha de novedad terapéutica. 2008;**2008**:1-2

[65] CADIME, EASP. Pioglitazone (DCI)/ Metformin (DCI). Ficha de novedad terapéutica. 2009;**2009**:1-2

[66] Wallach JD, Wang K, Zhang AD, et al. Updating insights into rosiglitazone and cardiovascular risk through shared data: Individual patient and summary level meta-analyses. BMJ. 2020;**368**:l7078. DOI: 10.1136/bmj.l7078

[67] Cheng D, Gao H, Li W. Long-term risk of rosiglitazone on cardiovascular events – A systematic review and meta-analysis. Endokrynologia Polska. 2018;**69**(4):381-394. DOI: 10.5603/ EP.a2018.0036

[68] Bilotta C, Perrone G, Adelfio V, et al. COVID-19 vaccine-related thrombosis: A systematic review and exploratory analysis. Frontiers in Immunology. 2021;**12**:729251. DOI: 10.3389/ fimmu.2021.729251

[69] European Medicines Agency. AstraZeneca's COVID-19 vaccine: EMA finds possible link to very rare cases of unusual blood clots with low blood platelets. [Internet]. 2021. Available from: https://www.ema.europa.eu/en/ news/astrazenecas-covid-19-vaccineema-finds-possible-link-very-rarecases-unusual-blood-clots-low-blood [Accessed: February 15, 2023]

[70] Bone A, Houck K. The benefits of data mining. Elife. 2017;**6**:e30280. DOI: 10.7554/eLife.30280

[71] Hauben M, Horn S, Reich L. Potential use of data-mining algorithms for the detection of 'surprise' adverse drug reactions. Drug Safety. 2007;**30**(2):143- 155. DOI: 10.2165/00002018-200730020- 00004

*DOI: http://dx.doi.org/10.5772/intechopen.112412 Data Mining Strategy to Prevent Adverse Drug Events: The Cases of Rosiglitazone and COVID-19…*

[72] Guan Y, Qi Y, Zheng L, et al. Data mining techniques for detecting signals of adverse drug reaction of cardiac therapy drugs based on Jinan adverse event reporting system database: A retrospective study. BMJ Open. 2023;**13**(1):e068127. DOI: 10.1136/ bmjopen-2022-068127

Section 3

## Optimization Techniques Developed in Data Mining

#### **Chapter 7**

### On the Selection of Power Transformation Parameters in Regression Analysis

*Haithem Taha Mohammed Ali and Azad Adil Shareef*

#### **Abstract**

In multiple linear regression, there are several classical methods used to estimate the parameters of power transformation models that are used to transform the response variable. Traditionally, these parameters can be estimated using either Maximum Likelihood Estimation or Bayesian methods in conjunction with the other model parameters. In this chapter, attention has been paid to four indicators of the efficiency and reliability of the regressive modeling, and study the possibility of considering them as decision rules through which the optimal power parameter can be chosen. The indicators are the coefficient of determination and p-value of the general linear F-test statistic. Also, the p-value of Shapiro-Wilk test (SWT) statistic for the residual's normality of the estimated linear regression of the transformed response vector and the estimated nonlinear regression of the original response vector resulting from the back transform of the power Transformation model. Real data were used and a computational algorithm was proposed to estimate the optimal power parameter. The authors concluded that the multiplicity of indicators does not lead to obtaining an optimal single value for the power parameter, but this multiplicity may be useful in fortifying the decision-making ability.

**Keywords:** Box-Cox transformation, multiple linear regression, Shapiro-Wilk test, general linear F-test statistic, Maximum Likelihood Estimation

#### **1. Introduction**

It is known that when some conditions of the statistical analysis are not met in the linear regression inputs, this means that the outputs of the statistical inference will be unreliable. The most important two conditions that must be fulfilled in the estimated linear regression model are the normality of residuals and constancy of its variances and it is the most violating condition as well [1]. Also, the unfulfilled of these conditions means that the estimated response mean function has no straight line shapes in its relationships with the explanatory variables. In this regard as well the lack of conditions becomes evident in complicated nonlinear models when the residuals in the original model are additive [2]. Therefore, the data transformation tools to linearity "Especially those that belong to the power transformation (PT) family" have been

used to greatly enhance the utility of statistical modeling and obtain a better fit as a general goal. That is, the main goal of data transformation is to prepare it to be compatible with the requirements of statistical inference tools [3]. In short, the confirming conditions for the best estimate of linear regression model are (i) the transformed response should be normally distributed with constant variance for each value of the predictor variables [4] or (ii) have more closing on a better fit to normality [5].

A large body of literature provides various suggestions and developments about the uses of PT for continuous variables in regression models, whether for the dependent variable, independent variables, or both. In this regard, two main research directions can be distinguished; the first is concerned with various proposals and strategies for developing the mathematical functions of PT models to address more complexities in data patterns. "For example, see [6–11]". While the second direction, which will be focused on in this chapter was concerned with the selecting methods of optimal power parameters in different PT families and datasets, "For example, see [8, 12–18]". There are many methods used to estimate the power parameters in Multiple Linear Regression (MLR). Traditionally, these parameters can be estimated using either Maximum Likelihood Estimation (MLE) or Bayesian methods in conjunction with the other model parameters [13]. It is also known that MLE is very sensitive to outliers [8]. Therefore, in addition to the traditional estimation methods, there are some other proposed methods based on the indicators of statistical modeling efficiency. These indicators were used as decision rules to choose the optimal value of the PT parameter [14, 15]. In general, multiplicity of criteria used for a particular dataset does not lead to a single value or at least a closed feasible region for a power parameter. Also, the values of the power parameters differ according to the transformation models.

Outside of the traditional methods, Bartlett's method was to choose a transformation based on the minimizing some measure of the heterogeneity of variance [16]. Tukey, 1949 [17] used the efficiency indicators of ANOVA such as minimization of the F-test value for non-additivity, minimization of the F ratio for interaction versus error, and maximization of the F ratio for treatments versus error [18]. Anscombe, 1961 and Anscombe and Tukey 1963 indicated how a certain function of the residuals can be providing us with a certain insight into the PT model [19]. While some other authors went on to propose algorithms for power parameter selection using the goodness of fit tests of the normality transformed data [12, 20] and coefficient of determination of the estimated linear regression of transformed response [15, 21, 22].

The chapter was divided into four sections. The second section included a short review of the PT models. Third section included the application and the computational algorithm. While the fourth section included the conclusions.

#### **2. Power transformation: short review**

Finney, 1947 [23] assumed the following simple family of PT to transform both sides of the Dose-Response regression *Y* ¼ *ƞ*ð Þþ *x*, *β ε*,

$$\boldsymbol{\Psi}(\boldsymbol{y},\boldsymbol{\omega}) = \begin{cases} \left(\boldsymbol{y}^{\boldsymbol{\lambda}\_1}, \boldsymbol{\omega}^{\boldsymbol{\lambda}\_2}\right) & \boldsymbol{\lambda} \neq \mathbf{0} \\ \left(\ln \boldsymbol{y}, \ln \boldsymbol{\omega}\right) & \boldsymbol{\lambda} = \mathbf{0} \end{cases} \tag{1}$$

*On the Selection of Power Transformation Parameters in Regression Analysis DOI: http://dx.doi.org/10.5772/intechopen.112297*

to form a monotonic simple linear regression *E*ð Þ¼ *ψ*ð Þ*y ƞ*ð ÞÞ *ψ*ð Þ *x* , *β* for the nonlinear relationship of the positive response *Y* given the positive dose *X. λ*<sup>1</sup> and λ<sup>2</sup> are the power parameters that can be estimated from the data.

Tukey, in 1957, developed another simple family of PT to accommodate negative *y*' s by assuming [24],

$$\boldsymbol{\Psi}(\boldsymbol{\eta}) = \begin{cases} \left(\boldsymbol{\eta} + \boldsymbol{a}\right)^{\dot{\boldsymbol{\lambda}}} & \boldsymbol{\lambda} \neq \mathbf{0} \\ \boldsymbol{\ln}\left(\boldsymbol{\eta} + \boldsymbol{a}\right) & \boldsymbol{\lambda} = \mathbf{0} \end{cases} \tag{2}$$

where the value of a can be chosen such that ð Þ *y* þ *a* > 0. In general, it is assumed that for each λ, ψ y � � is a monotonic function of *y* over the admissible range [13].

Considering the common family of Box-Cox transformation (BCT) [13], it is possible to propose the following generalized approach,

$$\psi(\mathbf{y}) = \begin{cases} \frac{(\mathbf{y} + \mathbf{a})^{\lambda} - \mathbf{b}}{\lambda \left\{ \mathbf{g} m (\mathbf{y} + \mathbf{a}) \right\}^{\lambda - 1}} & \lambda \neq \mathbf{0} \\\operatorname{gm}(\mathbf{y} + \mathbf{a}) \ln \ (\mathbf{y} + \mathbf{a}) & \lambda = \mathbf{0} \end{cases} \tag{3}$$

where *a* and *b* are constant quantities and a is chosen so that ð Þ *y* þ *a* >0. *gm y*ð Þ þ *a* represents the geometric mean of the shifted response ð Þ *y* þ *a* . Eq. (3) of BCT family hold for ð Þ *y* þ *a* >0 and for *y*> � *a*. A number of PT models have been derived from this family; the following PT is equivalent to the simple version of Finney transformation Eq. (1) when *a* ¼ 0, *b* ¼ 1 and *gm y*ð Þ¼ þ *a* 1,

$$\psi(y) = \begin{cases} \frac{y^{\lambda} - 1}{\lambda} & \lambda \neq 0 \\\ln|y| & \lambda = 0 \end{cases} \tag{4}$$

As for the following PT, it is an extended form of Eq. (4) when *a* 6¼ 0, *b* ¼ 1 and *gm y*ð Þ¼ þ *a* 1 and equivalent to Tukey transformation according to Eq. (2) since the analysis of variance is unchanged by a linear transformation [24],

$$\psi(\mathbf{y}) = \begin{cases} \frac{(\mathbf{y} + \mathbf{a})^{\lambda} - \mathbf{1}}{\lambda} & \lambda \neq \mathbf{0} \\\ln|\mathbf{y} + \mathbf{a}| & \lambda = \mathbf{0} \end{cases} \tag{5}$$

While as for *a* ¼ 0*, b* ¼ 1 can get a PT model equivalent to Eq. (4),

$$\psi(\mathbf{y}) = \begin{cases} \frac{\mathbf{y}^{\lambda} - \mathbf{1}}{\lambda \left\{ \mathbf{g}m(\mathbf{y}) \right\}^{\lambda - 1}} & \lambda \neq \mathbf{0} \\\frac{\mathbf{1}}{\left\{ \mathbf{g}m(\mathbf{y}) \right\} \ln \mathbf{y}} \ln \mathbf{y} & \lambda = \mathbf{0} \end{cases} \tag{6}$$

finally, as for *a* 6¼ 1*, b* ¼ 1*,* can get the following PT model,

$$\psi(\mathbf{y}) = \begin{cases} \frac{\left\{\mathbf{y} + a\right\}^{\lambda} - \mathbf{1}}{\lambda \left\{ g m \left(\mathbf{y} + a\right) \right\}^{\lambda - 1}} & \lambda \neq \mathbf{0} \\\frac{\left\{ g m \left(\mathbf{y} + a\right) \right\} \ln \left(\mathbf{y} + a\right)}{\left\{ g m \left(\mathbf{y} + a\right) \right\} \ln \left(\mathbf{y} + a\right)} & \lambda = \mathbf{0} \end{cases} \tag{7}$$

**109**

The main three properties of the PT family are: the first, is the continuity at λ go to zero, consider the BCT according to (Eq. 4), by the use of L'Hospital's Rule, it can be shown that, *lim <sup>λ</sup>*!<sup>0</sup> *<sup>y</sup><sup>λ</sup>* � <sup>1</sup> � �*=<sup>λ</sup>* <sup>¼</sup> *ln y*. The second property is the concavity of the transformation function *ψ*ð Þ*y* that leads to obtaining a non-linear regression model for the original data after performing a back transformation of the transformed data model. While the third property is flexibility, as the transformation by power is suitable for dealing with a lot of data structures, and is also suitable for achieving a number of goals.

In BCT family models, if the transformation parameter was negative, the order of the variable would be reversed. That is, when *Y* is increasing, *ψ*ð Þ*y* is decreasing for *λ*<0. So, Tukey, 1977 proposed the following model to maintain the order of the transformed variable [24],

$$\psi(\mathbf{y}) = \begin{cases} \mathbf{y}^{\lambda} & \lambda > \mathbf{0} \\ \ln \mathbf{y} & \lambda = \mathbf{0} \\ - \left(\mathbf{y}^{\lambda}\right) & \lambda < \mathbf{0} \end{cases} \tag{8}$$

BCT, according to Eq. (4) and Eq. (6), is applicable and restricted to positive data. So, Yeo and Johnson, 2000 [25] generalized BCT to include negative and positive values in datasets. They used a smoothness condition to combine the transformations for positive and negative observations, obtaining a one-parameter transformation family. For Y∈R, Yeo-Jonson Transformation (YJT) is given by,

$$\Psi(\boldsymbol{y}) = \begin{cases} \left( (\boldsymbol{y} + \mathbf{1})^{\lambda} - \mathbf{1} \right) / \lambda \boldsymbol{\lambda} \neq \mathbf{0} \text{ and } \boldsymbol{y} \ge \mathbf{0} \\\ L n(\boldsymbol{y} + \mathbf{1}) \, \boldsymbol{\lambda} = \mathbf{0} \text{ and } \boldsymbol{y} \ge \mathbf{0} \\\ - \left( (-\boldsymbol{y} + \mathbf{1})^{2 - \lambda} - \mathbf{1} \right) / (2 - \lambda) \, \boldsymbol{\lambda} \neq \mathbf{2} \text{ and } \boldsymbol{y} < \mathbf{0} \\\ L n(-\boldsymbol{y} + \mathbf{1}) \, \boldsymbol{\lambda} = \mathbf{2} \, \boldsymbol{and} \, \boldsymbol{y} < \mathbf{0} \end{cases} \tag{9}$$

Three properties of YJT namely [26]; (i) For *Y* ≥0, then *Ψ*ð Þ*y* ≥0, and for *Y* <0, then *Ψ*ð Þ*y* <0. (ii) *Ψ*ð Þ*y* is continuous at *λ* ! 0 and *λ* ! 2. (iii) *Ψ*ð Þ*y* is convex with *λ*>1, and concave with *λ*<1.

In MLR, for all previous PT families, the optimal power parameter *<sup>λ</sup>* <sup>∗</sup> <sup>¼</sup> 1 confirm the linearity of the regression relationship and no transformation is required, λ <sup>∗</sup> < 1 refers to the fact that the regression relationship of the original data is not linear due to the skew of the response distribution towards the right and vice versa for λ <sup>∗</sup> >1 [8].

The main idea of the use of PT models in data processing is based on the assumption that the transformed response variable in MLR follows a normal distribution. As a result, the original response follows an unknown and somewhat complex Probability Density Function (PDF) in the exponential family. In the sense that the response transformation changes the shape of data and its original unit of measure [27]. Thus, the optimal power parameter and other model parameters are estimated for the transformed data by the common estimation methods. In the end, the backtransformation will represent the fitted nonlinear regression model of the original data. Mathematically, for the univariate *Y* >0, based on the main assumption; *<sup>Y</sup>*ð Þ*<sup>λ</sup>* <sup>¼</sup> *<sup>ψ</sup>*ð Þ� *<sup>y</sup> <sup>N</sup> <sup>μ</sup>*, *<sup>σ</sup>*<sup>2</sup> ð Þ, the PDF of the univariate *<sup>Y</sup>* <sup>&</sup>gt; 0 is given by *<sup>f</sup> <sup>Y</sup> <sup>y</sup>*; *<sup>λ</sup>*, *<sup>μ</sup>*, *<sup>σ</sup>*<sup>2</sup> ð Þ¼ *<sup>f</sup> <sup>Y</sup>*ð Þ*<sup>λ</sup> <sup>ψ</sup>*ð Þ*<sup>y</sup>* ; *<sup>λ</sup>*, *<sup>μ</sup>*, *<sup>σ</sup>*<sup>2</sup> ð Þ*:J Y*ð Þ , *<sup>λ</sup>* , where, *J Y*ð Þ¼ , *<sup>λ</sup>* j j *<sup>d</sup>ψ*ð Þ*<sup>y</sup> <sup>=</sup>dy* is the Jacobian factor to transform (*Y*1, … , *Yn*Þ ! ð Þ *ψ*ð Þ *Y*1, *λ* , … , *ψ*ð Þ *Yn*, *λ* .

#### *On the Selection of Power Transformation Parameters in Regression Analysis DOI: http://dx.doi.org/10.5772/intechopen.112297*

Consider the MLR model *<sup>Y</sup>*ð Þ*<sup>λ</sup>* <sup>¼</sup> *<sup>X</sup><sup>β</sup>* <sup>þ</sup> *<sup>ε</sup>*, where *<sup>Y</sup>*ð Þ*<sup>λ</sup>* <sup>¼</sup> *<sup>ψ</sup> <sup>y</sup>* � � represents the *(nx1)* column vector of transformed values of response variable vector *Y*. *X* is the *nx p*ð ÞÞ þ 1 known information matrix. *β* is the (*p + 1)x1* unknown parameters vector and *ε* is the (*nx*1) column vector of residuals and distributed according to the normal distribution with mean equal to (*nx*1) zero vector and identity variances matrix equal to *σ*<sup>2</sup>*In*. Also, based on the main assumption; *ψ y* � � � *<sup>N</sup> <sup>X</sup>β*, *<sup>σ</sup>*<sup>2</sup> ð Þ *<sup>I</sup><sup>n</sup>* , the joint PDF of response variable vector *Y***,** is given by the following likelihood function,

$$L(\lambda, \boldsymbol{\theta}, \sigma^2 | \mathbf{y}, \mathbf{X}) = f\_{\mathbf{Y}}(\mathbf{y}) = \left(2\Pi\sigma^2\right)^{-n/2} \cdot \exp\left\{\frac{-\left(\mathbf{Y}^{(i)} - \mathbf{X}\boldsymbol{\theta}\right)^T \left(\mathbf{Y}^{(i)} - \mathbf{X}\boldsymbol{\theta}\right)}{2\sigma^2}\right\} J(\mathbf{Y}, \lambda) \tag{10}$$

Where *<sup>J</sup>*ð Þ¼ *<sup>Y</sup>*, *<sup>λ</sup>* <sup>Q</sup>*<sup>n</sup> <sup>i</sup>*¼<sup>1</sup> *dy*ð Þ*<sup>λ</sup> <sup>i</sup> =dyi* � � � � � �. Applying the method of MLE for Eq. (10) and solving *<sup>∂</sup>Ln L=∂<sup>β</sup>* <sup>¼</sup> 0 and *<sup>∂</sup>Ln L=∂σ*<sup>2</sup> <sup>¼</sup> 0, we get the following estimates for each value of *λ*,

$$\boldsymbol{\mathfrak{f}}^{\uparrow}(\boldsymbol{\lambda}) = \left(\mathbf{X}^{T}\mathbf{X}\right)^{-1}\mathbf{X}^{T}\mathbf{Y}^{(\boldsymbol{\lambda})} \tag{11}$$

$$
\sigma^2(\boldsymbol{\lambda}) = (\mathbf{1}/\boldsymbol{n}) \left( \mathbf{Y}^{(\boldsymbol{\lambda})} \right)^T H \left( \mathbf{Y}^{(\boldsymbol{\lambda})} \right) \tag{12}
$$

Where *<sup>H</sup>* <sup>¼</sup> *<sup>I</sup>* � *X X<sup>T</sup><sup>X</sup>* � ��<sup>1</sup> *X<sup>T</sup>*. Substituting the estimates *β<sup>ˆ</sup>* ð Þ*<sup>λ</sup>* and *<sup>σ</sup>*<sup>2</sup>ð Þ*<sup>λ</sup>* in the logarithm of likelihood function Eq. (10) gives what might be called the Box-Cox objective function after ignoring the constant term,

$$L(\lambda, \mathbf{y}) = -(n/2) \log \ \sigma^2(\lambda) + \log J(\mathbf{Y}, \lambda) \tag{13}$$

Note that the likelihood for a given λ is inversely proportional to the sum of the squared residuals *SSres*ð Þ*λ* of the regression *ψ y* � � on *X*. The likelihood function is maximized when *SSres*ð Þ*λ* value is minimized. The value of the power parameter *λ* is optimal when *L λ*, *y* � � is at its maximum.

#### **3. An application and computational algorithm**

We consider a real economic dataset that includes a set of five explanatory variables affecting the Current Account of the Republic of Iraq in the period 2004–2020 (**Table 1**). The dataset has been obtained from Iraqi Central Bank and is also available at https://cbiraq.org/. R program was used to analyze the data.

Evident from (**Figure 1**) that there are three outliers among the values of the response variable, which are the values *y*9, *y*<sup>10</sup> and *y*11. Also, regarding BCT and the conditions for its implementation, the response positively constraint is not fulfilled due to the presence of some negative values. Therefore, the estimating of MLR for these data would be risky, and the diagnostic and inference tools might give misleading results. So, there is a certain and definite need to conduct some mathematical preparations to shift the data to another space.

#### *Research Advances in Data Mining Techniques and Applications*


#### **Table 1.**

*The current account and some explanatory variables of Republic of Iraq for the period 2004–2020 "Million IQD".*

#### **Figure 1.**

*Box plot of response variable values.*

So, the following MLR model was chosen, which addresses the presence of negative values in the data and might have some robustness to get past the implications of having outliers,

$$\mathbf{Z}^{(i)} = \mathbf{U}\boldsymbol{\mathfrak{P}} + \mathbf{e} \tag{14}$$

*On the Selection of Power Transformation Parameters in Regression Analysis DOI: http://dx.doi.org/10.5772/intechopen.112297*

*<sup>Z</sup>*ð Þ*<sup>λ</sup>* represents the (17�1) column vector of transformed values of Simple Index Numbers (SIN) of the original response variable vector *Y:Z*ð Þ*<sup>λ</sup>* is defined according to the following simplified version of BCT family,

$$\mathbf{Z}^{(\lambda)} = \begin{cases} \frac{\mathbf{z}^{\lambda} - \mathbf{1}}{\lambda} & \lambda \neq \mathbf{0} \\\ ln \ z & \lambda = \mathbf{0} \end{cases} \tag{15}$$

And the ith-value in the *Z*<sup>0</sup> s vector is defined to the following SIN with considering the first year as a base year,

$$z\_i = \frac{\mathcal{y}\_{i+1} + a}{\mathcal{y}\_1 + a} (100) \tag{16}$$

*a* is a constant to shift the location of the response vector to positive space where it is chosen to ensure the BCT's constraint ð Þ *Y* þ *a* > 0. *U* is the (17*x*5Þ known information matrix of the SIN considering the first year as a base year of the explanatory variables, where,

$$
\mu\_{ik} = \frac{\varkappa\_{(i+1)k}}{\varkappa\_{1k}} (\mathbf{100}) \tag{17}
$$

and *u*1*<sup>k</sup>* ¼ 100% For *k* ¼ 2, 3, 4, 5. While the SIN for the first explanatory variable is defined as,

$$u\_{i1} = \frac{\varkappa\_{(i+1)1} + b\_k}{\varkappa\_{1k} + b\_k} (100) \tag{18}$$

where *u*<sup>11</sup> ¼ 100% and *bk* is a constant to shift the location of the explanatory variables to positive values where it is chosen so that ð Þ *X* þ *a* >0. *β* is the (6x1) unknown parameters vector and *ε* is the (17�1) column vector of residuals and distributed according to the normal distribution with mean equal to (17�1) zero vector and identity variances matrix equal to *σ*<sup>2</sup>*In*.

Finally, the nonlinear multiple regression model for the original data regression *Z* given *X* is derived from the following back-transform of BCT,

$$\mathbf{Z} = \begin{cases} \left(\lambda \,\mathbf{Z}^{(\lambda)} + \mathbf{1}\right)^{1/\lambda} & \lambda \neq \mathbf{0} \\ \mathbf{exp}\left(\mathbf{Z}^{(\lambda)}\right) & \lambda = \mathbf{0} \end{cases} \tag{19}$$

Thus, we can have obtained the estimated multiple nonlinear regression model for the original data regression from the estimated MLR of transformed data,

$$\hat{\mathbf{Z}} = \begin{cases} \left(\lambda \mathbf{U} \hat{\mathbf{p}} + \mathbf{1}\right)^{1/\lambda} & \lambda \neq \mathbf{0} \\ \mathbf{exp}\left(\mathbf{U} \hat{\mathbf{p}}\right) & \lambda = \mathbf{0} \end{cases} \tag{20}$$

A number of modeling efficiency indicators are included in our search algorithm to obtain optimal power parameter *λ* <sup>∗</sup> . The first is the traditional MLE. The second,

third, and fourth are the coefficient of determination (CoD), p-value of SWT statistic for the residual's normality, and p-value of the general linear F-test statistic of the estimated linear regression of the transformed response vector. The fifth is the p-value of SWT statistic for the residual's normality of the estimated nonlinear regression of the original response vector resulting from the back-transform of BCT. The proposed computational algorithm is as follows:

Step 1: Transform the original response vector *Y* to SIN's vector *Z* according to Eq. (16) of vector elements and the original information matrix *X* to SIN's matrix *U* according to Eq. (17) of matrix elements.

Step 2: Choose a set of candidate values for the power parameter. For example, fix λ∈Λ, where Λ ¼ �f g 2, �1*:*9, … , 0, … , 1*:*9,2 . Λ can be expanded to an acceptable range from which we can obtain a convex curve for MLE, and the same applies to CoD. Also, obtaining a minimum value of the p-value of general linear F-test statistic within Λ can be an indicator of acceptance of the candidate range.

Step 3: Transform the SIN's vector *Z* to *ψ*ð Þ *Z* using the simple version of the BCT family according to Eq. (15) by the first candidate λ <sup>∗</sup> in Λ.

Step 4: Estimate the parameters *<sup>β</sup><sup>ˆ</sup>* <sup>λ</sup> <sup>∗</sup> ð Þ and *<sup>σ</sup>*<sup>2</sup> <sup>λ</sup> <sup>∗</sup> ð Þ of MLR of *<sup>Z</sup>*ð Þ*<sup>λ</sup>* given *<sup>X</sup>* according to Eq. (14) using Eq. (11) and Eq. (12).

Step 5: Estimate log-likelihood function *<sup>L</sup>* <sup>λ</sup> <sup>∗</sup> ð Þ , *<sup>z</sup>* according to Eq. (13). Calculate CoD, a p-value of SWT statistics to test the residual vector normality, and p-value of the general linear F-test statistic.

Step 6: Estimate the multiple nonlinear regression model for the original data regression using Eq. (20).

Step 7: Calculate the p-value of SWT of the residual vector normality of the estimated multiple nonlinear regression model of the original data.

Step 8: Repeat all the steps from 3 to 7 for all values of λ∈Λ.

The tables below show the results of applying the computational algorithm. **Table 2** shows the optimal values of λ against each indicator in its optimal state. **Table 3** shows the estimates of power parameters according to the five indicators for all Λ ¼ �f g 3, �2*:*9, … , 0, … , 2*:*9, 3 .

Based on the results of p-values of general linear F-test statistic for all λ∈Λ in (**Table 3**), we conclude that the full estimated models "whether for the non-Linear multiple regression models when <sup>λ</sup>*<sup>ˆ</sup>* 6¼ 1 or MLR in which <sup>λ</sup>*<sup>ˆ</sup>* <sup>¼</sup> <sup>1</sup>" are appropriate for the data. It is also clear that the residuals are close to normality shape for transformed data models except in the case Ln Z based on the indicator of the p-value of SWT of residuals normality (**Table 2**).


**Table 2.**

*The optimal values of* λ *against each indicator in its optimal state.*

*On the Selection of Power Transformation Parameters in Regression Analysis DOI: http://dx.doi.org/10.5772/intechopen.112297*


**Table 3.**

*Estimates of the power parameter according to the five indicators for all* λ ∈Λ*.*

As for the MLE, the highest point corresponds to the value of the parameter when it is close to zero, (**Figure 2(a)**). That is, the optimal transformation is *Ln y*. On the other hand, according to the p-value of SWT of residuals normality, it is quite clear

that the residuals are abnormal. Therefore, it can be said that the results of general linear F-test statistics are not reliable.

As we mentioned in the article, the value of the optimal *λ* varies according to the different methods and indicators of estimation. Confirmation of that, the results of the optimal case for two of the five indicators led to obtaining identical values for the optimal power parameter at bλ ¼ �0*:*5 which are CoD (**Figure 2(b)**) and the p-value of general linear F-test statistic (**Figure 2(c)**).

#### **4. Conclusions**

The use of power transformation models to transform the response variable in regression relationships is, in fact, a way to create a nonlinear model for the data when the requirements of linear regression analysis are not met. In the sense that the statistical modeling operations of the transformed data are more like an intermediate station, the statistical analysis does not succeed unless the operations in this station are accurate and meet the requirements of the model construction. Therefore, there are many indicators of the success of statistical analysis, depending on the multiplicity of its reliability conditions. In this regard and when using PT models there are many methods for selecting the optimal power parameters. Two common directions can be identified: the first is the use of well-known estimation methods such as the method of MLE. The second is the use of some efficiency criteria in regression modeling as a decision rule for estimating the power parameter. We conclude that the multiplicity of criteria for selecting the power parameter does not mean that it can lead to a single value. However, the multiplicity of decision rules can contribute to providing features for optimal solutions and support the decision to choose the optimal power parameter.

#### **Author details**

Haithem Taha Mohammed Ali1,2\* and Azad Adil Shareef3

1 Department of Economic Sciences, University of Zakho, Kurdistan Region, Iraq


\*Address all correspondence to: haithem.taha@uoz.edu.krd

© 2023 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

*On the Selection of Power Transformation Parameters in Regression Analysis DOI: http://dx.doi.org/10.5772/intechopen.112297*

#### **References**

[1] Chatterjee S, Price B. Regression Analysis by Example. New York: John Wiley and Sons, Inc.; 1977. pp. 19-22

[2] Cook RD, Weisberg S. Diagnostics for heteroscedasticity in regression. Biometrika. 1983;**70**(1):1-10. DOI: 10.1093/biomet/70.1.1

[3] O'Hara RB, Kotze DJ. Do not logtransform count data. Methods in Ecology and Evolution. 2010;**1**:118-122. DOI: 10.1111/j.2041-210X.2010.00021.x

[4] van Albada SJ, Robinson PA. Transformation of arbitrary distributions to the normal distribution with application to EEG test-retest reliability. Journal of Neuroscience Methods. 2007;**161**(2):205-211. DOI: 10.1016/j.jneumeth.2006.11.004

[5] Box GEP, Cox DR. An analysis of transformations revisited, rebutted. Journal of the American Statistical Association. 1982;**77**(377):209-210

[6] Klein Entink RH, van der Linden WJ, Fox JPA. Box–Cox normal model for response times. British Journal of Mathematical and Statistical Psychology. 2009;**62**(Pt 3):621-640

[7] Fischer C. Comparing the logarithmic transformation and the box-cox transformation for individual tree basal area increment models. Forest Science. 2016;**62**(3):297-306. DOI: 10.5849/ forsci.15-135

[8] Raymaekers J, Rousseeuw PJ. Transforming Variables to Central Normality. Machine Learning. 2021. DOI: 10.1007/s10994-021-05960-5

[9] Ferrari SLP, Fumes G. Box-Cox symmetric distributions and applications to nutritional data. AStA Advances in

Statistical Analaysis. 2017;**101**:321-344. DOI: 10.1007/s10182-017-0291-6

[10] Yeo IK, Johnson RA. A new family of power transformations to improve normality or symmetry. Biometrika. 2000;**87**(4):954-959

[11] Vélez JI, Correa JC, Marmolejo-Ramos F. A new approach of the Box Cox transformation. Frontiers in Applied Mathmatics and Statistics. 2015;**1**(12):1- 10. DOI: 10.3389/fams.2015.00012

[12] Asar Ö, Ilk O, Dag O. Estimating Box-Cox power transformation parameter via goodness of fit tests. Communications in Statistics - Simulation and Computation. 2017;**46** (1):91-105. DOI: 10.1080/ 03610918.2014.957839

[13] Box GEP, Cox DR. An Analysis of Transformations. Journal of the Royal Statistical Society. Series B (Methodological). 1964;**26**(2):211-252. DOI: 10.1111/j.2517-6161.1964.tb00553.x

[14] Alyousif HT, Abduahad FN. Develop a nonlinear model for the conditional expectation of the Bayesian probability distribution (Gamma – Gamma). Al-Nahrain Journal of Science. 2018;**17**(2): 205-212 Available from: https://anjs.edu. iq/index.php/anjs/article/view/462/408

[15] Al-Saffar A, Mohammed Ali HT. Using power transformations in response surface methodology. In: 2022 International Conference on Computer Science and Software Engineering (CSASE), Iraq: IEEE; 2022. pp. 374-379. DOI: 10.1109/CSASE51777.2022.9759781

[16] Tukey JW. Dyadic ANOVA, an analysis of variance for vectors. Human Biology. 1950;**21**:65-110

[17] Box GEP, Tidwell PW. Transformation of the independent variables. Technometrics. 1962;**4**:531-550

[18] Tukey JW. One degree of freedom for non-additivity. Biometrics. 1949; **5**(3):232-242

[19] Velez JI, Marmolejo RF. A new approach to the Box-Cox transformation. Frontiers in Applied Mathematics and Statistics. 2015;**1**(12):1- 10. DOI: 10.3389/fams.2015.00012

[20] Chen G, Lockhart RA, Stephens MA. Box–Cox transformations in linear models: large sample theory and tests of normality. Canadian Journal of Statistics. 2002;**30**(2):1-59. DOI: 10.2307/3315946

[21] Draper NR, Smith H. Applied Regression Analysis. NY: John Wiley and Sons Inc.; 1981

[22] Atkinson AC, Riani M, Corbellini A. The box-cox transformation: Review and extensions. Statistical Science. 2021;**36** (2):239-255. DOI: 10.1214/20-STS778

[23] Finney DJ. The Principles of Biological Assay. Supplement to the Journal of the Royal Statistical Society. 1947;**9**(1):46-81. DOI: 10.2307/2983571

[24] Tukey JW. On the comparative anatomy of Transformations. The Annals of Mathematical Statistics. 1957; **28**(3):602-632

[25] Yeo IK, Johnson RA. A new family of power transformations to improve normality or symmetry. Biometrika. 2000;**56**(I):87-90. DOI: 10.1093/biomet/ 87.4.954

[26] Samira, S. Exact Box-Cox analysis. Electronic thesis and dissertation repository. 2018. Available from: https:// ir.lib.uwo.ca/etd/5308

[27] Cook RD, Weisberg S. Residuals and Influence in Regression. New York: Chapman and Hall; 1982

#### **Chapter 8**

## Modified Bagging in Linear Discriminant Analysis: Machine Learning

*Yousef El Gimati*

#### **Abstract**

The main idea of this work is the use of machine learning of BAGGING or Bootstrap AGGregatiING, which is extended to average the classifiers based on a distance function. The idea of this function is to find the shortest distance from each data point to the classification boundary by using 'Manhattan' distance in decision trees and alternative distance measure is the 'Mahalanobis' distance used for Linear Discriminant Analysis or LDA, called modified bagging in this work. Thus providing a weighted voting system instead of equal weight voting, the classification error is reduced. Modified bagging is a viable option to reduce the variance which is a component of the classification error. Referring to the analysis, we conclude that modified bagging gives statistically significant improvement in Ripley's data set with different bootstrap sample sizes.

**Keywords:** machine learning, bootstrap aggregating, modified bagging, classification error, LDA

#### **1. Introduction**

Advances in data collection and processing technology have led to a revolution in database management. Classification plays a key role in understanding, summarising and finding useful patterns in data, for example in medicine, agriculture, archaeology, taxonomy and industry. Statistical classification has been traditionally developed by not only statisticians but also engineers and computer scientists such as machine learning and neural network.

There are two main aspects to classification: cluster analysis and discriminant analysis. The goal of cluster analysis or unsupervised learning is to identify classes from data, that is, to determine the number of clusters and assign each observation to a specific class. Each cluster can be defined as a region of observations that are similar enough to each other to be grouped together. Cluster analysis is applied in medicine to cluster the incidences of specific types of tumours and in history to group archaeological findings. Recent applications of clustering techniques have come from biology, for instance discovery of tumour classes using gene expression data (bioinformatics). A popular clustering algorithm is the *k*-means algorithm [1], which aims to partition a given data set into *k* clusters based on squared distance.

By contrast, in discriminant analysis or supervised learning, we have a training sample (or learning sample) of data, in which we observe the outcome and feature measurements of a set of objects. Using this data, the aim then is to construct (or build) a classification rule to predict the outcome for unseen objects (test set). Contexts in which discriminant analysis is fundamental include, as examples: (i) Mechanical procedure for sorting a collection of coins into several classes (e.g. 10, 20, 50 pence or 1 pound), in which the obtained measurements could be diameter, shape or weight. A measurement on which the coins differ can be used for sorting the coins into pre-specified classes. (ii) Patients who experience a heart attack admitted to a hospital; data are often obtained, such as heart rate, blood pressure, patient's age and medical history, and a wide variety of other information. It would be useful to develop criteria to differentiate low-risk from high-risk patients. (iii) In social sciences, people use information collected from polls to predict the outcome of elections. Note that in the statistical literature, supervised learning problems (classification problems) are usually, but not always, referred to as discrimination.

All of these problems have in common the requirement of using knowledge of previously classified data to build a classifier in order to assign a class to a new observation. We refer to the construction or building of classification rules from data as discrimination, classification or learning. The quality of a classifier is characterised by the misclassification error rate on the test set. The main assumption is that observations in the test set are assumed to be generated from the same underlying distribution as the observations in the training set. Almost all methods of discriminant analysis can be seen as ways for the construction of a classifier.

Two contrasting points of view have been taken. We could take a parametric family, which can only be applied when the general parametric form of the probability density functions (pdfs) is known or assumed from either a theoretical knowledge or from studying a training set. This method's most well-known use is linear discriminant analysis (LDA), which assumes that the class-conditional densities are multivariate normal distributions with distinct means but a similar covariance matrix. This is in contrast to a *non-parametric* or *distribution-free* approach in which no assumption is made about the type of distribution from which the samples are drawn. An example is to use kernel density or *k*-nearest neighbour. Classification decision trees provide another example of a successful non-parametric method.

Comparing discriminant analysis to cluster analysis, more data is available on the observations. Applying discriminating in a cluster analysis makes sense because the clusters (classes) produced by the analysis can ultimately be used for prediction. Despite the fact that discriminant analysis and cluster analysis provide a useful dichotomy of classification problems, numerous real-world issues include the characteristics of both circumstances. Note that the focus of this thesis is on supervised learning problems only.

In this work, bagging is applied to linear discriminant analysis (LDA). Aggregating several decision classifiers in a committee is generally more accurate than the use of a single classifier, possibly ignoring the opinion of some classifiers. Thus, it might be promising to use bagging technique to get a better classifier with a more stable solution. Here, we use the LDA classifier with the usual bagging, averaging combining rule and modified version. The modified bagging is based on averaging the probability of LDA classifiers by aggregating posterior distance. We consider these techniques from the perspective of a two-class linear discriminant function. Historically, [2] was the first to propose a procedure for a two-group problem based on maximising the separation between the groups, and it is widely applied in assigning future examples to these groups. Note that usually, LDA is stable when it is constructed on large training samples

#### *Modified Bagging in Linear Discriminant Analysis: Machine Learning DOI: http://dx.doi.org/10.5772/intechopen.113260*

and unstable when small training samples are used, that is, *n* small relative to *d*. Here, we believe that LDA demonstrates that it is not always as stable as one would hope; consequently, applying modified bagging can be useful in some situations (see [3]).

Typical discrimination and classification play a key role in understanding, summarising and finding useful patterns in data, for example in medicine, archaeology, taxonomy and industry. The term discrimination or separation refers to the process of deriving classification rules from data of classified observations, whereas classification or allocation refers to the application of the rules to new observations of unknown class. In practice, discrimination and classification frequently overlap and the distinction between them becomes blurred.

#### **1.1 Linear discriminant analysis: overview**

Contrasting discriminant analysis to cluster analysis, more data is available on the observations. Applying discriminating in a cluster analysis makes sense because the clusters (classes) produced by the analysis can ultimately be used for prediction. Although the fact that discriminant analysis and cluster analysis provide a useful dichotomy of classification problems, numerous real-world issues include the characteristics of both circumstances. Note that the focus of this thesis is on supervised learning problems only.

Consider modeling each class density as a multivariate normal distribution with the following forms:

$$f\_i(\mathbf{x}) = \frac{1}{(2\pi)^{\frac{d\_2}{d\_1}} |\Sigma\_i|^{\frac{N}{4}}} \exp\left[-\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu}\_i)^\prime \Sigma\_i^{-1} (\mathbf{x} - \boldsymbol{\mu}\_i)\right], \text{for} \quad i = 1, 2\tag{1}$$

with mean vector *μi*, covariance matrix P*<sup>i</sup>* and *d* as the dimension of the feature space. Further, if we suppose that we have the special case where each of the classes has a common covariance matrix P, this leads to a particularly simple linear discriminant analysis (or LDA).

A general approach for a classification rule is to consider the posterior probabilities. The Bayes discriminant rule with respect to the posterior is to allocate a new observation *z* to the population for which *πifi*(*z*) is maximised, where *π<sup>i</sup>* are the prior probabilities. Then, the allocation rule that allocates *z* to *₲*<sup>1</sup> is

$$\left(\left(\mu\_1-\mu\_2\right)^{\prime}\Sigma^{-1}\mathbf{z}(\mu\_1-\mu\_2)-\frac{1}{2}(\mu\_1-\mu\_2)^{\prime}\Sigma^{-1}(\mu\_1+\mu\_2)\geq\log\left(\frac{\pi\_2}{\pi\_1}\right)\tag{2}$$

In most practical situations, the population quantities *μi*, P*<sup>i</sup>* and *π<sup>i</sup>* are unknown and replaced by their sample counterparts. Hence, the samples of *ni* observations from each *₲<sup>i</sup>* are used to define a sample-based rule by replacing:

*μ<sup>i</sup>* with *xi*, the estimated mean vector in *₲i*, P with *Sp*, the pooled sample covariance matrix and *πi* with, *π*^ ¼ *ni=n* the estimated prior probability. These estimates are given by

$$\overline{\mathbf{x}} = \frac{1}{n} \sum\_{i:\mathbf{y}\_i = \mathbf{j}} \mathbf{x}\_i, \mathbf{S}\_{\mathbf{j}} = \frac{1}{n\_{\mathbf{j}} - \mathbf{1}} \sum\_{i:\mathbf{y}\_i = \mathbf{j}} \left(\overline{\mathbf{x}}\_i - \overline{\mathbf{x}}\_{\mathbf{j}}\right) \left(\overline{\mathbf{x}}\_i - \overline{\mathbf{x}}\_{\mathbf{j}}\right)', \text{for } i = \mathbf{1}, 2 \tag{3}$$

$$\mathbf{S}\_p^2 = \frac{(n\_1 - 1)\mathbf{S}\_1 + (n\_2 - 1)\mathbf{S}\_2}{n\_1 + n\_2 - 2} \tag{4}$$

With these estimates inserted to give parametric estimates of *f*1 and *f*2, the rule (2) now becomes a sample linear discriminant rule: allocate *z* to *₲*<sup>1</sup> if

$$(\overline{\boldsymbol{\varpi}}\_1 - \overline{\boldsymbol{\varpi}}\_2)^\prime \mathbf{S}\_p^{-1} \boldsymbol{z} (\overline{\boldsymbol{\varpi}}\_1 - \overline{\boldsymbol{\varpi}}\_2) - \frac{1}{2} (\overline{\boldsymbol{\varpi}}\_1 - \overline{\boldsymbol{\varpi}}\_2)^\prime \mathbf{S}\_p^{-1} (\overline{\boldsymbol{\varpi}}\_1 + \overline{\boldsymbol{\varpi}}\_2) \ge \log \left( \frac{\pi\_2}{\pi\_1} \right) \tag{5}$$

#### **2. Bootstrap aggregating or bagging: overview**

The quest for a 'good' learner is an important issue in discriminant analysis. Usually, a single classifier constructed on a training set is biased and has large variance. Consequently, such a classifier may be misleading and will typically generalise poorly. One can improve the classifier by combing multiple classifiers instead of a single classifier. The cooperation of several classifiers as a decision 'committee' has been proposed as a way of reducing misclassification error.

In Breiman's paper [4], bagging, which has been shown to increase classifier accuracy, is used to enhance the performance of a learning algorithm. A general technique known as bagging involves combining several classifier outputs produced from a single classifier on a bootstrap resample version of the training set using a majority vote to improve the performance of a given learning algorithm.

Bagging produces random and independent training sets with replacement from the original data set. This procedure is repeated *B* times (say *B* = 50) and the resulting *B* classifiers [5]. A final classifier or prediction can be obtained by averaging the individual classifiers in regression or by taking the majority vote over individual classifiers in classification.

The following is how bootstrapping and aggregating approaches are used in bagging: enter a training set, which consists of n observations ℓ = {(*xi, yi*), *i =* 1 … *n*} with *xi* being the feature vector and *yi* being the response, taking values in *y ϵ* {1, 2 *… J*}. Draw bootstrap samples each of size *m* with replacement from the original training set. Construct a classifier from each bootstrap sample, and the final classifiers built from whose output is the class predicted most often by its sub-classifiers (for more details see [6]). **Figure 1** contains full details:

#### **2.1 Mechanics of bagging algorithm on LDA**

Bagging is usually investigated for decision trees. Breiman [4] has shown that bagging could reduce the classification error of decision trees, and he noticed that bagging is useful for unstable procedures only. For stable rules like LDA, [7] found that bagging is useless, because most cases [4] used large training sets, when LDA is very stable. In this work, we perform our study for linear discriminant analysis to investigate if applying bagging can be useful in some situations. Aggregating techniques are implemented on LDA in the following way:

#### **Algorithm 1**. Bagging algorithm on LDA

1. **for** *b =* 1 to *B* **do**

*ℓ\*b =* bootstrap sample of size *m* from *ℓ* (*iid* sample with replacement)

estimate the parameters used by LDA in Eq. (4).

*Modified Bagging in Linear Discriminant Analysis: Machine Learning DOI: http://dx.doi.org/10.5772/intechopen.113260*

**Figure 1.** *Illustration of a majority vote proposed by the researcher depicts: BAGGING = Bootstrap + AGGregatING.*

construct the LDA classifier *c <sup>ℓ</sup>\*b* (*x*)

#### **endfor**

store the classifiers {*c <sup>ℓ</sup>\*b* (*x*), *b* = 1,2 … , *B*}

2. Each observation in *ℓ* or in a test set is then classified using a majority vote of *c\**1, *c\**2, … , *c\*B* classifiers.

> A final decision rule: *<sup>c</sup>*bag (*x*) = argmax *yi* <sup>∈</sup>f g 0, 1 <sup>P</sup> *B b*¼1 *<sup>I</sup>* <sup>c</sup><sup>ℓ</sup><sup>∗</sup> <sup>b</sup> ð Þ¼ <sup>x</sup> <sup>y</sup> � �,

Where *I*[.] is the indicator function, *I*[.] = 1 if *c <sup>ℓ</sup>\*b* (*x*) = *y* and 0 if *c <sup>ℓ</sup>\*b* (*x*) 6¼ *y*

#### **2.2 Illustration of bagging algorithm on LDA**

Linear discriminant analysis is optimal for providing a classification rule that minimises the misclassification error rate if each class is a sample from a multivariate normal population and the population covariance matrices are all equal. Decision tree classification can achieve lower error rates when these assumptions are violated, for example in the case of bimodality.

To illustrate this, let us consider the situation where the LDA is applied to a two-class problem of one-dimensional data. Sixteen values from the first population *N*(�2,1.2) and sixteen values from the second population *N*(2,1.2). The separation of these two sets of 'univariate *xs*' is assessed by a linear discriminant rule in Eq. (4).

#### **Figure 2.**

*LDA classifier vertical (solid) line, bootstrap classifiers (dotted lines) and majority vote vertical (dashed) line. The two classes are shown by ᴼ and \*. Note that there are 16 observations in each class.*

In **Figure 2**, one can see that the LDA classifier built on the complete data is unable to accurately segregate the data. Making classifiers with bootstrapped samples based on 10 replicates sometimes gives a better classifier, sometimes a worse one. Aggregating bootstrap versions using a simple majority vote of classifiers could allow us to get a better classifier than the original classifier. An interesting observation from this example is that the bagged classifier is closer to Bayes' rule (where densities intersect) than the LDA classifier.

In one dimension, the majority rule is in a special case equivalent to a 'median' procedure.

Let us consider the above classification example with a two-class problem. In this example, we have 32 training examples. Also, we have classifier *cb* (*x*), *b* = 1, 2, … , *B*, as shown in **Figure 1**, so that if *x < cb* (*x*), then class 1 otherwise class 2. For a given new observation *z*, classify *z* to class 1 if [# *b* = 1,2 … ,10 *cb* (*x*) *> z*] > [# *b* = 1,2 … ,10 *cb* (*x*) < *z*], as majority vote.

The majority vote is usually constituted by 50% + 1 of the class that has the power to make decisions binding upon the whole. Here, the majority criterion is equivalent to the median, since [# *b* = 1,2 … ,10 *cb*(*x*) *> Ćb*] = 10/2, so that if *z < Ćb*, then *z* is classified to class 1 otherwise to class 2, where *Ć<sup>b</sup>* is the median of *cb* (*x*), *b* = 1,2 … , 10. Notice that it is very difficult to define the median in more than one-dimensional data. However, while comparing the averaging classifier with the median classifier, it is necessary to evaluate moments like the variance to make sure that a suitable classification rule has a low variance to ensure a low error rate. Let us denote *μ*~ is the median of the population. Standard asymptotic theory reveals that *4n f μ*~� *<sup>2</sup>* <sup>ð</sup> <sup>½</sup> ð Þ‐<sup>1</sup> is an estimator of var.ð Þ *<sup>μ</sup>*<sup>~</sup> , if we knew the density function. For example, in the case of the normal distribution, we have *μ*~ ¼ *μ*, so that fð Þ¼ *<sup>μ</sup>*<sup>~</sup> <sup>f</sup>ð Þ¼ *<sup>μ</sup>* <sup>1</sup>*<sup>=</sup> <sup>σ</sup>* ffiffi 2 <sup>p</sup> *<sup>π</sup>* � �*:* So, the variance of the mean here is smaller than that of the median.

#### **3. Averaging combining rule on LDA**

Based on the above observations on the variance of the mean and median, we introduce here the averaging combining rule when applying bagging to LDA classifiers, which is based on averaging their coefficients built on bootstrap replicates to create a final decision rule. As we have seen in the above example, the averaging has an advantage over the median procedure, which is in some sense equivalent to a majority vote that makes possible to use the averaging combining rule. Notice, however, that the averaging combining rule makes sense only when rules are 'consistent' at the boundaries. Thus, bagging in this case is organised as the follows:

**Algorithm 2**. Averaging combining algorithm on LDA


2. combine the LDA classifier by averaging their coefficients into a final decision rule:

$$\mathbf{c}\_{\text{bag}}(\boldsymbol{\omega}) = \left(\frac{\mathbf{1}}{B} \sum\_{b=1}^{B} c\_{l^{\*b}}\right)(\boldsymbol{\omega}) \tag{6}$$

#### **3.1 Modified bagging algorithm on LDA**

As mentioned previously, the averaging combining rule performs well only if the constructed rules on the bootstrap samples are consistent at the boundaries. In practice, this situation may not hold, because bootstrap samples have different features, which lead to inconsistent classifiers at the boundaries, especially if the bootstrap sample size is small. For this reason, we introduce an alternative procedure, which is based on posterior distance or Mahalanobis distance. The posterior distance treats with respective mean vectors and covariance matrix as the specification of a probability distribution for that class. For each new observation, we calculate the probability that the observation came from each class; the observation is then classified to the class which gave the highest probability. To illustrate, consider the one-dimensional example shown in **Figure 3**.

Suppose, first of all, we have two probability distributions with common covariance matrix: that observations drawn from class 1 have known distribution with mean 4, whereas those from class 2 have known distribution with mean 9. The classifier *c <sup>ℓ</sup>\*b* (*z*1) has a higher probability in class 1 than in class 2, and consequently, it would be classified *z*<sup>1</sup> to the class 1. In contrast, the classifier *c <sup>ℓ</sup>\*b* (*z*2) has a larger probability in class 2 than in class 1. Consequently, the observation *z*<sup>2</sup> would be classified to class 2. The posterior distance differs from the Euclidean distance in that it takes into account the variance of the probability distribution for each class.

**Figure 3.**

*Two probability densities from* N*(4,1.52) and* N*(9,1.52). Classifiers* c <sup>ℓ</sup>\*b *(*z*1) and* c <sup>ℓ</sup>\*b *(*z*2) (dashed) lines have higher probability, which are* z*<sup>1</sup> and* z*<sup>2</sup> to class 1 and class 2, respectively, according to posterior distance.*

#### **4. Mechanics of modified (weighted) bagging algorithm on LDA**

A commonly used distance measure is the Mahalanobis distance, which is defined as where *Sp* is the pooled sample covariance matrix, and *xi* and *xj* are respective vectors of measurements on observations *i* and *j*. Compared to the Euclidean distance or Manhattan distance, this distance measure has the advantage of explicitly accounting for any correlations that might exist between variables. This is similar to the posterior distance in the case of the data for each class being similarly distributed. Thus, a second modified version of bagging is implemented by us in the following way:

**Algorithm 3**. Modified bagging algorithm on LDA.

**for** *b =* 1 to *B* **do** *ℓ\*b =* bootstrap sample of size *m* from *ℓ* (*iid* sample with replacement) estimate the parameters used by LDA in Eq. (4) construct the LDA classifier *c <sup>ℓ</sup>\*b* (*x*) for each new observation *z \** , let

$$D(\mathbf{z}^\*) = \min[\min\_{\ast \bullet} (d(\mathbf{z}^\*, \mathbf{x} \langle \mathbf{c}\_{\ast \ast} \mathbf{x} \rangle\_{\ast \ast} (\mathbf{x}) \neq \mathbf{c}\_{\ast \ast} (\mathbf{z}^\*)), \min\_{\ast \bullet} (d(\mathbf{z}^\*, \mathbf{x} \langle \mathbf{x} \text{ on edge of } X\rangle) \tag{7}$$

**endfor**

$$\mathcal{L}\_{\text{bag}}(\mathbf{z}^\*) = \text{sign}\left[\sum\_{b=1}^B D(\mathbf{z}^\*) \times \mathcal{c}\_{l\*b}(\mathbf{z}^\*)\right],\tag{8}$$

Where sign (.) = �1 dependent on sign of its argument.

The procedure for modified (or weighed) bagging is performed by the following steps (Algorithm 3). Given a training set ℓ = {(*xi, yi*), *i* = 1 … *n*}, which consists of *n* observations *<sup>ℓ</sup><sup>i</sup> <sup>=</sup>* {(*xi, yi*)} with *xi* the two-dimensional feature vector and *yi* <sup>∈</sup> ½ � ‐*1*, *<sup>1</sup>* , draw bootstrap samples *l \* =* 1, … , *L* each of size m with replacement from the original training set *L*. Let *Z* be a 2*D* array covering the sample values *x*, and define the distance between *z*<sup>1</sup> and *z*<sup>2</sup> as

$$d \mid (\mathbf{z}\_1, \ \mathbf{z}\_2) = |\mathbf{z}\_{11} - \mathbf{z}\_{21}| + |\mathbf{z}\_{12} - \mathbf{z}\_{22}|.\tag{9}$$

#### **4.1 Ripley's simulated data**

The data set is a two-class classification problem in two features. A training set of 250 observations is used, and the error rates are estimated by using a further sample of size 1000 observations. We use the same data as in [8], as illustrated in **Figure 4**. The class distributions are chosen to allow a best-possible error rate of about 8% and are in fact equal mixtures of two normal distributions. The LDA classifier is constructed on the full training set and tested on the test set with test error rate of about 10.8% and has been superimposed (solid line) on the graph.

In this example, we perform weighted bagging, averaging rule and usual bagging for LDA. One of the reasons to choose the LDA classifier is ease of comparison. Another reason which nicely illustrates is that the training sample for the LDA is small, and therefore, the LDA becomes less stable. Here, we consider Ripley's data set with different bootstrap sample sizes (*m* = 75 and 100). Bootstrap replicates of small training samples differ more from each other than the large training samples (by statistical properties). Therefore, it appears that bagging may be advantageous for an LDA classifier built on relatively small training samples, especially when the classifier is most influenced by perturbations in the composition of the training sample. However, a very small training sample may be very misleading. As shown in **Figure 5** at the LHS that compares weighted bagging, averaging rule and usual bagging,

#### **Figure 4.**

*Two-class problem data from [8]. Two classes are displayed by 0 and 1: Note that there are 100 observations in each class. LDA classifier is constructed on the full data and has been superimposed with error rate of 10.8% (solid line), usual bagging (dashed-line) and weighted bagging (dotted line).*

#### **Figure 5.**

*Weighted bagging (solid line) and usual bagging (dashed line) of the different training sizes from Ripley's data set. The left panel is based on* m = *75, while the right panel is based on* m = *100. Note that test error rates for both procedures are estimated over 50 repetitions.*

constructed on a training sample of size (*m* = 75) and performed on the test set. One can clearly see the superiority of weight bagging over the other two procedures as the iterations increase. However, the optimal error rates can be found after 7 iterations by usual bagging, which gave better improvement than the later iterations. The test error rate was about 10.3% for modified bagging and 10.6% for both averaging rule and usual bagging. When the training sample size increased to *m* = 100, as shown at RHS of the same figure, all procedures are less accurate compared to the LHS plot produced by training sample size 75. With the further increase of the training sample size, up to *m* = 200, the modified bagging and usual bagging become more stable, nearly identical to the averaging rule. This result is consistent with the simulation study performed by Breiman [7] on LDA.

As a result, the classifier developed using a high training sample size may be comparable to the classifier developed using the entire training set. Therefore, some features of the classifiers built on a whole training set may be present in the bagged classifiers. The LDA classifier is where this phenomena is most pronounced. In general, the training sample size and the data distribution are strongly affected by the performance of the modified bagging and usual bagging, as well as the averaging combining rule.

#### **5. Conclusion**

Combining several decision classifiers in a committee can provide an improvement over the use of a single classifier. Bagging is a general technique which combines benefits of bootstrapping and aggregating. However, bagging is mainly investigated for decision trees and much less in linear discriminant analysis (LDA). Breiman [4] claims that bagging works well for unstable procedures, which often have high variance. This variance is a major component of classification error, which can be reduced by aggregating, often producing good results.

#### *Modified Bagging in Linear Discriminant Analysis: Machine Learning DOI: http://dx.doi.org/10.5772/intechopen.113260*

The idea of bagging is extended to averaging the classifiers based on the Manhattan distance in the decision tree (see [9]) or the Mahalanobis distance in the linear discriminant analysis (LDA) instead of a majority vote. In the first direction, we focused on creating decision trees, and experimental findings and analysis demonstrate that for all sample sizes, weighted bagged classifier outperforms conventional bagging on various tree levels. It is interesting to note that weighted bagging with stump trees outperforms standard bagging somewhat. In the second direction, we concentrated on building linear discriminant analysis (LDA). We have shown by using Ripley's simulated data that weighted bagging can provide a good classifier, which is better than the averaging rule and usual bagged classifier constructed on small training sample size. Increasing the training sample sizes, classifier becomes stable; both modified bagging and usual bagging are identical to the averaging rule. These techniques depend on the training sample size, as well as the distribution of data used to construct the classifier. Therefore, it is crucial to consider them while using this strategy to enhance classifier performance.

#### **5.1 Potential research directions**

We have concluded that LDA could sometimes be preferable to usual bagging. In general, simple majority vote strategy is not a good choice for the aggregating rule, whereas the weighted vote technique is often a good choice for a bagged classifier in decision trees [9].

Furthermore, the weighted bagging may also perform well in LDA classifier. However, the choice of aggregating technique may be important, which strongly depends on the training sample size and the distribution of data.

We only use weighted bagging in this study for the situation of a two-dimensional predictor space, but what about the more important scenario in terms of practical application of many variables? We leave this as an unsolved research issue.

#### **Acknowledgements**

I like to thank Professor Charles Taylor from Leeds University for his scientific input on this effort.

### **Author details**

Yousef El Gimati

1 Statistics Department, University of Benghazi, Libya

2 Faculty of Business Administration, Libyan International Medical University, Libya

\*Address all correspondence to: yousef.elgimati@uob.edu.ly

© 2023 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

*Modified Bagging in Linear Discriminant Analysis: Machine Learning DOI: http://dx.doi.org/10.5772/intechopen.113260*

#### **References**

[1] MacQueen J. Some methods for classification and analysis of multivariate observations. In: Lecam LM, Neyman J, editors. Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. Berkeley, CA: California Unversity Press; 1967. pp. 281-297

[2] Fisher RA. The use of multiple measurements in taxonomic problems. Annals of Eugenics. 1936;**7**:179-180

[3] Hastie T, Tibshirani R, Friedman J. The elements of statistical learning – Data mining, inference, and prediction. In: Springer Series in Statistics. New York, NY: Springer; 2017

[4] Breiman L. Bagging predictors. Machine Learning. 1996a;**26**:123-140

[5] Efron B, Tibhirani R. An Introduction to the Bootstrap. London: Chapman & Hall; 1993

[6] Davison AC, Hinkley DV. Bootstrap Method and their Application. Cambridge: Cambridge University Press; 1997

[7] Breiman L. Bias, variance and arcing classifier. Technical report, Statistics Department, University of California, Berkeley. 1996b

[8] Ripley BD. Neural networks and related methods for classification (with discussion). Journal of the Royal Statistical Society B. 1994;**56**:409-456

[9] El Gimati Y. Weighted bagging in decision trees: Data mining. JINAV: Journal of Information and Visualization. 2020;**1**(1):1-14. DOI: 10.35877/454RI. jinav149

#### **Chapter 9**

## Application of Process Mining and Sequence Clustering in Recognizing an Industrial Issue

*Hamza Saad*

#### **Abstract**

Process mining has become one of the best programs that can outline the event logs of production processes in visualized detail. We have addressed the important problem that easily occurs in the industrial process called Bottleneck. The analysis process was focused on extracting the bottlenecks in the production line to improve the flow of production. Given enough stored history logs, the field of process mining can provide a suitable answer to optimize production flow by mitigating bottlenecks in the production stream. Process mining diagnoses the productivity processes by mining event logs, this can help to expose the opportunities to optimize critical production processes. We found that there is a considerable bottleneck in the process because of the weaving activities. Through discussions with specialists, it was agreed that the main problem in the weaving processes, especially machines that were exhausted in overloading processes. The improvement in the system has been measured by teamwork; the cycle time for processes has improved to 91%. the worker's performance has improved from 96%, productivity has improved to 91%, product quality has improved by 85%, and lead time has optimized from days and weeks to hours.

**Keywords:** process mining, event-logs, clustering, bottlenecks, production processes

#### **1. Introduction**

Process mining is a set of analysis techniques that explain the data-based overview of how a business process is executed in the real environment [1]. Typically, people who conduct business work have a obvious overview of how processes are working. Process mining performs the historical data to refute or confirm this belief. Process mining applications have confirmed in several cases, the output of conducting a process mining work can achieve real improvements in the process by depicting all activities and events in a dynamic map [2]. Process mining is a data-based approach to answer important question to reqonize business processes. That means the real data of the business process conduction must be plugged into an IT system, so a business activity includes many small task activities or subtasks activities. Thus, the IT system needs to record events that can be captured to these activities [3].

These activities require the case ID to reference the business process instances, the events are part of the activities, and timestamp represents the time when the event

is executed. The events can hold relevant activities. Different types of process mining can be considered its main aims. Comprehensively, all types of process mining purpose to extract meaningful information in terms of many patterns from a set of process data logs being mapped and analyzed. According to Van der Aalst [4], process mining can be considered into the following three primary types: (i) business process conformance—which refers to comparing a business process model already known as an event log of the same business process to detect whether the reality, as recorded in the log, is in line with the business process model and vice versa; (ii) business process discovery—which refers to producing a still unknown business process model based on an event data log, using no prior information; and (iii) business process enhancement—which refers to modifying an existing business process model based on an event log of the same business process [5].

Process mining is entirely utilized in healthcare, business analysis, and service activities [6]. However, in the industry, process mining is still working to gain a good position as machine learning and data mining that have already been applied for decades. Process mining is established using algorithms of data mining and machine learning [7] so lastly, soft computing has been successfully applied, and the famous algorithm employed is a genetic algorithm [8]. There are many algorithms plugged into ProM software such as Fuzzy miner [9], Heuristic miner [10], and the Genetic miner [11]. However, some algorithms are unable to provide high performance to confirm if there is a significant bottleneck in the process, build tracks among activities, or improve the process. Some algorithms of soft computing can predict a proper sequence for activities in the process, calculate exact frequency, and calculate the exact idle time among activities in the same process mining diagram. Data in the industrial field includes many processes that need to be more focused on, and this cannot be done by studying the whole system but can be done by clustering the data into a number of homogeneous splits.

If all the required data is uploaded into the process mining algorithms, then the data will be extracted, and preprocessed and the event logs are constructed. An event data log is a group of events that follow the same business process. It is the implied data type upon which all process mining algorithms are constructed. Once the event data log is conducted, the process mining technique can be applied to recognize data.

As with many studies, process mining is relatively still growing. To limit it, one can say it is limited by soft computing, machine learning, and data mining on one side and by analysis and process modeling on the other side [4]. It has acquired some conductions by different industries over the last years. Some examples of software for process mining are Celonis PI and Fluxicon Disco. Disco is a more straightforward software, and it can accept big data based on the license that is provided by the company. Using this software does not require much knowledge to interpret results. There is no previous use of data mining in the textile field, especially for this case study. ProM has many algorithms plugged in, but not all algorithms are doing well with this data.

The study explains the evaluation method based on process mining to overcome bottlenecks in the production processes. The method is based on a defined framework, guidelines, and methodologies for process mining projects [12, 13], and adopted a question-driven process mining project (the management will be asked to confirm and suggest the changes in the process). It provides the main stages of a production process evaluation to improve the understandability and usability of process mining in the unstructured process for non-experts [14]. The proposed process mining methodology searches the ways of displaying simple processes by splitting data into smaller similar datasets and focusing on the specific similarity clusters.

*Application of Process Mining and Sequence Clustering in Recognizing an Industrial Issue DOI: http://dx.doi.org/10.5772/intechopen.113843*


**Table 1.**

*Statistics from Event-Logs for whole process.*

#### **2. Data exploration using process mining**

Data is collected regarding cases, activities, units, shifts, and timeframe. In this process, the event data is automatically extracted from the processes of the company job shop. Because the process mining had not been applied before in textile production, especially in the country of the case study, we spent 6 months waiting to get this data in the form of event logs. It contains information about 443 events, 33 cases, and 14 activities in the real production process. The data is stored as CSV Excel sheets and flat files. The real-world event log is formatted as MXML (Mining Extensible Markup Language) using the ProM tools [15], and (CVS Excel sheet) when Disco process mining is applied to solve data. **Table 1** shows the statistics of event logs after process mining is applied to solve data.

The method is proposed to control, monitor, and motivate the improvement of textile production processes and is based on the principles of the goal-driven process mining project [12]. The ultimate goal of the study is to investigate the applicability of process mining and its potential in improving the process of textile production.

#### **3. Methodology**

Agile project management has been followed to build the methodology to improve the process in production. The short-limit goal was designed to improve the current production process by removing any unnecessary work and extracting the important knowledge to manage the process in the job shop. The problem has been defined, and based on this, data in the form of event logs has been collected to fit process mining applications. Data was collected from the company archive and current job shop processes. The collected data was heterogeneous with many missed and high dimensional data. It has been filtered and preprocessed for use in Disco and ProM. Disco software has professional visualization and analysis of data, but this program has only one option to analyze data. Many algorithms from data mining, machine learning, soft computing, and fuzzy logic are plugged into the Platform of ProM. In addition, this platform has open access to plug in any algorithm's updates in process mining. The data form has been converted to be accepted in ProM. Sequence clustering based on the Markov Chain in ProM was used to split the original data into three splits. So, ProM has basic visualization, thus clustered data is returned to Disco to get more details about the bottlenecks in the process by analyzing each cluster. By getting the

results from clustered data and comparing it with the original data, the initial evaluation has been built for the management. The main issue was in the weaving processes with massive bottlenecks. Using agile project management and lean principles, some features have been chosen to change management and improve the process. Some lean principles that are more fitted to apply in the job shop have been estimated, like cellular manufacturing and VSM in the production line, and leadership training for the supervisors in the production line. We can apply VSM to visualize the process, but we found some limits to VSM, and these limits have been overcome by process mining when visualizing the whole process and extracting the main bottlenecks. We worked hard to close the project on time, so after 5 months we got improvement in the process. The methodology development is presented in **Figure 1**.

#### **4. Data preprocessing and analyzing**

There are many algorithms of data mining applied to cluster the data such as K-means and Mean-Shift Cluster. However, in the process mining, some activities follow one track on the map, this agreement is called a variant. It is not entirely like the cluster technique in data mining, but it can be used to solve the problem based on homogeneous activities. Data has 25 variants based on the number of activity paths in the whole map. The original period that was picked to collect this data was 5 months starting from 01/01/2019 to 05/31/2019, but the algorithms of process mining to provide an understood and appropriate map generated the process for two years and 93 days which begins from 01/01/2019 and finished on 06/05/2021. Variant 2, which has 11 activities recorded the lowest time duration (5 hours, 55 minutes), whereas the highest time duration recorded by variants 7 and 10 which have respectively 14 and 16 activities, both have the same time duration of 2 years and 93 days.

Process mining uses event logs to explain the real behavior in the conduction of the business process. Event logs consist of a set of traces that represent the process

*Application of Process Mining and Sequence Clustering in Recognizing an Industrial Issue DOI: http://dx.doi.org/10.5772/intechopen.113843*

**Figure 2.**

*Time duration for each variant.*

instance that began in the information system. Each trace contained one activity track, so it is called a process case. A case is typically traceable by one or more fields known as a case identifier. The process mining identified which activity has a high frequency and long idle performance.

Each track or path includes some activities, and each activity takes a specific time to be completed before starting another one. **Figure 2**. Variants and time duration.

The results of the process mining using Disco are explained in 25 variants; each variant represents one full path or track for groups of events or activities on the map (the activitities iclude part of production steps). Each variant has random activities and a specific time duration.

The majority of frequency goes to weaving activity which has the main bottleneck in the whole map of process mining based on the management answers. Process mining cannot provide a complete solution to the management, but it can provide explicit detail about what and where is the problem in the real process. Workers in the real system and data analyzers need to work together to remove bottlenecks and improve the work process. **Table 2** shows the frequency of each activity.

Frequency shows the most recurred process which is located at the weaving process with 162 frequencies. The only important absolute frequency is located at the weaving activity. However, in the case of frequency, there are many important activities like weaving, final shape, and drawing.

Furthermore, the bottleneck is entirely exposed in the tracks of weaving and weaving activity with the total duration reaching 66.1 months.

On the other hand, the bottlenecks exposed in most tracks with the huge time durations. The initial decision is that there is an abnormal activity in the weaving process and the question that should be answered by management is what the work efficiency and machine productivity in the textile department are. Moreover, how does the textile process relate to the rest of the processes in the whole process of production? Performance mode in Disco measures the time that is not utilized in the process (Idle time) because the process waits a while until the next activity starts its job. In the total duration activity, the total duration between the same activities in the textile process is 66.1 months. It is a significant bottleneck exposed in one weaving


#### **Table 2.**

*Activity and its frequency in the process (these are all activities in the process ranked based on the highest frequency).*

activity from total duration performance. However, by changing the mode from total duration to mean duration, the bottlenecks were exposed in three locations; a small bottleneck between sample testing and washing with a total duration of 39.7 weeks, a bottleneck between washing and silver package with a total duration of 51.4 weeks, and bottleneck between blending and reeling with total duration is 43.7 weeks.

There is a difference between total duration performance and mean duration performance. In the mean duration, there is no bottleneck exposed on weaving activity because data of event logs was clarified based on the clusters to compare the number of clusters instead of performance duration. Virtually, there was no doubt, because all results and bottlenecks that were exposed in the process mapping were sent to the management to give final suggestions and give details about the process to solve the problem, improve productivity, and estimate the next step to improve the process.

#### **5. Sequence clustering**

Sequence clustering is a machine learning technique that takes several sequences and gathers them in clusters so that each cluster group in similar sequences. The development of these methods has active research and studies, especially in connection with challenges in the field of informatics [16] and healthcare [17]. A simple sequence clustering algorithm is based on first-order Markov chains [18]. In the algorithm, every group cluster is established with the first-order Markov chain, where the current state depends only on the previous state. The probability that observed sequences belong to a presented group cluster is in effect the probability that the observed sequence was generated by the Markov chain assigned with that cluster [19].

For a sequence x = { *x*<sup>0</sup> , *x*<sup>1</sup> , *x*<sup>2</sup> ,…, *xL*<sup>−</sup><sup>1</sup> } of length L this can be simply explained as:

*Application of Process Mining and Sequence Clustering in Recognizing an Industrial Issue DOI: http://dx.doi.org/10.5772/intechopen.113843*

$$P\left(\frac{\boldsymbol{\varkappa}}{\boldsymbol{\varkappa}\_{k}}\right) = P\left(\boldsymbol{\varkappa}\_{0}, \boldsymbol{\mathcal{c}}\_{k}\right) . \prod\_{i=1}^{l-L-1} P\left(\frac{\boldsymbol{\varkappa}\_{i}}{\boldsymbol{\varkappa}\_{i-1} \mathbf{1}}, \boldsymbol{\mathcal{c}}\_{k}\right) \tag{1}$$

*Pxc* ( <sup>0</sup> , *<sup>k</sup>* ) refers to the probability of *x*0 occurs at the first state in Markov chain assigned with the cluster *<sup>k</sup> c* . − , 1 *i k i <sup>x</sup> P c x* refers to the transition state probability *xi*<sup>−</sup> 1 to the state *xi* in the same Markov chain. By giving the information to calculate *<sup>k</sup> <sup>x</sup> <sup>P</sup> c* , the algorithm of sequence clustering can be conducted as an extent to the well-known Algorithm of Expectation-Maximization [20]. The steps for the Markov chain cluster are:


The cluster is conducted in a solution based on three splits in the original dataset, 0, 1, and 2. Cluster 0 includes 164 events, cluster 1 has 117 events, and cluster 2 includes 229 events, the summation for these events is 510. However, 510 is not equal to the original total events that equal to 443 because these clusters or some clusters do not include pure split due to some events recurred in other clusters. The main map is generated from Disco process mining and Markov Chain in ProM process mining, the clusters from the Markov chain are:

Cluster 0 includes 11 instances (**Table 3** and **Figure 3**).

Cluster 1 includes 5 instances (**Table 4** and **Figure 4**).

Cluster 2 includes 15 instances (**Table 5**, **Figure 5**).


#### **Table 3.** *Instances, events, and cases in cluster 0.*

### **Figure 3.**

*Cluster 0 results.*


#### **Table 4.**

*Instances, events, and cases in cluster 1.*

**Figure 4.** *Cluster 1 results.*

*Application of Process Mining and Sequence Clustering in Recognizing an Industrial Issue DOI: http://dx.doi.org/10.5772/intechopen.113843*


#### **Table 5.**

*Instances, events, and cases in cluster 2.*

**Figure 5.** *Cluster 2 results.*

#### **6. Discussion**

Clustering using Markov chain in ProM is not clear enough to find bottlenecks or exact idle times, but to solve this problem, clustered data has moved to another software called Disco by respecting each cluster to extract more information can help to improve the current processes, so Disco software can give full details about each cluster by mapping the activities and giving full process details.

At cluster 0, the huge idle time is from assembly wending to weaving. The total duration is 61.3 days. The high relative frequency recorded to weaving activity is 22.09%.

At cluster 1, the huge idle time is from silver package to weaving. The total time duration is 52.1 weeks. The high relative frequency recorded to weaving activity is 28.21%.

At cluster 2, the huge idle time is from weaving to weaving. The total time duration is 38 months, so the high relative frequency recorded to weaving activity is 40.61%. The time from blending to sampling testing is huge also, but management said that time is reasonable because the tests are not always required if productivity is in the same design.

The cluster is playing a vital role in extracting more information by splitting data into homogeneous groups. Also, Disco does not have a clustering technique, but it can show the impact and the bottleneck by using the variants. These variants are close to clusters with more details because they give details for each track or bath in one variant that includes many activities, based on how much data that uploaded into the system.

By taking clustered data that was clustered using sequence clustering in ProM and uploading it to the Disco program to provide more details about data; cluster 0 has no bottleneck based on the performance of the total duration. Cluster 1 includes a long idle time reaching 52.1 weeks between the silver package and weaving activity; besides that, other occasional bottlenecks do not affect the process. However, in cluster 2 the algorithm failed to match all activities in the one integrated path because data in this case included a high dimensional clustered dataset, so in the split process, there is a bottleneck between blending and sample testing and the idle time between those activities is 51.2 weeks, but the considerable bottleneck is between Weaving and Weaving activity which its idle time is 38 months for same activity.

#### **7. Process improvement**

We applied the changes and waited time to evaluate the changes in the process as presented in **Table 6**.

The scope of the project focused on the main bottlenecks in the weaving processes. The deliverables of the project were optimizing the main bottlenecks by considering a group of factors to improve the production process. These factors are improving worker's efficiency, reducing lead time, reducing cycle time, and quality improvement. All training is planned to be done in 1 month to start getting improvement in the process. The people who work in the job shop have managed to work on some tasks according to agile project management until completing the final project. In the brainstorming meeting, the team scheduled leadership training for worker efficiency, reducing lead time and cycle time by applying VSM and Kaizen events, productivity, and machines management by cellular manufacturing and visualization management.

Agile project management helped a lot to finish the training in 1 month by prioritizing the most important features to finish on time. The changes in the system have been measured by the team to confirm these improvements; the worker's performance improved from 59 to 96% after 3 months, lead time has optimized from days and weeks to days and hours, productivity has improved from 65 to 91%, product quality has improved 80% by building quality assurance system, and cycle time for all process has improved to 91%.

*Application of Process Mining and Sequence Clustering in Recognizing an Industrial Issue DOI: http://dx.doi.org/10.5772/intechopen.113843*


#### **Table 6.**

*Apply change management to improve the process.*

#### **8. Conclusion**

Process mining is applied to improve industrial productivity by optimizing bottlenecks. Disco is used to analyze the original data to find a big bottleneck in the process. So, we decided to extract more information by splitting the original dataset. Disco has only one option to analyze data, but there are many options can we get from ProM. Each software has specific requirements to accept data. The form of data has been converted to fit ProM. Sequence clustering-based Markov chain has been used to split data. Because ProM has bad visualization, the clustered data has moved to Disco to visualize and analyze data based on each cluster. By analyzing clusters, huge bottlenecks have been found in the weaving processes. The agile project management is considered to tackle these bottlenecks by considering some factors that can play a vital role in improving the process. So, teamwork got internal training for lean principles to manage project tasks. After 5 months, The improvement in the system have been measured by teamwork; the cycle time for all process has improved to 91%. the worker's performance has improved from 96%, productivity has improved to 91%, product quality has improved by 85%, and lead time has optimized from days and weeks to hours.

*Research Advances in Data Mining Techniques and Applications*

### **Author details**

Hamza Saad Technology, Art and Design Department, Bemidji State University, MN, USA

\*Address all correspondence to: hamza.saad@bemidjistate.edu

© 2023 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

*Application of Process Mining and Sequence Clustering in Recognizing an Industrial Issue DOI: http://dx.doi.org/10.5772/intechopen.113843*

#### **References**

[1] Loyola-González O. Process mining: software comparison, trends, and challenges. International Journal of Data Science and Analytics. 2023;**15**(4):407-420

[2] Grisold T, Wurm B, Mendling J, Vom Brocke J. Using process mining to support theorizing about change in organizations. In: HICSS. 7 Jan 2020. pp. 1-10

[3] Cook JE, Wolf AL. Automating process discovery through event-data analysis. In: Proceedings of the 17th International Conference on Software Engineering. 23 Apr 1995. pp. 73-82

[4] Van der Aalst W. Process Mining: Discovery, Conformance, and Enhancement of Business Processes. Berlin: Springer Verlag; 2011a. ISBN 978-3-642-19344-6

[5] van Zelst SJ, Mannhardt F, de Leoni M, Koschmider A. Event abstraction in process mining: Literature review and taxonomy. Granular Computing. Jul 2021;**6**:719-736

[6] Liu C, Zhang J, Li G, Gao S, Zeng Q. A two-layered framework for the discovery of software behavior: A case study. IEICE TRANSACTIONS on Information and Systems. 2018;**101**(8):2005-2014

[7] Leemans SJ, van Zelst SJ, Lu X. Partial-order-based process mining: A survey and outlook. Knowledge and Information Systems. 2023;**65**(1):1-29

[8] Van der Aalst W, Reijers H, Song M. Discovering social networks from event logs. Computer Supported Cooperative Work. 2005;**14**(6):549-593

[9] Günther CW, Van Der Aalst WM. Fuzzy mining–adaptive process

simplification based on multi-perspective metrics. In: International Conference on Business Process Management. Berlin, Heidelberg: Springer; 2007. pp. 328-343

[10] Weijters AJ, Van der Aalst WM. Rediscovering workflow models from event-based data using little thumb. Integrated Computer-Aided Engineering. 1 Jan 2003;**10**(2):151-162

[11] Van Der Aalst W. Process mining: Overview and opportunities. ACM Transactions on Management Information Systems (TMIS). 2012;**3**(2):1-17

[12] Van der Aalst WM. Process Mining: Data Science in Action. Berlin/ Heidelberg, Germany: Springer; 2016

[13] Van Eck ML, Lu X, Leemans SJ, Van der Aalst WM. A Process Mining Project Methodology. In: International Conference on Advanced Information Systems Engineering. Berlin/Heidelberg, Germany: Springer; 2015. pp. 297-313

[14] Qafari MS, Van der Aalst W. Case Level Counterfactual Reasoning in Process Mining. arXiv preprint arXiv:2102.13490; 2021

[15] Van der Aalst WM, Van Dongen BF, Herbst J, Maruster L, Schimm G, Weijters AJ. Workflow mining: A survey of issues and approaches. Data & Knowledge Engineering. 2003;**47**(2):237-267

[16] Chen Y, Reilly KD, Sprague AP, Guan Z. SEQOPTICS: A protein sequence clustering system. BMC Bioinformatics. Dec 2006;**7**(4):1-9

[17] Pika A, Wynn MT, Budiono S, Ter Hofstede AH, van der Aalst WM, Reijers HA. Privacy-preserving process mining in healthcare. International Journal of Environmental Research and Public Health. 2020;**17**(5):1612

[18] Cadez I, Heckerman D, Meek C, Smyth P, White S. Model-based clustering and visualization of navigation patterns on a web site. Data Mining and Knowledge Discovery. 2003;**7**(4):399-424

[19] Ferreira D, Zacarias M, Malheiros M, Ferreira P. Approaching process mining with sequence clustering: Experiments and findings. In: Business Process Management: 5th International Conference, BPM 2007, Brisbane, Australia, September 24-28, 2007. Proceedings 5. Berlin Heidelberg: Springer; 2007. pp. 360-374

[20] Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological). 1977;**39**(1):1-22

### *Edited by Yves Rybarczyk*

For contemporary societies, data mining has emerged as a serious challenge. Thanks to more advanced analytical tools, the Big Data explosion has enabled businesses to assess their performance more thoroughly and accurately. For example, transitioning from using a basic spreadsheet to using data lake modeling offers more flexibility in terms of consulting and summarizing vast amounts of data from many business angles. Data mining, which is the foundation for this optimization of data analysis, has been strengthened by artificial intelligence and machine learning to find patterns in this deluge of data and build future prediction models, turning it into a critical tool for decision-making. This book provides an understanding of the most modern techniques and uses for data mining. It examines data mining in order to classify datasets, predict outcomes, and optimize analyses. Furthermore, the book demonstrates these technological developments by highlighting relevant applications of data mining in industry, biology, education, medicine, and health.

*Andries Engelbrecht, Artificial Intelligence Series Editor*

Published in London, UK © 2024 IntechOpen © your\_photo / iStock

Research Advances in Data Mining Techniques and Applications

IntechOpen Series

Artificial Intelligence, Volume 25

Research Advances in

Data Mining Techniques

and Applications

*Edited by Yves Rybarczyk*