**3. Genetic algorithm for document clustering**

#### **3.1 Individuals**

The population consists of a set of individuals, where each of it is made of a linear chromosome that is represented through a tree structure (hierarchical structure). An individual shall formed on a binary tree structure *cluster all documents* prepared at the top, where each document consists of a *feature vector.* The vector will consist of the weighted values of the frequencies of the stemmer terms that have been selected to implement the document processing scheme [4]. This representation will be attempted to evolve so that the chromosome will undergo genetic changes and find the groups "Clusters" more appropriate for all documents of the IRS. Within the root node we will have our fitness function *(fitness)* that measure the quality of the resulting clustering. Depending on the number of documents that need to be processed and the depth (height) of the tree you want to create, chromosome may be of variable length.

The Figure 3 shows the initial generation (0), a scheme of tree-based representation is adopted in order to allow the encoding of sufficiently complex logical structures within a chromosome. The search area for the GA is the space of all possible trees that can be generated, resulting from the whole relevant functions and terminals. This way we can evolve individuals of various shapes and sizes [8], allowing evolution to decide what are the best settings.

Although the initial population is random, there is a defined set of parameters governing the establishment of such individuals. For example, *there should not be created in the initial set two equal individuals,* for this production rules are created to ensure the compliance with this condition. The above mentioned rules require that the building grammar of each individual nodes takes place in Preorder.

Fig. 3. Initial Population of Individuals GA (generation "0")

#### **3.2 Production operators**

126 Bio-Inspired Computational Algorithms and Their Applications

 N documents

SI

feature vectors

**Feature vectors** 

Comparing and Analizing Results

its parameters

Documents processing to obtain

**Documents** 

Make the AG develop with

Obtaining M grups of the AG

NO

Choosing at random Documents that are in (M=2) groups

Processing the Kmeans algorithm for an M set of documents

**Feature vectors** 

The population consists of a set of individuals, where each of it is made of a linear chromosome that is represented through a tree structure (hierarchical structure). An individual shall formed on a binary tree structure *cluster all documents* prepared at the top, where each document consists of a *feature vector.* The vector will consist of the weighted values of the frequencies of the stemmer terms that have been selected to implement the document processing scheme [4]. This representation will be attempted to evolve so that the chromosome will undergo genetic changes and find the groups "Clusters" more appropriate for all documents of the IRS. Within the root node we will have our fitness function *(fitness)* that measure the quality of the resulting clustering. Depending on the number of documents that need to be processed and the depth (height) of the tree you want

The Figure 3 shows the initial generation (0), a scheme of tree-based representation is adopted in order to allow the encoding of sufficiently complex logical structures within a chromosome. The search area for the GA is the space of all possible trees that can be generated, resulting from the whole relevant functions and terminals. This way we can evolve individuals of various shapes and sizes [8], allowing evolution to decide what are

Although the initial population is random, there is a defined set of parameters governing the establishment of such individuals. For example, *there should not be created in the initial set two equal individuals,* for this production rules are created to ensure the compliance with this condition. The above mentioned rules require that the building grammar of each individual

Fig. 2. Experimental environment used in the tests with the GA.

Obtaining the M groups of Kmeans

**3. Genetic algorithm for document clustering** 

to create, chromosome may be of variable length.

**3.1 Individuals** 

Documentary collection

Document Cluster

> **Real Groups**

**M: Number of groups to process N: Sample of documents**  (N=few or many documents)

the best settings.

nodes takes place in Preorder.

The production operators are applied to each new generation. One or two individuals can be taken to produce new individuals for next generation by applying the transformations imposed by the operator. Both mutation operators and crossover will be implemented indistinctably. Both operators depend on a mutation probability and / or cross that is assigned to GA [7].

A *mutation operator* is applied on nodes (documents), selecting an individual from the population using the tournament method, and then randomly select a pair of terminal nodes of that individual to mutate its terminal nodes, generating a new individual transposing the nodes that have been chosen (see figure 4).

Fig. 4. Basic mutation operator applied to terminal

Tune Up of a Genetic Algorithm to Group Documentary Collections 129

but the overall performance of the algorithm does not depend exclusively on a single parameter but on a combination of all parameters. Many researchers pay more attention to some parameters than others, but most agree that the parameters that should be under controlare: selection schemes, population size, genetic variation operators and rates of their

Because GA have several **parameters** that must be carefully chosen to obtain a good performance and avoid premature convergence, in our case and *after much testing,* we opted

To control the *population size* we use the strategy called GAVAPS (Genetic variation in population size) proposed by Michalewicz [9] using the concept of age and lifetime. When creating the first generation all individuals are assigned a zero age, referring to the birth of the individual, and every time a new generation is born the age of each individual increases by one. At the same time an individual is born it is assigned a lifetime, which represents how long it will live within. Therefore, the individual will die when it will reach the given age. The lifetime of each individual depends on the value of its fitness compared to the average of the entire population. Thus, if an individual has better fitness will have more time to live, giving it greater ability to generate new individuals with their features. In our case, we allow each generation to generate new individuals with similar characteristics with

Therefore, we adopt this approach essentially the best individuals from each generation, and apply it to maintain *elitism* in the following generations, thus ensuring optimum intensification of available space, while keeping them during their lifetime [9]. However, to ensure diversity we *randomly* generate *the remaining individuals* in each generation. This way, we explore many different regions of the search space and allow for balance between

In all cases, the population size has been set at 50 individuals for the experiments conducted with samples following the suggestion of [1], which advises working with a population size between *l* and *2l* in most practical applications (the length of chromosome l) In our case, "l"

On the hand, we use two measures of function fitness to calculate the distance and similarity

\_ \_

 <sup>−</sup> <sup>−</sup> <sup>=</sup>

 σ*x x*

for the control of parameters, and some strategies such as:

intensification and diversity of feasible regions.

the length of our chromosome is always equal to:

Distance Euclidean 2

*r*

between documents and to be able to form better cluster (see table 1).

1

*t ij ik jk k <sup>d</sup> x x* <sup>=</sup> = −

*n* <sup>=</sup>

( )

1 *<sup>n</sup> <sup>j</sup> <sup>j</sup> <sup>i</sup> <sup>i</sup> i i j x x x x*

Similarity(Documents i )) )

σ

Fitness Global Min (α Distance(Documents i) + (1- α ) (1/

**2 \* number of documents to cluster -1.** 

(Similarity) 1

Table 1. Measures of the Function

chances.

this strategy.

Coefficient of

correlation of Pearson

For the crossover operator, an operator based on *mask crossover* [9] is applied, which selects through tournament method two parent individuals, randomly chooses the chromosome of one parent to be used as *"crossover mask of the selected individual¨.* The crossing is done by analyzing the chromosome of both parents. If both chromosomes have at least one function node (node 0), the chosed father mask is placed, but if we find documents in the chromosomes of both parents, then, the father *"not elected"* document will be selected and we'll use it as pivot on the father *"elected"* (mask) to make the crossing that corresponds to the mentioned father, while interchanging the chromosomes of the mentioned father. This *creates a new individual,* and ensure that in the given chromosome set there are the same structural characteristics of the parents but we only incorporate it in the population if the child has a better fitness than their parents. (see figure 5).

Fig. 5. Crossover operator (crossover mask)

#### **3.3 Selection**

After we evaluate population's fitness, the next step is chromosome selection. Selection embodies the principle of 'survival of the fittest' [5]. Satisfied fitness chromosomes are selected for reproduction, for it, we apply the method of selection of the tournament, using a tournament of 2, and we apply Elitism in each generation [2].

#### **4. Parameter control**

For its size, and the influence that small changes have on the behavior of the GA during the experiments [1], the choice of parameter values that are going to be used appears as a critical factor. For their election we paid attention to the variation of the GA performance indicators when it changed the value of any of these, specifically the evolution of the successes and the evolution of *"fitness".* Therefore, these parameters are very important parts as they directly influence the performance of the GA [13]. These parameters can be treated independently, 128 Bio-Inspired Computational Algorithms and Their Applications

For the crossover operator, an operator based on *mask crossover* [9] is applied, which selects through tournament method two parent individuals, randomly chooses the chromosome of one parent to be used as *"crossover mask of the selected individual¨.* The crossing is done by analyzing the chromosome of both parents. If both chromosomes have at least one function node (node 0), the chosed father mask is placed, but if we find documents in the chromosomes of both parents, then, the father *"not elected"* document will be selected and we'll use it as pivot on the father *"elected"* (mask) to make the crossing that corresponds to the mentioned father, while interchanging the chromosomes of the mentioned father. This *creates a new individual,* and ensure that in the given chromosome set there are the same structural characteristics of the parents but we only incorporate it in the population if the

4

**F**

**f**

**f f**

1

After we evaluate population's fitness, the next step is chromosome selection. Selection embodies the principle of 'survival of the fittest' [5]. Satisfied fitness chromosomes are selected for reproduction, for it, we apply the method of selection of the tournament, using a

1 3 2 5

 **CROSSOVER OPERATOR** 

**f**

**f f**

**F**

**Padre 2** 

3 4 5 3 2

**Chain: 000530214** 

4

For its size, and the influence that small changes have on the behavior of the GA during the experiments [1], the choice of parameter values that are going to be used appears as a critical factor. For their election we paid attention to the variation of the GA performance indicators when it changed the value of any of these, specifically the evolution of the successes and the evolution of *"fitness".* Therefore, these parameters are very important parts as they directly influence the performance of the GA [13]. These parameters can be treated independently,

child has a better fitness than their parents. (see figure 5).

2 1

**F**

**f f**

**Chain: 000130254** 

**Child 1** 

5 **f**

Fig. 5. Crossover operator (crossover mask)

tournament of 2, and we apply Elitism in each generation [2].

**3.3 Selection** 

**4. Parameter control** 

but the overall performance of the algorithm does not depend exclusively on a single parameter but on a combination of all parameters. Many researchers pay more attention to some parameters than others, but most agree that the parameters that should be under controlare: selection schemes, population size, genetic variation operators and rates of their chances.

Because GA have several **parameters** that must be carefully chosen to obtain a good performance and avoid premature convergence, in our case and *after much testing,* we opted for the control of parameters, and some strategies such as:

To control the *population size* we use the strategy called GAVAPS (Genetic variation in population size) proposed by Michalewicz [9] using the concept of age and lifetime. When creating the first generation all individuals are assigned a zero age, referring to the birth of the individual, and every time a new generation is born the age of each individual increases by one. At the same time an individual is born it is assigned a lifetime, which represents how long it will live within. Therefore, the individual will die when it will reach the given age. The lifetime of each individual depends on the value of its fitness compared to the average of the entire population. Thus, if an individual has better fitness will have more time to live, giving it greater ability to generate new individuals with their features. In our case, we allow each generation to generate new individuals with similar characteristics with this strategy.

Therefore, we adopt this approach essentially the best individuals from each generation, and apply it to maintain *elitism* in the following generations, thus ensuring optimum intensification of available space, while keeping them during their lifetime [9]. However, to ensure diversity we *randomly* generate *the remaining individuals* in each generation. This way, we explore many different regions of the search space and allow for balance between intensification and diversity of feasible regions.

In all cases, the population size has been set at 50 individuals for the experiments conducted with samples following the suggestion of [1], which advises working with a population size between *l* and *2l* in most practical applications (the length of chromosome l) In our case, "l" the length of our chromosome is always equal to:

### **2 \* number of documents to cluster -1.**

On the hand, we use two measures of function fitness to calculate the distance and similarity between documents and to be able to form better cluster (see table 1).


Table 1. Measures of the Function

Tune Up of a Genetic Algorithm to Group Documentary Collections 131

*Fitness*

20 0,75 1436 0,25291551 0,46489675 15 75,0 20 0,80 1592 0,20298477 0,47026890 16 80,0 20 0,85 2050 0,15255487 0,24504483 17 85,0 20 0,90 3694 0,15266796 0,25909582 17 85,0 20 0,95 1520 0,15319261 0,24596829 17 85,0 50 0,75 3476 0,25290429 0,28744261 35 70,0 50 0,80 3492 0,20285265 0,27862528 36 72,0 50 0,85 3355 0,15312467 0,29128428 36 72,0 50 0,90 2256 0,15318358 0,28347470 36 72,0 50 0,95 2222 0,15345986 0,27863789 36 72,0 80 0,75 3049 0,25704660 0,36871676 61 76,2 80 0,80 1371 0,20782096 0,33303315 61 76,2 80 0,85 2131 0,15784449 0,34447947 62 77,5 80 0,90 1649 0,15815252 0,32398087 62 77,5 80 0,95 2986 0,17796620 0,36009861 61 76,2 150 0,75 2279 0,26194273 0,29866150 91 60,6 150 0,80 1273 0,20636391 0,22933754 93 62,0 150 0,85 3257 0,15468909 0,27518240 94 62,6 150 0,90 1136 0,25482251 0,28218144 94 62,6 150 0,95 2452 0,25456480 0,26788158 91 60,6 250 0,75 3617 0,25754282 0,31144435 120 48,0 250 0,80 3274 0,20844638 0,25112189 121 48,4 250 0,85 3066 0,15805103 0,19299910 121 48,4 250 0,90 2343 0,20634355 0,20432140 121 48,4 250 0,95 2047 0,25541276 0,27844937 120 48,0

Table 3. Results of tests with GA, takong different samples of documents with the

Fig. 6. Best Fitness versus α values for different samples of documents of the Reuters

0,70 0,75 0,80 0,85 0,90 0,95 1,00 **α**

20 Docum 50 Docum 80 Docum 150 Docum

Collection: Distribution 21

0,00 0,05 0,10 0,15 0,20 0,25 0,30 0,35 0,40

 **Fitness**

distribution 21 of the Reuters collection, to determine the best value for

**Average Middle**  *Fitness*

**Hits Effectiviness (%)** 

0

0,02

0,04

0,06

**Documents α Generatión Best** 

with *xi* and *xj* the characteristic vectors of the documents that we are grouping, *"n"* the number of examples and *σxi* , *σxj* are the standard deviation of *xi* and *xj* and where: *α:* it will be the parameter that adjustment the distance and similarity. The fitness function is used to minimize the distance between the documents and maximize the similarity between them.

Therefore, for the experiments with our experimental environment, we used samples of documents "very few (20), few (50), many (80) and enough (150)" documents with the requirement that they belonged only to two categories of Reuters collections or Editorials. Each of the samples processed with five different seeds, and each of the results are compared with the method *"Kmeans."* Then, each experiment was repeated by varying the rate of probability of genetic algorithm operators, using all the parameters shown in table 2 up to find that value of α that best fit the two metrics hat combine in our function fitness.


Table 2. Parameters taken into consideration for the Genetic algorithm with composite function

#### **4.1 Studies to determine the value of α in the GA**

We use the distribution Reuters 21 of be that greater dispersion across your documents and apply the GA varying the value of α in each of the tests with the usual parameters, always trying to test the effectiveness of the GA. We analyzed the relationship between fitness and the value of α using the values in table 2. (the results are shown in table 3 and figure 6).

In figure 6, we can see that there is an increased dispersion of fitness values over 0.85, due to the increased contribution of Euclidean distance which makes it insensitive to fitness to find the clusters. The results, suggest that a value of α close to 0.85, provides better results because it gives us more effective in terms of number of hits, and a better fitness of the algorithm. This was corroborates with other distribution.


130 Bio-Inspired Computational Algorithms and Their Applications

with *xi* and *xj* the characteristic vectors of the documents that we are grouping, *"n"* the number of examples and *σxi* , *σxj* are the standard deviation of *xi* and *xj* and where: *α:* it will be the parameter that adjustment the distance and similarity. The fitness function is used to minimize the distance between the documents and maximize the similarity between

Therefore, for the experiments with our experimental environment, we used samples of documents "very few (20), few (50), many (80) and enough (150)" documents with the requirement that they belonged only to two categories of Reuters collections or Editorials. Each of the samples processed with five different seeds, and each of the results are compared with the method *"Kmeans."* Then, each experiment was repeated by varying the rate of probability of genetic algorithm operators, using all the parameters shown in table 2 up to find that value of α that best fit the two metrics hat combine in our function

Mutation Probability (Pm) 0.01, 0.03, 0.05, 0.07, 0.1, 0.3, 0, 5, 0.7

Crossover Probability (Pc) 0.70,0.75,0.80,0.85,0.90,0.95

α coefficients 0.85 (best value found)

Document cuantity Very Few, Few, Many, enough

Table 2. Parameters taken into consideration for the Genetic algorithm with composite

We use the distribution Reuters 21 of be that greater dispersion across your documents and apply the GA varying the value of α in each of the tests with the usual parameters, always trying to test the effectiveness of the GA. We analyzed the relationship between fitness and the value of α using the values in table 2. (the results are shown in table 3 and

In figure 6, we can see that there is an increased dispersion of fitness values over 0.85, due to the increased contribution of Euclidean distance which makes it insensitive to fitness to find the clusters. The results, suggest that a value of α close to 0.85, provides better results because it gives us more effective in terms of number of hits, and a better fitness of the

**Parameters Values** 

Número de evaluaciones (Generaciones) 5000 maximum

Population size (tree number) 50

Tournament size 2

Depth Threshold 7 /10

**4.1 Studies to determine the value of α in the GA** 

algorithm. This was corroborates with other distribution.

them.

fitness.

function

figure 6).


Table 3. Results of tests with GA, takong different samples of documents with the distribution 21 of the Reuters collection, to determine the best value for

Fig. 6. Best Fitness versus α values for different samples of documents of the Reuters Collection: Distribution 21

Tune Up of a Genetic Algorithm to Group Documentary Collections 133

robustness and adjusting each of its parameters. Finally, we experimentally used the parameters discussed in Table 1 and analyzed the behavior of the algorithm. We show in Figure 7 the average number of hits returned by the GA for samples of 20, 80 and 150 documents, changing the mutation rate, and show the hit factor of the GA against the mutation rate. We appreciate that we got the best performance with a rate of 0.03, this result shows that the best medium fitness could also be obtained by using this rate. We

48

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 **Mutation Rates**

Samples of 20 Documents Samples of 80 Documents

Samples of 150 Documents

0,00 0,02 0,04 0,06 0,08 0,10 **Muta tion R ate s**

Samples of 80 Documents

49

**H its A v e ra g e**

50

51

52

Fig. 7. Hits average of GA with samples 20, 80 and 150 documents varying mutation rate

In addition, we analyzed the incidence of crossover operator on the final results. The figures 8 show the behavior of the crossover rate versus hits average with very few samples (20), many (80) and many documents (150) respectively. Besides a comparative analysis is the success factor of GA varying the crossover rate. It makes clear, the GA performed better when using a rate of 0.80 for the crossover operator, regardless of the sample. Therefore, this value appears to be ideal if we maximize the efficiency of the algorithm, which is why we

0,00

0,02

0,04

**H it F a c to r** 0,06

0,08

0,10

corroborated that conduct with another collection.

Samples of 20 Documents

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 **Mutation Rates**

Samples of 150 Documents

and hit the GA.

71

73

75

**H its A v e ra g e**

77

79

16,2

16,7

**H its A v e ra g e**

17,2

conclude that is the rate that gives us better results.

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 **Mutation Rates**

#### **4.2 Tests to determine the value of the rate of mutation operator and crossover operator rate**

We began conducting an analysis of system behavior by varying the rate of mutation operator in a wide range of values to cover all possible situations. During experiments using different samples distribution Reuters. Thus, for the rate of mutation operator discussed a wide range of values in the range of: 0.01, 0.03, 0.05, 0.07, 0.1, 0.3, 0, 5, 0.7; that allowed us to apply the mutation operator of GA in different circumstances and study their behavior. For the study to determine the optimal value of the rate of crossover operator, is traced the interval from 0.70 to 0.95; value high, but oriented to frequently apply the operator we designed because that an optimum value for the *mutation probability* is much more important than the crossover probability, and choose to make a more detailed study of the odds ratio in our experiments. As a quality index value of the operator was given to the number of hits of the GA.

As for the *size of the tournament,* the value 2 has been chosen, because the binary tournament has shown a very good performance in a large number of applications of EAs. Although determining a optimal *fitness* function is not one of the fundamental objectives of this experiment, we have tried to add in a single value the measuring results as powerful and distinct as are the Euclidean distance and the Pearson correlation coefficient (based on cosine similarity).

Therefore, to find and the adjustment coefficient α that governs the weight that is to be given to both the distance as the inverse of similarity of the cluster documents, we've made many parameter controlled tests in order to obtain a value that allows an adequate contribution of both metrics with respect to fitness., finally finding a value for of 0.85.

The **number of maximum generations** the system has been set to is 5000, but this parameter may vary depending on the convergence of the algorithm. As for the **number of stemmer terms** to be used for representing the feature vectors of each of the documents we have used the terms, which have been selected through the NZIPF processing method [6][11].

Finally, we have established a limit called the **threshold of depth** for individuals (trees). Such a threshold, in the case of *"very few and few documents"* take the value of 7, and for the*"many and enough documents"* is set 10. To analyze the results, and to verify their effectiveness, we compared the results of the GA with the existing real groups of the document collection [6], and also compared the results with another supervised type of clustering algorithm in optimal conditions (Kmeans). We analized the following:


Since, the GA parameters directly affect the fitness behavior, before the experiments, we performed a comprehensive analysis of all GA performances, in order to determine its 132 Bio-Inspired Computational Algorithms and Their Applications

We began conducting an analysis of system behavior by varying the rate of mutation operator in a wide range of values to cover all possible situations. During experiments using different samples distribution Reuters. Thus, for the rate of mutation operator discussed a wide range of values in the range of: 0.01, 0.03, 0.05, 0.07, 0.1, 0.3, 0, 5, 0.7; that allowed us to apply the mutation operator of GA in different circumstances and study their behavior. For the study to determine the optimal value of the rate of crossover operator, is traced the interval from 0.70 to 0.95; value high, but oriented to frequently apply the operator we designed because that an optimum value for the *mutation probability* is much more important than the crossover probability, and choose to make a more detailed study of the odds ratio in our experiments. As

As for the *size of the tournament,* the value 2 has been chosen, because the binary tournament has shown a very good performance in a large number of applications of EAs. Although determining a optimal *fitness* function is not one of the fundamental objectives of this experiment, we have tried to add in a single value the measuring results as powerful and distinct as are the Euclidean distance and the Pearson correlation coefficient (based on

Therefore, to find and the adjustment coefficient α that governs the weight that is to be given to both the distance as the inverse of similarity of the cluster documents, we've made many parameter controlled tests in order to obtain a value that allows an adequate contribution of both metrics with respect to fitness., finally finding a value for of 0.85.

The **number of maximum generations** the system has been set to is 5000, but this parameter may vary depending on the convergence of the algorithm. As for the **number of stemmer terms** to be used for representing the feature vectors of each of the documents we have used

Finally, we have established a limit called the **threshold of depth** for individuals (trees). Such a threshold, in the case of *"very few and few documents"* take the value of 7, and for the*"many and enough documents"* is set 10. To analyze the results, and to verify their effectiveness, we compared the results of the GA with the existing real groups of the document collection [6], and also compared the results with another supervised type of

a. **Cluster efectiveness:** It is the most important indicator of the comparison of results considering the quality of the cluster. An analyzing process was carried out to see the successes achieved with the best fitness of GA, and also the average scores in all

b. **Fitness evolution.** Analysis was carried out to see the evolving fitness in each of the performances, assessing their behaviour and successes of the GA when varying the

c. **Convergence of the algorithm**: In which process the GA obtains the best fitness (best

Since, the GA parameters directly affect the fitness behavior, before the experiments, we performed a comprehensive analysis of all GA performances, in order to determine its

the terms, which have been selected through the NZIPF processing method [6][11].

clustering algorithm in optimal conditions (Kmeans). We analized the following:

**4.2 Tests to determine the value of the rate of mutation operator and crossover** 

a quality index value of the operator was given to the number of hits of the GA.

**operator rate** 

cosine similarity).

executions of the GA.

probability rate.

cluster).

robustness and adjusting each of its parameters. Finally, we experimentally used the parameters discussed in Table 1 and analyzed the behavior of the algorithm. We show in Figure 7 the average number of hits returned by the GA for samples of 20, 80 and 150 documents, changing the mutation rate, and show the hit factor of the GA against the mutation rate. We appreciate that we got the best performance with a rate of 0.03, this result shows that the best medium fitness could also be obtained by using this rate. We corroborated that conduct with another collection.

Fig. 7. Hits average of GA with samples 20, 80 and 150 documents varying mutation rate and hit the GA.

In addition, we analyzed the incidence of crossover operator on the final results. The figures 8 show the behavior of the crossover rate versus hits average with very few samples (20), many (80) and many documents (150) respectively. Besides a comparative analysis is the success factor of GA varying the crossover rate. It makes clear, the GA performed better when using a rate of 0.80 for the crossover operator, regardless of the sample. Therefore, this value appears to be ideal if we maximize the efficiency of the algorithm, which is why we conclude that is the rate that gives us better results.

Tune Up of a Genetic Algorithm to Group Documentary Collections 135

*Fitness Effectiveness Convergence Average* 

*(18 hits)* 

*(47 hits)* 

*(71 hits)* 

*(115 hits)* 

*(170 hits)* 

*Fitness Effectiveness Convergence Average* 

*(17 hits)* 

*(48 hits)* 

*(68 hits)* 

*(104 hits)* 

*(129 hits)* 

*Fitness Effectiveness Convergence Average* 

*(17 hits)* 

*(46 hits)* 

*(65 hits)* 

*(75 hits)* 

Table 5. Comparative results Evolutionary System with various samples of documents showing the best results and the average results of evaluations with the "Distribution 8" of

Table 6. Comparative results Evolutionary System with various samples of documents showing the best results and the average results of evaluations with the "Distribution 20" of

Table 4. Comparative results Evolutionary System with various samples of documents showing the best results and the average results of evaluations with the "Distribution 2" of

*Best Average* 

*Best Average* 

*Best Average*

*Fitness* 

*Fitness* 

*Deviation Fitness* 

886 0,15545476 0,00000828 *1086 16,6* 

*3051* 0,15624280 0,00002329 *2641 45,8* 

*2500* 0,15921181 0,00020587 *2246 67,8* 

*2342* 0,16508519 0,00007452 *2480 121,6* 

*2203* 0,17430502 0,00033602 *2059 202,8* 

*Deviation Fitness* 

555 0,15116356 0,00000000 *679 15,8* 

*1615* 0,15485650 0,00000000 *1334 43,8* 

*746* 0,15708362 0,00000898 *1360 66,2* 

*1989* 0,16242664 0,00033091 *2283 117,6* 

*2293* 0,16334198 0,00027325 *1773 140,6* 

*Deviation Fitness* 

1092 0,15321980 0,00018398 *1108 16,8* 

*2173* 0,15666137 0,00030077 *2635 44,8* 

*2196* 0,15810383 0,00001884 *1739 66,8* 

*1437* 0,15927630 0,00026701 *2636 82,2* 

*Average* 

*Convergence Kmeans* 

*Average* 

*Convergence Kmeans* 

*Average* 

*Convergence Kmeans* 

*Fitness* 

*Distribution 2 Reuters* 

*Collection 1 Samples of documents* 

*Very Few documents (20 documents)* 

*Few Documents (50 documents)* 

*Many Documents (80 documents)* 

*Enough Documents (150 documents)* 

*More Documents (246 documents)* 

*Distributión 8 Reuters Collection 2* 

*Very Few documents (20 documents)* 

*Few Documents (50 documents)* 

*Many Documents (80 documents)* 

*Enough Documents (150 documents)* 

*More Documents (188 documents)* 

*Distribution 20 Reuters Collection 3* 

*Very Few documents (20 documents)* 

*Few Documents (50 documents)* 

*Many Documents (80 documents)* 

*Enough Documents (108 documents)* 

*Samples of documents* 

*Samples of documents* 

the Reuters 21578 collection.

the Reuters 21578 collection.

the Reuters 21578 collection.

*Documents* 

*Best Result* 

*Categoríes: Acq y Earn* 

0,153027060 *85%*

0,156198620 *92%*

0,158069980 *81,25%*

0,159031080 *69.4%*

*Documents*

*Best Result* 

*Documents* 

*Best Result* 

*Categoríes: Acq y Earn* 

0,155447570 *85%*

0,156223280 *94%*

0,159009400 *89%*

0,165013920 *77%*

0,174112100 *69%*

*Categoríes: Acq y Earn* 

0,151163560 *85%*

0,154856500 *96%*

0,157073880 *85%*

0,162035070 *69,3%*

0,163014600 *68,63%*

Fig. 8. Hits average of GA with samples 20, 80 and 150 documents varying crossover rate and hit the GA.

To corroborate the results of the GA, we compare their results with the *Kmeans* algorithm, which was processed with *the same samples,* passing as input the number of groups that needed to be obtained. This algorithm used exclusively as a function of the Euclidean distance measure and being a supervised algorithm, the only adjustment of parameters was the *number of groups to process*, and is therefore executed on *Kmeans* in optimal conditions. We proved that *the medium effectiveness* of the GA is very acceptable, being in most cases better than *Kmeans* supervised algorithm [10] when using these parameters of mutation and crossover, but with the added advantage that we processed the documents in an unsupervised way, allowing evolution perform clustering with our adjustment. So, details of such behavior, we show graphically in figure 7 and 8, even showing a comparison of the same for each type of operator used in our experiments the evolutionary algorithm processed proposed for this work.

Then, in the table 4, 5, 6 and 7 show comparative results obtained with our algorithm using the optimal parameters of mutation and crossover with major documentary collection distribution Reuters 21578.

134 Bio-Inspired Computational Algorithms and Their Applications

48

0,6 0,7 0,8 0,9 **Crossover Rate**

Samples of 150 Documents

Samples of 20 Documents Samples of 80 Documents

0,70 0,75 0,80 0,85 0,90 0,95 **C rossover R a te**

Samples of 80 Documents

49

**H its A v e ra g e**

50

51

52

Fig. 8. Hits average of GA with samples 20, 80 and 150 documents varying crossover rate

To corroborate the results of the GA, we compare their results with the *Kmeans* algorithm, which was processed with *the same samples,* passing as input the number of groups that needed to be obtained. This algorithm used exclusively as a function of the Euclidean distance measure and being a supervised algorithm, the only adjustment of parameters was the *number of groups to process*, and is therefore executed on *Kmeans* in optimal conditions. We proved that *the medium effectiveness* of the GA is very acceptable, being in most cases better than *Kmeans* supervised algorithm [10] when using these parameters of mutation and crossover, but with the added advantage that we processed the documents in an unsupervised way, allowing evolution perform clustering with our adjustment. So, details of such behavior, we show graphically in figure 7 and 8, even showing a comparison of the same for each type of operator used in our experiments the evolutionary algorithm

0,00

0,02

0,04

**H it F ac to r** 0,06

0,08

0,10

Then, in the table 4, 5, 6 and 7 show comparative results obtained with our algorithm using the optimal parameters of mutation and crossover with major documentary collection

and hit the GA.

71

73

75

**H i t s Av e r age**

77

79

16,2

0,6 0,7 0,8 0,9 **Crossover Rate**

Samples of 150 Documents

0,6 0,7 0,8 0,9 **Crossover Rate**

Samples of 20 Documents

16,7

**H i t s Av e r age**

17,2

processed proposed for this work.

distribution Reuters 21578.


Table 4. Comparative results Evolutionary System with various samples of documents showing the best results and the average results of evaluations with the "Distribution 2" of the Reuters 21578 collection.


Table 5. Comparative results Evolutionary System with various samples of documents showing the best results and the average results of evaluations with the "Distribution 8" of the Reuters 21578 collection.


Table 6. Comparative results Evolutionary System with various samples of documents showing the best results and the average results of evaluations with the "Distribution 20" of the Reuters 21578 collection.

Tune Up of a Genetic Algorithm to Group Documentary Collections 137

Fig. 10. Graphs compare the results obtained with the composite function against Kmeans

In this study, we have proposed a new taxonomy of parameters of GA numerical and structural, and examine the effects of numerical parameters of the performance of the algorithm in GA based simulation optimization application by the use of a test clustering

• There is a dominance of a set of decision variables with respect to the objective function value of the optimization problem: The objective function value is directly related with

These properties of the problem domain generate a rapid convergent behavior of GA. According to our computational results lower mutation rates give better performance. GA mechanism creates a lock-n effect in the search space, hence lower mutation rates decreases the risk of premature convergence and provides diversification in the search space in this particular problem domain. Due to the dominance crossover operator does not have significant impact on the performance of GA. Moreover, starting with a seeded population

We can conclude that the GA had a favourable evolution, offering optimal document cluster in an acceptable and robust manner, based on a proper adjust of the parameters. We proved that *the medium effectiveness* of the GA is very acceptable, being in most cases better than Kmeans supervised algorithm, but with the added advantage that we processed the documents in an unsupervised way, allowing evolution perform clustering with our adjustment. As a result of our experiments, we appreciate that we got the best performance with a rate of 0.03 for the mutation operator and using a rate of 0.80 for the crossover operator, this values appears to be ideal if we maximize the efficiency of the genetic

the combination of this dominant set of variables equal a value of α close to 0.85. • The good solutions are highly dominant over other solutions with respect to the

objective function value, but not significantly diverse among each other.

problem. We start with the characteristics of the problem domain.

The main characteristic features of our problem domain are:

(Spain collection)

**5. Conclusion** 

generates more efficient results.

algorithm.


Table 7. Comparative results Evolutionary System with various samples of documents showing the best results and the average results of evaluations with the "Distribution 21" of the Reuters 21578 collection

To then display the results graphically in figure 9.

Fig. 9. Graphs compare the results obtained with the composite function against Kmeans (four collection Reuters)

Finally, to corroborate the results,we compare their results with the other collection in Spanish, which was processed in the same way, using all values of table 2. (see figure 10).

Fig. 10. Graphs compare the results obtained with the composite function against Kmeans (Spain collection)
