**2. Documentary base**

In this study we make use of two collections, the "Reuters 21578" collection and a Spanish documentary base that includes editorials of "El Mundo" from 2006 and 2007 in an open access format.

Reuters Documentary Base consists of real news wires that appeared in Reuters in 1987, this collection is becoming a standard within the domain of the automatic categorization of documents and is used by many authors in this area. The collection consists of 21578 documents distributed in 22 files. We developed a documentary process named NZIPF [6] [11] to generate documentary vectors that feed the system.

The documentary process consists of several stages of document processing, each of which represents a process that was developed on the base document to obtain documentary vectors more efficiently.

The first step is the called process of *Filter* whose main objective is to define the documents of the documental base with the purpose of having documents that belong to a single category, that which will allow to have a smaller complexity in the treatment of the documents. Then, the purpose of the process of *Zonning* on the documents is the one of obtaining the free text of each document. Next, we use a process of *Stop List*, we extract the terms of the text of the document where each one of the extracted words will be compared with a list of empty words that will eliminate the words that don't have interest or they lack own meaning. Then, the words will be able to suffer a process of cutting of their roots "*Stemming*", in our case, we have implemented and used an algorithm of Porter in English and another in Spanish. In this step, the *frequency* of the obtained terms is calculated, for all the documents of our documental base, with the purpose of being able to know that terms are more used in each one of the documents; and then with this information to be able to carry out a process of selection of those terms that are more representative. The following step will consist on selecting those terms with discriminatory bigger power to proceed to its normalization We apply the law of *Zipf*, we calculate the Point of Goffman [3] and the transition area that it allows us to obtain the terms of the documental base. Finally, we assign weight using a function *IDF* (Invert Document Frecuency) developed for Salton [4] that uses the frequency of a word in the document. After all these processes, we obtain the characteristic vectors of documents in the collection document.

The process is outlined in Figure 1.

124 Bio-Inspired Computational Algorithms and Their Applications

algorithm and domain independent, which may be applied to a wide range of learning tasks. One of the many possible applications to the field of IR might be solving a basic problem faced by an IRS: the need to find the groups that best describe the documents, and allow each other to place all documents by affinity. The problem that arises is in the difficulty of finding the group that best describes a document,since they do not address a single issue, and even if they did, the manner the topic is approached can also make it suitable for another group. Therefore, this task is complex and even subjective as two people could easily assign the same document to different groups using valid criteria.

Clustering is an important tool in data mining and knowdledge discovery because the ability to automatically group similar items together enables one to discover hidden similarity and key concepts [10]. This enables the users to comprehend a large amount of data. One example is searching the World Wide Web, because it is a large repository of many kinds of information, many search engines allow users to query the Web, usually via keyword search. However, a typical keyword search returns a large number of Web pages, making it hard for the user to comprehend the results and find the information that really needs. A challenge in document clustering is that many documents contain multiple

This paper presents a GA applied to the field of documentation, the algorithm improved itself by refining its parameters, offering a balance between intensification and diversity that

In this study we make use of two collections, the "Reuters 21578" collection and a Spanish documentary base that includes editorials of "El Mundo" from 2006 and 2007 in an open

Reuters Documentary Base consists of real news wires that appeared in Reuters in 1987, this collection is becoming a standard within the domain of the automatic categorization of documents and is used by many authors in this area. The collection consists of 21578 documents distributed in 22 files. We developed a documentary process named NZIPF [6]

The documentary process consists of several stages of document processing, each of which represents a process that was developed on the base document to obtain documentary

The first step is the called process of *Filter* whose main objective is to define the documents of the documental base with the purpose of having documents that belong to a single category, that which will allow to have a smaller complexity in the treatment of the documents. Then, the purpose of the process of *Zonning* on the documents is the one of obtaining the free text of each document. Next, we use a process of *Stop List*, we extract the terms of the text of the document where each one of the extracted words will be compared with a list of empty words that will eliminate the words that don't have interest or they lack own meaning. Then, the words will be able to suffer a process of cutting of their roots "*Stemming*", in our case, we have implemented and used an algorithm of Porter in English and another in Spanish. In this step, the *frequency* of the obtained terms is calculated, for all

ensures an acceptable optimal fitness along an unsupervised document cluster.

[11] to generate documentary vectors that feed the system.

subjects.

**2. Documentary base** 

vectors more efficiently.

access format.

Fig. 1. Documentary process conducted

On the other hand, within the testing environment there should be a user to provide documents that are meant to be grouped. The role of the user who provides documents will be represented by the samples of "very few (20), few (50), many (80) and enough (150)" documents, with the requirement that belonged to only two categories of Reuters or distribution of Editorials in Spanish represented by their feature vectors stemmer. Figure 2 shows the documentary environment [10] that we used for the experiments, it is important to note that, unlike the algorithms of the type monitored, where the number obtained groups needs to be known, our algorithm will evolve to find the most appropriate structure, forming the groups by itself.

Due to the nature of simulation of GA, its evolution is pseudo-random, this translates into the need for multiple runs with different seeds to reach the optimal solution. The generation of the seed is carried out according to the time of the system. For this reason, the experiments with GA were made by carrying out five executions to each of the samples taken from experimental collections [1]. The result of the experiment will be the best fitness obtained and their convergence. To measure the quality of the algorithm, the best solution obtained and the average of five runs of the GA must be analized.

Tune Up of a Genetic Algorithm to Group Documentary Collections 127

Document Processing (NZIPF) Model

**F**

f f

D2 D4 f

f

Cromosome 2: 00401502036

(document: Dx, x = number)

The chromosomes of each individual are in preorder. A Gen 0: Indicates a Function Node (f)

A Numerical Gen (other than 0): Indicates a Terminal Node

D1 D5

**Individual 1 Individual 2 Individual N** 

Documentary Vectors

Creation of Initial Population

**F**

f f

D4 f D1

The production operators are applied to each new generation. One or two individuals can be taken to produce new individuals for next generation by applying the transformations imposed by the operator. Both mutation operators and crossover will be implemented indistinctably. Both operators depend on a mutation probability and / or cross that is

D6

f

D3 D6 D2

D3

F

f f

f D3

D1 D4

**New Individual** 

Documents. of mutated terminals

D5 f

D6 D2

D5

Cromosome N: 00035400621

A *mutation operator* is applied on nodes (documents), selecting an individual from the population using the tournament method, and then randomly select a pair of terminal nodes of that individual to mutate its terminal nodes, generating a new individual

> **New Individual Cromosome: 00506200143**

F: Global Fitness f: Local Fitness

Fig. 3. Initial Population of Individuals GA (generation "0")

transposing the nodes that have been chosen (see figure 4).

F

f D3

D1 D2

Mutation

f f

D5 f

**Individual X Cromosome: 00506400123** 

D6 D4

Fig. 4. Basic mutation operator applied to terminal

Elected chromosome of Tournament

**3.2 Production operators** 

Documentary base "Reuters" / Editorials

**F: Global fitness f: Local fitness** 

Cromosome 1: 00506400123

**F**

f f

f D3

D1 D2

D5 f

D6 D4

assigned to GA [7].

Fig. 2. Experimental environment used in the tests with the GA.
