**2. SKC analysis module**

The knowledge process that the Analysis Module (AM) produces for the Semantic KnowCat (SKC) (Moreno-Llorena, 2008; Moreno-Llorena & Alamán, 2005; Moreno-Llorena et al., 2009a, 2009b) system may be considered a digestion because its intention is to extract something new and assimilating by the system from the existing knowledge.

to the system in a way that it can be used. For this reason, the module uses texts associated with the knowledge, the way the latter is structured and the way its components interact. As a result, the knowledge developed provides new access opportunities and interaction with

The knowledge tree of a KC node represents the common and shared understanding of the corresponding community on the domain dealt by the node. This tree may be considered a representation of an ontology underlying the domain (Gruber, 1993). Assignment of documents to topics in the knowledge tree involves semantic annotation of these documents within an ontology scope. This is the AM view of the knowledge tree node where it works. It is in this context, that one should be interested in the automatic annotation of documents automatic assignment of documents to topics- (Kiryakov et al., 2004) or in mapping between

In AM text mining techniques that allow poorly structured textual data processing through the use of vectorial models are being used (Baeza & Ribeiro, 1999; Chang et al., 2001). These techniques are currently very popular, especially for their use in automatic indexing of the Web contents. In addition, AM uses analysis language processing (Carreras et al., 2004) appropriate for natural language processing. Application of these techniques in the field data recovery is not widely used because the computing effort of its use does not justify the benefits it provides in most common cases, where some of the texts compared are small and the repository to be dealt with is big -as it happens when using conventional search engines on the Web-. However, it seems that the situation may be different when comparing larger texts on moderate- sized repositories, which is the case we are concerned with and why we thought it would be a good idea to use this technique (Brants, 2004). There are other alternatives to this approach (Baeza & Ribeiro, 1999) that the prototype could be included in

The ultimate aim of the module is to convert the result obtained into something useful to interact with the system and its contents. For this reason, it is essential to resolve problems related to the information filtering to be shown, typical of recommendation systems (Adomavicius & Tuzhilin, 2005), and with data visualization (Geroimenko &

To check the viability of the proposed approach several experiments have been performed with four KC nodes in learning activities carried out in the Universidad Autónoma de Madrid (Spain). The experimental results have shown evidences about how to take advantage of latent knowledge to enrich knowledge base and to facilitate the management task fulfilled by the system, the interaction among its entities and users' access to the contents that have been processed, among other interesting applications (Moreno-Llorena, 2008; Moreno-Llorena & Alamán, 2005; Moreno-Llorena et al., 2009a, 2009b). The enrichment of the proposed content seems to provide a very powerful support for automatic exchange of knowledge among knowledge management systems opening a way to the

The knowledge process that the Analysis Module (AM) produces for the Semantic KnowCat (SKC) (Moreno-Llorena, 2008; Moreno-Llorena & Alamán, 2005; Moreno-Llorena et al., 2009a, 2009b) system may be considered a digestion because its intention is to extract

development of the latter on the semantic Web field (Berners-Lee, 2000).

something new and assimilating by the system from the existing knowledge.

ontologies -trees- with different nodes (Noy & Musen, 2002).

the system and knowledge.

the future to contrast the results.

**2. SKC analysis module** 

Chen, 2002).

The AM works on the system knowledge base, depositing the result of its activity in the same repository, in a way in which both the system and users may use the module's contributions in a transparent way. The AM considers that the knowledge is formed by items. These items may be documents, topics, knowledge trees, nodes, users, etc.

With the new knowledge the system may improve the management it carries out in different ways, for instance, providing different views of the repository and new access services; simplifying users' classification of knowledge items in the system; or informing users about implicit relation between items, given the context of interaction.

Each knowledge item that is considered by the AM must have a description text associated, which may be assigned either manually or automatically; for the second option the module itself may deal with it on some occasions. The descriptive texts that are associated with the documents that SKC currently manages are the documents themselves, given that they contain textual information. In the case of the topics -that is a collection of documents-, are given by the descriptive texts of documents classified within themselves or of the subtopics they contain; although initially model texts that don´t necessarily have to be part of the system knowledge repository, may be used. Nodes are the same way in that they may be considered the root topics of a knowledge tree constituted by the topics and the documents included within it. Regarding users, several description texts may be associated with them, considering the documents or topics that, for instance, they provide or use frequently.

The AM carries out two fundamental tasks: on the one hand, it develops knowledge that is latent in the system; and on the other hand it incorporates it within the system itself in an explicit way in order to allow its exploitation. Implicit knowledge is found in relations that are established among the different knowledge items, for instance within the contents included or within interactions that ones and others establish. Explicit knowledge is incorporated into the system in its clear new state, describing the existing knowledge items, or in the form of new knowledge items that are added to the repository.

The link through the contents is established, in this approach, obtaining vectorial descriptors of the weight of the terms from text documents associated with the items. With these descriptors items may be compared, the distance that separates them may be determined and groups among them may be formed.

Associations based on the interaction between knowledge items are determined by analyzing how items relate among themselves. Like this, the way in which topics group documents and other topics in the knowledge tree of nodes may be considered, and how users provide documents to the system.

The knowledge incorporated into the system, as a result of the analysis, provides new repository exploitation opportunities. On one hand, the enriched knowledge items may be shown from new perspectives thanks to the new attributes. On the other hand, the items incorporated into the system by the knowledge assimilated by the latter, allow offering users different views of the repository and new services.

#### **2.1 Linking by content**

In our approach we have considered, initially, four types of knowledge items: nodes that are system instances in charge of the knowledge management about an area with the help of a user community; topics structured in the form of a knowledge tree that develop the different aspects of the main node topic; users that constitute the community that participates in the node; and documents that describe the different topics and are provided by the users, searched by them and which are the object of their consideration.

Digestion of Knowledge in a KM System to Reveal Implicit Knowledge 105

In our approach descriptors are weights of words vectors that may be used to determine similarities with other vectors of the same kind and thus relate the corresponding knowledge items (Baeza & Ribeiro, 1999; Chang et al., 2001). The process for obtaining these vectors starts from the texts associated with the items. As texts may be in different formats, they need to be treated in order to obtain their contents "naked" in the form of a flat text. In our approach text files in PDF and HTML (see Fig. 2) format have been considered, although

Introducción Por arquitectura de Linux, y de cualquier otro sistema operativo en general, podemos entender que es la relación estructurada que tienen los distintos componentes del sistema entre ellos, para cumplir su función: proporcionar al usuario una maquina virtual

After the text format has been eliminated- creating flat text files HTX (see Fig. 3)- the lemmas to which the terms refer to must be identified (obviating the grammatical forms in which they are shown) and to determine the grammatical categories to which they belong. With that, references are unified to concepts, the number of different words considered is

In our approach we have used FreeLing language analysis tool (Carreras et al., 2004) that facilitates obtaining all the necessary information to achieve the previous objectives. FreeLing allows to analyse a text to identify the grammatical categories to which the words that form them belong to and to determine the lemmas to which these words correspond in a reference dictionary. When FreeLing cannot find an appropriate lemma to some word, it considers it a new lemma. With all this, the tool may establish the most probable morphologic interpretation of each word that integrates the text that shall be useful to determine a semantic approach of the latter. As a result of the analysis, FreeLing provides tagged version of the text (FTG files), indicating for each appearance of a word its original form together with the lemma and the corresponding morphologic interpretation that are considered more feasible. The Fig. 4 left shows a FTG file example, where for each row, the first string is the word original form, the second is the corresponding lemma, and the third

...

...

3 2 0.66 ejecutar VMG0000,VMN0000

5 2 0.66 proceso NCMS000,NCMP000 6 1 0.33 Básicamente RG 7 1 0.33 Evidentemente RG 8 1 0.33 Paralelamente RG 9 1 0.33 TM\_ms:200 Zu 10 1 0.33 abierto AQ0MSP 11 1 0.33 abrir VMN0000 12 1 0.33 abstraer VMN0000 13 1 0.33 acceder VMP00PM 14 1 0.33 acceso NCMS000 15 1 0.33 acelerar VMN0000 16 1 0.33 además RG

17 1 0.33 administración NP00000 18 1 0.33 administrador NCMS000 19 1 0.33 administrar VMIP3S0 20 1 0.33 affs NCMP000

4 2 0.66 first NCMS000

both are transformed into flat text files before starting the process.

reduced and the terms with no utility are identified.

is the encoding of the grammatical category.

cualquier cualquiera DI0CS0

general general AQ0CS0 podemos poder VMIP1P0 entender entender VMN0000 que que PR0CN000 es ser VSIP3S0 la el DA0FS0

relación relación NCFS000 estructurada estructurar VMP00SF

distintos distinto DI0MP0

Fig. 4. FTG (left) and DWF (right) file examples

tienen tener VMIP3P0 los el DA0MP0

otro otro DI0MS0 sistema sistema NCMS000 operativo operativo AQ0MS0

en en SPS00

que que CS

...

Arquitectura de Linux

sobre la cual trabajar.

Fig. 3. HTX file example

...

... y y CC de de SPS00

Fig. 1. Process obtaining weight of words vectors for knowledge items from HTML

It is possible to associate a text document with all the knowledge items that are considered in the system that describes them. These contributions may have several backgrounds. First, text associations to knowledge items may come from the nature of these items; for instance the documents used in experiments are textual type. Secondly, these associations may stem from the explicit relations of knowledge items with other items that already have associated texts. This occurs with topics that organize the documents, users that provide documents to the system or the node that contains both. Thirdly, text associations descriptive to the items may be inferred from more dynamic relations, like the one that is established between users and the documents that are more frequently visited by them or the ones they give their opinion about, or like the ones that are shown between documents referenced among them. Lastly, it is always possible to associate texts descriptive to the items that affect some aspects of utility, such as users' curriculums, their topics of interest, key words associated with documents or descriptions of topics. This case is completely general and may be applied to non textual documents, such as images, sound, etc.

Once a descriptive text has been associated with one of the items considered, it is necessary to put it in a way that it can be used as a comparison instrument. This is achieved by converting the text into a descriptor that shall be connected to the aspect it is referring to. For instance, if the text associated with a user describes the topics of interest for the latter, the corresponding descriptor shall refer to the users' preferences; but if the text describes the documents it itself has elaborated, the corresponding descriptor shall refer to its creative job. Like this, items shall be able to have as many descriptors as aspects of them are taken into account.

```
<html> 
 <head> 
 <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> 
 <title>Arquitectura de Linux</title> 
 </head> 
 <body text="#330033" bgcolor="#CCCCCC" link="#3366FF" vlink="#000099" alink="#FF0000"> 
 <center><table COLS=2 WIDTH="73%" ><tr><td> 
 <h1><font color="#CC0000">Arquitectura de&nbsp;</font></h1> 
 <center><h1><font color="#CC0000">&nbsp;&nbsp;Linux</font></h1></center></td> 
 <td><img SRC="img4.jpg" height=114 width=102></td></tr> 
 </table></center> 
 <hr WIDTH="100%" color="#FFFFFF"> 
 <h2>Introducci&oacute;n</h2> 
         Por arquitectura de Linux, y 
 de cualquier otro sistema operativo en general, podemos entender que es 
 la relación estructurada que tienen los distintos componentes del 
 sistema entre ellos, para cumplir su función: proporcionar al usuario 
 una <b>maquina virtual</b> sobre la cual trabajar. 
...
```
Fig. 2. HTML file example

In our approach descriptors are weights of words vectors that may be used to determine similarities with other vectors of the same kind and thus relate the corresponding knowledge items (Baeza & Ribeiro, 1999; Chang et al., 2001). The process for obtaining these vectors starts from the texts associated with the items. As texts may be in different formats, they need to be treated in order to obtain their contents "naked" in the form of a flat text. In our approach text files in PDF and HTML (see Fig. 2) format have been considered, although both are transformed into flat text files before starting the process.

```
Arquitectura de Linux 
Introducción Por arquitectura de Linux, y de cualquier otro sistema operativo en general, 
podemos entender que es la relación estructurada que tienen los distintos componentes del 
sistema entre ellos, para cumplir su función: proporcionar al usuario una maquina virtual 
sobre la cual trabajar. 
...
```
Fig. 3. HTX file example

104 New Research on Knowledge Management Technology

**dwfs2cwf**

**cdwf2dww**

**CWF**

**DWW**

**ftg2dwf**

**HTX DWF**

**FTG**

Fig. 1. Process obtaining weight of words vectors for knowledge items from HTML

It is possible to associate a text document with all the knowledge items that are considered in the system that describes them. These contributions may have several backgrounds. First, text associations to knowledge items may come from the nature of these items; for instance the documents used in experiments are textual type. Secondly, these associations may stem from the explicit relations of knowledge items with other items that already have associated texts. This occurs with topics that organize the documents, users that provide documents to the system or the node that contains both. Thirdly, text associations descriptive to the items may be inferred from more dynamic relations, like the one that is established between users and the documents that are more frequently visited by them or the ones they give their opinion about, or like the ones that are shown between documents referenced among them. Lastly, it is always possible to associate texts descriptive to the items that affect some aspects of utility, such as users' curriculums, their topics of interest, key words associated with documents or descriptions of topics. This case is completely general and may be applied to

Once a descriptive text has been associated with one of the items considered, it is necessary to put it in a way that it can be used as a comparison instrument. This is achieved by converting the text into a descriptor that shall be connected to the aspect it is referring to. For instance, if the text associated with a user describes the topics of interest for the latter, the corresponding descriptor shall refer to the users' preferences; but if the text describes the documents it itself has elaborated, the corresponding descriptor shall refer to its creative job. Like this, items shall

be able to have as many descriptors as aspects of them are taken into account.

<h1><font color="#CC0000">Arquitectura de&nbsp;</font></h1>

 Por arquitectura de Linux, y de cualquier otro sistema operativo en general, podemos entender que es la relación estructurada que tienen los distintos componentes del sistema entre ellos, para cumplir su función: proporcionar al usuario

<td><img SRC="img4.jpg" height=114 width=102></td></tr>

una <b>maquina virtual</b> sobre la cual trabajar.

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

<body text="#330033" bgcolor="#CCCCCC" link="#3366FF" vlink="#000099" alink="#FF0000">

<center><h1><font color="#CC0000">&nbsp;&nbsp;Linux</font></h1></center></td>

**htm2htx**

**HTML**

**freeling**

non textual documents, such as images, sound, etc.

<title>Arquitectura de Linux</title>

 <hr WIDTH="100%" color="#FFFFFF"> <h2>Introducci&oacute;n</h2>

<center><table COLS=2 WIDTH="73%" ><tr><td>

<html> <head>

</head>

...

</table></center>

Fig. 2. HTML file example

After the text format has been eliminated- creating flat text files HTX (see Fig. 3)- the lemmas to which the terms refer to must be identified (obviating the grammatical forms in which they are shown) and to determine the grammatical categories to which they belong. With that, references are unified to concepts, the number of different words considered is reduced and the terms with no utility are identified.

In our approach we have used FreeLing language analysis tool (Carreras et al., 2004) that facilitates obtaining all the necessary information to achieve the previous objectives. FreeLing allows to analyse a text to identify the grammatical categories to which the words that form them belong to and to determine the lemmas to which these words correspond in a reference dictionary. When FreeLing cannot find an appropriate lemma to some word, it considers it a new lemma. With all this, the tool may establish the most probable morphologic interpretation of each word that integrates the text that shall be useful to determine a semantic approach of the latter. As a result of the analysis, FreeLing provides tagged version of the text (FTG files), indicating for each appearance of a word its original form together with the lemma and the corresponding morphologic interpretation that are considered more feasible. The Fig. 4 left shows a FTG file example, where for each row, the first string is the word original form, the second is the corresponding lemma, and the third is the encoding of the grammatical category.


Fig. 4. FTG (left) and DWF (right) file examples

Digestion of Knowledge in a KM System to Reveal Implicit Knowledge 107

...

...

The vector formed by the terms that appear in the text associated with a knowledge item and their respective weights, constitutes the resulting descriptor to the process, which is kept in the form of a file (DWW files). The Fig. 5 right shows a DWW file example, where each row corresponds to a lemma and the columns are: numeric identifier, appearances number, absolute and normalized frequency on document, inverse document frequency,

DWW shall be used to compare the items among one another, calculating the level of similarity among the weight of words vectors they represent. The similarity between two vectors is the measurement of the distance between them. In this approach we have considered the distance between vectors is estimated according to the cosine of the angle

> *i j t t i j k k v v p p sim v v*

Therefore, the similarity between two vectors vi and vj is the scalar product of the two vectors, broken up by the product of the respective modules. The scalar product of these vectors is calculated by the sum of the components of the product pk in each of its t dimensions. A module vector is calculated by the sum of the squares of the components of a

 c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 T102 T103 T104 T105 T106 T107 T109 T110 T111 T112 T53 T54 T55 T56 T57 T58 f0 D100 0.08 0.07 0.04 0.06 0.00 0.16 0.03 0.05 0.01 0.04 0.06 0.04 0.05 0.04 0.02 0.12 f1 D101 0.05 0.05 0.01 0.03 0.00 0.00 0.03 0.01 0.02 0.03 0.04 0.01 0.03 0.03 0.02 0.04 f2 D102 0.07 0.09 0.13 0.13 0.00 0.00 0.10 0.03 0.06 0.11 0.12 0.13 0.16 0.11 0.12 0.21 f3 D105 0.12 0.09 0.05 0.17 0.00 0.00 0.13 0.05 0.12 0.11 0.14 0.05 0.22 0.50 0.20 0.15 f4 D109 0.08 0.13 0.14 0.14 0.00 0.00 0.12 0.08 0.09 0.11 0.12 0.14 0.18 0.13 0.10 0.17 f5 D111 0.12 0.12 0.16 0.17 0.00 0.00 0.11 0.05 0.08 0.11 0.14 0.16 0.22 0.17 0.15 0.19 f6 D113 0.00 0.02 0.00 0.00 0.00 0.85 0.00 0.01 0.00 0.00 0.03 0.00 0.00 0.03 0.01 0.00 f7 D114 0.14 0.19 0.07 0.11 0.00 0.01 0.13 0.09 0.10 0.11 0.14 0.07 0.12 0.12 0.08 0.14 f8 D115 0.02 0.02 0.00 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.02 0.00 0.03 f9 D116 0.14 0.19 0.07 0.11 0.00 0.00 0.12 0.08 0.11 0.11 0.13 0.07 0.14 0.09 0.08 0.16 f10 D123 0.15 0.21 0.15 0.11 0.00 0.00 0.13 0.08 0.10 0.12 0.11 0.15 0.13 0.13 0.09 0.15

1 1 0.333 5.000 1.667 Básicamente 2 1 0.333 2.397 0.799 Evidentemente 3 1 0.333 1.971 0.657 Paralelamente 4 1 0.333 5.000 1.667 TM\_ms:200 5 1 0.333 5.000 1.667 abierto 6 1 0.333 0.546 0.182 abrir 7 1 0.333 5.000 1.667 abstraer 8 1 0.333 1.306 0.435 acceder 9 1 0.333 1.175 0.392 acceso 10 1 0.333 1.261 0.420 acelerar 11 1 0.333 0.554 0.185 además 12 1 0.333 1.427 0.476 administración 13 1 0.333 2.175 0.725 administrador 14 1 0.333 2.096 0.699 administrar 15 1 0.333 5.000 1.667 affs 16 1 0.333 5.000 1.667 aleatorio 17 1 0.333 0.806 0.269 algo 18 1 0.333 5.000 1.667 algoritmo 19 1 0.333 0.431 0.144 alguno

, ,

(2)

*k i k j*

, , 1 2 2 1 1

*t i j ki kj k*

*v v p p* 

...

...

they form.

vector (see For.2).

...

...

Fig. 6. CDF file example.

30 298 2.510 0.400 mayor 31 290 2.579 0.412 deber 32 283 2.643 0.422 mundo 33 278 2.691 0.430 poner 34 277 2.700 0.431 alguno 35 277 2.700 0.431 tres 36 270 2.770 0.443 político 37 268 2.791 0.446 público 38 266 2.812 0.449 Gobierno 39 263 2.844 0.454 económico 40 260 2.877 0.459 decir 41 260 2.877 0.459 día 42 259 2.888 0.461 general 43 253 2.957 0.471 menos 44 252 2.968 0.473 gran 45 251 2.980 0.474 caso 46 249 3.004 0.478 cada 47 248 3.016 0.479 medio 48 246 3.041 0.483 grande

word weight and lemma.

Fig. 5. CWF (left) and DWW (right) file examples.

(, )

The text that has been tagged using FreeLing is processed according to its grammatical categories, in order to completely eliminate entry words that are not considered relevant for the comparison of texts, such as determiners, conjunctions or prepositions. Tags and the original form of other entries are also eliminated. In this way the original text shall remain a sequence of lemmas that already exist in the reference dictionary or that have been minted from outstanding terms that do not appear in it. In this sequence entries for different forms of the same word in the original text appear as a repetition of the same lemmas. Each lemma included in this sequence may be ascribed to a semantic interest in order to contribute to the creation of a descriptor that is the objective of the process.

By counting the appearances of each term in the sequence of lemmas, it is possible to establish the frequency of each of them. In this way, word frequency files are generated for each text associated with a knowledge item (DWF files). DWF files include only one entry for each lemma that contains the corresponding identifier and its frequency, normalised with regard to the maximum appearances of other words taken into account in the document. The Fig. 4 right shows a DWF file example, where each row corresponds to a lemma and the columns are: identifier, appearances number, frequency normalised, lemma and grammatical categories.

Following a similar process to the one described –but working on a collection of representative texts for the general use of the language being used- a reference file is generated with the frequency of words in this collection (CWF file) that represents the frequency of words in the common use of the language (Baeza & Ribeiro, 1999). The document collection is processed as if it were the text associated with a knowledge item. So that the words found and their frequency is representative for the general use of the language, the collection must be broad enough and cover general themes. In our approach the 748 articles included in El País newspaper annuals from four different years that deal with the most outstanding events that took place during that period in the main field of information such as society, culture, sports and so on, have been used.

CWF files are similar to DWF; they include one entry for each lemma, with the identifier in question and its frequency coefficient for the inverse document. This frequency is logarithm to the base ten of the quotient of the total documents in collection N, between number nk of documents where the term appears (see For.1). This coefficient is an indicator frequency of the use of the term in the general use of the language that represents the collection and indicates the rareness of the latter. The Fig. 5 left shows a CWF file example, where each row corresponds to a lemma and the columns are: identifier, appearances number, inverse frequency normalised on document collection, inverse document frequency and lemma.

$$p\_{k,i} = f\_{k,i} \times f \text{di}\_k = f\_{k,i} \times \log \frac{N}{n\_k} \tag{1}$$

Considering the word frequency files of each knowledge item (DWF), and using the Word frequency file in the reference collection (CWF), a weight for each term is established in the text associated with the item. The weight of a word in a text represents the relevance of the term in it. A term is more characteristic of a text the more frequent it is in the corresponding text and the less frequent it is in the general use of the language in which it is written. Specifically, the weight pk,i of a term k in a document i is the result of the normalised frequency fk,i of the word k in text i, by the term frequency inverse document in the collection used as reference fdik (see For.1).

The text that has been tagged using FreeLing is processed according to its grammatical categories, in order to completely eliminate entry words that are not considered relevant for the comparison of texts, such as determiners, conjunctions or prepositions. Tags and the original form of other entries are also eliminated. In this way the original text shall remain a sequence of lemmas that already exist in the reference dictionary or that have been minted from outstanding terms that do not appear in it. In this sequence entries for different forms of the same word in the original text appear as a repetition of the same lemmas. Each lemma included in this sequence may be ascribed to a semantic interest in order to contribute to the

By counting the appearances of each term in the sequence of lemmas, it is possible to establish the frequency of each of them. In this way, word frequency files are generated for each text associated with a knowledge item (DWF files). DWF files include only one entry for each lemma that contains the corresponding identifier and its frequency, normalised with regard to the maximum appearances of other words taken into account in the document. The Fig. 4 right shows a DWF file example, where each row corresponds to a lemma and the columns are: identifier, appearances number, frequency normalised, lemma

Following a similar process to the one described –but working on a collection of representative texts for the general use of the language being used- a reference file is generated with the frequency of words in this collection (CWF file) that represents the frequency of words in the common use of the language (Baeza & Ribeiro, 1999). The document collection is processed as if it were the text associated with a knowledge item. So that the words found and their frequency is representative for the general use of the language, the collection must be broad enough and cover general themes. In our approach the 748 articles included in El País newspaper annuals from four different years that deal with the most outstanding events that took place during that period in the main field of

CWF files are similar to DWF; they include one entry for each lemma, with the identifier in question and its frequency coefficient for the inverse document. This frequency is logarithm to the base ten of the quotient of the total documents in collection N, between number nk of documents where the term appears (see For.1). This coefficient is an indicator frequency of the use of the term in the general use of the language that represents the collection and indicates the rareness of the latter. The Fig. 5 left shows a CWF file example, where each row corresponds to a lemma and the columns are: identifier, appearances number, inverse frequency normalised on document collection, inverse document frequency and lemma.

,, , *ki ki k ki* log

*<sup>N</sup> p f fdi f*

Considering the word frequency files of each knowledge item (DWF), and using the Word frequency file in the reference collection (CWF), a weight for each term is established in the text associated with the item. The weight of a word in a text represents the relevance of the term in it. A term is more characteristic of a text the more frequent it is in the corresponding text and the less frequent it is in the general use of the language in which it is written. Specifically, the weight pk,i of a term k in a document i is the result of the normalised frequency fk,i of the word k in text i, by the term frequency inverse document in the

*k*

*<sup>n</sup>* (1)

information such as society, culture, sports and so on, have been used.

creation of a descriptor that is the objective of the process.

and grammatical categories.

collection used as reference fdik (see For.1).


Fig. 5. CWF (left) and DWW (right) file examples.

The vector formed by the terms that appear in the text associated with a knowledge item and their respective weights, constitutes the resulting descriptor to the process, which is kept in the form of a file (DWW files). The Fig. 5 right shows a DWW file example, where each row corresponds to a lemma and the columns are: numeric identifier, appearances number, absolute and normalized frequency on document, inverse document frequency, word weight and lemma.

DWW shall be used to compare the items among one another, calculating the level of similarity among the weight of words vectors they represent. The similarity between two vectors is the measurement of the distance between them. In this approach we have considered the distance between vectors is estimated according to the cosine of the angle they form.

$$\text{sim}(\boldsymbol{\upsilon}\_{i}, \boldsymbol{\upsilon}\_{j}) = \frac{\vec{\boldsymbol{\upsilon}}\_{i} \bullet \vec{\boldsymbol{\upsilon}}\_{j}}{|\vec{\boldsymbol{\upsilon}}\_{i}| \times |\vec{\boldsymbol{\upsilon}}\_{j}|} = \frac{\sum\_{k=1}^{t} p\_{k,i} \times p\_{k,j}}{\sqrt{\sum\_{k=1}^{t} p\_{k,i}^{2}} \times \sqrt{\sum\_{k=1}^{t} p\_{k,j}^{2}}} \tag{2}$$

Therefore, the similarity between two vectors vi and vj is the scalar product of the two vectors, broken up by the product of the respective modules. The scalar product of these vectors is calculated by the sum of the components of the product pk in each of its t dimensions. A module vector is calculated by the sum of the squares of the components of a vector (see For.2).


Fig. 6. CDF file example.

Digestion of Knowledge in a KM System to Reveal Implicit Knowledge 109

The descriptors added to the knowledge items provide new data to show hidden aspects of the elements they describe. For instance, the interest a particular item arouses may be suitable to make it stand out among the other items or to put all in order. In addition, the most characteristic terms that an item includes may result in an interesting reference to

In our approach, the knowledge specified by the analysis process is incorporated into the system in the form of a new knowledge element category that represents the relation among items of all kinds previously considered (documents, topics, users and nodes). The links incorporated this way in the repository provide the base to offer users new multidimensional views of the knowledge and new services to facilitate its exploitation. In particular, to demonstrate this proposal we have implemented an interactive view of the graph of a relation among knowledge items in the system (see Fig. 7 left), as well as a context sensitive recommendation service that provides reference items -of different kinds-

Fig. 7. Interactive view of the knowledge as graph related items (left window) and a context

The view in the form of a graph integrates the static relations established in the system with other dynamics that progress through time. Among the first we can find hierarchical links that join the topics of the knowledge tree or authorship links that connect users to the documents they provide to the knowledge base. Among the dynamic relations the derivation of the item's character present in the repository in each moment and the ones due to the interactions established among them as a result of the system activity may be mentioned. An example of this kind of view can be seen in the illustration above (see Fig. 7 left), where the topics are represented by orange circles, the documents by clear squares, the static relations by black lines and the dynamic relations between the items taken into

sensitive recommendation (inferior center window and right window).

search information related to it in other information repositories.

related to the item the user is working with in each moment (see Fig. 7 right).

The level of similarity between two vectors is a coefficient between zero and one. The closer the value is to the unit, the more similar vectors are, and the closer to zero, the less similar they shall be. The relation of similarity established between knowledge items is described with this coefficient. In our approach the knowledge items that exceed a specific threshold of similarity coefficient of the relation between both are considered to be related. Unfortunately, this threshold may not be established neither in a fixed way nor in a general way for all cases, given that depending on circumstances such as theme nodes or the nature of the documents taken into account, the election of its value may vary greatly.

The similarity between two set of DWW is summarized in a CDF file. The Fig. 6 shows a CDF file example, where rows fi and columns ci represent DWW files (respectively of documents and topics in this case) and he numbers the similarity sim(fi,ci) between them. In Fig.6, for instance, sim(f6,c5) (similarity between the document D113 and the topic T107) is 0.85, that is much higher than sim(f6,c10) (similarity between the document D113 and the topic T53), that is 0.03.
