2. Vector model

One of the most common methods in text mining for automatic indexing is the vector model. In it, every document and any need for information or query is encoded as a vector whose components reflect the importance of a particular term in its meaning or semantics.

#### 2.1. Matrix term document

A database containing n documents described by m terms is represented as a matrix Am�<sup>n</sup> called matrix term document, where the element aij denotes the weight of term i in document j. A natural choice for the components of vector document is a function of the frequency with which each term occurs in it, that is to say, aij ¼ f ij, where f ij is the number of times the term i appears in the document j. There are more sophisticated schemes such as those given in [1, 2] that may lead to better results, and in general, as is done in [2], the procedure is to define the inputs from A to

$$a\_{i\dot{\jmath}} = l\_{i\dot{\jmath}} \mathbf{g}\_i d\_{\dot{\jmath}} \tag{1}$$

where lij is the local weight of term i in document j, gi is the overall weight of term i in the collection of documents, and dj is the component standardization, which specifies whether the columns of A (that is, the documents) are normalized or not. Local and global weights are applied to increase or decrease the importance of terms within or between documents. In [1], they explain the weight scheme for terms that are recommended to use depending on the characteristics of the document collection. In the language of the vector model, the columns and rows of A are called document vectors and term vectors, respectively. Because each vector document contains only a small part of the totality of terms that describe the entire collection of documents, normally, the term document matrix is sparse; i.e., most of its entries are zero.

#### 2.2. Latent semantic indexing

information in the same and other languages than the language of the queries. For example, a user could ask a question in language X (source language) to find documents in languages X,

Many methods have been used in IR, among them the vector model, which interprets queries and documents as vectors and information retrieval, is based on operations among them. Latent semantic indexing (LSI) is an IR method based on the vector model, which replaces a term document matrix with a sum of matrices of a particular structure. In this sense, the singular value decomposition (SVD), QR and ULV factorizations, and the semidiscrete decomposition (SDD) have been used in LSI to IR. The SDD has shown benefits in saving storage of large databases, but has not been tested in CLIR. This chapter examines and evaluates a method for bilingual retrieving information (Spanish-English) based on semidiscrete decomposition (SDD), which retrieves relevant information in both languages when the query is made in Spanish. In addition, are presented four case studies that show the performance of the LSI via SDD method for CLIR and the results are compared with those obtained by applying the LSI via SVD method. To do this, a database was built combining documents (Bible Gospels)

One of the most common methods in text mining for automatic indexing is the vector model. In it, every document and any need for information or query is encoded as a vector whose

A database containing n documents described by m terms is represented as a matrix Am�<sup>n</sup> called matrix term document, where the element aij denotes the weight of term i in document j. A natural choice for the components of vector document is a function of the frequency with which each term occurs in it, that is to say, aij ¼ f ij, where f ij is the number of times the term i appears in the document j. There are more sophisticated schemes such as those given in [1, 2] that may lead to better results, and in general, as is done in [2], the procedure is to define the

aij ¼ lijgi

where lij is the local weight of term i in document j, gi is the overall weight of term i in the collection of documents, and dj is the component standardization, which specifies whether the columns of A (that is, the documents) are normalized or not. Local and global weights are applied to increase or decrease the importance of terms within or between documents. In [1], they explain the weight scheme for terms that are recommended to use depending on the characteristics of the document collection. In the language of the vector model, the columns

dj, (1)

components reflect the importance of a particular term in its meaning or semantics.

Y, Z (target languages).

122 Multilingualism and Bilingualism

in Spanish and English.

2. Vector model

2.1. Matrix term document

inputs from A to

Latent semantic indexing (LSI) [3–5], also called latent semantic analysis (LSA) is an automatic indexing method based on the semantics of documents, which attempts to overcome the two main problems that have the traditional indexing schemes of lexical coincidences: polysemy and synonymy. The first has to do with a word that can have multiple meanings, and therefore, the words of a query may not coincide in meaning with those of the documents; the second means that several terms can have the same meaning and hence the words used in queries can match nonrelevant documents.

LSI is based on the assumption that there is some latent semantic structure underlying data that is corrupted by the variety of words used [4], but this semantic structure can be discovered and enhanced by approximating the matrix term document by a summation of matrices of particular structure, for example, by a low rank approximation obtained by some matrix decomposition.

#### 2.3. Queries and measures of performance

In the vector model, the queries are also seen as vectors and then match a query q means finding in the column space of A (the subspace generated by the vectors documents) the documents aj that are most similar to her in meaning. In [2], they explain that it is possible to associate a weight scheme with a query and have

$$\mathfrak{q}\_k = l\_k \mathfrak{g}\_{k'} \tag{2}$$

where qk is the kth input of q, and lk and gk are the components of local and global weight, respectively. The documents considered as relevant are those that are geometrically closer to the query according to some measure, and often the cosine of the angle between the query vector and each of the document vectors is used as a measure of similarity, so that the largest values correspond to the most relevant documents. Then, aj is recovered if

$$\cos\left(\theta\left(q, a\_j\right)\right) = \frac{q^t a\_j}{||q||\_2 ||a||\_j} > \text{tol},\tag{3}$$

where tol is a predefined tolerance. Another commonly used measure of similarity is the dot product between query vector and each document vector that is computed as

$$p = q^{\sharp}A,\tag{4}$$

where the ith entry of p represents the score of the document i. Thus, the documents can be organized from major to minor, by relevance to the consultation, according to their score.

In [6], they describe that a good process of matching queries is when the intersection between the set of documents retrieved and the set of relevant documents is as large as possible and the number of irrelevant documents recovered is small. In this way, to measure the performance of an information retrieval system, one must evaluate the ability of the system to retrieve relevant information (recall) and to reduce irrelevant information (precision).

Other measures frequently used to evaluate the quality of an IR system are the pseudoprecision, average pseudoprecision, and mean average pseudoprecision (MAP). Let ri be the number of relevant documents retrieved up to position i in the sorted list of documents:


$$\tilde{P}(\mathbf{x}) = \max P\_{i\prime} \quad \text{where } \mathbf{x} \le \frac{r\_i}{r\_n}, \mathbf{i} = 1, 2, \cdots, n. \tag{5}$$

• The average pseudoprecision for a single query is defined by

$$P\_{av} = \frac{1}{n} \sum\_{i=0}^{n-1} \tilde{P}\left(\frac{i}{n-1}\right). \tag{6}$$

A ¼ U

well documented [1, 6, 7].

• R Að Þ¼ span uf g <sup>1</sup>; u2;…; ur

• R At � � <sup>¼</sup> span vf g <sup>1</sup>; <sup>v</sup>2;…; vr

• N Að Þ¼ span vf g <sup>r</sup>þ<sup>1</sup>; vrþ<sup>2</sup>;…; vn

• N At � � <sup>¼</sup> span uf g <sup>r</sup>þ<sup>1</sup>; urþ<sup>2</sup>; …; un

structure of a matrix.

• rank (A) = r

Proof: See [6–8].

singular values of A.

where

Σ 0 � �V<sup>t</sup>

Cross - Language Information Retrieval Using Two Methods: LSI via SDD and LSI via SVD

where U ∈ R<sup>m</sup>�<sup>m</sup> and V ∈ R<sup>n</sup>�<sup>n</sup> are orthogonal matrices whose columns are called, respectively, singular vectors to the left and to the right of A, and Σ ∈ R<sup>n</sup>�<sup>n</sup> is a diagonal matrix that contains the singular values σ<sup>1</sup> ≥ σ<sup>2</sup> ≥ σ<sup>n</sup> ≥ , ⋯, ≥ 0 of A in decreasing order within its diagonal. This factorization exists for any matrix A, and numerical linear algebra texts commonly include it in their content [7, 8]. Methods to calculate the SVD of dense and sparse matrices are

The following two theorems show how the SVD reveals important information about the

σ<sup>1</sup> ≥ σ<sup>2</sup> ≥ , ⋯, ≥ σ<sup>r</sup> > σ<sup>r</sup>þ<sup>1</sup> ¼ , ⋯, ¼ 0. If R (A) and N (A) denote the column space and null space

<sup>A</sup> <sup>¼</sup> <sup>X</sup><sup>r</sup> i¼1

The theorem reveals that the SVD gives orthogonal bases for the four fundamental subspaces associated with a matrix and, in particular, in the context of term document matrices, the second part indicates that generating the semantic content of a database does not require to use all document vectors but a subset of the singular vectors to the left corresponding to the range of the matrix. The sum of the last part of the theorem is usually called expansion in

uiσivt i

<sup>f</sup> ¼ j j j j A � Ak

<sup>∶</sup> <sup>¼</sup> UkΣkV<sup>t</sup>

2

<sup>f</sup> , (9)

<sup>k</sup>: (10)

Theorem 1. Let Am�<sup>n</sup>, where without loss of generality <sup>m</sup> <sup>≥</sup> <sup>n</sup>, <sup>A</sup> <sup>¼</sup> <sup>U</sup>ΣV<sup>t</sup>

of A, respectively, and if U ¼ ½ � u<sup>1</sup> u2; ⋯; um and V ¼ ½ � v<sup>1</sup> v2; ⋯; vn , then,

Theorem 2 (Eckart and Young). Suppose A ∈ R<sup>m</sup>�<sup>n</sup> has rank r > k. Then,

min

rankð Þ <sup>B</sup> j j j j <sup>A</sup> � <sup>B</sup> <sup>2</sup>

Ak <sup>¼</sup> <sup>X</sup> k

i¼1

uiσivt i , (8)

http://dx.doi.org/10.5772/intechopen.74171

125

, the SVD of A and

• The mean average pseudoprecision (MAP), used to evaluate yield in a set of queries, is defined by

$$MAP = \frac{1}{M} \sum\_{j=1}^{M} \left[ \frac{1}{n} \sum\_{i=0}^{n-1} \check{P} \left( \frac{i}{n-1} \right) \right] \tag{7}$$

where M is the number of queries.
