3. Latent semantic indexing via singular value decomposition

In this section, we initially present the singular value decomposition (SVD) and two theorems that show how the SVD gives useful information about the structure of a matrix. Also, it is explained why these theorems are important for IR and in particular for LSI. Subsequently, the LSI method based on the SVD is exposed.

#### 3.1. Singular value decomposition

The SVD of a matrix Am�<sup>n</sup>, with m ≥ n, is a factorization of the form

Cross - Language Information Retrieval Using Two Methods: LSI via SDD and LSI via SVD http://dx.doi.org/10.5772/intechopen.74171 125

$$A = \mathcal{U} \begin{pmatrix} \Sigma \\ 0 \end{pmatrix} V^t \tag{8}$$

where U ∈ R<sup>m</sup>�<sup>m</sup> and V ∈ R<sup>n</sup>�<sup>n</sup> are orthogonal matrices whose columns are called, respectively, singular vectors to the left and to the right of A, and Σ ∈ R<sup>n</sup>�<sup>n</sup> is a diagonal matrix that contains the singular values σ<sup>1</sup> ≥ σ<sup>2</sup> ≥ σ<sup>n</sup> ≥ , ⋯, ≥ 0 of A in decreasing order within its diagonal. This factorization exists for any matrix A, and numerical linear algebra texts commonly include it in their content [7, 8]. Methods to calculate the SVD of dense and sparse matrices are well documented [1, 6, 7].

The following two theorems show how the SVD reveals important information about the structure of a matrix.

Theorem 1. Let Am�<sup>n</sup>, where without loss of generality <sup>m</sup> <sup>≥</sup> <sup>n</sup>, <sup>A</sup> <sup>¼</sup> <sup>U</sup>ΣV<sup>t</sup> , the SVD of A and σ<sup>1</sup> ≥ σ<sup>2</sup> ≥ , ⋯, ≥ σ<sup>r</sup> > σ<sup>r</sup>þ<sup>1</sup> ¼ , ⋯, ¼ 0. If R (A) and N (A) denote the column space and null space of A, respectively, and if U ¼ ½ � u<sup>1</sup> u2; ⋯; um and V ¼ ½ � v<sup>1</sup> v2; ⋯; vn , then,

• rank (A) = r

In [6], they describe that a good process of matching queries is when the intersection between the set of documents retrieved and the set of relevant documents is as large as possible and the number of irrelevant documents recovered is small. In this way, to measure the performance of an information retrieval system, one must evaluate the ability of the system to retrieve relevant

Other measures frequently used to evaluate the quality of an IR system are the pseudoprecision, average pseudoprecision, and mean average pseudoprecision (MAP). Let ri be the number of rele-

• The recall for the ith document from the list, Ri, is the ratio of relevant documents seen so far, that is, Ri ¼ ri=rn, where rn is the amount of relevant documents retrieved.

• The precision for the i-th document, Pi, is the proportion of documents up to position i

<sup>~</sup>ð Þ, is defined by

, i ¼ 1, 2, ⋯, n: (5)

: (6)

(7)

ri rn

<sup>P</sup><sup>~</sup> <sup>i</sup> n � 1 � �

• The mean average pseudoprecision (MAP), used to evaluate yield in a set of queries, is

1 n Xn�1 i¼0

In this section, we initially present the singular value decomposition (SVD) and two theorems that show how the SVD gives useful information about the structure of a matrix. Also, it is explained why these theorems are important for IR and in particular for LSI. Subsequently, the

"

<sup>P</sup><sup>~</sup> <sup>i</sup> n � 1 � �

�,

j¼1

information (recall) and to reduce irrelevant information (precision).

that are relevant to a given query, that is, Pi ¼ ri=i.

P x

• The average pseudoprecision for a single query is defined by

• The pseudoprecision for a level of recall x, P x

defined by

124 Multilingualism and Bilingualism

where M is the number of queries.

LSI method based on the SVD is exposed.

3.1. Singular value decomposition

vant documents retrieved up to position i in the sorted list of documents:

<sup>~</sup>ð Þ¼ maxPi, where x <sup>≤</sup>

Pav <sup>¼</sup> <sup>1</sup> n Xn�1 i¼0

MAP <sup>¼</sup> <sup>1</sup> M X M

3. Latent semantic indexing via singular value decomposition

The SVD of a matrix Am�<sup>n</sup>, with m ≥ n, is a factorization of the form


$$A = \sum\_{i=1}^{r} u\_i \sigma\_i \sigma\_i^t$$

Proof: See [6–8].

The theorem reveals that the SVD gives orthogonal bases for the four fundamental subspaces associated with a matrix and, in particular, in the context of term document matrices, the second part indicates that generating the semantic content of a database does not require to use all document vectors but a subset of the singular vectors to the left corresponding to the range of the matrix. The sum of the last part of the theorem is usually called expansion in singular values of A.

Theorem 2 (Eckart and Young). Suppose A ∈ R<sup>m</sup>�<sup>n</sup> has rank r > k. Then,

$$\min\_{\mathbf{rank}(B)} \left||A - B||\_f^2 = ||A - A\_k||\_{f'}^2\tag{9}$$

where

$$A\_k = \sum\_{i=1}^k \mu\_i \sigma\_i \upsilon\_i^t \colon = \mathcal{U}\_k \Sigma\_k V\_k^t. \tag{10}$$

#### Proof: See [6–8].

In this case, the theorem states that Ak is the matrix of rank k closest to A. The Uk columns live in the semantic space and are used to approximate the documents. As is known, truncated SVD is useful for "eliminating noise" present in an array and therefore, in the case of matrices representing a database, to remove term-document associations that are obscuring the real meaning of it.

#### 3.2. LSI via SVD

As mentioned earlier, LSI is an IR method based on the vector model that approximates a document term matrix by a sum of the matrices of particular structure. In this regard, according to Theorem 2, LSI via SVD uses the singular value decomposition to obtain a k-rank approximation of the original document term matrix Am�<sup>n</sup> in order to eliminate the noise present in it and project the m terms, n documents, and queries in a k-dimensional space, where k ≪ minð Þ m; n . It is important to keep in mind that document term matrices are commonly well conditional, that is, their singular values have no gaps and do not decay rapidly to zero, so a suitable k to truncate the SVD cannot be estimated [6]; experimentally, it has been concluded that for very large databases, k is taken between [100; 300] [3].

As in the vector model, in LSI, it is possible to match a query q through operations between vectors, namely, the cosine of the angle or the product point between the query vector and the document vectors. In this case, we calculate

$$p = \tilde{q}^t \ddot{A}\_\prime \tag{11}$$

where the xi <sup>∈</sup> <sup>R</sup>m, yi <sup>∈</sup> <sup>R</sup><sup>n</sup> are formed by elements of the set S ¼ �f g <sup>1</sup>; <sup>0</sup>; <sup>1</sup> , and di is a positive scalar, called the i-th SDD value. The matrix Ak is called semidiscrete decomposition of rank k (or SDD k-term). The algorithm that allows to calculate the SDD of a matrix and some of its

Cross - Language Information Retrieval Using Two Methods: LSI via SDD and LSI via SVD

http://dx.doi.org/10.5772/intechopen.74171

127

The truncated SVD produces the best approximation of range k for a matrix; however, generally, even for a very low range approximation, more storage is required than the original matrix if it is sparse, that is, if the majority of its components are zero. As document term matrices that correspond to real databases are commonly large and sparse, using truncated SVD can be extremely expensive in terms of storage. It is for the above, that to save space (and

In this sense, LSI via SDD consists of replacing the term document matrix by an approximation that allows, as they sign in [10], to identify the clusters that form the documents present in databases and at the same time save a considerable amount of storage with respect to other factorizations. In [2], they show that for equal values of k, the SVD requires about 32 times more storage than the SDD. Specifically, LSI via SDD consists of approximating the document term matrix by a sum of exterior products of rank 1 such as in the SVD, but whose vectors consist only of elements of the set S ¼ �f g 1; 0; 1 . For more details on LSI via SDD, see [2, 11]. To match the queries with the documents using LSI via SDD, we proceed in the same way as in

> p ¼ q ~t

Different documents can contain information that is conceptually the same without having to use similar words. People when they make a query in an IR system, for example, a search engine such as Google, do so by concept, and the words they use in it generally do not match those of the relevant documents. In this way, the main objective of CLIR, which is the retrieval of relevant information in the same and other languages to the queries, is highly affected

To address this situation, databases have been created between languages, which are collections of documents that combine low percentages of languages and for its construction, it is necessary to take into account two concepts of close relationship with CLIR: parallel aligned

A, ~ (13)

<sup>k</sup>. Relevant documents are those that correspond to the largest

consultation time) in [2], they propose the SDD as an alternative of SVD in LSI.

properties, for example, its convergence, is described in [9].

LSI via SVD, that is, by calculating the product

<sup>k</sup>q, and <sup>A</sup><sup>~</sup> <sup>¼</sup> DkY<sup>t</sup>

because they must be compared terms of different languages.

4.2. LSI via SDD

where <sup>~</sup><sup>q</sup> <sup>¼</sup> <sup>X</sup><sup>t</sup>

components of p.

5. The CLIR problem

corpus and fusion strategies.

where <sup>~</sup><sup>q</sup> <sup>¼</sup> <sup>U</sup><sup>t</sup> <sup>k</sup>q, <sup>A</sup><sup>~</sup> <sup>¼</sup> <sup>Σ</sup>kV<sup>t</sup> <sup>k</sup> and Ak <sup>¼</sup> UkΣkV<sup>t</sup> <sup>k</sup> is the approximation of rank k of A obtained from the truncated SVD. Therefore, the recovered documents will be those corresponding to the largest components of p.

#### 4. Latent semantic indexing via semidiscrete matrix decomposition

In this section, we present the semidiscrete decomposition (SDD) of a matrix and the method LSI via SDD. For more details, see [2, 9].

#### 4.1. Semidiscrete decomposition

A semidiscrete decomposition (SDD) expresses a matrix as a weighted sum of outer products formed by vectors whose inputs are taken from the set S ¼ �f g 1; 0; 1 that is given as

$$A\_k = \underbrace{[\mathbf{x}\_1 \; \mathbf{x}\_2 \cdots \mathbf{x}\_k]}\_{\mathbf{x}\_k} \underbrace{\begin{bmatrix} d\_1 & 0 & \cdots & 0\\ 0 & d\_2 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & d\_k \end{bmatrix}}\_{D\_k} \underbrace{\begin{bmatrix} y\_1^t\\ y\_2^t\\ \vdots\\ y\_k^t \end{bmatrix}}\_{\mathbf{y}\_k^t} = \sum\_{i=1}^k d\_i \mathbf{x}\_i \mathbf{y}\_{i\prime}^t$$

where the xi <sup>∈</sup> <sup>R</sup>m, yi <sup>∈</sup> <sup>R</sup><sup>n</sup> are formed by elements of the set S ¼ �f g <sup>1</sup>; <sup>0</sup>; <sup>1</sup> , and di is a positive scalar, called the i-th SDD value. The matrix Ak is called semidiscrete decomposition of rank k (or SDD k-term). The algorithm that allows to calculate the SDD of a matrix and some of its properties, for example, its convergence, is described in [9].

#### 4.2. LSI via SDD

Proof: See [6–8].

126 Multilingualism and Bilingualism

3.2. LSI via SVD

where <sup>~</sup><sup>q</sup> <sup>¼</sup> <sup>U</sup><sup>t</sup>

largest components of p.

In this case, the theorem states that Ak is the matrix of rank k closest to A. The Uk columns live in the semantic space and are used to approximate the documents. As is known, truncated SVD is useful for "eliminating noise" present in an array and therefore, in the case of matrices representing a database, to remove term-document associations that are obscuring the real meaning of it.

As mentioned earlier, LSI is an IR method based on the vector model that approximates a document term matrix by a sum of the matrices of particular structure. In this regard, according to Theorem 2, LSI via SVD uses the singular value decomposition to obtain a k-rank approximation of the original document term matrix Am�<sup>n</sup> in order to eliminate the noise present in it and project the m terms, n documents, and queries in a k-dimensional space, where k ≪ minð Þ m; n . It is important to keep in mind that document term matrices are commonly well conditional, that is, their singular values have no gaps and do not decay rapidly to zero, so a suitable k to truncate the SVD cannot be estimated [6]; experimentally, it has been

As in the vector model, in LSI, it is possible to match a query q through operations between vectors, namely, the cosine of the angle or the product point between the query vector and the

> p ¼ q ~t A,

the truncated SVD. Therefore, the recovered documents will be those corresponding to the

In this section, we present the semidiscrete decomposition (SDD) of a matrix and the method

A semidiscrete decomposition (SDD) expresses a matrix as a weighted sum of outer products

⋱ ⋮ ⋯ dk

yt 1 yt 2 ⋮ yt k


<sup>¼</sup> <sup>X</sup> k

i¼1

dixiy<sup>t</sup> i , (12)

formed by vectors whose inputs are taken from the set S ¼ �f g 1; 0; 1 that is given as

⋮ 0

⋮ 0

d<sup>1</sup> 0 ⋯ 0 0 d<sup>2</sup> ⋯ 0


4. Latent semantic indexing via semidiscrete matrix decomposition

~ (11)

<sup>k</sup> is the approximation of rank k of A obtained from

concluded that for very large databases, k is taken between [100; 300] [3].

<sup>k</sup> and Ak <sup>¼</sup> UkΣkV<sup>t</sup>

document vectors. In this case, we calculate

<sup>k</sup>q, <sup>A</sup><sup>~</sup> <sup>¼</sup> <sup>Σ</sup>kV<sup>t</sup>

LSI via SDD. For more details, see [2, 9].

Ak ¼ x<sup>1</sup> x2⋯xk ½ �


4.1. Semidiscrete decomposition

The truncated SVD produces the best approximation of range k for a matrix; however, generally, even for a very low range approximation, more storage is required than the original matrix if it is sparse, that is, if the majority of its components are zero. As document term matrices that correspond to real databases are commonly large and sparse, using truncated SVD can be extremely expensive in terms of storage. It is for the above, that to save space (and consultation time) in [2], they propose the SDD as an alternative of SVD in LSI.

In this sense, LSI via SDD consists of replacing the term document matrix by an approximation that allows, as they sign in [10], to identify the clusters that form the documents present in databases and at the same time save a considerable amount of storage with respect to other factorizations. In [2], they show that for equal values of k, the SVD requires about 32 times more storage than the SDD. Specifically, LSI via SDD consists of approximating the document term matrix by a sum of exterior products of rank 1 such as in the SVD, but whose vectors consist only of elements of the set S ¼ �f g 1; 0; 1 . For more details on LSI via SDD, see [2, 11].

To match the queries with the documents using LSI via SDD, we proceed in the same way as in LSI via SVD, that is, by calculating the product

$$p = \tilde{q}^t \tilde{A}\_\prime \tag{13}$$

where <sup>~</sup><sup>q</sup> <sup>¼</sup> <sup>X</sup><sup>t</sup> <sup>k</sup>q, and <sup>A</sup><sup>~</sup> <sup>¼</sup> DkY<sup>t</sup> <sup>k</sup>. Relevant documents are those that correspond to the largest components of p.
