5. The CLIR problem

Different documents can contain information that is conceptually the same without having to use similar words. People when they make a query in an IR system, for example, a search engine such as Google, do so by concept, and the words they use in it generally do not match those of the relevant documents. In this way, the main objective of CLIR, which is the retrieval of relevant information in the same and other languages to the queries, is highly affected because they must be compared terms of different languages.

To address this situation, databases have been created between languages, which are collections of documents that combine low percentages of languages and for its construction, it is necessary to take into account two concepts of close relationship with CLIR: parallel aligned corpus and fusion strategies.

#### 5.1. Parallel aligned corpus

A parallel text is a text accompanied by its translations in other languages. Large collections of parallel text are called parallel corpus. In order to use a parallel corpus correctly, it is necessary to align the original text with its (your) translation (translations), that is, you must identify the phrases or words in the original text with their corresponding translations in the other languages. This is known as the parallel aligned corpus.

stories and gives each one a title. The queries used are those given in Table 1, which describe parables and miracles in the life of Jesus. Table 2 shows the biblical quotation where each of

Cross - Language Information Retrieval Using Two Methods: LSI via SDD and LSI via SVD

http://dx.doi.org/10.5772/intechopen.74171

129

An LSI model is the set of parameters that are considered in the application of the latent semantic index method, that is, the local and global weight schemes, the number of factors (k), the fusion strategies, etc., that are chosen for performing recovery experiments. A fourletter string has been used to differentiate LSI models. The first three indicate the local weight, the global weight and the use of standardization in the matrix term document, respectively, and the last corresponds to the query matrix and refers to the local weight of the terms. In this way, the nomenclature fex.l, for example, means that in the matrix term document, the frequency (f) for the local weight and an entropy value (e) (see [1]) for the global weight of the terms were used; besides, the columns of the matrix document term were not normalized (x)

1 El bautizo de Jesús 5 Niño epiléptico curado 9 Vino nuevo viejo odres 2 Impuesto al Cesar 6 La alimentación a cinco mil 10 El sembrador y la tierra 3 Limpieza al templo 7 La higuera maldita 11 Grano de mostaza

these queries is located, that is, the documents that are relevant.

and a logarithmic value (l) was used for the local weight of the query terms.

4 Entrada a Jerusalén 8 Tela nueva vestido Viejo 12 Higuera

Query Matthew Mark Luke John 1 Mt 3:13-17 Mc 1:9-11 Lc 3:21-23 Jn 1:29-39

3 Mt 21:12-13 Mc 11:12-14 Jn 2:14-22 4 Mt 21:1-11 Mc 11:1-10 Lc 19:29-44 Jn 12:12-19

6 Mt 14:15-21 Mc 6:35-44 Lc 9:12-17 Jn 6:5-13

Query Query Query

2 Mt 22:15-22 Mc 12:13-17 Lc 20:20-26

5 Mt 17:14-18 Mc 9:17-29 Lc 9:38-43

 Mt 9:16 Mc 2:21 Lc 5:36 Mt 9:17 Mc 2:22 Lc 5:37-38 Mt 13:3-8,18-23 Mc 4:3-8,14-20 Lc 8:5-8,11-15 Mt 13:31-32 Mc 4:30-32 Lc 13:18-19 Mt 24:32-35 Mc 13:28-29 Lc 21:29-31

7 Mt 21:18-22 Mc 11:12-14,20-25

6.1. Case 1: identification of the LSI model

Table 1. Queries used for the case studies.

Table 2. Location of queries in the gospels.

As stated by Kolda et al. in [12], perhaps the biggest decision to make when implementing LSI multilanguage is which parallel aligned corpus to use. In this work, we have adopted the Bible as ours and reasons for this are: (i) it is probably the most translated book in the whole world, which allows us to have many translations of the same documents, (ii) given its presentation by chapters and verses, its parallel alignment is facilitated, (iii) if we refer to the Gospels (Matthew, Mark, Luke, John), it is easy to identify facts related to the life of Jesus and thus recognize relevant documents for queries made in this context.

#### 5.2. Fusion strategies

The central purpose in CLIR is to develop tools that allow the terms of query to coincide with those of documents that describe the same or similar meaning, even if they are in different languages [13]. The goal is the construction of parallel aligned corpus using the languages of the documents, which can be done, for the case of two languages, for example, taking portions of documents in a certain language and adding them to the corresponding documents of the other language. This is called a fusion strategy. Related works are [14, 15].

This work seeks to recover relevant documents in Spanish and English when queries are made in Spanish using fusion strategies, which combine approximately 10% of documents. The central idea behind each fusion used is to take a specific amount of verses in a certain language and add them to the corresponding verses of the other language.
