6. Study cases

Four case studies are developed with the intention of evaluating the performance of two methods of LSI, LSI via SVD and LSI via SDD, applied in CLIR. The first one identifies the LSI model that allows obtaining the best results in terms of the mean average pseudoprecision (MAP). For this case, two English translations of the Gospels were used: The King James and New Living Bible. In the second case, we start with the model previously chosen to develop two experiments that involve two fusion strategies that combine small portions of the Gospels in the English-Spanish languages, using the King James and Reina Valera 1966 versions. In Cases 3 and 4, computational comparisons are made between the LSI methods and their performance are analyzed when the collection of documents increases, respectively.

In all cases, the documents used to consist of a group of verses that form a story, verses which were taken from the New International version of the Bible, which organizes the verses by stories and gives each one a title. The queries used are those given in Table 1, which describe parables and miracles in the life of Jesus. Table 2 shows the biblical quotation where each of these queries is located, that is, the documents that are relevant.

#### 6.1. Case 1: identification of the LSI model

5.1. Parallel aligned corpus

128 Multilingualism and Bilingualism

5.2. Fusion strategies

6. Study cases

guages. This is known as the parallel aligned corpus.

recognize relevant documents for queries made in this context.

other language. This is called a fusion strategy. Related works are [14, 15].

and add them to the corresponding verses of the other language.

are analyzed when the collection of documents increases, respectively.

A parallel text is a text accompanied by its translations in other languages. Large collections of parallel text are called parallel corpus. In order to use a parallel corpus correctly, it is necessary to align the original text with its (your) translation (translations), that is, you must identify the phrases or words in the original text with their corresponding translations in the other lan-

As stated by Kolda et al. in [12], perhaps the biggest decision to make when implementing LSI multilanguage is which parallel aligned corpus to use. In this work, we have adopted the Bible as ours and reasons for this are: (i) it is probably the most translated book in the whole world, which allows us to have many translations of the same documents, (ii) given its presentation by chapters and verses, its parallel alignment is facilitated, (iii) if we refer to the Gospels (Matthew, Mark, Luke, John), it is easy to identify facts related to the life of Jesus and thus

The central purpose in CLIR is to develop tools that allow the terms of query to coincide with those of documents that describe the same or similar meaning, even if they are in different languages [13]. The goal is the construction of parallel aligned corpus using the languages of the documents, which can be done, for the case of two languages, for example, taking portions of documents in a certain language and adding them to the corresponding documents of the

This work seeks to recover relevant documents in Spanish and English when queries are made in Spanish using fusion strategies, which combine approximately 10% of documents. The central idea behind each fusion used is to take a specific amount of verses in a certain language

Four case studies are developed with the intention of evaluating the performance of two methods of LSI, LSI via SVD and LSI via SDD, applied in CLIR. The first one identifies the LSI model that allows obtaining the best results in terms of the mean average pseudoprecision (MAP). For this case, two English translations of the Gospels were used: The King James and New Living Bible. In the second case, we start with the model previously chosen to develop two experiments that involve two fusion strategies that combine small portions of the Gospels in the English-Spanish languages, using the King James and Reina Valera 1966 versions. In Cases 3 and 4, computational comparisons are made between the LSI methods and their performance

In all cases, the documents used to consist of a group of verses that form a story, verses which were taken from the New International version of the Bible, which organizes the verses by An LSI model is the set of parameters that are considered in the application of the latent semantic index method, that is, the local and global weight schemes, the number of factors (k), the fusion strategies, etc., that are chosen for performing recovery experiments. A fourletter string has been used to differentiate LSI models. The first three indicate the local weight, the global weight and the use of standardization in the matrix term document, respectively, and the last corresponds to the query matrix and refers to the local weight of the terms. In this way, the nomenclature fex.l, for example, means that in the matrix term document, the frequency (f) for the local weight and an entropy value (e) (see [1]) for the global weight of the terms were used; besides, the columns of the matrix document term were not normalized (x) and a logarithmic value (l) was used for the local weight of the query terms.



Table 1. Queries used for the case studies.

Table 2. Location of queries in the gospels.

In this case, different LSI models are tested and the best one is determined from the MAP for k = 100. Table 3 reports the results.

It is observed that with both methods, that is, LSI via SVD and LSI via SDD, the highest values of the MAP, marked in bold, were achieved with the models len.f, len.b, and len.l. This means that the log-entropi scheme is the one with the best performance and that the local weight of the terms in the query matrix does not affect the quality of a recovery. For this reason, in all subsequent experiments, only the len.f model will be used in both methods.

#### 6.2. Case 2: fusion strategies

Two experiments are developed that involve merging the documents in English with their corresponding versions in Spanish. In each one, a different fusion strategy is used and 20 documents are retrieved by query. For each query in each experiment, an analysis of the selection of the k is made in order to establish a margin for the choice of the same. The errors obtained are illustrated in terms of the average of pseudoprecision and tables that give details of what was recovered in each query are shown. The errors were calculated with the formula

$$Error = 1 - \frac{1}{n} \sum\_{i=0}^{n-1} \check{P}\left(\frac{i}{n-1}\right). \tag{14}$$

Gospels Doc. English Doc. Spanish Example of a database document

Table 4. Fusion scheme of the database (left) and example of a document of it (right).

Mark Eng + Spa Spanish Luke Eng + Spa Spanish John Eng + Spa Spanish

respectively.

Matthew Eng + Spa Spanish And seeing the multitudes, he went up into a mountain: and when he was

Figure 1. Errors versus k for each query. The asterisks and circles indicate the methods LSI via SVD and LSI via SDD,

taught them, saying,

set, his disciples came unto him: Viendo la multitud, subio al monte; y sentandose, vinieron a el sus discipulos. And he opened his mouth, and

http://dx.doi.org/10.5772/intechopen.74171

131

Cross - Language Information Retrieval Using Two Methods: LSI via SDD and LSI via SVD

The database has 670 documents, of which, considering the 12 queries, 72 are relevant, that is, only 10.74% of the collection is relevant. The amount of storage required is 0.375 MB.

#### 6.2.1. Experiment 1

The fusion strategy increases the size of the database by approximately 10% and consisted of taking a single verse from the beginning of each document in Spanish and adding it to the end of the corresponding first verse in English. Table 4 illustrates the structure of the database and one of its documents.

In Figure 1, we show the graphs that relate the k to the error levels for each of the queries. It is observed that the error curves give clues for the selection of a k in almost all the queries. In the


Table 3. MAP for different LSI models.


Table 4. Fusion scheme of the database (left) and example of a document of it (right).

In this case, different LSI models are tested and the best one is determined from the MAP for

It is observed that with both methods, that is, LSI via SVD and LSI via SDD, the highest values of the MAP, marked in bold, were achieved with the models len.f, len.b, and len.l. This means that the log-entropi scheme is the one with the best performance and that the local weight of the terms in the query matrix does not affect the quality of a recovery. For this reason, in all

Two experiments are developed that involve merging the documents in English with their corresponding versions in Spanish. In each one, a different fusion strategy is used and 20 documents are retrieved by query. For each query in each experiment, an analysis of the selection of the k is made in order to establish a margin for the choice of the same. The errors obtained are illustrated in terms of the average of pseudoprecision and tables that give details of what was recovered in each query are shown. The errors were calculated with the formula

> n Xn�1 i¼0

The database has 670 documents, of which, considering the 12 queries, 72 are relevant, that is,

The fusion strategy increases the size of the database by approximately 10% and consisted of taking a single verse from the beginning of each document in Spanish and adding it to the end of the corresponding first verse in English. Table 4 illustrates the structure of the database and one of its

In Figure 1, we show the graphs that relate the k to the error levels for each of the queries. It is observed that the error curves give clues for the selection of a k in almost all the queries. In the

LSI model SVD SDD LSI model SVD SDD cxn.f 0.6948 0.6440 len.l 0.7591 0.7035 fxn.f 0.6762 0.5920 lin.f 0.7375 0.6811 lxn.f 0.7444 0.6276 lin.b 0.7375 0.6811 lxn.b 0.7444 0.6276 lin.l 0.7375 0.6811 lxn.l 0.7444 0.6276 lpn.f 0.7058 0.6957 len.f 0.7591 0.7035 lpn.b 0.7058 0.6957 len.b 0.7591 0.7035 lpn.l 0.7058 0.6957

only 10.74% of the collection is relevant. The amount of storage required is 0.375 MB.

<sup>P</sup><sup>~</sup> <sup>i</sup> n � 1 � �

: (14)

subsequent experiments, only the len.f model will be used in both methods.

Error <sup>¼</sup> <sup>1</sup> � <sup>1</sup>

k = 100. Table 3 reports the results.

130 Multilingualism and Bilingualism

6.2. Case 2: fusion strategies

6.2.1. Experiment 1

Table 3. MAP for different LSI models.

documents.

Figure 1. Errors versus k for each query. The asterisks and circles indicate the methods LSI via SVD and LSI via SDD, respectively.

first one, for example, it is observed that k = 70 would be the optimum k for LSI via SVD. In Query 5, the two methods completely failed with a 100% error; in Queries 8, 9, and 10, the errors are approximately zero in almost all values of k. There is good behavior of the methods in Queries 2, 4, 6, 7, 8, 9, 10, and 11 in many values of k, in particular, for some less than 110. In Queries 1 and 12, the best results were also in these values. It is also observed that there are usually many local minima, which makes it difficult to automate the choice of k through some parameter selection algorithm.

Here, it is appreciated that in all queries, except for q3 and q12, the selected k increased; however, the errors in q1, q2, q5, q6, q7, q8, q9, and q10 remained the same. In q4 and q11, the errors

Cross - Language Information Retrieval Using Two Methods: LSI via SDD and LSI via SVD

http://dx.doi.org/10.5772/intechopen.74171

133

From the above, it is concluded that the performance of the LSI methods subtly deteriorated when considering k in such interval, since in 10 of the 12 queries, the errors were maintained and in only two they increased in small percentages. The main contribution of these tables is to

The fusion strategy used in this case also combines a single verse, that is, 10% of the documents, but unlike the previous experiment, it takes verses in English and adds them to the corresponding verses in Spanish and vice versa. The structure of the database is illustrated in

Again it is observed that at q5, an error of 100% was obtained. In q2, q6, q7, q8, q9, and q10, the methods reached errors close to zero in some values of k. In q1 and q11, only LSI via SDD obtained errors close to that value. Again, the existence of many local minimums in the error levels of each query is highlighted. Information on the optimal k, the selected k, the k selected in the interval [70, 100] (denoted by ksel<sup>1</sup> and ksel2, respectively) and the corresponding errors is

We find that in q2, q5, q6, q7, q8, q9, q10, q11, and q12, the errors for kopt, ksel1, and ksel<sup>2</sup> did not change when using LSI via SVD, that is, for this group of nine queries, the optimal k lies in the interval [70, 100]. With LSI via SDD, for this same group of queries, except for q2 and q11, the kopt is also obtained in that interval; in q3, the errors increased when the selection interval of

Matthew Eng + Spa 1–14 Spanish

Mark Eng + Spa 1–8 Spanish

Luke Eng + Spa 1–12 Spanish

John Eng + Spa 1–10 Spanish

Table 7. Scheme of the database for the structure of the merger 2.

Chapters

English 15–28 Spa + Eng

English 9–16 Spa + Eng

English 13–24 Spa + Eng

English 11–21 Spa + Eng

Figure 2 illustrates the error curves for each query by increasing the parameter k.

increased from 0 to 3% and from 9 to 12%, respectively.

6.2.2. Experiment 2

given in Table 8.

Table 7.

have identified a small range for the choice of the parameter k.

In Table 5, we show for each query two values of k and the corresponding errors. The first, called optimal k, indicates the smallest k for which the smallest error was obtained. The other represents the same but considers k < 110. It is noted that in all queries, except q3, the optimum k matches the selected k, which leads us to think about the possibility of reducing the domain of choice of k when considering the k selected in an interval of lower amplitude. Table 6 shows analogous results to those in Table 6 considering the values of the k selected in the interval [70, 100].


Table 5. Fusion 1. Query errors for the optimal k and the selected k.


Table 6. Fusion 1. Errors per query for the selected k in [70, 100].

Here, it is appreciated that in all queries, except for q3 and q12, the selected k increased; however, the errors in q1, q2, q5, q6, q7, q8, q9, and q10 remained the same. In q4 and q11, the errors increased from 0 to 3% and from 9 to 12%, respectively.

From the above, it is concluded that the performance of the LSI methods subtly deteriorated when considering k in such interval, since in 10 of the 12 queries, the errors were maintained and in only two they increased in small percentages. The main contribution of these tables is to have identified a small range for the choice of the parameter k.

## 6.2.2. Experiment 2

first one, for example, it is observed that k = 70 would be the optimum k for LSI via SVD. In Query 5, the two methods completely failed with a 100% error; in Queries 8, 9, and 10, the errors are approximately zero in almost all values of k. There is good behavior of the methods in Queries 2, 4, 6, 7, 8, 9, 10, and 11 in many values of k, in particular, for some less than 110. In Queries 1 and 12, the best results were also in these values. It is also observed that there are usually many local minima, which makes it difficult to automate the choice of k through some

In Table 5, we show for each query two values of k and the corresponding errors. The first, called optimal k, indicates the smallest k for which the smallest error was obtained. The other represents the same but considers k < 110. It is noted that in all queries, except q3, the optimum k matches the selected k, which leads us to think about the possibility of reducing the domain of choice of k when considering the k selected in an interval of lower amplitude. Table 6 shows analogous results to those in Table 6 considering the values of the k selected in the interval [70, 100].

SVD SDD SVD SDD SVD SDD

10 0.078 10 0.078

q9 <sup>k</sup>

q10 <sup>k</sup>

q11 <sup>k</sup>

q12 <sup>k</sup>

opt Err k sel Err

opt Err k sel Err

opt Err k sel Err

opt Err k sel Err

q9 <sup>k</sup>

q10 <sup>k</sup>

q11 <sup>k</sup>

q12 <sup>k</sup>

85 0.065 85 0.065

65 0.33 65 0.33

sel Error 70

sel Error 70

sel Error 90

sel Error 70

55 0.097 55 0.097

95 0.74 95 0.74

0.5

0

0.25

0.17

SVD SDD SVD SDD SVD SDD

1

0.07

0

0

70 1

90 0.07

70 0

70 0

sel Error 70

sel Error 70

sel Error 90

sel Error 70

parameter selection algorithm.

132 Multilingualism and Bilingualism

q1 <sup>k</sup>

q2 <sup>k</sup>

q3 <sup>k</sup>

q4 <sup>k</sup>

q1 <sup>k</sup>

q2 <sup>k</sup>

q3 <sup>k</sup>

q4 <sup>k</sup>

opt Err k sel Err

opt Err k sel Err

opt Err k sel Err

opt Err k sel Err

70 0.5 70 0.5

195 0.157 90 0.25

70 0.173 70 0.173

sel Error 70

sel Error 70

sel Error 90

sel Error 70

0.5

0

0.25

0.17

65 0.833 65 0.833

395 0.25 80 0.727

Table 5. Fusion 1. Query errors for the optimal k and the selected k.

95 0.83

75 0

80 0.72

100 0.03

Table 6. Fusion 1. Errors per query for the selected k in [70, 100].

q5 <sup>k</sup>

q6 <sup>k</sup>

q7 <sup>k</sup>

q8 <sup>k</sup>

q5 <sup>k</sup>

q6 <sup>k</sup>

q7 <sup>k</sup>

q8 <sup>k</sup>

opt Err k sel Err

opt Err k sel Err

opt Err k sel Err

opt Err k sel Err

The fusion strategy used in this case also combines a single verse, that is, 10% of the documents, but unlike the previous experiment, it takes verses in English and adds them to the corresponding verses in Spanish and vice versa. The structure of the database is illustrated in Table 7.

Figure 2 illustrates the error curves for each query by increasing the parameter k.

Again it is observed that at q5, an error of 100% was obtained. In q2, q6, q7, q8, q9, and q10, the methods reached errors close to zero in some values of k. In q1 and q11, only LSI via SDD obtained errors close to that value. Again, the existence of many local minimums in the error levels of each query is highlighted. Information on the optimal k, the selected k, the k selected in the interval [70, 100] (denoted by ksel<sup>1</sup> and ksel2, respectively) and the corresponding errors is given in Table 8.

We find that in q2, q5, q6, q7, q8, q9, q10, q11, and q12, the errors for kopt, ksel1, and ksel<sup>2</sup> did not change when using LSI via SVD, that is, for this group of nine queries, the optimal k lies in the interval [70, 100]. With LSI via SDD, for this same group of queries, except for q2 and q11, the kopt is also obtained in that interval; in q3, the errors increased when the selection interval of


Table 7. Scheme of the database for the structure of the merger 2.

Figure 2. Errors versus k for each query. The asterisks and circles indicate the methods LSI via SVD and LSI via SDD, respectively.

6.3. Case 3: computational comparison of LSI models

k

k

k

k

k

k

k

k

sel1 Err1 55

sel1 Err1 35

sel1 Err1 110

sel1 Err1 80

q1 <sup>k</sup>

q2 <sup>k</sup>

q3 <sup>k</sup>

q4 <sup>k</sup>

opt 55 355 <sup>k</sup>

sel2 70 70 <sup>k</sup>

opt 35 15 <sup>k</sup>

sel2 70 80 <sup>k</sup>

opt 130 150 <sup>k</sup>

sel2 100 75 <sup>k</sup>

opt 115 150 <sup>k</sup>

sel2 80 70 <sup>k</sup>

70 0.21

105 0.55

15 0

65 0.87 q5 <sup>k</sup>

q6 <sup>k</sup>

q7 <sup>k</sup>

q8 <sup>k</sup>

Table 8. Fusion 2. Errors per query for the optimal k, the selected k, and the selected k in [70, 100].

0.75

0

0.49

0.34

The results shown in the experiments of the second case study do not consider the efficiency of the IR systems, that is, the time of the LSI methods, the amount of storage required by each of them, the ability to quickly obtain relevant documents, and the relationship between these aspects. For this reason, in this case, computational results are presented that allow the LSI methods to be compared in such aspects. All tests were performed on a computer with Intel (R) Core (TM) I5–3230 CPU @ 2.60 Hz and with 6 GB of RAM. In Figures 3 and 4, the results obtained by SVD and SDD have been marked with an asterisk (\*) and a circle (), respectively. Figure 3 illustrates the size in megabytes (MB) of the SVD and SDD decompositions for various values of k. Clearly, it is observed that in all the k, the SDD consumes much less space than the SVD. For k = 400, for example, the SVD occupies a space of 22.325 MB, while the SDD 0.875 MB, that is, there is a saving of 21.45 MB. In addition, in the lower part, the time used,

SVD SDD SVD SDD SVD SDD

Err 0.75 0 Err 1 1 Err 0 0

Err2 0.83 0.89 Err2 11 Err2 00

Err 0 0 Err 0 0 Err 0 0

Err2 0 0.16 Err2 00 Err2 00

Err 0.33 0.33 Err 0 0 Err 0.06 0

0

Err2 0.56 0.61 Err2 00 Err2 0.06 0.25

Err 0.3 0.16 Err 0 0 Err 0.33 0.5

Err2 0.34 0.21 Err2 00 Err2 0.33 0.5

0

sel1 Err1 25

sel1 Err1 90

0

sel1 Err1 60

1

sel1 Err1 10

opt 10 10 <sup>k</sup>

Cross - Language Information Retrieval Using Two Methods: LSI via SDD and LSI via SVD

10 1

sel2 70 70 <sup>k</sup>

opt 60 85 <sup>k</sup>

85 0

sel2 70 85 <sup>k</sup>

opt 90 40 <sup>k</sup>

40 0

sel2 90 70 <sup>k</sup>

opt 25 15 <sup>k</sup>

15 0

sel2 70 70 <sup>k</sup>

q9 <sup>k</sup>

q10 <sup>k</sup>

q11 <sup>k</sup>

q12 <sup>k</sup>

opt 25 15

0

sel2 70 70

opt 25 30

0

sel2 70 70

opt 75 20

sel2 75 85

opt 25 90

sel2 75 90

0.06

0.33

15 0

135

30 0

20 0

90 0.5

sel1 Err1 25

http://dx.doi.org/10.5772/intechopen.74171

sel1 Err1 25

sel1 Err1 75

sel1 Err1 25

the k was reduced; in q4, for the ksel<sup>1</sup> and the ksel2, the errors were equal but superior to the corresponding ones of the kopt.

In this way, considering the two experiments, it is concluded that the results in terms of the Eq. (14) to calculate the errors in the recoveries favor the Fusion 1 since when considering k in the interval [70, 100], LSI methods obtained minor errors compared to those found with Fusion 2. Therefore, in the following cases, only Fusion 1 will be used in order to continue with the comparison of LSI methods.



Table 8. Fusion 2. Errors per query for the optimal k, the selected k, and the selected k in [70, 100].

#### 6.3. Case 3: computational comparison of LSI models

the k was reduced; in q4, for the ksel<sup>1</sup> and the ksel2, the errors were equal but superior to the

Figure 2. Errors versus k for each query. The asterisks and circles indicate the methods LSI via SVD and LSI via SDD,

In this way, considering the two experiments, it is concluded that the results in terms of the Eq. (14) to calculate the errors in the recoveries favor the Fusion 1 since when considering k in the interval [70, 100], LSI methods obtained minor errors compared to those found with Fusion 2. Therefore, in the following cases, only Fusion 1 will be used in order to continue with the

corresponding ones of the kopt.

respectively.

134 Multilingualism and Bilingualism

comparison of LSI methods.

The results shown in the experiments of the second case study do not consider the efficiency of the IR systems, that is, the time of the LSI methods, the amount of storage required by each of them, the ability to quickly obtain relevant documents, and the relationship between these aspects. For this reason, in this case, computational results are presented that allow the LSI methods to be compared in such aspects. All tests were performed on a computer with Intel (R) Core (TM) I5–3230 CPU @ 2.60 Hz and with 6 GB of RAM. In Figures 3 and 4, the results obtained by SVD and SDD have been marked with an asterisk (\*) and a circle (), respectively.

Figure 3 illustrates the size in megabytes (MB) of the SVD and SDD decompositions for various values of k. Clearly, it is observed that in all the k, the SDD consumes much less space than the SVD. For k = 400, for example, the SVD occupies a space of 22.325 MB, while the SDD 0.875 MB, that is, there is a saving of 21.45 MB. In addition, in the lower part, the time used,

It is observed that the SVD-based method reaches its highest score, 0.7227, in k = 100 (at 4.7 seconds), and that the other one does it in k = 80 (at 18.9 seconds) with a value of 0.6838. It should be noted that the qualities of the methods, from k = 40 for LSI via SDD, were very close. On the right side, the size of the decompositions is crossed with the MAP. The last circle (for k = 400) means that a quality of 0.6539 was reached with just 0.85 MB; in turn, the first asterisk illustrates that with 1.09 MB, there was a MAP of only 0.4196. It is also observed that there are only three asterisks for SVD, and it is because the rest surpasses the scope of the graph. Likewise, it is highlighted that the highest score for the SDD-based method required only 0.14 MB of storage (approximately one-third of the weight of the matrix term document), while LSI via SVD required 5.6 MB of storage (approximately 15 times the weight of the matrix term document) to achieve its best performance. This time LSI via SDD widely outperformed

Cross - Language Information Retrieval Using Two Methods: LSI via SDD and LSI via SVD

http://dx.doi.org/10.5772/intechopen.74171

137

Figure 4. MAP versus time of LSI methods (left) and size of decompositions (right).

So far, we have only studied the LSI methods when you have a fixed document collection. In practice, it often happens that these collections are dynamic, that is, that new documents are added or that some existing ones are deleted. In this case study, the performance of the LSI methods is analyzed when different amounts of documents are added to the database. For this, the average pseudoprecision (see Eq. (9)) is used as a measure of quality to make an analysis by

Specifically, 20 and 88 documents were added to the initial collection of 670 documents in order to obtain two new databases with 690 and 758 documents, which have 75 and 78 relevant documents, respectively. For these three collections, the results of Figure 5 and Table 9

the other method.

are presented.

6.4. Case 4: adding documents to the data base

query and the MAP to generalize the study to all of them.

Figure 3. Size in megabytes (above) and execution time in seconds of the decompositions (below), as a function of the number of factors (k).

in seconds, to obtain each decomposition is shown. For the presented k, it is evident that in the SVD, less time is used. However, the amount of seconds used by each algorithm to build the matrices of the SVD and SDD factorizations is small, because even for high values of k, the recorded time is approximately 40 seconds.

Finally, in Figure 4, the MAP is presented as a function of the time of the LSI method, calculated using the formula Time LSI methods = Decomposition Time + Query time, and the amount of storage required by each one. In the graph, on the left, there are 20 asterisks corresponding to the values k = 20, 40, …, 400 and 20 circles related to the same values of k. The second asterisk (corresponding to k = 40), for example, means that LSI via SVD required approximately 1.06 seconds to reach a quality of 0.5850, while the second circle shows that LSI via SDD at 9.21 seconds reached a MAP of 0.6062.

Figure 4. MAP versus time of LSI methods (left) and size of decompositions (right).

It is observed that the SVD-based method reaches its highest score, 0.7227, in k = 100 (at 4.7 seconds), and that the other one does it in k = 80 (at 18.9 seconds) with a value of 0.6838. It should be noted that the qualities of the methods, from k = 40 for LSI via SDD, were very close. On the right side, the size of the decompositions is crossed with the MAP. The last circle (for k = 400) means that a quality of 0.6539 was reached with just 0.85 MB; in turn, the first asterisk illustrates that with 1.09 MB, there was a MAP of only 0.4196. It is also observed that there are only three asterisks for SVD, and it is because the rest surpasses the scope of the graph. Likewise, it is highlighted that the highest score for the SDD-based method required only 0.14 MB of storage (approximately one-third of the weight of the matrix term document), while LSI via SVD required 5.6 MB of storage (approximately 15 times the weight of the matrix term document) to achieve its best performance. This time LSI via SDD widely outperformed the other method.

#### 6.4. Case 4: adding documents to the data base

in seconds, to obtain each decomposition is shown. For the presented k, it is evident that in the SVD, less time is used. However, the amount of seconds used by each algorithm to build the matrices of the SVD and SDD factorizations is small, because even for high values of k, the

Figure 3. Size in megabytes (above) and execution time in seconds of the decompositions (below), as a function of the

Finally, in Figure 4, the MAP is presented as a function of the time of the LSI method, calculated using the formula Time LSI methods = Decomposition Time + Query time, and the amount of storage required by each one. In the graph, on the left, there are 20 asterisks corresponding to the values k = 20, 40, …, 400 and 20 circles related to the same values of k. The second asterisk (corresponding to k = 40), for example, means that LSI via SVD required approximately 1.06 seconds to reach a quality of 0.5850, while the second circle shows that LSI

recorded time is approximately 40 seconds.

number of factors (k).

136 Multilingualism and Bilingualism

via SDD at 9.21 seconds reached a MAP of 0.6062.

So far, we have only studied the LSI methods when you have a fixed document collection. In practice, it often happens that these collections are dynamic, that is, that new documents are added or that some existing ones are deleted. In this case study, the performance of the LSI methods is analyzed when different amounts of documents are added to the database. For this, the average pseudoprecision (see Eq. (9)) is used as a measure of quality to make an analysis by query and the MAP to generalize the study to all of them.

Specifically, 20 and 88 documents were added to the initial collection of 670 documents in order to obtain two new databases with 690 and 758 documents, which have 75 and 78 relevant documents, respectively. For these three collections, the results of Figure 5 and Table 9 are presented.

As a first observation, it is highlighted that MAP levels are higher, in all databases, when LSI is used via SVD. However, the six scores shown give evidence of the good performance of the two methods considering the low percentages of relevant documents in each collection, since all the success rate exceeds 65%. On the other hand, it is emphasized that there seems to be a direct relationship between the percentage of relevant documents and MAP values with the SDDbased method, that is, the higher the percentage of relevant documents, the greater the MAP.

Cross - Language Information Retrieval Using Two Methods: LSI via SDD and LSI via SVD

http://dx.doi.org/10.5772/intechopen.74171

139

The LSI method originally used the singular values decomposition (SVD) for the benefits that it has in terms of data representation in spaces of reduced dimension and other properties with respect to data filtering. This makes the SVD a powerful tool in IR and in CLIR. The semidiscrete decomposition (SDD), of which few investigations have been developed, has been successfully used in IR, and this research has shown that it is also useful in CLIR and that it is

• In Case 2 for Fusion 1, the errors for the k selected in the interval [70, 100] were the same in

• When in Case 3 efficiencies were evaluated, in aspects, particularly one method surpassed the other. Specifically, when the MAP measure was related to the time of the methods, LSI via SVD was imposed because it requires fewer seconds to reach its highest performance; in contrast, when analyzing the MAP and the amount of storage, LSI via SDD showed a significantly higher performance. In this aspect, it is emphasized that with SDD, only onethird of the weight of the original matrix was needed to reach its highest performance; with SVD, on the other hand, it required almost 15 times the weight of the matrix term document to achieve such value. This is the true impact of the SDD, the ability to obtain

• The MAP quality measure evaluates the performance of an IR method considering a set of queries, so that the higher this value, the better the method's performance will have been. In the fourth case study, when the performance of the methods was considered by increasing the number of documents in the database, higher performance was obtained when using the SVD since higher MAP values were found. However, with both methods, satisfactory results were obtained, because when conducting a search in Spanish, you can retrieve relevant documents in this language and in English with a success rate of at least

Therefore, it is concluded that although the LSI via SVD method has been widely used and is a powerful tool in CLIR, the LSI via SDD method results in an important and innovative alternative in information recovery tasks, since, in addition to achieving results comparable to those of the other method in the task of retrieving relevant information in multiple languages after consulting only one, and also has the benefit of saving large amounts of space when huge

also comparable with the standard approach used by the SVD. Evidence of this is that

7 of the 12 queries and in the rest, they differ at most by 47%.

good results at a very low cost in terms of storage.

7. Conclusions

65%.

databases are stored.

Figure 5. Comparison of the LSI methods with respect to the average pseudoprecision when adding documents to the database.


Table 9. Percentage of relevant documents and MAP for the different databases.

In this figure, the average pseudoprecision obtained with the k in Table 7 is shown for each of the 12 queries and for each database. The results of the LSI via SVD method are shown on the left and on the right are those corresponding to LSI via SDD. Methods for collections with 670, 690, and 758 documents have been labeled with a circle, an asterisk, and a triangle, respectively. It is observed that in the two methods, the average of pseudoprecision for Query 5 is 0 and for Queries 8, 9, and 10, it is 1. In Query 2, LSI via SVD also had a score of 1, while LSI via SDD did so only for 670 documents. In the rest of the queries, there are averages that go up or down as documents are added to the database. In Query 12, for example, it is noted that the highest score is for 690 documents, decreases when the collection is increased to 758 and decreases again when there are barely 670 documents. From this, it is concluded that there is no direct or inverse relationship between the average pseudoprecision and the number of documents in the database.

On the other hand, in order to evaluate the performance of the methods considering all the queries, in Table 9, the MAP obtained in each database is presented when again using the k of the Table 6.

As a first observation, it is highlighted that MAP levels are higher, in all databases, when LSI is used via SVD. However, the six scores shown give evidence of the good performance of the two methods considering the low percentages of relevant documents in each collection, since all the success rate exceeds 65%. On the other hand, it is emphasized that there seems to be a direct relationship between the percentage of relevant documents and MAP values with the SDDbased method, that is, the higher the percentage of relevant documents, the greater the MAP.
