**3. Information access to Mongolian historical documents**

In the recent years, the needs for utilizing digital representations and proving access to historical documents encouraged the development of various tools for transcribing, annotating and publishing of historical manuscripts. In order to provide computer technology-driven solutions to solve the facing challenges of Mongolian humanities scholarship as well as to benefit the recent achievements in the digital humanities worldwide, it is necessary to analyse the requirements of Mongolian historical documents for digital tools.

In this section, we describe our methods for implementing integrated access to historical documents that are capable of coping with linguistic transformations from ancient times to the present. First, we propose an information extraction method for digitized ancient Mongolian historical documents. The proposed method extracts named entities from historical manuscripts by utilizing machine learning techniques. Results will be utilized for building digital text representations that encode named entities, the possible alterations, corrections, errors and interpretations of ancient Mongolian words in a modern language. In the later sections, we discuss how to develop a digital edition of Mongolian historical documents by considering various features and requirements of Mongolian manuscripts.

#### **3.1. Information extraction from ancient Mongolian documents**

This section discusses an information extraction method for digitized ancient Mongolian documents by using the features of traditional Mongolian script. Named entities such as personal names and place names are extracted automatically from digitized text of ancient Mongolian documents by employing support vector machine (SVM) for aiming to reduce the labourintensive analysis on historical text. Information extraction, named entity extraction (NEE) and tagging or annotations are able to turn plain text into structured data for analysis or effective use, via NLP applications and analytical methods. State-of-the-art NEE systems for English produce near-human performance to extract named entities [8]. However, there has been little research on text mining or NEE for Mongolian language, and none of the research has considered text mining on ancient Mongolian historical documents due to the lack of research in those areas. Therefore, proposing an information extraction method for ancient historical documents in traditional Mongolian script is crucial.

#### *3.1.1. The proposed approach*

TMSDL can be used to access and retrieve the historical manuscripts written in traditional Mongolian script using a query in modern Mongolian (Cyrillic). The research achievements, as well as the experiences obtained from the development of the TMSDL, have motivated us to share further research results in developing methods to providing cross-lingual and cross-

Certainly, there has been a little research on text mining for Mongolian language, and none of the research has considered text mining on ancient Mongolian historical documents due to the lack of research in those areas. Because of the notable difference between mediaeval Mongolian and modern Mongolian, the existing NLP tools, which were designed on modern Mongolian, do not perform well on ancient Mongolian texts. Therefore, further computerized

In the recent years, the needs for utilizing digital representations and proving access to historical documents encouraged the development of various tools for transcribing, annotating and publishing of historical manuscripts. In order to provide computer technology-driven solutions to

chronological information access to ancient Mongolian historical documents.

analyses of ancient Mongolian historical documents are necessary.

**Figure 1.** A folio of the 'little' Altan Tobchi in the TMSDL with keywords' highlights.

146 Multilingualism and Bilingualism

**3. Information access to Mongolian historical documents**

The flowchart in **Figure 2** shows an overview of the main steps and components of the proposed approach. The proposed approach starts with preprocessing tasks where an ancient Mongolian corpus gets tokenized, each token gets annotated and gold standard annotations are prepared for inputting into SVM for learning. The proposed method learns the extraction rules of personal names from annotated training corpora and then extracts personal names from ancient Mongolian texts by using SVM. The following sections explain the main three components: (1) pre-processing, (2) annotating and (3) named entity extraction.

#### *3.1.1.1. Preprocessing step*

The first step is to divide digitized ancient Mongolian plain text of into tokens. This is necessary because we want to mark up each token in the next tasks. A token is quite often a word delimited by space, but there exist some unique features for traditional Mongolian script. For

personal name, and 'I' tag indicates the tokens inside a personal name. 'O' tag indicates other tokens not belong to personal names. An example of the IOB2 annotation of the text in traditional

Cross-Lingual and Cross-Chronological Information Access to Multilingual Historical Documents

http://dx.doi.org/10.5772/intechopen.72421

149

Because of some unique features of traditional Mongolian script, we also use 'Start/End' (SE) chunk tag set [11], which represents the character position in a word, along with the IOB2

ejen O

naiiman O

sirɣ-a\_yi O

ögelen B

eke\_dür\_iyen I

abču O

irebei O

iregsen\_ü O

qoyin-a O

qan O

yeke O

oru O

saɣubai. O

tengri\_eče O

jayaɣ-a\_bar O

**Token Transliteration IOB2 tag**

Mongolian script can be seen in **Table 1**.

**Figure 2.** Overview of the main steps and components of the proposed approach.

instance, in traditional Mongolian script, certain words with a final vowel letter 'a' or 'e' are separated visually from the preceding consonant by a narrow gap. Moreover, some suffixes are visually separated from the stem of a word or from other suffixes. However, the 'a' or 'e' is an integral part of the word stem, as well as any attached suffixes are considered to be an integral part of the word as a whole. In Unicode, control characters Mongolian Vowel Separator (MVS) and narrow no-break space (NNBSP) handle the behaviour of Mongolian suffixes and vowels 'a'/'e' in the end of a word [2]. This information can be used as a feature in SVM. Other features are discussed in Section 3.1.1.3.

The next step is to annotate tokens and prepare gold standard annotations. Because of the lack of NLP tools and part of speech data for ancient Mongolian manuscripts, we first annotate all the personal names in the 'Little' Altan Tobchi using the manually compiled personal names' indices (lists of personal names) obtained from the 'Qad-un ündüsün quriyangγui altan tobči-Textological Study' [9]. After converting to a format that is suitable for a linear classifier, we input that data into the classifier for training, which returns a probability matrix (i.e., a model). The classifier is trained with gold standard annotations of tokens with known classes (i.e., personal names). The classifier calculates weights for each feature in correlation to each class. This can be seen as a probability of an object belonging to a certain class (i.e., personal names) when having those specific characteristics. These weights are saved in a probability matrix (i.e., NEE model), which will be used for classifying unseen named entities in the next steps.

#### *3.1.1.2. Annotating step*

In this step, each token of digitized ancient Mongolian manuscript will be annotated with the correct tag. We use the IOB2 [10] format for tagging tokens. 'B' tag indicates the beginning of a personal name, and 'I' tag indicates the tokens inside a personal name. 'O' tag indicates other tokens not belong to personal names. An example of the IOB2 annotation of the text in traditional Mongolian script can be seen in **Table 1**.

Because of some unique features of traditional Mongolian script, we also use 'Start/End' (SE) chunk tag set [11], which represents the character position in a word, along with the IOB2


instance, in traditional Mongolian script, certain words with a final vowel letter 'a' or 'e' are separated visually from the preceding consonant by a narrow gap. Moreover, some suffixes are visually separated from the stem of a word or from other suffixes. However, the 'a' or 'e' is an integral part of the word stem, as well as any attached suffixes are considered to be an integral part of the word as a whole. In Unicode, control characters Mongolian Vowel Separator (MVS) and narrow no-break space (NNBSP) handle the behaviour of Mongolian suffixes and vowels 'a'/'e' in the end of a word [2]. This information can be used as a feature in SVM. Other

**Figure 2.** Overview of the main steps and components of the proposed approach.

The next step is to annotate tokens and prepare gold standard annotations. Because of the lack of NLP tools and part of speech data for ancient Mongolian manuscripts, we first annotate all the personal names in the 'Little' Altan Tobchi using the manually compiled personal names' indices (lists of personal names) obtained from the 'Qad-un ündüsün quriyangγui altan tobči-Textological Study' [9]. After converting to a format that is suitable for a linear classifier, we input that data into the classifier for training, which returns a probability matrix (i.e., a model). The classifier is trained with gold standard annotations of tokens with known classes (i.e., personal names). The classifier calculates weights for each feature in correlation to each class. This can be seen as a probability of an object belonging to a certain class (i.e., personal names) when having those specific characteristics. These weights are saved in a probability matrix (i.e., NEE model), which will be used for classifying unseen named entities in the next steps.

In this step, each token of digitized ancient Mongolian manuscript will be annotated with the correct tag. We use the IOB2 [10] format for tagging tokens. 'B' tag indicates the beginning of a

features are discussed in Section 3.1.1.3.

*3.1.1.2. Annotating step*

148 Multilingualism and Bilingualism


• **Suffix**: In traditional Mongolian script, many living being and humankind proper names

Cross-Lingual and Cross-Chronological Information Access to Multilingual Historical Documents

http://dx.doi.org/10.5772/intechopen.72421

151

• **Special non-word boundaries:** In traditional Mongolian script, some suffixes are visually separated from the stem of a word or from other suffixes, although they are an integral part of the word. Moreover, in some words with a final vowel letter 'a' or 'e', final vowel letters 'a' and 'e' are separated visually from the preceding consonant by a narrow gap although

• **End of token or special word delimiters**: A token is usually a word delimited by space, but

• **Information of the preceding and following tokens**: We also extract a feature by looking at the context of the current, preceding and succeeding IOB2 annotations (currently, the window stretches from *Cn−2 to Cn+2*) as visualized in **Table 2**. Such a feature could correct

The final task in this step is to extract the personal names, which have the proper names'

The proposed method [14] is capable of extracting proper nouns from digitized text of ancient Mongolian manuscripts with 0.6993, 0.5679 and 0.6268 of precision, recall and F-measure, respectively, when utilizing a SVM tool LIBLINEAR with the L2-regularized L2-loss support

When conducting experiments in extracting personal names from traditional Mongolian historical documents, we utilized digitized text of a chronological book of ancient Mongolian kings and the Mongol Empire—'Little' Altan Tobchi—which was made using bamboo pen xylograph technique as the experimental corpus. The 'Little' Altan Tobchi consists of 164 pages that contain approximately 16,200 words. The average number of words is 100 per page, with the longest one having 115 words and the shortest one 75 words. Precision, recall and F-measure were calculated by the fivefold cross-validation for extracting personal names. Manually annotated named entities, extracted named entities [14], manually compiled scholar's commentaries and interpretations [9], as well as digital texts of ancient Mongolian manuscripts [7], will be utilized for building a digital edition of ancient Mongolian manuscripts. The next sections discuss how to develop a digital edition of Mongolian historical documents

*3.1.2. Performance of extracting named entities from Mongolian historical documents*

by describing some features and requirements of Mongolian manuscripts.

**Table 2.** A feature of the preceding and following two tokens.

Tokens *Wn−3 Wn−2 Wn−1 Wn Wn+1 Wn+2 Wn+3* IOB2 tags *Cn−3 Cn−2 Cn−1 Cn Cn+1 Cn+2 Cn+3*

there exist some unique features in traditional Mongolian script.

take only certain plural suffixes such as nar or ner and possessive suffixes [13].

they are an integral part of the word stem.

markups, from the ancient Mongolian digital text.

mislabelled IOB2 annotations.

vector classification (dual) solver [15].

**Table 1.** An example of the IOB2 annotation of personal names in traditional Mongolian script text.

tags. 'S' tag is attached to the first character of each word including the personal names and 'E' tag to the last character. Therefore, each token will include the (1) IOB2 tag and (2) SE tag. SE tags are useful when there is a difference in word boundary between the test data and trained data [11, 12]. Particularly, an approach based on SE tags could improve the SVM prediction when there is no stemmer for traditional Mongolian. After attaching the IOB2 and SE tags to each token, we extract the features for chunking that will be used to learn the rules of personal name extraction. The features, i.e., characteristics of a token are explained in the next section.

#### *3.1.1.3. Named entity extraction step*

In this step, the proposed approach had to find the personal names in ancient Mongolian digitized texts. This method conducts the classification and grouping of tokens by SVM. The classifier in the SVM calculates a probability of a token belonging to personal names by inputting the extracted features to SVM. The features of a token might be possible clues to the proposed approach of whether or not this token is a named entity. In other words, we need some features to distinguish personal names.

We consider the following features of traditional Mongolian script for distinguishing personal names.


• **Suffix**: In traditional Mongolian script, many living being and humankind proper names take only certain plural suffixes such as nar or ner and possessive suffixes [13].

**Token Transliteration IOB2 tag**

törügsen O

temüjin B

činggis B

qaɣan I

buyu O

tags. 'S' tag is attached to the first character of each word including the personal names and 'E' tag to the last character. Therefore, each token will include the (1) IOB2 tag and (2) SE tag. SE tags are useful when there is a difference in word boundary between the test data and trained data [11, 12]. Particularly, an approach based on SE tags could improve the SVM prediction when there is no stemmer for traditional Mongolian. After attaching the IOB2 and SE tags to each token, we extract the features for chunking that will be used to learn the rules of personal name extraction. The features, i.e., characteristics of a token are explained in the

In this step, the proposed approach had to find the personal names in ancient Mongolian digitized texts. This method conducts the classification and grouping of tokens by SVM. The classifier in the SVM calculates a probability of a token belonging to personal names by inputting the extracted features to SVM. The features of a token might be possible clues to the proposed approach of whether or not this token is a named entity. In other words, we need some

We consider the following features of traditional Mongolian script for distinguishing personal

• **Preceding information of the current token**: If the preceding token is generational or dynastic information, an inherited or lifetime title of nobility, or a traditional descriptive

• **Beginning of a sentence**: For example, subjects or personal names are often at the begin-

phrase, it could indicate that current token is a personal name.

**Table 1.** An example of the IOB2 annotation of personal names in traditional Mongolian script text.

next section.

150 Multilingualism and Bilingualism

names.

ning of a sentence.

*3.1.1.3. Named entity extraction step*

features to distinguish personal names.


The final task in this step is to extract the personal names, which have the proper names' markups, from the ancient Mongolian digital text.

#### *3.1.2. Performance of extracting named entities from Mongolian historical documents*

The proposed method [14] is capable of extracting proper nouns from digitized text of ancient Mongolian manuscripts with 0.6993, 0.5679 and 0.6268 of precision, recall and F-measure, respectively, when utilizing a SVM tool LIBLINEAR with the L2-regularized L2-loss support vector classification (dual) solver [15].

When conducting experiments in extracting personal names from traditional Mongolian historical documents, we utilized digitized text of a chronological book of ancient Mongolian kings and the Mongol Empire—'Little' Altan Tobchi—which was made using bamboo pen xylograph technique as the experimental corpus. The 'Little' Altan Tobchi consists of 164 pages that contain approximately 16,200 words. The average number of words is 100 per page, with the longest one having 115 words and the shortest one 75 words. Precision, recall and F-measure were calculated by the fivefold cross-validation for extracting personal names.

Manually annotated named entities, extracted named entities [14], manually compiled scholar's commentaries and interpretations [9], as well as digital texts of ancient Mongolian manuscripts [7], will be utilized for building a digital edition of ancient Mongolian manuscripts. The next sections discuss how to develop a digital edition of Mongolian historical documents by describing some features and requirements of Mongolian manuscripts.


**Table 2.** A feature of the preceding and following two tokens.

#### **3.2. Making a web-based system by utilizing research outcomes**

The past achievements in developing the TMSDL and the research outcomes of extracting named entities from Mongolian historical text allow us to create a digital representation that reflects ancient Mongolian historical manuscripts. This section covers our development in creating a web-based prototype system, which browses ancient Mongolian historical manuscripts.

#### *3.2.1. A digital edition of Mongolian manuscripts*

We utilized Edition Visualization Technology (EVT) for creating and browsing a digital edition of Mongolian manuscripts, which is encoded according to the Text Encoding Initiative (TEI) XML schemas and guidelines [16]. The named entities including the historical figures and place names are explicitly encoded using the TEI guidelines along with the additional data such as editorial markup, various commentaries, transcriptions and interpretations that have been suggested by researchers [9], etc., [17]. Well-known historical figures including generational or dynastic information, an inherited or lifetime title of nobility, or a traditional descriptive phrase or nickname are also marked. In the proposed digital edition, Unicode is chosen at the character level, and TEI P5 is applied on higher levels. As shown in **Figures 3** and **4**, all the personal names and place names in the 'Little' Altan Tobchi are visualized and highlighted in both transliteration and traditional Mongolian text. Image-to-text feature can

link a column in a manuscript folio image to the corresponding text and highlight them in all edition levels. As shown in **Figure 5**, all the named entities are listed as a full list with hyper-

**Figure 4.** A digital edition with image-to-text link, a virtual keyboard and personal names' highlights in transliteration.

Cross-Lingual and Cross-Chronological Information Access to Multilingual Historical Documents

http://dx.doi.org/10.5772/intechopen.72421

153

In addition, we made the following customizations in EVT to make it suitable for Mongolian

The proposed prototype can present scanned image-based editions with two edition levels: (1) diplomatic interpretative and (2) transliteration. Transliteration is helpful for those who are not familiar with a script of a certain language but understands that language. Transliteration

in Latin letters of Mongolian historical documents is popular among scholars.

links to the folios that appear certain named entity.

manuscripts in traditional Mongolian script.

*3.2.1.1. Parallel-text editions with transliteration*

**Figure 3.** A digital edition with image-to-text link and personal names' highlights.

Cross-Lingual and Cross-Chronological Information Access to Multilingual Historical Documents http://dx.doi.org/10.5772/intechopen.72421 153

**Figure 4.** A digital edition with image-to-text link, a virtual keyboard and personal names' highlights in transliteration.

link a column in a manuscript folio image to the corresponding text and highlight them in all edition levels. As shown in **Figure 5**, all the named entities are listed as a full list with hyperlinks to the folios that appear certain named entity.

In addition, we made the following customizations in EVT to make it suitable for Mongolian manuscripts in traditional Mongolian script.

#### *3.2.1.1. Parallel-text editions with transliteration*

**Figure 3.** A digital edition with image-to-text link and personal names' highlights.

**3.2. Making a web-based system by utilizing research outcomes**

*3.2.1. A digital edition of Mongolian manuscripts*

manuscripts.

152 Multilingualism and Bilingualism

The past achievements in developing the TMSDL and the research outcomes of extracting named entities from Mongolian historical text allow us to create a digital representation that reflects ancient Mongolian historical manuscripts. This section covers our development in creating a web-based prototype system, which browses ancient Mongolian historical

We utilized Edition Visualization Technology (EVT) for creating and browsing a digital edition of Mongolian manuscripts, which is encoded according to the Text Encoding Initiative (TEI) XML schemas and guidelines [16]. The named entities including the historical figures and place names are explicitly encoded using the TEI guidelines along with the additional data such as editorial markup, various commentaries, transcriptions and interpretations that have been suggested by researchers [9], etc., [17]. Well-known historical figures including generational or dynastic information, an inherited or lifetime title of nobility, or a traditional descriptive phrase or nickname are also marked. In the proposed digital edition, Unicode is chosen at the character level, and TEI P5 is applied on higher levels. As shown in **Figures 3** and **4**, all the personal names and place names in the 'Little' Altan Tobchi are visualized and highlighted in both transliteration and traditional Mongolian text. Image-to-text feature can

> The proposed prototype can present scanned image-based editions with two edition levels: (1) diplomatic interpretative and (2) transliteration. Transliteration is helpful for those who are not familiar with a script of a certain language but understands that language. Transliteration in Latin letters of Mongolian historical documents is popular among scholars.


**Figure 5.** A list of named entities with hyperlinks to the folios of a Mongolian manuscript.

There is a limited recommendation to encode transliterations in TEI. Soualah and Hassoun [18] proposed to implement transliteration by using a specific model, which uses the [18] element with the *@xml:lang*, *@target* and *@type* attributes. However, we consider transliteration as a separate edition and use it as parallel-text editions as shown in **Figure 6**.

**3.3. Applying and extending the proposed method to across languages**

**Figure 6.** Parallel-text editions with personal names' highlights and virtual keyboards.

tion access to multilingual historical documents.

*historical documents*

This section discusses (1) how the existing cross-language information retrieval techniques can be utilized in the proposed prototype system and (2) how the proposed approach can be applied to other languages in order to provide cross-lingual and cross-chronological informa-

Cross-Lingual and Cross-Chronological Information Access to Multilingual Historical Documents

http://dx.doi.org/10.5772/intechopen.72421

155

There has been little research in information retrieval techniques for historical documents, and almost none of the breakthroughs in research in information retrieval and information access have aimed at retrieving information in the native language from ancient, cross-chronological and/or cross-script foreign language documents. Few approaches that could be considered a cross-chronological information retrieval have been proposed, and there has been little research in information retrieval techniques for historical documents. Ernst-Gerlach and Fuhr focused on modern and archaic German and developed a retrieval method that considers the spelling differences and variations over time [19]. Koolen et al. considered the spelling and pronunciation differences between ancient and modern Dutch [20], while Gotscharek et al. [21] and Hauser et al. [22] considered the spelling differences and variations between

*3.3.1. Adopting cross-language and cross-chronological information retrieval techniques in* 

#### *3.2.1.2. Supporting the traditional Mongolian script*

A unique feature of traditional Mongolian script is displaying vertically, from top to bottom, in columns advancing from left to right. Due to poor support for traditional Mongolian script at the EVT, we customized it to display the scanned images at the top and the corresponding text in traditional Mongolian script below with the direction top to bottom and left to right. We also set to display text in traditional Mongolian script on the left, and the corresponding transliteration in Latin letters on the right that can be used to compare them. Additionally, as shown in **Figures 4** and **6**, we added a simple virtual keyboard composed of 22 traditional Mongolian letters and their corresponding Latin letters to help users to input a Mongolian keyword to benefit free-text search and keyword highlighting.

Cross-Lingual and Cross-Chronological Information Access to Multilingual Historical Documents http://dx.doi.org/10.5772/intechopen.72421 155

**Figure 6.** Parallel-text editions with personal names' highlights and virtual keyboards.

**Figure 5.** A list of named entities with hyperlinks to the folios of a Mongolian manuscript.

a separate edition and use it as parallel-text editions as shown in **Figure 6**.

keyword to benefit free-text search and keyword highlighting.

*3.2.1.2. Supporting the traditional Mongolian script*

154 Multilingualism and Bilingualism

There is a limited recommendation to encode transliterations in TEI. Soualah and Hassoun [18] proposed to implement transliteration by using a specific model, which uses the [18] element with the *@xml:lang*, *@target* and *@type* attributes. However, we consider transliteration as

A unique feature of traditional Mongolian script is displaying vertically, from top to bottom, in columns advancing from left to right. Due to poor support for traditional Mongolian script at the EVT, we customized it to display the scanned images at the top and the corresponding text in traditional Mongolian script below with the direction top to bottom and left to right. We also set to display text in traditional Mongolian script on the left, and the corresponding transliteration in Latin letters on the right that can be used to compare them. Additionally, as shown in **Figures 4** and **6**, we added a simple virtual keyboard composed of 22 traditional Mongolian letters and their corresponding Latin letters to help users to input a Mongolian

#### **3.3. Applying and extending the proposed method to across languages**

This section discusses (1) how the existing cross-language information retrieval techniques can be utilized in the proposed prototype system and (2) how the proposed approach can be applied to other languages in order to provide cross-lingual and cross-chronological information access to multilingual historical documents.

#### *3.3.1. Adopting cross-language and cross-chronological information retrieval techniques in historical documents*

There has been little research in information retrieval techniques for historical documents, and almost none of the breakthroughs in research in information retrieval and information access have aimed at retrieving information in the native language from ancient, cross-chronological and/or cross-script foreign language documents. Few approaches that could be considered a cross-chronological information retrieval have been proposed, and there has been little research in information retrieval techniques for historical documents. Ernst-Gerlach and Fuhr focused on modern and archaic German and developed a retrieval method that considers the spelling differences and variations over time [19]. Koolen et al. considered the spelling and pronunciation differences between ancient and modern Dutch [20], while Gotscharek et al. [21] and Hauser et al. [22] considered the spelling differences and variations between modern and archaic German. Pilz et al. considered spelling variations of English and German historical texts [23]. In general, the main challenge for historical European languages like Dutch, English and German is the spelling variants.

Japan; and 'Sumo', Japanese traditional wrestling. For instance, if the search query submitted by the user is a name of the Ukiyo-e artist, i.e., 'Utagawa Hiroshige', then the query 'Utagawa

Cross-Lingual and Cross-Chronological Information Access to Multilingual Historical Documents

http://dx.doi.org/10.5772/intechopen.72421

157

We are conducting further research to generalize the proposed method to other historical documents in various languages. We also believe that the proposed prototype could be applied to other historical documents in Todo, Manchu and Sibe, which are the derivative scripts of

In this chapter, we have described our research to achieve cross-lingual and cross-chronological information access to ancient Mongolian historical materials. More specifically, we have introduced methods for providing information access that cuts across different historical

We introduced an information extraction method for digitized ancient Mongolian historical manuscripts of the 13–16th century in Sections 3. The proposed information extraction method for ancient Mongolian historical documents performs computerized massive analysis on Mongolian historical documents. It can reduce traditional labour-intensive manual analysis on Mongolian historical text significantly. Named entities such as historical figures and places of ancient Mongolia that are difficult for manual examination are recognized from

The extracted results are utilized for building a digital edition of an ancient Mongolian his-

encoded digital edition that reflects the ancient Mongolian manuscripts would help scholars conducting research in the ancient history for digging hidden knowledge of the Middle Ages of Mongolia in ancient Mongolian historical documents that is not available in modern-language documents. Furthermore, explicitly encoded digital text enables users to search and browse ancient Mongolian manuscript using the named entities' visualization, i.e., it allows not only retrieving information but also analysing and visualizing the contents of the information. We also hope digital editions along with the scanned images would recreate the experience of encountering the original manuscripts. Its information visualization feature of ancient Mongolian texts and a TMSDL's feature that can retrieve ancient manuscripts written in traditional Mongolian script using a query in modern Mongolian (Cyrillic) would help researchers who are interested in using digital representations of ancient historical manuscripts as scholarly tools by using a modern language. Such a feature is very useful, since the needs of humanities researchers are diverse and might require access to information in ancient languages, rather than searching and browsing limited collections in modern languages. Indeed Mongolian ancient documents are mostly available in ancient scripts and dialects, so users

who do not understand ancient Mongolian may not find the desired information.

We also believe the TEI-

torical document and made available through a web-based system.1

Hiroshige' is translated into Japanese as '歌川広重' and sent to Japanese databases.

traditional Mongolian.

periods and dialects.

historical manuscripts.

1

http://www.dl.is.ritsumei.ac.jp/AltanTovch/

**4. Summary and future directions**

Furthermore, Kimura and Maeda proposed a retrieval method that considers not only language differences over time but also cultural and time differences in modern and archaic Japanese [24]. Tripathi developed a retrieval system that considers the differences in various scripts and writing systems of Brahmic (Indic) and proposed a method to retrieve Sanskrit documents written in Sanskrit script or Brahmic families' scripts, using scripts such as Devanagari, Kannada, Telugu and Bengali [25]. To cope with cross-chronological and cross-script Mongolian documents, Khaltarkhuu and Maeda proposed a retrieval technique that is capable of searching traditional Mongolian script documents using modern Mongolian query [26–28].

We improved Khaltarkhuu and Maeda's grammatical-rule-based approach [26–28] and proposed an 'ancient-to-modern information retrieval' method [7, 29] by adding a dictionarybased query translation technique in order to consider cross-chronological differences in the writing systems of the ancient and modern Mongolian languages for accessing cross-chronological and cross-script ancient Mongolian documents by using a query in modern Mongolian in Cyrillic. To boost the quality of the translation, the 'ancient-to-modern information retrieval' approach [7, 29] matches query terms to words in a dictionary. If no exact match is found, the grammatical-rule-based approach [26–28] is used. In other words, the grammatical-rulebased query translation approach is used for inflected words, words with ancient spellings or grammar or the words missing from the dictionary. For the word sense disambiguation, in case if there are words which have multiple candidates, we choose the most frequent words. In our approach, we merge spelling variants of ancient Mongolian words.

We have already integrated the 'ancient-to-modern information retrieval' method in the TMSDL, and it can be easily applied to our digital edition for accessing ancient Mongolian historical collections written in traditional Mongolian script.

#### *3.3.2. Applying the proposed approach to other languages*

We have been demonstrating a facility for cross-language searching between English and Japanese for enabling English-speaking users to search Ukiyo-e databases available in Japanese by using English queries [30–32]. Such a feature is very useful for users, since the Ukiyo-e databases in Japanese institutions are mostly available in Japanese, so that users who do not understand Japanese may not find the desired information. Ukiyo-e, a Japanese traditional woodblock printing, is known worldwide as one of the fine arts of the Edo period (1603–1868). The texts of Ukiyo-e databases contain archaic Japanese words which reflect the Japanese language of the Edo period.

Like the 'ancient-to-modern information retrieval', a dictionary-based query translation approach is adopted by utilizing a domain-specific dictionary, which contains the terms related to Japanese arts and cultures. The proposed feature works well with a variety of keywords (i.e., no full sentences) that may include the personal names, specific terms such as 'Geisha', traditional Japanese female entertainers; 'Fuji', Mount Fuji, the highest mountain in Japan; and 'Sumo', Japanese traditional wrestling. For instance, if the search query submitted by the user is a name of the Ukiyo-e artist, i.e., 'Utagawa Hiroshige', then the query 'Utagawa Hiroshige' is translated into Japanese as '歌川広重' and sent to Japanese databases.

We are conducting further research to generalize the proposed method to other historical documents in various languages. We also believe that the proposed prototype could be applied to other historical documents in Todo, Manchu and Sibe, which are the derivative scripts of traditional Mongolian.
