**2. Ancient Mongolian manuscripts**

This section briefly explains certain characteristics of Mongolian manuscripts and current situation of digitized ancient historical materials written in ancient Mongolian and challenges they present in the digital era.

### **2.1. A brief introduction of Mongolian manuscripts**

Mongolian historical documents have been written in numerous scripts, i.e., the traditional Mongolian script, Square or Phags-pa script, Soyombo script and Horizontal square script [1]. Among them, the traditional Mongolian script is the most popular and longest-surviving script for over 800 years and has better supports with the computer systems recently since its integration to the Unicode Standard [2] in September 1999. On the 20th of June, 2017, the Soyombo and Horizontal square scripts (a.k.a. Zanabazar scripts) were standardized in the most recent version of the Unicode Standard [3]. However, this research focuses on the traditional Mongolian script because of its popularity, availability of digital texts and improved supports at the computers.

In 1946, Mongolia has made language reforms to eliminate a difference between written and spoken Mongolian language, and the Cyrillic script was adapted to Mongolian. The spelling of modern Mongolian in the Cyrillic alphabet was based on the pronunciations in the Khalkha dialect, the largest Mongol ethnic group [4, 5]. Such a radical change separated the Mongolian people from their historical archives written in traditional Mongolian script. Manuscripts in traditional Mongolian script preserve the ancient writing, while modern Mongolian reflects the unique pronunciations in modern dialects. Understanding historical documents in traditional Mongolian script is becoming as equally important a consideration for Mongolians as modern Mongolian in Cyrillic script. However, reading traditional Mongolian documents by using literacy in modern Mongolian is not a simple task. Traditional Mongolian is a distinct dialect with grammar different from that of modern Mongolian. The traditional Mongolian script is written vertically, from top to bottom, in columns advancing from left to right. This script has four derivative scripts: Todo or Clear, Manchu, Vaghintara and Sibe (Xibe) script. The Todo script was used by the Oirats and Kalmyks, and the Manchu script was a writing system in the Qing dynasty. The Sibe script is used in Xinjiang, in the northwest of China. The Vaghintara script was used by the Buryats.

Moreover, the circumstances that the manuscript passed through a process of copying or reprinting with possible alterations, corrections and unintended errors makes researchers wonder which ancient spelling is correct or what the ancient word originally meant. Scholars had been pointing out from time to time that copies could not meet the requirements of scholars who want to study them as a source material [6]. Moreover, various different commentaries, transcriptions, annotations and interpretations have been suggested by humanities researchers. Besides, manuscripts are vulnerable to degradations and might have lacunas, physical damages or missing parts, which require costly reconstructions of the original text.

In general, there are two main demands from both users and researchers for making ancient Mongolian manuscripts usable in this digital era. Firstly, a digital representation that explains a given manuscript in a modern language is helpful for users who want to read, search and browse ancient Mongolian manuscripts. Secondly, in the field of humanities, getting knowledge by analysing various historical documents is an important task. There are increasing demands from Mongolian humanities researchers to perform text analysis at massive scale with prompt and accurate results. Having a digital representation that fully reflects a given manuscript is an awaited demand for researchers who want to study it as a scholarly source using a computer.

Nevertheless, computerized text analysis of Mongolian historical documents has not been done due to the lack of natural language processing (NLP) tools that can handle ancient Mongolian. Such demands have encouraged us to introduce our approaches in providing universal information access to ancient Mongolian historical documents.

#### **2.2. Ancient Mongolian manuscripts in the digital age**

on the Internet. Recently, a number of large-scale digital library projects have been launched, e.g., Europeana, World Digital Library, HathiTrust and Google Book Search. These websites make multilingual materials covering various languages and historical periods available to

There are various technical challenges, however, in implementing universal integrated access to these digital collections due to this great diversity, and difficulties occur in accessing these information sources, mainly due to the diversity of languages. Even within the same language, considerable differences exist in grammar, vocabulary and script depending on the historical period, and this is the primary cause of the difficulties in implementing universal information access. Thus, this chapter presents our approach to providing cross-lingual and cross-chronological access to historical documents that account for evolution of languages over periods ranging from ancient to modern. Particularly, in this chapter, we introduce our approach in providing cross-lingual and cross-chronological information access to historical

In Section 2, we discuss the current situation of digitized ancient historical materials written in ancient Mongolian and the challenges in providing universal information access to them in the digital era. Then, our proposed method for cross-lingual and cross-chronological information access to ancient Mongolian historical materials is discussed in Section 3. Finally, in Section 4,

This section briefly explains certain characteristics of Mongolian manuscripts and current situation of digitized ancient historical materials written in ancient Mongolian and challenges

Mongolian historical documents have been written in numerous scripts, i.e., the traditional Mongolian script, Square or Phags-pa script, Soyombo script and Horizontal square script [1]. Among them, the traditional Mongolian script is the most popular and longest-surviving script for over 800 years and has better supports with the computer systems recently since its integration to the Unicode Standard [2] in September 1999. On the 20th of June, 2017, the Soyombo and Horizontal square scripts (a.k.a. Zanabazar scripts) were standardized in the most recent version of the Unicode Standard [3]. However, this research focuses on the traditional Mongolian script because of its popularity, availability of digital texts and improved supports at the computers. In 1946, Mongolia has made language reforms to eliminate a difference between written and spoken Mongolian language, and the Cyrillic script was adapted to Mongolian. The spelling of modern Mongolian in the Cyrillic alphabet was based on the pronunciations in the Khalkha dialect, the largest Mongol ethnic group [4, 5]. Such a radical change separated the Mongolian people from their historical archives written in traditional Mongolian script. Manuscripts in traditional Mongolian script preserve the ancient writing, while modern Mongolian reflects

materials in a less-researched language such as ancient Mongolian.

we discuss the future prospects of this research.

**2.1. A brief introduction of Mongolian manuscripts**

**2. Ancient Mongolian manuscripts**

they present in the digital era.

the public.

144 Multilingualism and Bilingualism

To the best of our knowledge, there are a small number of digital texts of ancient Mongolian manuscripts. A few ancient Mongolian historical manuscripts including (1) 'Qad-un ündüsünü quriyangγui altan tobči neretü sudur' (the Altan Tobchi or the Golden Summary: Short history of the Origins of the Khans) (written in 1604) a.k.a. 'Little' Altan Tobchi and (2) the 'Asaraγči neretü-yin teüke' or 'Asragch nėrtĭĭn tu̇u̇kh' (the Story of Asragch) (written in 1677), which were written in traditional Mongolian script, have been converted to digital texts and made publicly available through the traditional Mongolian script digital library (TMSDL) [7]. **Figure 1** shows a folio of the 'Little' Altan Tobchi in the TMSDL with keywords' highlights.

solve the facing challenges of Mongolian humanities scholarship as well as to benefit the recent achievements in the digital humanities worldwide, it is necessary to analyse the requirements of

Cross-Lingual and Cross-Chronological Information Access to Multilingual Historical Documents

http://dx.doi.org/10.5772/intechopen.72421

147

In this section, we describe our methods for implementing integrated access to historical documents that are capable of coping with linguistic transformations from ancient times to the present. First, we propose an information extraction method for digitized ancient Mongolian historical documents. The proposed method extracts named entities from historical manuscripts by utilizing machine learning techniques. Results will be utilized for building digital text representations that encode named entities, the possible alterations, corrections, errors and interpretations of ancient Mongolian words in a modern language. In the later sections, we discuss how to develop a digital edition of Mongolian historical documents by consider-

This section discusses an information extraction method for digitized ancient Mongolian documents by using the features of traditional Mongolian script. Named entities such as personal names and place names are extracted automatically from digitized text of ancient Mongolian documents by employing support vector machine (SVM) for aiming to reduce the labourintensive analysis on historical text. Information extraction, named entity extraction (NEE) and tagging or annotations are able to turn plain text into structured data for analysis or effective use, via NLP applications and analytical methods. State-of-the-art NEE systems for English produce near-human performance to extract named entities [8]. However, there has been little research on text mining or NEE for Mongolian language, and none of the research has considered text mining on ancient Mongolian historical documents due to the lack of research in those areas. Therefore, proposing an information extraction method for ancient

The flowchart in **Figure 2** shows an overview of the main steps and components of the proposed approach. The proposed approach starts with preprocessing tasks where an ancient Mongolian corpus gets tokenized, each token gets annotated and gold standard annotations are prepared for inputting into SVM for learning. The proposed method learns the extraction rules of personal names from annotated training corpora and then extracts personal names from ancient Mongolian texts by using SVM. The following sections explain the main three

The first step is to divide digitized ancient Mongolian plain text of into tokens. This is necessary because we want to mark up each token in the next tasks. A token is quite often a word delimited by space, but there exist some unique features for traditional Mongolian script. For

components: (1) pre-processing, (2) annotating and (3) named entity extraction.

Mongolian historical documents for digital tools.

ing various features and requirements of Mongolian manuscripts.

**3.1. Information extraction from ancient Mongolian documents**

historical documents in traditional Mongolian script is crucial.

*3.1.1. The proposed approach*

*3.1.1.1. Preprocessing step*

**Figure 1.** A folio of the 'little' Altan Tobchi in the TMSDL with keywords' highlights.

TMSDL can be used to access and retrieve the historical manuscripts written in traditional Mongolian script using a query in modern Mongolian (Cyrillic). The research achievements, as well as the experiences obtained from the development of the TMSDL, have motivated us to share further research results in developing methods to providing cross-lingual and crosschronological information access to ancient Mongolian historical documents.

Certainly, there has been a little research on text mining for Mongolian language, and none of the research has considered text mining on ancient Mongolian historical documents due to the lack of research in those areas. Because of the notable difference between mediaeval Mongolian and modern Mongolian, the existing NLP tools, which were designed on modern Mongolian, do not perform well on ancient Mongolian texts. Therefore, further computerized analyses of ancient Mongolian historical documents are necessary.
