**2. Document description languages**

238 Emerging Informatics – Innovative Concepts and Applications

and characters (Yang & Kot, 2004). The main drawback of this scheme is its high

Huang proposed an authentication method for binary images including text documents (Huang et al., 2004), in which firstly the binary image is segmented in blocks and then some pixels in each block are rearranged in order to enforce a given relationship between the total number of black and white pixels in it. During the authentication process, this relationship is verified for each block in order to authenticate the block. If this relationship is satisfied the block is considered as authentic, otherwise the block is considered as tampered. The principal disadvantage of this method is that a degradation introduced in the encoded

Wu and Liu proposed binary image block-wise authentication scheme, in which flippable pixels in each block are manipulated in order to embed a watermark bit in the block (Wu & Liu, 2004). Here the embedded watermark is imperceptible, because fliping flippable pixels do not cause any distortion of the binary image. However, in general, the watermark embedding payload is very low compared with the number of flippable pixels into the

To improve the embedding payload, Gou and Wu introduced the concept of "super-pixels" and wet paper coding into the Wu and Liu's scheme (Gou & Wu,, 2007). The "Super-pixels" form a set of individually non-flippable pixels, which can be removed or added together without causing visual distortion. Also Wu and Liu reported that their authentication scheme is robust to printing and scanning operations. However during the scanning process, a rotation, even with angles smaller than one degree may results in an embedded

Document authentication schemes for formats such as Portable Document Format (PDF) or PostScript had received few attention among researchers although many official documents are stored using this type of formats. In (Zhu et al., 2007), a document authentication method using render sequence encoding is proposed, in which the encoding process is based on modulate the display sequences using a Document Description Language (DDL), such as PostScript, PDF, Printer Control Language, etc. In the render sequence, predefined characters are permuted by a user's secret key; and then during the authentication process, the document is considered as authentic if the permutation corresponds to the secret key used in embedded stage. This scheme determines correctly if a document is authentic or not, however there are two inconveniences that may limit its practical use. Firstly the size of the encoded document file is considerably increased compared with the original file size, and the second one is the fact that the structure of the encoded render sequence is unnatural, and as a consequence, it can be easily detected by an unauthorized person, doing it possible the

To solve these problems, Gonzalez-Lee proposed a watermarking-based document authentication scheme, in which character metrics are used to embed a watermark sequence (Gonzalez-Lee et al., 2009). The advantage of proposed scheme is that the watermarked file size is not changed compared with original file size and also the watermarked file conserves its original appearance, enhances in this form its security because the watermark presence is

computational complexity and vulnerability against noise.

binary image is noticeable.

watermark signal lost.

not evident.

used of reverse engineering to tamper the document.

image.

Computer languages such as C language are general propose, they can be used for developing a broad spectrum of applications; others like Fortan and Matlab are designed for numerical calculations so their respective instruction sets facilitate greatly calculations in engineering field. One can easily think on many useful instructions or functions that facilitate coding complex programs, for example, the function sin(x) is very useful in engineering computing programs but it is of little use in describing an electronic document.

In order to achieve an efficient description of the basic elements that allow the creation of a practical document, we need a proper computer language that meets the challenge of describing properly an electronic document, this computer language is called a Document Description Language or DDL for short, and thus a DDL is a computer language which instruction set is designed to contain commands for common tasks needed to draw a document.

A DDL is designed to facilitate the description of a document, in other words, their instruction set are very handy for common task such as to indicate where to draw a given set of characters (e.g. a row or a paragraph), which font size, and other properties according to the desired document layout. It is hard to imagine trying to describe a web page using C or Matlab instruction set, so, the scope and propose of DLL's is evident.

We can mention many implementations of practical DDL's, for example, for describing Web pages we can use the Hiper Text Markup Language (HTML), and for electronic documentation, we can choose among PostSript, Portable Document Format (PDF), Open Document Format (ODF) used by the OppenOffice.org and LibreOffice projects.

As discussed above, there are many DDL's, most of them are different radically, this difficult the development of a universal approach that can be used for every DDL. In most cases, a given watermarking approach can be adapted for several DDL's, but in other cases, we must to design a completely different paradigm.

Authentication of Script Format Documents Using Watermarking Techniques 241

This is a text document showing a DDL with a xml

( a )

(this is a text document showing a DDL with a PostScript

( b )

( c )

Fig. 1. Example of a DDS, one can notice how a Language is used to describe the structure of an electronic document. The same text was written with a) the ODF; b) the Postscript

considered the basis of PDF, it is feasible that if you understand the postscript it will be in fact easier to understand the PDF internals, conversely, it will be more difficult to proceed

A typical approach is depicted in Fig. 2. In this figure we can see that the most important parts of the script file are the header and the body. The former is called Encapsulated PostScript or EPS, it contains information about the version of the standard used in the document; in addition, it contains other useful data such as the number of pages, the bounding box, etc. The latter, that is to say, the body contains the whole contents of the document organized in pages (each one can be recognized easily by the special command

(This a text document showing a DDL with a PDF approach) Tj

<text:p text:style-name="Standard">

<office:body>

<office:text>

 approach </text:p> </office:text>

</office:document-content>

</office:body>

100 50 moveto

approach)

100 50 Td

Language and c) the PDF.

the other way.

show

Finally, we wish to point out that a DDL is like any other computer language, it provides an instruction set but those instructions must be properly structured, in next section, a discussion on this subject is carried out.
