**4. Document watermarking approach**

Watermarking for authentication schemes differ from copyright enforcement schemes, in the latter, the watermark integrity is crucial, since no matter what attack is carried out on the protected material, the watermark should be still detected, of course damaged yet detectable. In authentication applications, the watermark should be fragile, any modifications should damage the watermark seriously so the system would be unable to detect the watermark, and in other words, any modification on the protected media would render the watermark undetectable by the system. These kinds of applications are intended to prevent frauds or moral damages.

#### **4.1 Attack scenario to watermark**

As stated in last section, in watermarking for authentication applications, a natural attack scenario is as follows: an attacker trying to modify a protected digital material in order to change the meaning of this material. An example of this is an electronic document that is modified to change the message contained in this document to commit fraud. Such attack is feasible due to the existence of free tools such as PDFedit, (Hocko, 2009).

In order to carry out a successful attack, the attacker must achieve the following goals:

 Change the meaning of the original message in the protected document so it matches some desired meaning, usually malicious, in a way that is not possible to figure the modification out.

 Preserve as much as possible of the watermark, so an automatic verification system still be able to detect it an thus to validate the document as a legitimate one.

From this situation is evident the need of a document authentication system based on fragile watermarking, so even if the modification of the document is small, the watermark shall be no detectable.

## **4.2 Watermarking using character metrics**

244 Emerging Informatics – Innovative Concepts and Applications

moveto and then the text "C Language History" is the contents of the row and the following vector contains the metrics for each character in the row, generally, the characters does not full fill the page width, so a small constant should be added to each metric in order to fit the page width, that is to say, to left and right justify the text, next, the command xshow indicates that this row must be drawn with given metrics, however nothing is

As depicted in Fig. 5, we can find a rich source of data that can be modified in order to either hide information to implement a steganographic system or to embed digital watermarks. A natural question is that if such modifications could have side effects such as visual distortion, but consider that each unit of metrics is in fact 1/72 inches, that it to say, a metric of 1.0 = 1/72 inches, so the changes are mostly imperceptible. More about DDS

In next section, we will discuss a watermarking system that uses character metrics in order

Watermarking for authentication schemes differ from copyright enforcement schemes, in the latter, the watermark integrity is crucial, since no matter what attack is carried out on the protected material, the watermark should be still detected, of course damaged yet detectable. In authentication applications, the watermark should be fragile, any modifications should damage the watermark seriously so the system would be unable to detect the watermark, and in other words, any modification on the protected media would render the watermark undetectable by the system. These kinds of applications are intended

As stated in last section, in watermarking for authentication applications, a natural attack scenario is as follows: an attacker trying to modify a protected digital material in order to change the meaning of this material. An example of this is an electronic document that is modified to change the message contained in this document to commit fraud. Such attack is

 Change the meaning of the original message in the protected document so it matches some desired meaning, usually malicious, in a way that is not possible to figure the

In order to carry out a successful attack, the attacker must achieve the following goals:

feasible due to the existence of free tools such as PDFedit, (Hocko, 2009).

 [ 8.100947 3.930948 7.540798 5.871108 6.430798 6.430798 6.430798 5.871108 6.430798 5.871108 3.930948 8.650798 4.210798 5.320798 4.210798

50 742 moveto (C Language History)

actually drawn until a showpage command is encountered.

languages can be read on (Adobe, 1999),(Adobe,2006) and (Reid, 1990).

Fig. 5. Example of an actual row definition.

to embed digital watermarks.

**4. Document watermarking approach** 

to prevent frauds or moral damages.

**4.1 Attack scenario to watermark** 

modification out.

6.430798 4.761107 6.430798 ] xshow

In section 3.1, the metrics of characters were described, in this section; we discuss a model for watermarking using characters metrics. This model is depicted in Fig. 6. In this model, some edition software takes the raw text so it can build a well formed DDS from the input data; the edition software uses the instructions in a DDL data base so the resulting DDS follows the file standard. Then, the watermarking algorithm embeds a watermark generated using some secret key in the resulting script, the final product is a watermarked DDS.

Fig. 6. Watermarking model for electronic documents in a DDS approach.

There are many software capable of producing high quality documents, we will assume that such software is provided by third party, yet the resulting documents follow some standard. So, the watermarking system has to be designed to interpret the input DDS in order to process it under this assumption.

Next, we will introduce a watermarking scheme which relies on the modification of character metrics for watermark embedding; a question might be arisen regarding the distortion caused by the metrics modification, in this subject, we must consider that a unit of metrics equals 1/72 inches, so small modifications should be negligible.

The watermark *W = w ,i = 1,2, ..,N <sup>i</sup>* . is a binary (-1 or 1) pseudo random sequence with zero mean an variance 1. Without losing generality, we will assume that we are dealing with horizontal documents; the extension to vertical and diagonal documents is easily carried out.

The whole document is interpreted and then we can form two vectors named *C = c ,i = 1,2, ..,N <sup>i</sup>* . and *M = m ,i = 1,2, ..,N <sup>i</sup>* . , the former is the vector of the characters of the document, and the latter is a vector of their metrics. The character metrics are firstly modified as follows:

Authentication of Script Format Documents Using Watermarking Techniques 247

The value of *d* must be compared with the threshold *Th* and if *d Th* holds, then the watermark is present and thus the document is considered as authentic, otherwise, as

*<sup>2</sup> <sup>σ</sup> Th = 2 2*

Equations (4) is a modification from the one proposed by Piva as the optimal threshold for correlation-based detectors, and since proposed system holds the same asumptions as presented in (Piva, 1998), equation (4) holds, however, in order to achieve accurate results for the intended application, the value of '3.3' from the original equation was changed for '2.8' because in this way a lower value of embedding gain can be set, this helps to make the watermark very fragile, so a lower value of *Th* is desirable because it helps to reduce false positive error rate (a false positive is when the system decides that a tampered document is authentic; false negative occurs when the system decides that an authentic document is

tampered). A block diagram for the watermark detection process is shown in Fig. 9.

*N*

.8 (4)

Fig. 8. Detailed block diagram of the Watermarking algorithm.

tampered. The threshold is computed as:

Fig. 9. Watermarking detection.

Where *<sup>2</sup> σ* is the variance of the vector of metrics *M* .

$$m\prime\_i = m\_i + \frac{ASCII\left(c\_i\right)}{1000} \tag{1}$$

Where *<sup>i</sup> c* is the i-th character in the document and *ASCII c <sup>i</sup>* is the ASCII value of character *<sup>i</sup> c* . For example, if *<sup>i</sup> c =A* , *ASCII c = 097 <sup>i</sup>* .

The watermark is embedded using a multiplicative rule as follows:

$$M\_i = m\_i^\prime \left(1 + \mathcal{g}w\_i\right) \tag{2}$$

where *Mi* is the watermarked metric corresponding to the i-th character, this is another vector named *M'= M ,i = 1,2, ..,N <sup>i</sup>* . and *wi* is the i-th watermark bit, *g* is the gain factor; in experimental results, we found that a good value for g is one that just crosses the threshold as depicted in Fig. 7, that keeps a balance between the watermark imperceptibility and tamper detection capability.

Fig. 7. Watermarking detection, the watermark was generated using key number 500. The use of a gain value that barely crosses the threshold is advised.

Then, the watermarked metrics vector *M'*replaces the original metrics vector *M* . Finally, the vectors *C* and *M'* are used to re-assemble the document, for better understanding see Fig. 8.

On the other hand, for detecting the watermark, we need to retrieve the watermarked metrics vector from the file, so we have the vector *<sup>M</sup> = m ,i = 1,2, ..,N <sup>i</sup>* . . Where *mi* is the extracted metric. Then the presence of the watermark can be assessed by computing the Cross Correlation ( *d* ) between the retrieved watermark *M* and the watermark *W* as follows:

$$d = \frac{1}{N} \sum\_{i+1}^{N} \tilde{m}\_i w\_i \tag{3}$$

*m =m* ' <sup>1000</sup>

Where *<sup>i</sup> c* is the i-th character in the document and *ASCII c <sup>i</sup>* is the ASCII value of

where *Mi* is the watermarked metric corresponding to the i-th character, this is another vector named *M'= M ,i = 1,2, ..,N <sup>i</sup>* . and *wi* is the i-th watermark bit, *g* is the gain factor; in experimental results, we found that a good value for g is one that just crosses the threshold as depicted in Fig. 7, that keeps a balance between the watermark imperceptibility and

Fig. 7. Watermarking detection, the watermark was generated using key number 500. The

Then, the watermarked metrics vector *M'*replaces the original metrics vector *M* . Finally, the vectors *C* and *M'* are used to re-assemble the document, for better understanding see Fig. 8. On the other hand, for detecting the watermark, we need to retrieve the watermarked metrics vector from the file, so we have the vector *<sup>M</sup> = m ,i = 1,2, ..,N <sup>i</sup>* . . Where *mi* is the extracted metric. Then the presence of the watermark can be assessed by computing the Cross Correlation ( *d* ) between the retrieved watermark *M* and the watermark *W* as follows:

*N*

*i+1 <sup>1</sup> d= mw*

*i i*

*<sup>N</sup>* (3)

use of a gain value that barely crosses the threshold is advised.

*i i*

character *<sup>i</sup> c* . For example, if *<sup>i</sup> c =A* , *ASCII c = 097 <sup>i</sup>* .

tamper detection capability.

The watermark is embedded using a multiplicative rule as follows:

*<sup>i</sup>*

(1)

*Mii i = m 1+ gw* ' (2)

*ASCII c*

Fig. 8. Detailed block diagram of the Watermarking algorithm.

The value of *d* must be compared with the threshold *Th* and if *d Th* holds, then the watermark is present and thus the document is considered as authentic, otherwise, as tampered. The threshold is computed as:

$$\text{LTh} = 2.8 \sqrt{2 \frac{\sigma^2}{N}} \tag{4}$$

Where *<sup>2</sup> σ* is the variance of the vector of metrics *M* .

Equations (4) is a modification from the one proposed by Piva as the optimal threshold for correlation-based detectors, and since proposed system holds the same asumptions as presented in (Piva, 1998), equation (4) holds, however, in order to achieve accurate results for the intended application, the value of '3.3' from the original equation was changed for '2.8' because in this way a lower value of embedding gain can be set, this helps to make the watermark very fragile, so a lower value of *Th* is desirable because it helps to reduce false positive error rate (a false positive is when the system decides that a tampered document is authentic; false negative occurs when the system decides that an authentic document is tampered). A block diagram for the watermark detection process is shown in Fig. 9.

Fig. 9. Watermarking detection.

Authentication of Script Format Documents Using Watermarking Techniques 249

To further support the results of the MOS, we present a measure of the distortion of the metrics compared with the original metrics (see Fig. 10). It can be seen that when a character with high ASCII value appears in the document, the distortion becomes larger although it is

Fig. 10. Error percentage for each character in the ASCII code for some random watermark;

Let's consider two possibilities to tamper a document, in the first one, the attacker changes characters according to convenience without changing the metrics because he expects that this won't damage the watermark, if the attack is carried out this way, we can expect a document as shown in Fig. 12. It is quite evident that some modifications were made, so any human can easily detect the tamper even if the original document is not available for comparison. Now, consider another variant, the attacker have knowledge of the file standard so he has the needed skills to modify the document to preserve its natural look, to achieve this goal, the attacker must to re-compute the metrics related to the tampered characters, as expected, the more tampered characters, the more the damage to the watermark, in Fig. 13 we show a typical behaviour of this phenomena, we can see that once the correlation value d is below the threshold value, it never surpasses it again, furthermore,

In Fig. 11 a pieces of a document and its watermarked version is shown.

too small to cause significant distortion.

the maximum distortion is about 16 %.

**5.2 Tamper detection capability** 

Experimental results and discussions will be carried out in next section.
