**6. Testers' validation**

One of the problems with each subjective experiments is reliability of the subjects, or, more precisely, of each individual subject. If a subject proves to be unreliable, any conclusions based on his/her answers may be misleading or simply useless. Therefore the issue of detecting and eliminating unreliable subjects is important in all types of subjective experiments.

Eliminating subjects needs to be based on strict rules, otherwise there is a risk that a subject is eliminated simply because his/her answers do not fit the theory being tested. In other words, any subjective criteria need to be eliminated. The correct methodology should be objective and allow for each subject's individual preferences.

On the other hand, it is necessary to detect subjects who do not take the experiment seriously, i.e. they answer randomly and do not care about giving correct and precise answers. There may also be subjects for whom a given test is too difficult (for example video sequences appear too fast).

The most popular way to validate subjects is correlation. It is a simple and intuitively correct method. We compute correlation between individual subject scores and the scores for all other subjects. We used this method in VQEG in (VQE, 2010). The problems entailed in are: (1) setting the subject elimination threshold (2) eliminating subjects no single answers, and (3) all subjects need to carry out the same tasks numerous times. For the first problem, an experienced scientist can specify correct threshold fitting the problem which he/she is analyzing. The second problem is more difficult to deal with. We know that even for reliable subjects some of their answers are likely to be incorrect. This may be a simple consequence of being tired or distracted for a short period of time. Correlation methodology cannot help in dealing with this problem. The third problem is not important for quality based subjective experiments, since the same sequences are scored in any case (e.g. the same source sequence encoded using different compression parameters). Nevertheless, in task-based subjective experiments the same source sequence should not be shown many times, because the correct

Based on the above parameters, it is easy to determine that the whole test set consists of 900

One of the problems with each subjective experiments is reliability of the subjects, or, more precisely, of each individual subject. If a subject proves to be unreliable, any conclusions based on his/her answers may be misleading or simply useless. Therefore the issue of detecting and

Eliminating subjects needs to be based on strict rules, otherwise there is a risk that a subject is eliminated simply because his/her answers do not fit the theory being tested. In other words, any subjective criteria need to be eliminated. The correct methodology should be objective

On the other hand, it is necessary to detect subjects who do not take the experiment seriously, i.e. they answer randomly and do not care about giving correct and precise answers. There may also be subjects for whom a given test is too difficult (for example video sequences appear

The most popular way to validate subjects is correlation. It is a simple and intuitively correct method. We compute correlation between individual subject scores and the scores for all other subjects. We used this method in VQEG in (VQE, 2010). The problems entailed in are: (1) setting the subject elimination threshold (2) eliminating subjects no single answers, and (3) all subjects need to carry out the same tasks numerous times. For the first problem, an experienced scientist can specify correct threshold fitting the problem which he/she is analyzing. The second problem is more difficult to deal with. We know that even for reliable subjects some of their answers are likely to be incorrect. This may be a simple consequence of being tired or distracted for a short period of time. Correlation methodology cannot help in dealing with this problem. The third problem is not important for quality based subjective experiments, since the same sequences are scored in any case (e.g. the same source sequence encoded using different compression parameters). Nevertheless, in task-based subjective experiments the same source sequence should not be shown many times, because the correct

eliminating unreliable subjects is important in all types of subjective experiments.

Fig. 3. Generation of HRCs.

**6. Testers' validation**

too fast).

sequences (each SRC 1-30 encoded into each HRC 1-30).

and allow for each subject's individual preferences.

answer for a particular task could be remembered. For this reason different pool of sequences is shown to different subjects (e.g. each compression level for a given source sequences needs to be shown to a different subject).

A more formal way toward validation of subjects is the Rasch theory (Boone et al., 2010). It defines the difficult level for each particular question (e.g. single video sequence from a test set), or whether a subject is more or less critical in general. Based on this information it is possible to detect answers that not only do not fit the average, but also individual subjects' behavior. Formally the probability of giving correct answer is estimated by equation (Baker, 1985)

$$P(X\_{in} = 1) = \frac{1}{1 + \exp(\beta\_{ll} - \delta\_{\bar{l}})} \tag{2}$$

where *β<sup>n</sup>* is ability of *n*th person to make a task and *δ<sup>i</sup>* is the *i*th task difficulty.

Estimating both the task difficulty and subject ability make it possible to predict the correct answer probability. Such probability can be compared with the real task result.

In order to estimate *β<sup>n</sup>* and *δ<sup>i</sup>* values the same tasks have to be run by all the subjects which is a disadvantage of the Rasch theory, similarly to the correlation-based method. Moreover, the more subjects involved in the test, the higher the accuracy of the method. An excellent example of this methodology in use is national high school exams, where the Rasch theory helps in detecting the differences between different boards marking the pupils' tests (Boone et al., 2010). In subjective experiments, there are always limited numbers of answers per question. This means that the Rasch theory can still be used, although the results need to be checked carefully. Tasks-based experiments are a worst-case scenario. In this case each subject carries out a task a very limited number of times in order to ensure that the task result (for example license plate recognition) is based purely on the particular distorted video and is not remembered by the subject. This makes the Rasch theory difficult to use.

In order to solve this problem we propose two custom metrics for subject validation. They both work for partially ordered test sets (Insall & Weisstein, 2011), i.e. those for which certain subsets can be ordered by task difficulty. Additionally we assume that answers can be classified as correct or incorrect (i.e. a ground truth is available, e.g. a license plate to be recognized). Note that due to the second assumption these metrics cannot be used for quality assessment tasks, since we cannot say that one answer is better than another (as we have mentioned before, there is no ground truth regarding quality).

#### **6.1 Logistic metric**

Assuming that the test set is partially ordered can be interpreted in a numeric way: if a subject fails to recognize a license plate, and for *n* sequences with higher or equal QP the license plate was recognized correctly by other subjects, the subject's inaccuracy level is increased by *n*. Higher *n* values may indicate a better chance that the subject is irrelevant and did not pay attention to the recognition task.

Computing such coefficients for different sequence results in the total subject quality (*Sqi*) given by

$$Sq\_{\bar{l}} = \sum\_{\bar{j} \in \mathcal{S}\_{\bar{l}}} ssq\_{\bar{l}, \bar{j}} \tag{3}$$

where *ssqli*,*<sup>j</sup>* is subject quality metric based on Levenshtein distance obtained for subject *i* and

Quality Assessment in Video Surveillance 67

where *l*(*i*, *j*) is the Levenshtein distance between the correct answer and the subject *i* answer for the *j*th sequence, *B* is set of all subjects, and *Aj* is the set of all sequences for which the task

<sup>0</sup> <sup>5</sup> <sup>10</sup> <sup>15</sup> <sup>20</sup> <sup>25</sup> <sup>30</sup> <sup>35</sup> <sup>40</sup> <sup>45</sup> <sup>50</sup> <sup>0</sup>

Sql

Figure 5 shows the histogram obtained for *Sql*. It is significantly different than the previous one obtained for *Sq*. It can be assumed that an *Sql* higher than 10 or 15 indicates a potentially irrelevant subject. One subject obtained significantly higher *Sql* value than the others (50). More detailed investigation of this case revealed that the subject provided additional text for one answer. After correction the corrected value for this subject is 25, which is still very high.

In the area of entertainment video, a great deal of research has been carried out on the parameters of the contents that are the most effective for perceptual quality. These parameters form a framework in which predictors can be created such that objective measurements can

Analysis of the traditional QoE subjective experiment data is focused on the mean subject answer modeling. In addition subject reliability is controlled by the correlation test. Nevertheless, in case of the task-based QoE (QoR) it is impossible or very difficult to use such methodology. Therefore, modeling QoR subjective data calls for new methodology which is

The first step of the subjective experiment data analysis is subject evaluation which is presented in the previous section. The next step of the data analysis is finding the probability

 <sup>0</sup> if *<sup>l</sup>*(*i*, *<sup>j</sup>*) <sup>≤</sup> *<sup>l</sup>*(*m*, *<sup>k</sup>*) *l*(*i*, *j*) − *l*(*m*, *k*) if *l*(*i*, *j*) > *l*(*m*, *k*)

(7)

*ssqli*,*<sup>j</sup>* =

∑ *m*∈*B*

∑ *k*∈*Aj*

is not easier defined at the beginning of the section.

2

Fig. 5. Histogram of subject quality *Sql* obtained for all 30 subjects.

be developed through the use of subjective testing (Takahashi et al., 2010).

4

Number

**7. Modeling approaches**

presented in this section.

 of Subjects

6

8

10

12

sequence *j* given by

where *Si* is set of all sequences carried out by *i*th subject, and *ssqi*,*<sup>j</sup>* is the subject sequence quality for sequence *j*, which is given by

$$\text{ssq}\_{i,j} = \begin{cases} 0 \text{ if } r(i,j) = 1\\ n \text{ if } r(i,j) = 0 \end{cases} \tag{4}$$

where

$$m = \sum\_{k \in A\_{\!/\!\!}} \sum\_{m \in B} r(m, k) \tag{5}$$

where *r*(*i*, *j*) is 1 if *i*th subject recognized the *j*th sequence and 0 otherwise, *B* is set of all subjects, and *Aj* is a set of all not easier sequences as defined above. In the case of this experiment *Aj* is a set of sequences with the same resolution and view although with a higher or equal QP than the *j*th sequence.

We computed *Sq* for each subject; the results are presented in Figure 4.

The histogram shows that the value of 6 was exceeded for just three subjects, denoted with IDs 18, 40 and 48. Such subjects should be removed from the test.

*Sqi* metric assumes that a task can be done correctly or incorrectly. In case of recognition missing one character or all characters the metric returns the same value. In case of license plate recognition and many other tasks the level of error can be defined. The next proposed metric takes into consideration to incorrectness level of the answer.

#### **6.2 Levenshtein distance**

Levenshtein distance is the number of edits required to obtain one string from another. Subject quality based on Levenshtein distance *Sql* is given by

$$Sql\_{\dot{i}} = \sum\_{j=1}^{30} ssql\_{\dot{i},j} \tag{6}$$

where *Si* is set of all sequences carried out by *i*th subject, and *ssqi*,*<sup>j</sup>* is the subject sequence

∑ *m*∈*B*

where *r*(*i*, *j*) is 1 if *i*th subject recognized the *j*th sequence and 0 otherwise, *B* is set of all subjects, and *Aj* is a set of all not easier sequences as defined above. In the case of this experiment *Aj* is a set of sequences with the same resolution and view although with a higher

<sup>0</sup> <sup>1</sup> <sup>2</sup> <sup>3</sup> <sup>4</sup> <sup>5</sup> <sup>6</sup> <sup>7</sup> <sup>8</sup> <sup>9</sup> <sup>0</sup>

The histogram shows that the value of 6 was exceeded for just three subjects, denoted with

*Sqi* metric assumes that a task can be done correctly or incorrectly. In case of recognition missing one character or all characters the metric returns the same value. In case of license plate recognition and many other tasks the level of error can be defined. The next proposed

Levenshtein distance is the number of edits required to obtain one string from another. Subject

30 ∑ *j*=1

*Sqli* =

Sq

0 if *r*(*i*, *j*) = 1

*<sup>n</sup>* if *<sup>r</sup>*(*i*, *<sup>j</sup>*) = <sup>0</sup> (4)

*r*(*m*, *k*) (5)

*ssqli*,*<sup>j</sup>* (6)

*ssqi*,*<sup>j</sup>* =

We computed *Sq* for each subject; the results are presented in Figure 4.

1

Fig. 4. Histogram of subject quality *Sq* obtained for all 30 subjects.

IDs 18, 40 and 48. Such subjects should be removed from the test.

metric takes into consideration to incorrectness level of the answer.

quality based on Levenshtein distance *Sql* is given by

2

Number

3

 of Subjects

4

5

6

7

*n* = ∑ *k*∈*Aj*

quality for sequence *j*, which is given by

or equal QP than the *j*th sequence.

**6.2 Levenshtein distance**

where

where *ssqli*,*<sup>j</sup>* is subject quality metric based on Levenshtein distance obtained for subject *i* and sequence *j* given by

$$\begin{aligned} ssq l\_{i,j} &= \\ \sum\_{k \in A\_j} \sum\_{m \in B} \begin{cases} 0 & \text{if } l(i,j) \le l(m,k) \\ l(i,j) - l(m,k) \text{ if } l(i,j) > l(m,k) \end{cases} \end{aligned} \tag{7}$$

where *l*(*i*, *j*) is the Levenshtein distance between the correct answer and the subject *i* answer for the *j*th sequence, *B* is set of all subjects, and *Aj* is the set of all sequences for which the task is not easier defined at the beginning of the section.

Fig. 5. Histogram of subject quality *Sql* obtained for all 30 subjects.

Figure 5 shows the histogram obtained for *Sql*. It is significantly different than the previous one obtained for *Sq*. It can be assumed that an *Sql* higher than 10 or 15 indicates a potentially irrelevant subject. One subject obtained significantly higher *Sql* value than the others (50). More detailed investigation of this case revealed that the subject provided additional text for one answer. After correction the corrected value for this subject is 25, which is still very high.

#### **7. Modeling approaches**

In the area of entertainment video, a great deal of research has been carried out on the parameters of the contents that are the most effective for perceptual quality. These parameters form a framework in which predictors can be created such that objective measurements can be developed through the use of subjective testing (Takahashi et al., 2010).

Analysis of the traditional QoE subjective experiment data is focused on the mean subject answer modeling. In addition subject reliability is controlled by the correlation test. Nevertheless, in case of the task-based QoE (QoR) it is impossible or very difficult to use such methodology. Therefore, modeling QoR subjective data calls for new methodology which is presented in this section.

The first step of the subjective experiment data analysis is subject evaluation which is presented in the previous section. The next step of the data analysis is finding the probability

<sup>60</sup> <sup>70</sup> <sup>80</sup> <sup>90</sup> <sup>100</sup> <sup>110</sup> <sup>120</sup> <sup>130</sup> <sup>140</sup> <sup>150</sup> <sup>0</sup>

80 90 100 110 120 130 140 150 160 170 180

Bit Rate [kbit/s]

(b) For all HRCs.

0.4

0.5

0.6

Detection Probability

We present an extension of the sequences analyzed to all HRCs results in the model drawn in

The result obtained is less precise. Some of the points are strongly scattered (see results for bit-rate 110 to 130 kbit/s). Moreover, comparing the models presented in Figure 6(a) and Figure 6(b) different conclusions can be drawn. For example, 150 kbit/s results in around a 90% detection probability for HRCs 20 to 25 and less than 70% for all HRCs. It is therefore evident that the bit-rate itself cannot be used as the only explanatory variable. The question

In Figure 8(a) we show DP obtained for SRCs. The SRCs had a strong impact on the DP. We would like to stress that there is one SRC (number 26) which was not detected even once (see Figure 7(a)). The non-zero confidence interval comes from the corrected confidence interval computation explained in (Agresti & Coull, 1998). In contrast, SRC number 27 was almost always detected, i.e. even for very low bit-rates (see Figure 7(b)). A detailed investigation

2. the characters, as some of them are more likely to be confused than others, as well as

the detection probability obtained can be very different (for example HRC 4 and 24).

A better DP model has to include these factors. On the other hand, these factors cannot be fully controlled by the monitoring system, and therefore these parameters help to understand what kind of problems might influence DP in a working system. Factors which can be controlled are described by different HRCs. In Figure 8(b) we show the DP obtained for different HRCs. For each HRC, we used all SRCs, and therefore any differences observed in HRCs should be SRC independent. HRC behavior is more stable because detection probability decreases for higher QP values. One interesting effect is the clear threshold in the DP. For all HRCs groups two consecutive HRCs for which the DPs are strongly different can be found. For example, HRC 4 and 5, HRC 17 and 18, and HRC 23 and 24. Another effect is that even for the same QP

0.7

0.8

0.9

1

Quality Assessment in Video Surveillance 69

Model

Sub jects 95% conf. interval

Bit Rate [kbit/s]

then is, what other explanatory variables can be used.

1. the contrast of the plate characters,

shows that the most important factors (in order of importance) are:

3. the illumination, if part of the plate is illuminated by a strong light.

Fig. 6. Example of the logit model and the obtained detection probabilities.

(a) For HRC 20 to 25.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 6(b).

Detection

Probability

Model

Sub jects 95% conf. interval

of doing a particular task correctly. Again it is different from traditional QoE since the model has to predict probability not the mean value. It calls to use more general models like Generalized Linear Model (GLZ) (Agresti, 2002.).

The last open problems are explanatory variables i.e. the metrics which can correlate well with the probability of the correct task execution. Assessment principles for task-based video quality are a relatively new field. Solutions developed so far have been limited mainly to optimizing network Quality of Service (QoS) parameters. Alternatively, classical quality models such as the Peak Signal-to-Noise Ratio (PSNR) Eskicioglu & Fisher (1995) or Structural Similarity (SSIM) (Wang et al., 2004) have been applied, although they are not well suited to the task. The chapter presents an innovative, alternative approach, based on modeling detection threshold probabilities.

The testers who participated in this study provided a total of 960 answers. Each answer could be interpreted as the number of per-character errors, i.e. 0 errors meaning correct recognition. The average probability of a license plate being identified correctly was **54.8%** (**526/960**), and **64.1%** recognitions had no more than one error. **72%** of all characters was recognized.

## **7.1 Answers analysis**

The goal of this analysis is to find the detection probability as a function of a certain parameter(s) i.e. the explanatory variables. The most obvious choice for the explanatory variable is bit-rate, which has two useful properties. The first property is a monotonically increasing amount of information, because higher bit-rates indicate that more information is being sent. The second advantage is that if a model predicts the needed bit-rate for a particular detection probability, it can be used to optimize the network utilization.

Moreover, if the network link has limited bandwidth the detection probability as a function of a bit-rate computes the detection probability, what can be the key information which could be crucial for a practitioner to decide whether the system is sufficient or not.

The Detection Probability (DP) model should predict the DP i.e. the probability of obtaining 1 (correct recognition). In such cases, the correct model is logit (Agresti, 2002.). The simplest logit model is given by the following equation:

$$p\_d = \frac{1}{1 + \exp(a\_0 + a\_1 x)}\tag{8}$$

where *x* is an explanatory variable, *a*<sup>0</sup> and *a*<sup>1</sup> are the model parameters, and *pd* is detection probability.

The logit model can be more complicated; we can add more explanatory variables, which may be either categorical or numerical. Nevertheless, the first model tested was the simplest one.

Building a detection probability model for all of the data is difficult, and so we considered a simpler case based on the HRCs groups (see section 5.3). Each five HRCs (1-5, 6-10, etc.) can be used to estimate the threshold for a particular HRCs group. For example, in Figure 6(a) we show an example of the model and the results obtained for HRCs 20 to 25.

The obtained model crosses all the confidence intervals for the observed bit-rates. The saturation levels on both sides of the plot are clearly visible. Such a model could successfully be used to investigate detection probability.

of doing a particular task correctly. Again it is different from traditional QoE since the model has to predict probability not the mean value. It calls to use more general models

The last open problems are explanatory variables i.e. the metrics which can correlate well with the probability of the correct task execution. Assessment principles for task-based video quality are a relatively new field. Solutions developed so far have been limited mainly to optimizing network Quality of Service (QoS) parameters. Alternatively, classical quality models such as the Peak Signal-to-Noise Ratio (PSNR) Eskicioglu & Fisher (1995) or Structural Similarity (SSIM) (Wang et al., 2004) have been applied, although they are not well suited to the task. The chapter presents an innovative, alternative approach, based on modeling

The testers who participated in this study provided a total of 960 answers. Each answer could be interpreted as the number of per-character errors, i.e. 0 errors meaning correct recognition. The average probability of a license plate being identified correctly was **54.8%** (**526/960**), and **64.1%** recognitions had no more than one error. **72%** of all characters was recognized.

The goal of this analysis is to find the detection probability as a function of a certain parameter(s) i.e. the explanatory variables. The most obvious choice for the explanatory variable is bit-rate, which has two useful properties. The first property is a monotonically increasing amount of information, because higher bit-rates indicate that more information is being sent. The second advantage is that if a model predicts the needed bit-rate for a particular

Moreover, if the network link has limited bandwidth the detection probability as a function of a bit-rate computes the detection probability, what can be the key information which could be

The Detection Probability (DP) model should predict the DP i.e. the probability of obtaining 1 (correct recognition). In such cases, the correct model is logit (Agresti, 2002.). The simplest

where *x* is an explanatory variable, *a*<sup>0</sup> and *a*<sup>1</sup> are the model parameters, and *pd* is detection

The logit model can be more complicated; we can add more explanatory variables, which may be either categorical or numerical. Nevertheless, the first model tested was the simplest one. Building a detection probability model for all of the data is difficult, and so we considered a simpler case based on the HRCs groups (see section 5.3). Each five HRCs (1-5, 6-10, etc.) can be used to estimate the threshold for a particular HRCs group. For example, in Figure 6(a) we

The obtained model crosses all the confidence intervals for the observed bit-rates. The saturation levels on both sides of the plot are clearly visible. Such a model could successfully

<sup>1</sup> <sup>+</sup> exp(*a*<sup>0</sup> <sup>+</sup> *<sup>a</sup>*1*x*) (8)

*pd* <sup>=</sup> <sup>1</sup>

detection probability, it can be used to optimize the network utilization.

crucial for a practitioner to decide whether the system is sufficient or not.

show an example of the model and the results obtained for HRCs 20 to 25.

logit model is given by the following equation:

be used to investigate detection probability.

like Generalized Linear Model (GLZ) (Agresti, 2002.).

detection threshold probabilities.

**7.1 Answers analysis**

probability.

Fig. 6. Example of the logit model and the obtained detection probabilities.

We present an extension of the sequences analyzed to all HRCs results in the model drawn in Figure 6(b).

The result obtained is less precise. Some of the points are strongly scattered (see results for bit-rate 110 to 130 kbit/s). Moreover, comparing the models presented in Figure 6(a) and Figure 6(b) different conclusions can be drawn. For example, 150 kbit/s results in around a 90% detection probability for HRCs 20 to 25 and less than 70% for all HRCs. It is therefore evident that the bit-rate itself cannot be used as the only explanatory variable. The question then is, what other explanatory variables can be used.

In Figure 8(a) we show DP obtained for SRCs. The SRCs had a strong impact on the DP. We would like to stress that there is one SRC (number 26) which was not detected even once (see Figure 7(a)). The non-zero confidence interval comes from the corrected confidence interval computation explained in (Agresti & Coull, 1998). In contrast, SRC number 27 was almost always detected, i.e. even for very low bit-rates (see Figure 7(b)). A detailed investigation shows that the most important factors (in order of importance) are:


A better DP model has to include these factors. On the other hand, these factors cannot be fully controlled by the monitoring system, and therefore these parameters help to understand what kind of problems might influence DP in a working system. Factors which can be controlled are described by different HRCs. In Figure 8(b) we show the DP obtained for different HRCs.

For each HRC, we used all SRCs, and therefore any differences observed in HRCs should be SRC independent. HRC behavior is more stable because detection probability decreases for higher QP values. One interesting effect is the clear threshold in the DP. For all HRCs groups two consecutive HRCs for which the DPs are strongly different can be found. For example, HRC 4 and 5, HRC 17 and 18, and HRC 23 and 24. Another effect is that even for the same QP the detection probability obtained can be very different (for example HRC 4 and 24).

In order to build a precise DP model, differences resulting from SRCs and HRCs analysis have to be considered. In this experiment we found factors which influence the DP, but we observed an insufficient number of different values for these factors to build a correct model. Therefore, the lesson learned from this experiment is highly important and will help us to design better

Quality Assessment in Video Surveillance 71

For further analysis we assumed that the threshold detection parameter to be analyzed is the probability of plate recognition with no more than one error. For detailed results, please refer

It was possible to fit a polynomial function in order to model quality (expressed as detection threshold probability) of the license plate recognition task. This is an alternative, innovative approach. The achieved *R*<sup>2</sup> is 0.86 (see Figure 9). According to the model, one may expect 100% correct recognition for bit-rates of minimum around 360 kbit/s and higher. Obviously accuracy of recognition depends on many external conditions and also size of image details.

Unfortunately, due to relatively high diversity of subjective answers, no better fitting was achievable in either case. However, a slight improvement is likely to be possible by using

Summarizing presentation of the results for the quality modeling case, we would like to note that a common method of presenting results can be used for any other modeling case. This

40 80 120 160 200 240 280 320 360 400 440

Bit Rate [kbit/s]

Fig. 9. Example of the obtained detection probability and model of the license plate

**y = -0.00x4 + 0.00x3 - 0.00x2 + 0.02x - 0.48 R² = 0.86** 

and more precise experiments in the future.

to Figure 9.

other curves.

0.0

recognition task.

0.1

0.2

0.3

0.4

0.5

Threshold Probability

0.6

0.7

0.8

0.9

Subjects Model

1.0

**7.2 Alternative way of modeling perceptual video quality**

Therefore 100% can be expected only if other conditions are ideal.

(a) One SRC (number 26) which was not detected even once. (b) SRC number 27 was almost always detected, i.e. even for very low bit-rates.

Fig. 7. The SRCs had a strong impact on the DP.

(a) For different SRCs with 90% confidence intervals. (b) For different HRCs. The solid lines correspond to QP from 43 to 51, and the dashed lines correspond to QP from 37 to 45.

Fig. 8. The detection probabilities obtained.

Different HRCs groups have different factors which can strongly influence the DP. The most important factors are differences in spatial and temporal activities and plate character size. We cropped and/or re-sized the same scene (SRC) resulting in a different output video sequence which had different spatial and temporal characteristics.

(b) SRC number 27 was almost always detected, i.e.

<sup>0</sup> <sup>5</sup> <sup>10</sup> <sup>15</sup> <sup>20</sup> <sup>25</sup> <sup>30</sup> −0.2

(b) For different HRCs. The solid lines correspond to QP from 43 to 51, and the dashed lines

correspond to QP from 37 to 45.

HRC

even for very low bit-rates.

0

0.2

0.4

Detection Probability

Different HRCs groups have different factors which can strongly influence the DP. The most important factors are differences in spatial and temporal activities and plate character size. We cropped and/or re-sized the same scene (SRC) resulting in a different output video sequence

0.6

0.8

1

1.2

(a) One SRC (number 26) which was not detected

Fig. 7. The SRCs had a strong impact on the DP.

<sup>0</sup> <sup>5</sup> <sup>10</sup> <sup>15</sup> <sup>20</sup> <sup>25</sup> <sup>30</sup> −0.2

(a) For different SRCs with 90% confidence

Fig. 8. The detection probabilities obtained.

SRC

which had different spatial and temporal characteristics.

even once.

0

intervals.

0.2

0.4

Detection Probability

0.6

0.8

1

1.2

In order to build a precise DP model, differences resulting from SRCs and HRCs analysis have to be considered. In this experiment we found factors which influence the DP, but we observed an insufficient number of different values for these factors to build a correct model. Therefore, the lesson learned from this experiment is highly important and will help us to design better and more precise experiments in the future.

#### **7.2 Alternative way of modeling perceptual video quality**

For further analysis we assumed that the threshold detection parameter to be analyzed is the probability of plate recognition with no more than one error. For detailed results, please refer to Figure 9.

It was possible to fit a polynomial function in order to model quality (expressed as detection threshold probability) of the license plate recognition task. This is an alternative, innovative approach. The achieved *R*<sup>2</sup> is 0.86 (see Figure 9). According to the model, one may expect 100% correct recognition for bit-rates of minimum around 360 kbit/s and higher. Obviously accuracy of recognition depends on many external conditions and also size of image details. Therefore 100% can be expected only if other conditions are ideal.

Unfortunately, due to relatively high diversity of subjective answers, no better fitting was achievable in either case. However, a slight improvement is likely to be possible by using other curves.

Summarizing presentation of the results for the quality modeling case, we would like to note that a common method of presenting results can be used for any other modeling case. This

Fig. 9. Example of the obtained detection probability and model of the license plate recognition task.

Faye, P., Br?maud, D., Daubin, M. D., Courcoux, P., Giboreau, A. & Nicod, H. (2004).

Quality Assessment in Video Surveillance 73

Ford, C. & Stange, I. (2010). A framework for generalising public safety video applications

Ghinea, G. & Chen, S. Y. (2008). Measuring quality of perception in distributed multimedia: Verbalizers vs. imagers, *Computers in Human Behavior* 24(4): 1317–1329. Ghinea, G. & Thomas, J. P. (1998). Qos impact on user perception and understanding

Insall, M. & Weisstein, E. W. (2011). *Partially Ordered Set*, MathWorld–A Wolfram Web Resource. http://mathworld.wolfram.com/PartiallyOrderedSet.html. ITU-T (1999). Recommendation 910: Subjective video quality assessment methods for multimedia applications, ITU-T Rec. P.910. URL: http://www.itu.int/ ITU-T (2000). Recommendation 500-10: Methodology for the subjective assessment of the quality of television pictures, ITU-R Rec. BT.500. URL: http://www.itu.int/ ITU-T (2008). Recommendation 912: Subjective video quality assessment methods for

Janowski, L. & Romaniak, P. (2010). Qoe as a function of frame rate and resolution

Leszczuk, M. (2011). Assessing task-based video quality — a journey from subjective

Leszczuk, M., Janowski, L., Romaniak, P., Glowacz, A. & Mirek, R. (2011). Quality assessment

Nyman, G., Radun, J., Leisti, T., Oja, J., Ojanen, H., Olives, J. L., Vuori, T. & Hakkinen, J. (2006).

URL: http://dx.doi.org/10.1007/978-3-642-21512-4\_2

changes, *in* S. Zeadally, E. Cerqueira, M. Curado & M. Leszczuk (eds), *Future Multimedia Networking*, Vol. 6157 of *Lecture Notes in Computer Science*, Springer Berlin

psycho-physical experiments to objective quality models, *in* A. Dziech & A. Czyzewski (eds), *Multimedia Communications, Services and Security*, Vol. 149 of *Communications in Computer and Information Science*, Springer Berlin Heidelberg, pp. 91–99. URL: http://dx.doi.org/10.1007/978-3-642-21512-4\_11 Leszczuk, M. I., Stange, I. & Ford, C. (2011). Determining image quality requirements for

recognition tasks in generalized public safety video applications: Definitions, testing, standardization, and current trends, *Broadband Multimedia Systems and Broadcasting*

for a licence plate recognition task based on a video streamed in limited networking conditions, *in* A. Dziech & A. Czyzewski (eds), *Multimedia Communications, Services and Security*, Vol. 149 of *Communications in Computer and Information Science*, Springer

What do users really perceive — probing the subjective image quality experience, *Proceedings of the SPIE International Symposium on Electronic Imaging 2006: Imaging*

*Multimedia*, MULTIMEDIA '98, ACM, New York, NY, USA, pp. 49–54.

recognition tasks, ITU-T Rec. P.912. URL: http://www.itu.int/

URL: http://dx.doi.org/10.1007/978-3-642-13789-1\_4

*(BMSB), 2011 IEEE International Symposium on*, pp. 1 –5.

*Quality and System Performance III, Vol. 6059*, pp. 1–7.

URL: http://doi.acm.org/10.1145/290747.290754

com/science/article/pii/S0950329304000540

Kraków.

/ Heidelberg, pp. 34–45.

Berlin Heidelberg, pp. 10–18.

Perceptive free sorting and verbalisation tasks with naive subjects: an alternative to descriptive mappings, *Food Quality and Preference* 15(7-8): 781 – 791. Fifth Rose Marie Pangborn Sensory Science Symposium. URL: http://www.sciencedirect.

to determine quality requirements, *in* A. Dziech & A. Czyzewski (eds), *Multimedia Communications, Services and Security*, AGH University of Science and Technology

of multimedia video clips, *Proceedings of the sixth ACM international conference on*

is possible through the application of appropriate transformations, allowing the fitting of diverse recognition tasks into a single quality framework.
