**4.3 Methods**

#### **Definition of settings of methods**

A collection of settings can be varied, depending on the type of desirable results to obtain at the end of the quality study. In this section, some of the most important are described.

Single or Double Stimulus methods

In double stimulus methods, viewers are shown each pair of video sequences, the reference and the impaired one. Whereas, in single stimulus methods, viewers are shown only the impaired sequence.

The number of stimulus defines the possibility of comparison to a reference, which allow the observer to detect the artifacts and impairments more easily on the image than without any original signal.

 In real conditions the user does not have a reference to compare, so a single stimulus method is considered more realistic. But, a double stimulus avoids more efficiently the errors occurred by context effects. Context effects occur when subjective ratings are influenced by the severity and ordering of impairments within the test session.

In double stimulus there are two different kinds of presenting each pair of sequences, depending on the number of screens used on the study. With two screens every pair can be presented simultaneously, allowing the user to compare at the same time detecting the variation of quality.

The comparison scale is only available in double stimulus methods.

With or without repetition methods

One of the main problems and context effect affecting to the results of subjective assessment is the fatigue of the observers. The observer has a limited time in which its scores are effective.

Long sessions produce high fatigue and exhaustion, which distort the results and invalidates the assessment. For that reason, the time of each session must be reduced to less than half an hour with extended breaks.

Depending on the accuracy of the study, each pair can be presented twice, with one or more repetitions. If the variety of parameters to measure in sequences is wide, it is possible to reduce the time for sessions, presenting each pair of sequences only once, i.e. without repetition. The objective is saving time, expanding the quality parameters (QP) under evaluation, avoiding the fatigue of observers.

Absolute or Comparison methods

Depending on the objective of the study, it is possible to define the expected results obtained. Absolute results are related to single stimulus methods, whereas, comparison

A collection of settings can be varied, depending on the type of desirable results to obtain at the end of the quality study. In this section, some of the most important are described.

In double stimulus methods, viewers are shown each pair of video sequences, the reference and the impaired one. Whereas, in single stimulus methods, viewers are shown only the

The number of stimulus defines the possibility of comparison to a reference, which allow the observer to detect the artifacts and impairments more easily on the image than without any

 In real conditions the user does not have a reference to compare, so a single stimulus method is considered more realistic. But, a double stimulus avoids more efficiently the errors occurred by context effects. Context effects occur when subjective ratings are

In double stimulus there are two different kinds of presenting each pair of sequences, depending on the number of screens used on the study. With two screens every pair can be presented simultaneously, allowing the user to compare at the same time detecting the

One of the main problems and context effect affecting to the results of subjective assessment is the fatigue of the observers. The observer has a limited time in which its scores are

Long sessions produce high fatigue and exhaustion, which distort the results and invalidates the assessment. For that reason, the time of each session must be reduced to less

Depending on the accuracy of the study, each pair can be presented twice, with one or more repetitions. If the variety of parameters to measure in sequences is wide, it is possible to reduce the time for sessions, presenting each pair of sequences only once, i.e. without repetition. The objective is saving time, expanding the quality parameters (QP) under

Depending on the objective of the study, it is possible to define the expected results obtained. Absolute results are related to single stimulus methods, whereas, comparison

influenced by the severity and ordering of impairments within the test session.

The comparison scale is only available in double stimulus methods.

approximately 100 ms are considered very annoying.

**4.3 Methods** 

impaired sequence.

variation of quality.

effective.

original signal.

**Definition of settings of methods** 

Single or Double Stimulus methods

With or without repetition methods

than half an hour with extended breaks.

evaluation, avoiding the fatigue of observers.

Absolute or Comparison methods

Additionally, it is important that the sound be synchronized with the video. This is most noticeable for speech and lip synchronization, for which time lags of more than methods are more related to double stimulus methods, although it is possible to obtain absolute measurements with full reference.

First type of methods utilizes indistinctly the quality or the impairment scale, while the second type utilizes a scale called "comparison scale", which assigns the relation between the members of each pair of sequences.

Continuous or discrete (non-continuous) evaluation methods

There are different options of combining video in a continuous war: one program or a series of sequences of different or the same type of content. These programs may include one or various quality parameters under evaluation (e.g. bitrate). Each program should have duration of at least 5 minutes

The time of response of the viewer must be fast to identify the impairment observed. Nevertheless, the varying delay may influence the assessment results if only the average over a program segment is calculated. Studies are being carried out to evaluate the impact of the response time of different viewers on the resulting quality grade.

Fig. 5. Example of data from a continuous assessment

Type of scale

There are different types of scale, depending on desirable results that the researcher expects to obtain from the study. The four most representatives appear below.

1. Quality Scale (QS)

This scale is used in different methods to evaluate in absolute the perceived quality of a video sequence. There are variations with different number of grades.


Video Quality Assessment 139

The reference and the test sequence are shown only once. Subjects rate the amount of

The main purpose of DSCQS method is to measure the quality of systems relative to a reference. Viewers are shown pairs of video sequences (the reference sequence and the impaired sequence) in a randomized order. It is widely accepted as an accurate test method with little sensitivity to context effects, as viewers are shown the sequence twice. Viewers are asked to rate the quality of each sequence in the pair after the second showing. It is also

Since standard double stimulus methods like DSCQS provide only a single quality score for a given video sequence, where a typical video sequence might be 10 seconds long, questions have been raised as to the applicability of these testing methods for evaluating the

The purpose of this method is to quantify the quality of systems (when no reference is

The method of this type called Absolute Category Rating (ACR) utilizes a single stimulus. Viewers only see the video under test, without the reference. They give one rating for its overall quality using a discrete five-level scale from 'bad' to 'excellent'. The fact that the reference is not shown with every test clip makes ACR a very efficient method compared to

impairment in the test sequence comparing one to the other.

The double-stimulus continuous quality-scale (DSCQS) method

performance of objective real-time video quality monitoring systems.

DSIS or DSCQS, which take almost 2 or 4 times as long, respectively.

used to measure the quality of stereoscopic image coding

Fig. 6. Scheme of a DSIS system

Fig. 7. Scheme of a DSCQS system

Single-stimulus (SS) methods

Fig. 8. Scheme of a SS system

available).

2. Impairment Scale (IS)

Unlike the QS, IS scale tries to extract the effect over the human perception of an artifact or other impairment.


3. Comparison Scale (CS)

This scale is not allowed for single stimulus methods. The objective is to establish a relative judgment between a pair of sequences to evaluate impairment or degradation in image. It is a 7-grade scale, as follows.


4. Numerical Scale (NS)

The numerical scale uses numbers to obtain the opinion of the observers. The scale depends on the number of grades on the scale

The most frequent scale used in numerical terms is known as Mean Opinion Score (MOS), normalized as the five-grade scale in range from 1 to 5. Other scales are the 10-grade scale from 1 to 10, or 8-grade scale from 1 to 8, but sometimes it is difficult to find equivalent adjectives for each grade. Other different numbers scales are, for example, "compare scale", which utilizes 7 grades including zero to indicate no perceptible variation.

Zero is rarely used because of its negative connotations.

#### **Most frequent methods**

Combining the different settings to develop a subjective evaluation method, it is possible to define the most common methods. Even so, there are other combinations, also acceptable, that do not appear on the next list of the most representatives.

The double-stimulus impairment scale (DSIS) method.

This is the method used by the European Broadcasting Union (EBU), in order to measure the robustness of systems (i.e. failure characteristics).

4 Perceptible, but not annoying

5 Imperceptible

2 Annoying 1 Very annoying

degradation in image. It is a 7-grade scale, as follows.

+3 Much worse +2 Worse

+1 Slightly worse 0 The same -1 Slightly better



depends on the number of grades on the scale

that do not appear on the next list of the most representatives. The double-stimulus impairment scale (DSIS) method.

robustness of systems (i.e. failure characteristics).

Zero is rarely used because of its negative connotations.

3 Slightly annoying

Unlike the QS, IS scale tries to extract the effect over the human perception of an

This scale is not allowed for single stimulus methods. The objective is to establish a relative judgment between a pair of sequences to evaluate impairment or

The numerical scale uses numbers to obtain the opinion of the observers. The scale

The most frequent scale used in numerical terms is known as Mean Opinion Score (MOS), normalized as the five-grade scale in range from 1 to 5. Other scales are the 10-grade scale from 1 to 10, or 8-grade scale from 1 to 8, but sometimes it is difficult to find equivalent adjectives for each grade. Other different numbers scales are, for example, "compare scale", which utilizes 7 grades including zero to indicate no

Combining the different settings to develop a subjective evaluation method, it is possible to define the most common methods. Even so, there are other combinations, also acceptable,

This is the method used by the European Broadcasting Union (EBU), in order to measure the

2. Impairment Scale (IS)

3. Comparison Scale (CS)

4. Numerical Scale (NS)

perceptible variation.

**Most frequent methods** 

artifact or other impairment.

The reference and the test sequence are shown only once. Subjects rate the amount of impairment in the test sequence comparing one to the other.

Fig. 6. Scheme of a DSIS system

The double-stimulus continuous quality-scale (DSCQS) method

The main purpose of DSCQS method is to measure the quality of systems relative to a reference. Viewers are shown pairs of video sequences (the reference sequence and the impaired sequence) in a randomized order. It is widely accepted as an accurate test method with little sensitivity to context effects, as viewers are shown the sequence twice. Viewers are asked to rate the quality of each sequence in the pair after the second showing. It is also used to measure the quality of stereoscopic image coding

Since standard double stimulus methods like DSCQS provide only a single quality score for a given video sequence, where a typical video sequence might be 10 seconds long, questions have been raised as to the applicability of these testing methods for evaluating the performance of objective real-time video quality monitoring systems.

Fig. 7. Scheme of a DSCQS system

Single-stimulus (SS) methods

The purpose of this method is to quantify the quality of systems (when no reference is available).

The method of this type called Absolute Category Rating (ACR) utilizes a single stimulus. Viewers only see the video under test, without the reference. They give one rating for its overall quality using a discrete five-level scale from 'bad' to 'excellent'. The fact that the reference is not shown with every test clip makes ACR a very efficient method compared to DSIS or DSCQS, which take almost 2 or 4 times as long, respectively.

Fig. 8. Scheme of a SS system

Video Quality Assessment 141

In this section, the most important metrics are described to make to offer an overview of the

There are three types of objective quality assessment depending on the presence and availability of a reference image or any of its features to develop the study: Full-Reference

Old metrics designed for digital imaging systems, such as MSE (Mean Sqaured Error) and PSNR (Peak Signal-to-Noise Ratio), which are still very used to develop quality assessment,

The analysis of measurement of artifacts such as tiling or blurring, especially introduced by video compression algorithms, will be interesting to offer the reader a perspective of

 Full Reference (FR) metrics, when the original image is present and can be used to compare it to the degraded image in order to obtain the reduction of quality because of

 Reduced Reference (RR) metrics. The original image is not available for the study but there are some properties and characteristics of it which can be used to obtain quality results. No-Reference (NR) metrics. There is no original image or properties of it, in this case the degraded image and its affection to the human visual system is the only tool to conclude with the quality of an image. This kind of metrics are more complicated but

are defined in this section. They are still adequate for evaluating error measures

Depending on the presence of a video reference, three kinds of analysis are defined:

Fig. 11. Scheme of a SDSCE system

evolution of metrics.

**5.1 Objective quality metrics** 

**5. Objective quality assessment** 

techniques that develop this type of quality assessment.

(FR), Reduced-Reference (RR) and No-Reference).

the process of encoding and decoding.

Fig. 12. Full Reference (FR) metric diagram

Stimulus-comparison (SC) or Pair-Comparison (PC) methods

For this method, test clips from the same scene but different conditions (quality parameter under evaluation) are paired in all possible combinations, and viewers make a preference judgment for each pair. This allows very fine quality discrimination between clips.

This method uses a comparison scale.

Fig. 9. Scheme of a SC system

Single stimulus continuous quality evaluation (SSCQE)

Instead of seeing separate short sequence pairs, viewers watch a program of typically 20–30 minutes' duration which has been processed by the system under test; the reference is not shown. Using a slider, the subjects continuously rate the instantaneously perceived quality on the DSCQS scale from 'bad' to 'excellent'.

The purpose of this type of study is to assess not only the basic quality of the images but also the fidelity of the information transmitted.

Fig. 10. Scheme of a SSCQE system

Simultaneous double stimulus for continuous evaluation (SDSCE) method

Two screens are necessary for this method of evaluation, which are parallel placed in front of the user. The left screen plays the reference sequence, while the right one plays the impaired sequence that viewers must score.

The main purpose of this method is to measure the fidelity between two video sequences. It is also used to compare different error resilience tools.

Each video pair is shown once or twice. The duration of the test session is shorter, and allows to evaluate a higher amount of quality parameters.

For this method, test clips from the same scene but different conditions (quality parameter under evaluation) are paired in all possible combinations, and viewers make a preference

Instead of seeing separate short sequence pairs, viewers watch a program of typically 20–30 minutes' duration which has been processed by the system under test; the reference is not shown. Using a slider, the subjects continuously rate the instantaneously perceived quality

The purpose of this type of study is to assess not only the basic quality of the images but

Two screens are necessary for this method of evaluation, which are parallel placed in front of the user. The left screen plays the reference sequence, while the right one plays the

The main purpose of this method is to measure the fidelity between two video sequences. It

Each video pair is shown once or twice. The duration of the test session is shorter, and

Simultaneous double stimulus for continuous evaluation (SDSCE) method

judgment for each pair. This allows very fine quality discrimination between clips.

Stimulus-comparison (SC) or Pair-Comparison (PC) methods

Single stimulus continuous quality evaluation (SSCQE)

on the DSCQS scale from 'bad' to 'excellent'.

also the fidelity of the information transmitted.

Fig. 10. Scheme of a SSCQE system

impaired sequence that viewers must score.

is also used to compare different error resilience tools.

allows to evaluate a higher amount of quality parameters.

This method uses a comparison scale.

Fig. 9. Scheme of a SC system

Fig. 11. Scheme of a SDSCE system
