*:* log *d start* ð Þ *<sup>i</sup>* � *endi*�<sup>1</sup> � 1

(2)

consists of words *fj* and the target sentence of words *ei*. Words *fj* belong to the source vocabulary *F* and the words *ei* to the target vocabulary *E*. In the phrase-based

Standard phrase-based SMT models consist of three components:

Log-linear models of phrase-based SMT are most commonly used:

j*ei* � �

log *pLM ei* ð Þ j*e*<sup>1</sup> … *ei*�<sup>1</sup>

*P*ð Þ**e** *P*ð Þ **f**j**e** *:* (1)

, and each source

**<sup>e</sup>** <sup>¼</sup> argmax **<sup>e</sup>**

model, the source sentence *f* is broken down into *I* phrases *fi*

*d* is the probability distribution of reordering.

X *I*

"

þ *λLM*

**2.4 Hybrid machine translation**

**146**

log*ϕ fi*

*i*¼1

X *N*

*i*¼1

phrase *fi* is translated into a target phrase *ei*.

*Recent Trends in Computational Intelligence*

theory:

model.

*p*ð Þ¼ **e**, *a*j *f* exp *λϕ*

of the target sentence.

*Statistical machine translation system using a language model based on surface forms, a language model based on MSD tags, a language model based on lemmas, and three OSMs.*

between the two approaches have narrowed, and hybrid approaches emerged, which try to gain benefit from both of them. We distinguish two groups of hybrid MT, those guided by rule-based MT and those guided by statistical approaches. Hybrid systems, guided by rule-based MT, use statistical MT to identify the set of appropriate translation candidates and/or to combine partial translations into the final sentence in the target language. Hybrid systems, guided by statistical MT, use rules at pre-/post-processing stages.

#### **2.5 Neural machine translation**

Neural MT emerged as a successor of statistical MT. It has made rapid progress in recent years, and it is paving its way into the translation industry as well. Neural MT is a deep learning-based approach to MT that uses a large neural network based on vector representations of words. If compared with statistical MT, there is no separate language model, translation model, or reordering model, but just a single sequence model, which predicts one word at a time. The prediction is conditioned on the source sentence and the already produced sequence in the target language. The prediction power of neural MT is more promising than that of statistical MT, as neural networks share statistical evidence between similar words. In **Figure 3** one of the proposed topology for neural machine translation is given with the same example sentence as in **Figure 2**. The input words are passed through the layers of the encoder (blue circles) to its last layer, the context vector, updating it for every input word. The context layer is then passed through the decoder layers (red circles) to output words, and it is again updated for each output word.

The encoder-decoder recurrent neural network architecture with attention is currently the state of the art for machine translation.

Although effective, the neural MT systems still suffer some issues, such as scaling to larger vocabularies of words and the slow speed of training the models.

sentence during training in such a way that the words on the source side appear closer to their final positions on the target side. A frequent problem of inflectional languages is also an inaccurate translation of pronouns. There are also many cases in inflectional languages where the subject is dropped completely. Problematic also is differences in the expression of negation. Slavic languages fall into the category of highly inflected languages, and they cause many problems in machine

*Word permutations of the English sentence "I have been studying English for two years" in Slovene.*

1. Angleščino študiram dve leti. 2. Dve leti študiram angleščino. 3. Študiram angleščino, dve leti.

*Machine Translation and the Evaluation of Its Quality DOI: http://dx.doi.org/10.5772/intechopen.89063*

Another group of morphologically rich languages is a group of agglutinative languages, which are even more difficult to use in machine translation. In an agglutinative language, words may consist of more than one, and possibly many, morphemes. Each morpheme in a sequence indicates a particular grammatical meaning. Morphemes are used commonly as basic units in MT for those groups of languages. All these phenomena cause errors in translations produced by MT systems and make the use of MT questionable. It is necessary to evaluate MT quality

As MT emerges as an important mode of translation, its quality is becoming more and more important. Judging translation quality is called machine translation evaluation. It is defined commonly by technical terms. It means, with the exception of human (i.e. manual) evaluation, that it is defined as an algorithm that can be coded into a programme and run by a computer that calculates the evaluation score, which tells the user how good a translation is. Translation evaluation methods count word- and/or sentence-based errors that can be detected automatically, while general text-level aspects are not taken into account. This weakness of automatic MT evaluation is one of the main criticisms in the translation community. Despite that, in the last decade, we have been witnessing great progress in automatic

MT quality can be measured in many different ways, depending on the goal of the evaluation and the means available. Traditionally, there are two paradigms of machine translation evaluation: Glass-box evaluation and black-box evaluation. Glass-box evaluation measures the quality of a system based on internal system properties. Black-box evaluation examines only the system output, without

connecting it to the internal mechanisms of the translation system. The focus in this section will be on black-box evaluation. It is concerned only with the objective behaviour of the system upon a predetermined evaluation set. An evaluation set is a set of sentences in the source language and their translations into the target language, obtained by the translation system. These sentence pairs are then exposed to the evaluation. An evaluation set needs to be selected carefully to cover all data features important for future use of the translation system. The same translation quality can then be expected on the other data that is of the same type as the evaluation set; if not, translations of quite different quality could be obtained.

translation [7, 8].

**Table 2.**

*Third example is used in colloquial speech.*

before use in practice.

MT evaluation.

**149**

**4. Machine translation evaluation**

**Figure 3.**

*Neural machine translation system using the encoder-decoder topology.*

In addition, large corpus is needed to train neural MT systems with performance comparable to statistical machine translation. Researchers continue to work on solving open problems.

#### **3. Problems in machine translation**

The fast progress of MT has boosted translation quality significantly, but, unfortunately, machine translation approaches are not equally successful for all language pairs. Morphologically rich languages are problematic in MT, especially if the translation is from a morphologically less complex to a morphologically more complex language. Morphological distinctions not present in the source language need to be generated in the target language. Much work on morphology-aware approaches relies heavily on language-specific tools, which are not always available. Many morphologically rich languages fall in the category of low-resource languages.

One group of morphologically rich languages is a group of highly inflected languages. They are difficult not only for MT but also for other language technology applications [4, 5]. The main problem in highly inflected languages is that the large number of inflected word forms lead to data sparsity (see example in **Table 1**), which results in unreliable estimates in statistical MT [6]. Most words in a given corpus occur at most a handful of times. Therefore, the translation rule coverage is partial, and the estimation of translation probabilities is poor. Some approaches try to reduce the problem of data sparsity by using modelling units other than words; for example, stems and endings, lemmas and morphosyntactic tags, etc. Relaxed word order in inflectional languages poses another problem (see example in **Table 2**). Usually, very little information about the target word order is obtainable from the source sentence. Pre-ordering approaches learn to preprocess the source


#### **Table 1.**

*Inflected word forms of the word "student" (masculine) in Slovene.*

*Machine Translation and the Evaluation of Its Quality DOI: http://dx.doi.org/10.5772/intechopen.89063*


**Table 2.**

In addition, large corpus is needed to train neural MT systems with performance comparable to statistical machine translation. Researchers continue to work on

The fast progress of MT has boosted translation quality significantly, but, unfortunately, machine translation approaches are not equally successful for all language pairs. Morphologically rich languages are problematic in MT, especially if the translation is from a morphologically less complex to a morphologically more complex language. Morphological distinctions not present in the source language need to be generated in the target language. Much work on morphology-aware approaches relies heavily on language-specific tools, which are not always available. Many morphologically rich languages fall in the category of low-resource languages. One group of morphologically rich languages is a group of highly inflected languages. They are difficult not only for MT but also for other language technology applications [4, 5]. The main problem in highly inflected languages is that the large number of inflected word forms lead to data sparsity (see example in **Table 1**), which results in unreliable estimates in statistical MT [6]. Most words in a given corpus occur at most a handful of times. Therefore, the translation rule coverage is partial, and the estimation of translation probabilities is poor. Some approaches try to reduce the problem of data sparsity by using modelling units other than words; for example, stems and endings, lemmas and morphosyntactic tags, etc. Relaxed word order in inflectional languages poses another problem (see example in **Table 2**). Usually, very little information about the target word order is obtainable from the source sentence. Pre-ordering approaches learn to preprocess the source

**Singular Dual Plural**

Nominative študent študenta študenti Genitive študenta študentov študentov Dative študentu študentoma študentom Accusative študenta študenta študente Locative študentu študentih študentih Instrumental študentom študentoma študenti

solving open problems.

**Figure 3.**

**3. Problems in machine translation**

*Recent Trends in Computational Intelligence*

*In the example, the word has nine different endings.*

*Inflected word forms of the word "student" (masculine) in Slovene.*

**Table 1.**

**148**

*Neural machine translation system using the encoder-decoder topology.*

*Word permutations of the English sentence "I have been studying English for two years" in Slovene.*

sentence during training in such a way that the words on the source side appear closer to their final positions on the target side. A frequent problem of inflectional languages is also an inaccurate translation of pronouns. There are also many cases in inflectional languages where the subject is dropped completely. Problematic also is differences in the expression of negation. Slavic languages fall into the category of highly inflected languages, and they cause many problems in machine translation [7, 8].

Another group of morphologically rich languages is a group of agglutinative languages, which are even more difficult to use in machine translation. In an agglutinative language, words may consist of more than one, and possibly many, morphemes. Each morpheme in a sequence indicates a particular grammatical meaning. Morphemes are used commonly as basic units in MT for those groups of languages. All these phenomena cause errors in translations produced by MT systems and make the use of MT questionable. It is necessary to evaluate MT quality before use in practice.

#### **4. Machine translation evaluation**

As MT emerges as an important mode of translation, its quality is becoming more and more important. Judging translation quality is called machine translation evaluation. It is defined commonly by technical terms. It means, with the exception of human (i.e. manual) evaluation, that it is defined as an algorithm that can be coded into a programme and run by a computer that calculates the evaluation score, which tells the user how good a translation is. Translation evaluation methods count word- and/or sentence-based errors that can be detected automatically, while general text-level aspects are not taken into account. This weakness of automatic MT evaluation is one of the main criticisms in the translation community. Despite that, in the last decade, we have been witnessing great progress in automatic MT evaluation.

MT quality can be measured in many different ways, depending on the goal of the evaluation and the means available. Traditionally, there are two paradigms of machine translation evaluation: Glass-box evaluation and black-box evaluation. Glass-box evaluation measures the quality of a system based on internal system properties. Black-box evaluation examines only the system output, without connecting it to the internal mechanisms of the translation system. The focus in this section will be on black-box evaluation. It is concerned only with the objective behaviour of the system upon a predetermined evaluation set. An evaluation set is a set of sentences in the source language and their translations into the target language, obtained by the translation system. These sentence pairs are then exposed to the evaluation. An evaluation set needs to be selected carefully to cover all data features important for future use of the translation system. The same translation quality can then be expected on the other data that is of the same type as the evaluation set; if not, translations of quite different quality could be obtained.

The reason is in the fact that MT systems are trained on translation examples. If these examples are of a different type to the text that is afterwards translated by the system, the system has only weak knowledge about its translation and, consequently, produces poor translations. Different types of data mean variations in structure, genre, and style. Evaluation, on the other hand, can focus on testing the systems's robustness. In this case, the evaluation set is composed of subsets of different data types. One should be aware that obtaining a robust MT system means at least training it with translation examples of different data types.

There is also a difference between judging and measuring the quality of MT output as a final product and judging and measuring the usability of MT output for subsequent corrections by humans, called post-editing (PE). As regards the latter, it is interesting to know how much editing effort is needed to make the MT output match a reference translation or to become an acceptable translation. What is acceptable translation is left to be decided by the translation expert. In the MT community, there are no criteria for it.

what their scores mean. They rely on the idea that MT quality in itself should approach human quality. Automatic metrics depend on the availability of human reference translation. They evaluate the output of MT systems by comparing it to the reference translation. As there is a great variability even in human translation, it

**Adequacy Fluency**

 All meaning 5 Flawless language Most meaning 4 Good language Much meaning 3 Non-native language Little meaning 2 Disfluent language None 1 Incomprehensible

is important to have more human reference translations for each machinetranslated sentence to be evaluated. Evaluation metrics then provide evaluation

Standard and recently proposed automatic metrics for MT evaluation will be discussed in the continuation of this section. Statistical correlation coefficients are used to see how close automatic evaluation is to manual judgements. Three correlation coefficients will be described later in the section. Machine translation, coupled with subsequent post-editing, has become a widely accepted method in the translation industry. This type of translation workflow will be discussed at the end of this

An obvious method for evaluation is to look at the translation and judge by hand, whether it is correct or not. To get reliable judgements, the evaluators should be appropriately qualified. From a practical point of view, manual evaluation, performed by translation experts, is expensive and takes time. What is needed are automatic metrics that are quick and cheap to use and approximate human judgements accurately. De facto metrics, used in the MT community, are BLEU, NIST, METEOR, and TER. All these metrics need reference translations because they compare the MT output with reference translations and provide comparison scores. If reference translations are available, these metrics can be used to evaluate the output of any number of systems quickly, without the need for human intervention. Let us take an example where the reference translation is "Dve leti že študiram angleščino" and the MT output to be evaluated is "Angleščino študiram dve leti".

scores based on the most similar reference translation.

*Numeric scale for judging adequacy and fluency.*

*Machine Translation and the Evaluation of Its Quality DOI: http://dx.doi.org/10.5772/intechopen.89063*

**5. Basic metrics for translation evaluation in MT**

*precision* <sup>¼</sup> *correct*

of machine translation output. For the same example, the recall is:

*recall* <sup>¼</sup> *correct*

*lengtho*

*lengthr*

*correct* counts the number of correctly translated words, and *lengtho* is the length

¼ 4

¼ 4

<sup>4</sup> <sup>¼</sup> 100%*:* (3)

<sup>5</sup> <sup>¼</sup> 80%*:* (4)

If we compute the precision, we get:

section.

**151**

**Table 3.**

For a long time, methods for evaluating human and MT quality have been disconnected. The comparison between them was impossible. In recent years, a framework called multidimensional quality metrics (MQM) [9] was developed for evaluating the quality of both human and machine translations. It includes over 100 issue types that cover all of the major translation quality evaluation metrics. For the specific translation quality judgement task, relevant issues may be chosen from MQM. The focus of this section is only on the evaluation of MT quality, whereas human translation is taken as the gold standard.

#### **4.1 Manual evaluation**

The most common option for judging and measuring machine translation quality is human evaluation. The quality of MT output is judged by experts in translation and linguistics from two different perspectives. The first perspective is the degree of adherence to the target text and target language norms, referring, for example, to features such as grammaticality and clarity. This quality evaluation perspective is known as fluency. When judging fluency, the source text is not relevant. The evaluators have access to only the translation being judged and not the source data. Fluency requires an expert fluent only in the target language. On the other hand, source text adherence is judged to the source text norms and meaning, in terms of how well the target text represents the informational content of the source text. It is known as accuracy. The evaluators have access to the source text and translations being judged. Frequently, the context of a sentence is also taken into account. The evaluators must be bilingual in both the source and target languages. The adequacy and fluency are usually judged on a 5-point scale, as given in **Table 3**.

Human evaluation is time-consuming and expensive. It is also inherently subjective. To alleviate the problem of subjectivity, more experts are usually asked to evaluate the translations in the same evaluation set, and their evaluations are, finally, justified statistically.

#### **4.2 Automatic evaluation**

MT systems are rarely static, and they tend to be improved over time as resources grow and bugs are fixed. The evaluation needs to be repeated many times. Automatic evaluation metrics are cost-free alternatives to human evaluation. They are used commonly during the development of MT systems to estimate improvement. They are also applicable to compare different MT systems. While using automatic metrics to judge the translation quality, it is important to understand

#### *Machine Translation and the Evaluation of Its Quality DOI: http://dx.doi.org/10.5772/intechopen.89063*


**Table 3.**

The reason is in the fact that MT systems are trained on translation examples. If these examples are of a different type to the text that is afterwards translated by the system, the system has only weak knowledge about its translation and, consequently, produces poor translations. Different types of data mean variations in structure, genre, and style. Evaluation, on the other hand, can focus on testing the systems's robustness. In this case, the evaluation set is composed of subsets of different data types. One should be aware that obtaining a robust MT system means

There is also a difference between judging and measuring the quality of MT output as a final product and judging and measuring the usability of MT output for subsequent corrections by humans, called post-editing (PE). As regards the latter, it is interesting to know how much editing effort is needed to make the MT output match a reference translation or to become an acceptable translation. What is acceptable translation is left to be decided by the translation expert. In the MT

For a long time, methods for evaluating human and MT quality have been disconnected. The comparison between them was impossible. In recent years, a framework called multidimensional quality metrics (MQM) [9] was developed for evaluating the quality of both human and machine translations. It includes over 100 issue types that cover all of the major translation quality evaluation metrics. For the specific translation quality judgement task, relevant issues may be chosen from MQM. The focus of this section is only on the evaluation of MT quality, whereas

The most common option for judging and measuring machine translation quality is human evaluation. The quality of MT output is judged by experts in translation and linguistics from two different perspectives. The first perspective is the degree of adherence to the target text and target language norms, referring, for example, to features such as grammaticality and clarity. This quality evaluation perspective is known as fluency. When judging fluency, the source text is not relevant. The evaluators have access to only the translation being judged and not the source data. Fluency requires an expert fluent only in the target language. On the other hand, source text adherence is judged to the source text norms and meaning, in terms of how well the target text represents the informational content of the source text. It is known as accuracy. The evaluators have access to the source text and translations being judged. Frequently, the context of a sentence is also taken into account. The evaluators must be bilingual in both the source and target languages. The adequacy

and fluency are usually judged on a 5-point scale, as given in **Table 3**.

Human evaluation is time-consuming and expensive. It is also inherently subjective. To alleviate the problem of subjectivity, more experts are usually asked to evaluate the translations in the same evaluation set, and their evaluations are,

MT systems are rarely static, and they tend to be improved over time as resources grow and bugs are fixed. The evaluation needs to be repeated many times. Automatic evaluation metrics are cost-free alternatives to human evaluation. They are used commonly during the development of MT systems to estimate improvement. They are also applicable to compare different MT systems. While using automatic metrics to judge the translation quality, it is important to understand

at least training it with translation examples of different data types.

community, there are no criteria for it.

*Recent Trends in Computational Intelligence*

**4.1 Manual evaluation**

finally, justified statistically.

**4.2 Automatic evaluation**

**150**

human translation is taken as the gold standard.

*Numeric scale for judging adequacy and fluency.*

what their scores mean. They rely on the idea that MT quality in itself should approach human quality. Automatic metrics depend on the availability of human reference translation. They evaluate the output of MT systems by comparing it to the reference translation. As there is a great variability even in human translation, it is important to have more human reference translations for each machinetranslated sentence to be evaluated. Evaluation metrics then provide evaluation scores based on the most similar reference translation.

Standard and recently proposed automatic metrics for MT evaluation will be discussed in the continuation of this section. Statistical correlation coefficients are used to see how close automatic evaluation is to manual judgements. Three correlation coefficients will be described later in the section. Machine translation, coupled with subsequent post-editing, has become a widely accepted method in the translation industry. This type of translation workflow will be discussed at the end of this section.

#### **5. Basic metrics for translation evaluation in MT**

An obvious method for evaluation is to look at the translation and judge by hand, whether it is correct or not. To get reliable judgements, the evaluators should be appropriately qualified. From a practical point of view, manual evaluation, performed by translation experts, is expensive and takes time. What is needed are automatic metrics that are quick and cheap to use and approximate human judgements accurately. De facto metrics, used in the MT community, are BLEU, NIST, METEOR, and TER. All these metrics need reference translations because they compare the MT output with reference translations and provide comparison scores. If reference translations are available, these metrics can be used to evaluate the output of any number of systems quickly, without the need for human intervention. Let us take an example where the reference translation is "Dve leti že študiram angleščino" and the MT output to be evaluated is "Angleščino študiram dve leti". If we compute the precision, we get:

$$precision = \frac{correct}{length\_o} = \frac{4}{4} = 100\%.\tag{3}$$

*correct* counts the number of correctly translated words, and *lengtho* is the length of machine translation output. For the same example, the recall is:

$$recall = \frac{correct}{length\_r} = \frac{4}{5} = \\$0\%.\tag{4}$$

*lengthr* is the length of reference translation. F-measure results in:

$$F\_1 = \frac{precision \cdot recall}{\frac{precision + recall}{2}} = \\$\mathfrak{P}\mathfrak{W}\mathfrak{G}.\tag{5}$$

matching. Matching is done in three stages. The first stage is exact matching. Strings are aligned, which are identical in the reference and the translation. Words that are not matched are stemmed in the second stage. Stemming is the process of reducing inflected words to their word stem by cutting off the ends of words. Words with the same morphological root are aligned after stemming. In the last stage, unaligned words which are found to be synonyms are aligned, according to WordNet.

*WER computation for the MT output "Angleščino študiram dve leti", if the reference is "Dve leti že študiram*

Reference: dve leti že študiram angleščino

Edit: I I M M D D D

Output: angleščino študiram dve leti

*Machine Translation and the Evaluation of Its Quality DOI: http://dx.doi.org/10.5772/intechopen.89063*

*WER* <sup>¼</sup> <sup>0</sup>*:*5�2þ0*:*5�<sup>3</sup>

**Table 5.**

*angleščino".*

<sup>5</sup> ¼ 50%

WordNet [13] is a large lexical database of synonyms (called synsets). In WordNet, synsets are interlinked by means of conceptual-semantic and lexical relations. METEOR does not use higher-order *n*-grams, as *n*-gram counts do not require an explicit word-to-word matching. In METEOR, an explicit measure of the level of grammaticality is used. It captures directly how good the structure of the matched

Word error rate (WER) metric was first used to evaluate automatic speech recognition. It counts the minimum number of edits needed to change the evaluated translation so that it matches the references exactly, normalised by the average length of the references. The minimum number of edits is also called Levenshtein distance. Possible edits are insertion (*I*), deletion (*D*), and substitution (*S*) of single words. Matched words are denoted with M. Different edits can have different weights. For example, substitution is usually weighted at unity, but deletion and

*WER* <sup>¼</sup> *<sup>S</sup>* <sup>þ</sup> <sup>0</sup>*:*<sup>5</sup> � *<sup>D</sup>* <sup>þ</sup> <sup>0</sup>*:*<sup>5</sup> � *<sup>I</sup>*

*TER* <sup>¼</sup> *<sup>S</sup>* <sup>þ</sup> *<sup>D</sup>* <sup>þ</sup> *<sup>I</sup>* <sup>þ</sup> *Shift lengthr*

Translation edit rate (TER) metric [14] is a derivate from the WER. It uses an additional edit step, namely, shifts of word sequences (*Shift*). A shift moves a contiguous sequence of words within the evaluated translation to another location within the translation. All edits have equal cost. If more than one reference is available, and since the minimum number of edits needed to modify the translation is called for, only the number of edits to the closest reference is measured. TER is normalised by the average length of the reference. Position-independent error rate (PER) is another derivate from WER, which treats the reference and translation output as bags of words. Words from the translation are aligned to words in the

Although BLEU, NIST, METEOR, and TER metrics are used most frequently in the evaluation of MT quality, new metrics emerge almost every year. There is a

*lengthr*

*:* (7)

*:* (8)

words in the machine translation is in relation to the reference.

**Table 5** contains the calculation of WER for our example.

**6. Advanced metrics for translation evaluation in MT**

insertion are both weighted at 0.5:

reference, ignoring the position.

**153**

Based on given measures, the quality of translation is good, as reordering is not penalised. It is not always a good decision. For example, the MT output "Dve angleščino študiram leti" will get the same evaluation result, even though the translation is disfluent.

BLEU [10] measures the overlap of unigrams (single words) and high-order *n*-grams between MT output and reference translations. It is defined as follows:

$$BLEU = \min\left(1, \frac{length\_o}{length\_r}\right) \left(\prod\_{i=1}^{4} precision\_i\right)^{\frac{1}{4}}.\tag{6}$$

The main component of BLEU is *n*-gram precision, i.e. *precisioni*. It is calculated as the ratio between matched *n*-grams and the total number of *n*-grams in the evaluated translation. Precision is calculated separately for each *n*-gram order, and the precisions are combined via a geometric averaging. The highest *n*-gram order is defined commonly to be four (four words in a sequence). Higher-order *n*-grams are used as an indirect measure of a translations level of grammatical well-formedness. The BLEU metric computes the modified precision score, weighted by the brevity penalty, which punishes sentences that are shorter than the reference. The final scores range from 0 to 1. **Table 4** contains the calculation of BLEU score for our example.

BLEU is typically computed over the entire corpus, not single sentences. It is important to point out that very few translations will attain a score of 1 unless they are identical to a reference translation. For this reason, even a human translator will not necessarily score 1, as there is great variability of possible correct translations. In this sense, it is also important to note that having more reference translations per sentence is highly welcome. It will increase the BLEU score. NIST [11] is a close derivate of BLEU.

Both metrics, BLEU and NIST, focus only on *n*-gram precision and disregard recall. Recall is the ratio between matched *n*-grams and the total number of *n*-grams in the reference translation. METEOR metric [12] combines precision and recall. The authors of METEOR argue that the brevity penalty in BLEU does not compensate adequately for the lack of recall. METEOR computes a score only for unigram


#### **Table 4.**

*BLEU score computation for the MT output "Angleščino študiram dve leti", if the reference is "Dve leti že študiram angleščino".*

*Machine Translation and the Evaluation of Its Quality DOI: http://dx.doi.org/10.5772/intechopen.89063*


**Table 5.**

*lengthr* is the length of reference translation. F-measure results in:

*<sup>F</sup>*<sup>1</sup> <sup>¼</sup> *precision* � *recall precision*þ*recall* 2

penalised. It is not always a good decision. For example, the MT output "Dve angleščino študiram leti" will get the same evaluation result, even though the

*BLEU* <sup>¼</sup> min 1, *lengtho*

translation is disfluent.

*Recent Trends in Computational Intelligence*

example.

**Table 4.**

**152**

*študiram angleščino".*

Based on given measures, the quality of translation is good, as reordering is not

BLEU [10] measures the overlap of unigrams (single words) and high-order *n*-grams between MT output and reference translations. It is defined as follows:

> *lengthr* � � <sup>Y</sup>

as the ratio between matched *n*-grams and the total number of *n*-grams in the evaluated translation. Precision is calculated separately for each *n*-gram order, and the precisions are combined via a geometric averaging. The highest *n*-gram order is defined commonly to be four (four words in a sequence). Higher-order *n*-grams are used as an indirect measure of a translations level of grammatical well-formedness. The BLEU metric computes the modified precision score, weighted by the brevity penalty, which punishes sentences that are shorter than the reference. The final scores range from 0 to 1. **Table 4** contains the calculation of BLEU score for our

BLEU is typically computed over the entire corpus, not single sentences. It is important to point out that very few translations will attain a score of 1 unless they are identical to a reference translation. For this reason, even a human translator will not necessarily score 1, as there is great variability of possible correct translations. In this sense, it is also important to note that having more reference translations per sentence is highly welcome. It will increase the BLEU score. NIST [11] is a close derivate of BLEU. Both metrics, BLEU and NIST, focus only on *n*-gram precision and disregard recall. Recall is the ratio between matched *n*-grams and the total number of *n*-grams in the reference translation. METEOR metric [12] combines precision and recall. The authors of METEOR argue that the brevity penalty in BLEU does not compensate adequately for the lack of recall. METEOR computes a score only for unigram

**Metric Score** 1-gram precision 4

2-gram precision 1

3-gram precision 0

4-gram precision 0

Brevity penalty 4

BLEU 0%

*BLEU score computation for the MT output "Angleščino študiram dve leti", if the reference is "Dve leti že*

The main component of BLEU is *n*-gram precision, i.e. *precisioni*. It is calculated

4

*i*¼1

*precisioni* !<sup>1</sup>

4

4

3

2

1

5

*:* (6)

¼ 89%*:* (5)

*WER computation for the MT output "Angleščino študiram dve leti", if the reference is "Dve leti že študiram angleščino".*

matching. Matching is done in three stages. The first stage is exact matching. Strings are aligned, which are identical in the reference and the translation. Words that are not matched are stemmed in the second stage. Stemming is the process of reducing inflected words to their word stem by cutting off the ends of words. Words with the same morphological root are aligned after stemming. In the last stage, unaligned words which are found to be synonyms are aligned, according to WordNet. WordNet [13] is a large lexical database of synonyms (called synsets). In WordNet, synsets are interlinked by means of conceptual-semantic and lexical relations. METEOR does not use higher-order *n*-grams, as *n*-gram counts do not require an explicit word-to-word matching. In METEOR, an explicit measure of the level of grammaticality is used. It captures directly how good the structure of the matched words in the machine translation is in relation to the reference.

Word error rate (WER) metric was first used to evaluate automatic speech recognition. It counts the minimum number of edits needed to change the evaluated translation so that it matches the references exactly, normalised by the average length of the references. The minimum number of edits is also called Levenshtein distance. Possible edits are insertion (*I*), deletion (*D*), and substitution (*S*) of single words. Matched words are denoted with M. Different edits can have different weights. For example, substitution is usually weighted at unity, but deletion and insertion are both weighted at 0.5:

$$WER = \frac{\\$+0.5 \cdot D + 0.5 \cdot I}{length\_r}.\tag{7}$$

$$TER = \frac{\ $+D + I + Sh\circ\$ \circ\text{ft}}{length\_r}.\tag{8}$$

**Table 5** contains the calculation of WER for our example.

Translation edit rate (TER) metric [14] is a derivate from the WER. It uses an additional edit step, namely, shifts of word sequences (*Shift*). A shift moves a contiguous sequence of words within the evaluated translation to another location within the translation. All edits have equal cost. If more than one reference is available, and since the minimum number of edits needed to modify the translation is called for, only the number of edits to the closest reference is measured. TER is normalised by the average length of the reference. Position-independent error rate (PER) is another derivate from WER, which treats the reference and translation output as bags of words. Words from the translation are aligned to words in the reference, ignoring the position.

#### **6. Advanced metrics for translation evaluation in MT**

Although BLEU, NIST, METEOR, and TER metrics are used most frequently in the evaluation of MT quality, new metrics emerge almost every year. There is a

metrics-shared task, held annually at the WMT Conference where new evaluation metrics are proposed [15, 16, 17]. Those which exhibit high correlation with human judgement will be presented from the pool of recently defined metrics.

CDER [18] is a more advanced metric that is concerned with edits and Levenshtein distance. It calculates the distance between two strings *e<sup>I</sup> <sup>i</sup>* and *e* �*L <sup>i</sup>* using the auxiliary quantity *D i*ð Þ , *l* , defined as:

$$D(i,l) \coloneqq d\_{\rm CD} \left(e\_1^i, e\_1^{\tilde{\phantom{i}}l} \right). \tag{9}$$

$$D(\mathbf{0}, \mathbf{0}) = \mathbf{0},\tag{10}$$

Phrases are matched if they are listed as paraphrases in a language appropriate paraphrase table. Paraphrases are extracted automatically from the parallel corpora

*i*

where *h* is the system output and *r* is the reference. Each feature *ϕi*ð Þ *h*,*r* has a weight *wi* assigned to it. The first group of features consists of adequacy features. These features use precision, recall, and F1-score for different counts. F1-score is the harmonic mean of precision and recall multiplied by the constant of 2. The constant of 2 scales the F1-score to 1 when both recall and precision are 1. In BEER, function words and content words (nonfunction words) are counted separately. By differentiating function and nonfunction words, a better estimate is obtained of which words are more important and are less. The most important adequacy feature is a count of matching character *n*-grams. Using it, the translations are considered partially correct even if they did not get the morphology completely right. Character *n*-grams of order 6 are used. The second group of features comprises ordering features. Word order is evaluated by presenting the reordering as a permutation and calculating the distance to the ideal monotone permutation. Permutation trees

ChrF [24] is another, even simpler, metric based on character *n*-grams. It computes an F-score, based on precision and recall, using character *n*-grams:

*ChrF<sup>β</sup>* <sup>¼</sup> <sup>1</sup> <sup>þ</sup> *<sup>β</sup>*<sup>2</sup> � � *ChrP* � *ChrR*

*ChrP* and *ChrR* are character *n*-gram precision and recall, averaged over all *n*grams. *ChrP* is the percentage of *n*-grams in the translation, which have a counterpart in the reference. *ChrR* is the percentage of character *n*-grams in the reference, which are also present in the translation. In the final score, the parameter is used, which gives more importance to recall than to precision. In [25], the optimal value for β is found to be 2, and the metric is called chrF2. In this metric, a recall has two times more importance than precision. WordF2 is a similar metric, where words are used instead of characters. Different weightings for *n*-grams were also investigated. Uniform weights are the most promising for machine translation evaluation.

DREEM [26] is a new metric based on distributed representations of words and sentences generated by deep neural networks. Neural networks are models that imitate human brains to recognise patterns in sequences. DREEM employs three different types of word and sentence representations: One-hot representations, distributed word representations learned from a neural network model, and distributed sentence representations computed with a recursive autoencoder. The final

score is the cosine similarity of the representation of the translation and the

WSD is an extension of METEOR that includes synonym mappings.

RATATOUILLE [27] is a metric combination of BLEU, BEER, METEOR, and few more metrics, out of which METEOR-WSD is a novel contribution. METEOR-

In this section, state-of-the-art MT evaluation metrics were investigated briefly. Only the most important characteristics of them were exposed. For a more elaborate description of each metric, the reader is advised to use the provided references to

*wi* � *ϕi*ð Þ *h*,*r* (14)

*<sup>β</sup>*<sup>2</sup> � *ChrP* <sup>þ</sup> *ChrR* (15)

The BEER [23] metric provides a linear combination of different features:

*BEER h*ð Þ¼ ,*<sup>r</sup>* <sup>X</sup>

used to train statistical MT systems.

*Machine Translation and the Evaluation of Its Quality DOI: http://dx.doi.org/10.5772/intechopen.89063*

are used to estimate long-distance reordering.

reference, multiplied with a length penalty.

literature.

**155**

$$D(i,l) = \min\left\{ \begin{aligned} D(i-1,l-1) + c\_{SUB}(e\_i, \tilde{e}\_l), \\ D(i-1,l) + \mathbf{1}, \\ D(i,l-1) + \mathbf{1}, \\ \min\_{i'} D(i',l) + \mathbf{1}. \end{aligned} \right\} \tag{11}$$

In addition to classical edit operations (i.e. insertion, deletion, and substitution), it models block reordering explicitly as an additional edit operation. As a further improvement, it introduces word dependent substitution costs *cSUB ei* ð Þ ,~*el* . The observation that the substitution of a word with a similar one is likely to affect translation quality less than the substitution with a completely different word is accounted for in a metric score.

Tolerant BLEU [19] and LeBLEU [20] are derivates of BLEU with a relaxation of the strict word *n*-gram matching that is used in standard BLEU. Tolerant BLEU applies a specific distance measure that requires an exact match only in the middle of words, not in words as a whole. LeBLEU uses a distance measure based on characters. It also facilitates a fuzzy matching of longer chunks of text that allows, for example, to match two independent words with a compound.

CharacTER [21] is a derivate of TER:

$$\text{CharacterTER} = \frac{\text{S}\_{char} + D\_{char} + I\_{char} + \text{Shift Cost}}{length\_{char}}.\tag{12}$$

It calculates the edit rate on character level, whereas shift edits are still performed on word level. First, a technique for word-level shifts is performed, words are then split into characters, and, finally, the edit distance is calculated based on characters, and *Shift  Cost* is calculated. In addition, the lengths of translations in characters (*lengthochar*) instead of references are used for normalising the edit distance, which effectively counters the issue that shorter translations normally achieve lower TER. If we have two translations of different lengths, but with the same edit distance, they will obtain the same TER, as the length of the reference remains unchanged. In the same case, the longer translation will obtain lower TER if the edit distance is normalised by the length of translation.

METEOR universal [22] is a derivate of METEOR. It adds the fourth stage to matching. It is paraphrase matching. For each target phrase *e*1, all source phrases *f* that *e*<sup>1</sup> translates are found. Each alternate phrase (*e*<sup>2</sup> 6¼ *e*1) that translates *f* is considered a paraphrase with probability *P f* ð Þ� j*e*<sup>1</sup> *P e*ð Þ <sup>2</sup>j *f* . The cumulative probability of *e*<sup>2</sup> being a paraphrase of *e*<sup>1</sup> is the sum over all possible pivot phrases *f*:

$$P(e\_2|e\_1) = \sum P(f|e\_1) \cdot P(e\_2|f). \tag{13}$$

*Machine Translation and the Evaluation of Its Quality DOI: http://dx.doi.org/10.5772/intechopen.89063*

metrics-shared task, held annually at the WMT Conference where new evaluation metrics are proposed [15, 16, 17]. Those which exhibit high correlation with human

> *<sup>i</sup>* and *e* �*L <sup>i</sup>* using

*:* (12)

(11)

� �*:* (9)

9 >>>>>>>=

>>>>>>>;

*D*ð Þ¼ 0, 0 0, (10)

judgement will be presented from the pool of recently defined metrics. CDER [18] is a more advanced metric that is concerned with edits and

Levenshtein distance. It calculates the distance between two strings *e<sup>I</sup>*

8 >>>>>>><

>>>>>>>:

for example, to match two independent words with a compound.

the edit distance is normalised by the length of translation.

*P e*ð Þ¼ <sup>2</sup>j*e*<sup>1</sup>

**154**

*D i*ð Þ , *l* ≔ *dCD e*

*D i*ð Þþ � 1, *l* 1, *D i*ð Þþ , *l* � 1 1,

<sup>0</sup> *D i*<sup>0</sup> ð Þþ , *<sup>l</sup>* <sup>1</sup>*:*

In addition to classical edit operations (i.e. insertion, deletion, and substitution), it models block reordering explicitly as an additional edit operation. As a further improvement, it introduces word dependent substitution costs *cSUB ei* ð Þ ,~*el* . The observation that the substitution of a word with a similar one is likely to affect translation quality less than the substitution with a completely different word is

Tolerant BLEU [19] and LeBLEU [20] are derivates of BLEU with a relaxation of the strict word *n*-gram matching that is used in standard BLEU. Tolerant BLEU applies a specific distance measure that requires an exact match only in the middle of words, not in words as a whole. LeBLEU uses a distance measure based on characters. It also facilitates a fuzzy matching of longer chunks of text that allows,

*CharacTER* <sup>¼</sup> *Schar* <sup>þ</sup> *Dchar* <sup>þ</sup> *Ichar* <sup>þ</sup> *Shift Cost*

METEOR universal [22] is a derivate of METEOR. It adds the fourth stage to matching. It is paraphrase matching. For each target phrase *e*1, all source phrases *f* that *e*<sup>1</sup> translates are found. Each alternate phrase (*e*<sup>2</sup> 6¼ *e*1) that translates *f* is considered a paraphrase with probability *P f* ð Þ� j*e*<sup>1</sup> *P e*ð Þ <sup>2</sup>j *f* . The cumulative probability of *e*<sup>2</sup> being a paraphrase of *e*<sup>1</sup> is the sum over all possible pivot phrases *f*:

It calculates the edit rate on character level, whereas shift edits are still performed on word level. First, a technique for word-level shifts is performed, words are then split into characters, and, finally, the edit distance is calculated based on characters, and *Shift  Cost* is calculated. In addition, the lengths of translations in characters (*lengthochar*) instead of references are used for normalising the edit distance, which effectively counters the issue that shorter translations normally achieve lower TER. If we have two translations of different lengths, but with the same edit distance, they will obtain the same TER, as the length of the reference remains unchanged. In the same case, the longer translation will obtain lower TER if

*lengthochar*

<sup>X</sup>*P f* ð Þ� <sup>j</sup>*e*<sup>1</sup> *P e*ð Þ <sup>2</sup><sup>j</sup> *<sup>f</sup> :* (13)

min *i*

*i* 1,*e* �*l* 1

*D i*ð Þþ � 1, *l* � 1 *cSUB ei* ð Þ ,~*el* ,

the auxiliary quantity *D i*ð Þ , *l* , defined as:

*Recent Trends in Computational Intelligence*

*D i*ð Þ¼ , *l min*

accounted for in a metric score.

CharacTER [21] is a derivate of TER:

Phrases are matched if they are listed as paraphrases in a language appropriate paraphrase table. Paraphrases are extracted automatically from the parallel corpora used to train statistical MT systems.

The BEER [23] metric provides a linear combination of different features:

$$BEER(h, r) = \sum\_{i} w\_i \times \phi\_i(h, r) \tag{14}$$

where *h* is the system output and *r* is the reference. Each feature *ϕi*ð Þ *h*,*r* has a weight *wi* assigned to it. The first group of features consists of adequacy features. These features use precision, recall, and F1-score for different counts. F1-score is the harmonic mean of precision and recall multiplied by the constant of 2. The constant of 2 scales the F1-score to 1 when both recall and precision are 1. In BEER, function words and content words (nonfunction words) are counted separately. By differentiating function and nonfunction words, a better estimate is obtained of which words are more important and are less. The most important adequacy feature is a count of matching character *n*-grams. Using it, the translations are considered partially correct even if they did not get the morphology completely right. Character *n*-grams of order 6 are used. The second group of features comprises ordering features. Word order is evaluated by presenting the reordering as a permutation and calculating the distance to the ideal monotone permutation. Permutation trees are used to estimate long-distance reordering.

ChrF [24] is another, even simpler, metric based on character *n*-grams. It computes an F-score, based on precision and recall, using character *n*-grams:

$$\text{ChrF}\beta = \left(1 + \beta^2\right) \frac{\text{ChrP} \cdot \text{ChrR}}{\beta^2 \cdot \text{ChrP} + \text{ChrR}} \tag{15}$$

*ChrP* and *ChrR* are character *n*-gram precision and recall, averaged over all *n*grams. *ChrP* is the percentage of *n*-grams in the translation, which have a counterpart in the reference. *ChrR* is the percentage of character *n*-grams in the reference, which are also present in the translation. In the final score, the parameter is used, which gives more importance to recall than to precision. In [25], the optimal value for β is found to be 2, and the metric is called chrF2. In this metric, a recall has two times more importance than precision. WordF2 is a similar metric, where words are used instead of characters. Different weightings for *n*-grams were also investigated. Uniform weights are the most promising for machine translation evaluation.

DREEM [26] is a new metric based on distributed representations of words and sentences generated by deep neural networks. Neural networks are models that imitate human brains to recognise patterns in sequences. DREEM employs three different types of word and sentence representations: One-hot representations, distributed word representations learned from a neural network model, and distributed sentence representations computed with a recursive autoencoder. The final score is the cosine similarity of the representation of the translation and the reference, multiplied with a length penalty.

RATATOUILLE [27] is a metric combination of BLEU, BEER, METEOR, and few more metrics, out of which METEOR-WSD is a novel contribution. METEOR-WSD is an extension of METEOR that includes synonym mappings.

In this section, state-of-the-art MT evaluation metrics were investigated briefly. Only the most important characteristics of them were exposed. For a more elaborate description of each metric, the reader is advised to use the provided references to literature.

It should be noted that despite the well-known problems with BLEU, and the availability of many other metrics, MT system developers have continued to use BLEU as the primary measure of translation quality.

Today, different MT systems are available for use in practice. Usually, the qualities of different MT systems are compared between themselves by computing the translation quality scores on a predetermined evaluation set. The question arises whether, if there is a difference in quality on the evaluation set, one can be ensured that different MT systems indeed own different system quality. A difference in quality on an evaluation set may be just the result of happenstance. Research work on the statistical significance test for MT evaluation was done by Koehn [28], and the bootstrap resampling method is proposed to compute the statistical significance intervals for evaluation metrics on evaluation data. Statistical significance usually refers to the notions of the p-value, the probability that the observed difference in quality will occur by chance given the null hypothesis.

#### **7. Correlation between automatic and human evaluation**

Human judgements of translation quality are usually trusted as the gold standard, and the aim of an automatic evaluation metric is to produce quality estimates that are as close as possible to human judgements. As there are many different evaluation metrics, the user needs to decide which automatic evaluation metric he trusts the most. Correlation coefficients are used commonly to measure the closeness of automatic metric scores and manual judgements. Manual MT quality judgements on a number of test data are needed for comparison. Correlation coefficients are then computed on system level and/or segment level.

System-level comparison is done to compare different MT systems in general. First, each system gets a cumulative rank that reflects how high the annotators ranked that system. The metric scores of systems are also converted into ranks, and then the Spearman's rank correlation coefficient *ρ* is computed as [16]:

$$\rho = 1 - \frac{\mathbf{6} \cdot \sum\_{i=1}^{n} d\_i^2}{n \cdot (n^2 - 1)}. \tag{16}$$

The quality of a metric's segment level scores is usually measured by means of Kendall's *τ* rank correlation coefficient [17]. Let *r*ð Þ� denotes annotator's rank and *m*ð Þ� metric's rank. To compute Kendall's *τ*, the annotators rank all the translations of each segment from the best to the worst. Pairs ð Þ *a*, *b* are then built where one system's translation *r a*ð Þ of a particular segment is judged to be (strictly) better than

*Con* ≔ f g ð Þ *a*, *b* ∈*Pairs*j*m a*ð Þ> *m b*ð Þ ,

In a concordant pair, a human annotator and an automatic metric agree in ranking, and in a discordant pair, they disagree. Finally, Kendall's *τ* is computed as:

> *<sup>τ</sup>* <sup>¼</sup> <sup>∣</sup>*Con*<sup>∣</sup> � <sup>∣</sup>*Dis*<sup>∣</sup> ∣*Con*∣ þ ∣*Dis*∣

*τ* value is between �1 (a metric always predicted a different order than humans did) and 1 (a metric always predicted the same order as humans). Metrics with

In this section, no analysis of correlations for automatic metrics is presented, as it depends on many parameters. In general, all evaluation metrics presented in this chapter correlate well with human judgements. It is only worth mentioning that, for inflected languages, metrics that work on character level correlate better with

In recent years, MT has become accepted more widely in the translation industry [29]. The most common workflow involves the use of machine-translated text as a raw translation that is corrected or post-edited by a translator. Post-editing (PE) tools and practices for such workflows are being developed in large multilingual organisations, such as the European Commission [30]. The researchers in [31] report that 30% of the companies in the translation industry currently use MT. The majority (70%) of the MT users combine MT with PE at least some of the time.

Afterwards, all concordant and discordant pairs are counted:

*Pearson's correlation coefficient* r *for selected evaluation metrics used for different MT systems.*

human judgements than metrics that work only on word level.

**8. Post-editing machine translation**

*Pairs* ≔ f g ð Þj *a*, *b r a*ð Þ<*r b*ð Þ *:* (18)

*Dis* <sup>≔</sup> f g ð Þ *<sup>a</sup>*, *<sup>b</sup>* <sup>∈</sup>*Pairs*j*m a*ð Þ<sup>&</sup>lt; *m b*ð Þ *:* (19)

*:* (20)

the other system's translation *r b*ð Þ:

*Machine Translation and the Evaluation of Its Quality DOI: http://dx.doi.org/10.5772/intechopen.89063*

**Figure 4.**

higher *τ* are better.

**157**

*di* is the difference between the annotator's rank and metric's rank for system *i*. The number of systems is denoted with *n*. The possible values of *ρ* range between 1 (where all systems are ranked in identical order) and � 1 (where the systems are ranked in the reverse order). Metrics with values of Spearman's *ρ* closer to 1 are better. The Spearman's correlation coefficient *ρ* is sometimes too harsh [17]: If a metric disagrees with humans in ranking two systems of a very similar quality, the *ρ* coefficient penalises this equally as if the systems were very distant in their quality. Pearson's correlation coefficient *r* is sometimes preferred [17]. It measures the strength of the linear relationship between a metric's scores and human scores:

$$r = \frac{\sum\_{i=1}^{n} \left(H\_i - \overline{H}\right) \cdot \left(\mathcal{M}\_i - \overline{\mathcal{M}}\right)}{\sqrt{\sum\_{i=1}^{n} \left(H\_i - \overline{H}\right)^2} \sqrt{\sum\_{i=1}^{n} \left(\mathcal{M}\_i - \overline{\mathcal{M}}\right)^2}} \,. \tag{17}$$

H is the vector of annotator's scores of all systems, and M is the vector of the corresponding scores as predicted by the given metric. *H* and *M* are their means, respectively. **Figure 4** shows Pearson's correlations of selected system-level metrics and MT systems built for different language pairs [15]. We can see that in majority of cases metrics correlate well with human judgements.

*Machine Translation and the Evaluation of Its Quality DOI: http://dx.doi.org/10.5772/intechopen.89063*

**Figure 4.**

It should be noted that despite the well-known problems with BLEU, and the availability of many other metrics, MT system developers have continued to use

Today, different MT systems are available for use in practice. Usually, the qualities of different MT systems are compared between themselves by computing the translation quality scores on a predetermined evaluation set. The question arises whether, if there is a difference in quality on the evaluation set, one can be ensured that different MT systems indeed own different system quality. A difference in quality on an evaluation set may be just the result of happenstance. Research work on the statistical significance test for MT evaluation was done by Koehn [28], and the bootstrap resampling method is proposed to compute the statistical significance intervals for evaluation metrics on evaluation data. Statistical significance usually refers to the notions of the p-value, the probability that the observed difference in

Human judgements of translation quality are usually trusted as the gold standard, and the aim of an automatic evaluation metric is to produce quality estimates that are as close as possible to human judgements. As there are many different evaluation metrics, the user needs to decide which automatic evaluation metric he trusts the most. Correlation coefficients are used commonly to measure the closeness of automatic metric scores and manual judgements. Manual MT quality judgements on a number of test data are needed for comparison. Correlation coefficients

System-level comparison is done to compare different MT systems in general. First, each system gets a cumulative rank that reflects how high the annotators ranked that system. The metric scores of systems are also converted into ranks, and

> P*<sup>n</sup> <sup>i</sup>*¼<sup>1</sup>*d*<sup>2</sup> *i*

*di* is the difference between the annotator's rank and metric's rank for system *i*. The number of systems is denoted with *n*. The possible values of *ρ* range between 1 (where all systems are ranked in identical order) and � 1 (where the systems are ranked in the reverse order). Metrics with values of Spearman's *ρ* closer to 1 are better. The Spearman's correlation coefficient *ρ* is sometimes too harsh [17]: If a metric disagrees with humans in ranking two systems of a very similar quality, the *ρ* coefficient penalises this equally as if the systems were very distant in their quality. Pearson's correlation coefficient *r* is sometimes preferred [17]. It measures the strength of the linear relationship between a metric's scores and human scores:

*<sup>i</sup>*¼<sup>1</sup> *Hi* � *<sup>H</sup>* � � � *Mi* � *<sup>M</sup>* � �

*<sup>i</sup>*¼<sup>1</sup> *Hi* � *<sup>H</sup>* � �<sup>2</sup> <sup>q</sup> ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P*<sup>n</sup>*

H is the vector of annotator's scores of all systems, and M is the vector of the corresponding scores as predicted by the given metric. *H* and *M* are their means, respectively. **Figure 4** shows Pearson's correlations of selected system-level metrics and MT systems built for different language pairs [15]. We can see that in majority

*<sup>n</sup>* � *<sup>n</sup>*ð Þ <sup>2</sup> � <sup>1</sup> *:* (16)

*<sup>i</sup>*¼<sup>1</sup> *Mi* � *<sup>M</sup>* � �<sup>2</sup> <sup>q</sup> *:* (17)

then the Spearman's rank correlation coefficient *ρ* is computed as [16]:

*<sup>ρ</sup>* <sup>¼</sup> <sup>1</sup> � <sup>6</sup> �

BLEU as the primary measure of translation quality.

*Recent Trends in Computational Intelligence*

quality will occur by chance given the null hypothesis.

are then computed on system level and/or segment level.

*r* ¼

**156**

P*<sup>n</sup>*

P*<sup>n</sup>*

of cases metrics correlate well with human judgements.

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

**7. Correlation between automatic and human evaluation**

*Pearson's correlation coefficient* r *for selected evaluation metrics used for different MT systems.*

The quality of a metric's segment level scores is usually measured by means of Kendall's *τ* rank correlation coefficient [17]. Let *r*ð Þ� denotes annotator's rank and *m*ð Þ� metric's rank. To compute Kendall's *τ*, the annotators rank all the translations of each segment from the best to the worst. Pairs ð Þ *a*, *b* are then built where one system's translation *r a*ð Þ of a particular segment is judged to be (strictly) better than the other system's translation *r b*ð Þ:

$$\text{Pairs} := \{(a, b) \,|\, r(a) < r(b)\}. \tag{18}$$

Afterwards, all concordant and discordant pairs are counted:

$$\begin{aligned} Con &:= \{(a, b) \in Paris \, | \, m(a) > m(b)\}, \\ Dis &:= \{(a, b) \in Paris \, | \, m(a) < m(b)\}. \end{aligned} \tag{19}$$

In a concordant pair, a human annotator and an automatic metric agree in ranking, and in a discordant pair, they disagree. Finally, Kendall's *τ* is computed as:

$$\tau = \frac{|Con| - |Dis|}{|Con| + |Dis|}. \tag{20}$$

*τ* value is between �1 (a metric always predicted a different order than humans did) and 1 (a metric always predicted the same order as humans). Metrics with higher *τ* are better.

In this section, no analysis of correlations for automatic metrics is presented, as it depends on many parameters. In general, all evaluation metrics presented in this chapter correlate well with human judgements. It is only worth mentioning that, for inflected languages, metrics that work on character level correlate better with human judgements than metrics that work only on word level.

#### **8. Post-editing machine translation**

In recent years, MT has become accepted more widely in the translation industry [29]. The most common workflow involves the use of machine-translated text as a raw translation that is corrected or post-edited by a translator. Post-editing (PE) tools and practices for such workflows are being developed in large multilingual organisations, such as the European Commission [30]. The researchers in [31] report that 30% of the companies in the translation industry currently use MT. The majority (70%) of the MT users combine MT with PE at least some of the time.

Post-editing MT is attractive because it has been shown to be faster than human translation. It is faster than translation from scratch and even faster than translation assisted by a translation memory [32]. Speed is not the only factor that should be taken into account when assessing the post-editing process. More recent studies have looked at ways of determining post-editing effort. In [33], three levels of postediting effort are defined: Temporal effort, cognitive effort, and technical effort. The temporal effort is the time needed to post-edit a given text, cognitive effort is the activation of cognitive processes during post-editing, and the technical effort means the operations such as insertions and deletions that are performed during post-editing. All three levels of post-editing effort are influenced greatly by the translation quality. The use of PE and MT also raises the question about the quality of final translations. Has the quality improved, or is it worse?

cheap and fast. In the chapter, traditional and advanced metrics for automatic MT evaluation were presented. Despite the well-known problems with BLEU, and the availability of many other metrics, MT system developers have continued to use

MT quality is continually improving. Despite that, there are still a number of flaws in machine translation output. To make the translation correct, post-editing machine translation output is proposed to be integrated into the translation pro-

Future research in MT will be devoted to neural machine translation. It is still not very well understood. Its inner workings are commonly seen as a black-box, which works as the neurons of the human brain. As the computing power NMT requires becomes more widely available, many different configurations can be

Human translators are worried to be replaced by machines. Machine translation, no matter how sophisticated, cannot match the accuracy of people. Human translators are also an important segment in MT evolution not only as post-editors but

The authors acknowledge the financial support from the Slovenian Research

Faculty of Electrical Engineering and Computer Science, University of Maribor,

© 2019 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,

Future effort in machine translation evaluation will be directed toward character-based metrics which show the highest correlation with human judgement

examined to further improve the accuracy of machine translation.

also as teachers for MT systems to become better and better.

BLEU as the primary measure of translation quality.

cesses. It is discussed at the end of the chapter.

*Machine Translation and the Evaluation of Its Quality DOI: http://dx.doi.org/10.5772/intechopen.89063*

Agency (research core funding No. P2-0069).

Mirjam Sepesy Maučec\* and Gregor Donaj

provided the original work is properly cited.

\*Address all correspondence to: mirjam.sepesy@um.si

at the system and segment levels.

**Acknowledgements**

**Author details**

Slovenia

**159**

As PE effort is related strongly to MT quality, derivatives of standard quality metrics are developed, which are concerned more with PE effort. Human-mediated translation error rate (HTER) [14] is a human-in-the-loop variant of TER. Instead of a reference, post-edited translation is used in the comparison. HTER centres on what edits are to be made to convert a translation into its post-edited version. It is computed as the ratio between the number of edit steps and the number of words in the post-edited version. HTER can be used as a measure of technical PE effort: The fewer changes necessary to convert the translation into its post-edited version, the less the effort required from the translator.

HTER is concerned more with the final translation and not the process. In [34] a metric called actual edit rate (AER) is proposed, which measures the translator's actual edit operations, which may involve more complex tasks, for example, applying corrections to previously post-edited parts of the text.

A study on PE of MT confirmed the relation between HTER and MT qualities [34]. An increase in HTER was evident as the quality of the MT system decreased. In contrast, they did not establish any significant association between AER and MT qualities. Keyboard activity may not be as sensitive to MT quality as PE time. They also found a linear relationship between MT quality and post-editing speed. MT quality was measured by the BLEU score of the system. The increase of BLEU score by one point resulted in a decrease of post-editing speed of about 0.16 seconds/ word post-editing time. Their study also shows the correlation between the quality of machine translation output and the quality after post-editing. They confirmed that worse translation almost always leads to worse result after post-editing. As the use of MT and PE workflows has increased, there is a growing demand for expertise in PE skills. The research on and teaching of skills specific to post-editing has become necessary. The authors in [31] emphasise the impact of "familiarity with translation technology" on the employability of future translators.

#### **9. Conclusion**

Machine translation is being used by millions of people on a daily basis. This chapter discusses different MT approaches that were developed over time. Currently, the most promising approach is neural machine translation. Although effective, it also suffers some issues, such as scaling to larger vocabularies of words and the slow speed of training the models. Researchers continue to work on solving the problems and making translation a better service accessible to everyone.

The second part of the chapter describes how machine translation output is evaluated. The main characteristics of human and automatic MT evaluation were outlined. Human evaluation of MT output remains crucial to look for ideas to improve MT systems still further. On the other hand, automatic MT evaluation is *Machine Translation and the Evaluation of Its Quality DOI: http://dx.doi.org/10.5772/intechopen.89063*

cheap and fast. In the chapter, traditional and advanced metrics for automatic MT evaluation were presented. Despite the well-known problems with BLEU, and the availability of many other metrics, MT system developers have continued to use BLEU as the primary measure of translation quality.

MT quality is continually improving. Despite that, there are still a number of flaws in machine translation output. To make the translation correct, post-editing machine translation output is proposed to be integrated into the translation processes. It is discussed at the end of the chapter.

Future research in MT will be devoted to neural machine translation. It is still not very well understood. Its inner workings are commonly seen as a black-box, which works as the neurons of the human brain. As the computing power NMT requires becomes more widely available, many different configurations can be examined to further improve the accuracy of machine translation.

Future effort in machine translation evaluation will be directed toward character-based metrics which show the highest correlation with human judgement at the system and segment levels.

Human translators are worried to be replaced by machines. Machine translation, no matter how sophisticated, cannot match the accuracy of people. Human translators are also an important segment in MT evolution not only as post-editors but also as teachers for MT systems to become better and better.

#### **Acknowledgements**

Post-editing MT is attractive because it has been shown to be faster than human translation. It is faster than translation from scratch and even faster than translation assisted by a translation memory [32]. Speed is not the only factor that should be taken into account when assessing the post-editing process. More recent studies have looked at ways of determining post-editing effort. In [33], three levels of postediting effort are defined: Temporal effort, cognitive effort, and technical effort. The temporal effort is the time needed to post-edit a given text, cognitive effort is the activation of cognitive processes during post-editing, and the technical effort means the operations such as insertions and deletions that are performed during post-editing. All three levels of post-editing effort are influenced greatly by the translation quality. The use of PE and MT also raises the question about the quality

As PE effort is related strongly to MT quality, derivatives of standard quality metrics are developed, which are concerned more with PE effort. Human-mediated translation error rate (HTER) [14] is a human-in-the-loop variant of TER. Instead of a reference, post-edited translation is used in the comparison. HTER centres on what edits are to be made to convert a translation into its post-edited version. It is computed as the ratio between the number of edit steps and the number of words in the post-edited version. HTER can be used as a measure of technical PE effort: The fewer changes necessary to convert the translation into its post-edited version, the

HTER is concerned more with the final translation and not the process. In [34] a metric called actual edit rate (AER) is proposed, which measures the translator's actual edit operations, which may involve more complex tasks, for example, apply-

A study on PE of MT confirmed the relation between HTER and MT qualities [34]. An increase in HTER was evident as the quality of the MT system decreased. In contrast, they did not establish any significant association between AER and MT qualities. Keyboard activity may not be as sensitive to MT quality as PE time. They also found a linear relationship between MT quality and post-editing speed. MT quality was measured by the BLEU score of the system. The increase of BLEU score by one point resulted in a decrease of post-editing speed of about 0.16 seconds/ word post-editing time. Their study also shows the correlation between the quality of machine translation output and the quality after post-editing. They confirmed that worse translation almost always leads to worse result after post-editing. As the use of MT and PE workflows has increased, there is a growing demand for expertise in PE skills. The research on and teaching of skills specific to post-editing has become necessary. The authors in [31] emphasise the impact of "familiarity with

Machine translation is being used by millions of people on a daily basis. This chapter discusses different MT approaches that were developed over time. Currently, the most promising approach is neural machine translation. Although effective, it also suffers some issues, such as scaling to larger vocabularies of words and the slow speed of training the models. Researchers continue to work on solving the

The second part of the chapter describes how machine translation output is evaluated. The main characteristics of human and automatic MT evaluation were outlined. Human evaluation of MT output remains crucial to look for ideas to improve MT systems still further. On the other hand, automatic MT evaluation is

problems and making translation a better service accessible to everyone.

of final translations. Has the quality improved, or is it worse?

ing corrections to previously post-edited parts of the text.

translation technology" on the employability of future translators.

less the effort required from the translator.

*Recent Trends in Computational Intelligence*

**9. Conclusion**

**158**

The authors acknowledge the financial support from the Slovenian Research Agency (research core funding No. P2-0069).

#### **Author details**

Mirjam Sepesy Maučec\* and Gregor Donaj Faculty of Electrical Engineering and Computer Science, University of Maribor, Slovenia

\*Address all correspondence to: mirjam.sepesy@um.si

© 2019 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

### **References**

[1] Doherty S, Gaspari F, Groves D, van Genabith J. Mapping the industry. I: Findings on translation technologies and quality assessment. In: GALA. 2013

[2] Koehn P, Och FJ, Marcu D. Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, Vol. 1. Association for Computational Linguistics; 2003. pp. 48-54

[3] Durrani N, Schmid H, Fraser A. A joint sequence translation model with integrated reordering. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Vol. 1. Association for Computational Linguistics; 2011. pp. 1045-1054

[4] Donaj G, Kačič Z. Language Modeling for Automatic Speech Recognition of Inflective Languages: An Applications-Oriented Approach Using Lexical Data. Springer; 2016

[5] Donaj G, Kačič Z. Contextdependent factored language models. EURASIP Journal on Audio, Speech, and Music Processing. 2017;**2017**(1):6

[6] Maučec MS, Donaj G. Morphosyntactic tags in statistical machine translation of highly inflectional language. In: Proceedings of the Artificial Intelligence and Natural Language Conference (AINL FRUCT); Saint-Petersburg, Russia. 2016. pp. 99-102

[7] Maučec MS, Brest J. Slavic languages in phrase-based statistical machine translation: A survey. Artificial Intelligence Review. 2019;**51**(1):77-117

[8] Maučec MS, Donaj G. Morphology in statistical machine translation from

english to highly inflectional language. Information Technology and Control. 2018;**47**(1):63-74

characters and embeddings achieve good performance. In: Proceedings of the Third Conference on Machine Translation: Shared Task Papers. 2018.

*Machine Translation and the Evaluation of Its Quality DOI: http://dx.doi.org/10.5772/intechopen.89063*

Statistical Machine Translation. 2014.

[24] Popović M. Chrf: Character n-gram f-score for automatic MT evaluation. Proceedings of the Tenth Workshop on Statistical Machine Translation. 2015:

[25] Popović M. Chrf deconstructed: Beta parameters and n-gram weights. In: Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, Volume 2. 2016. pp. 499-504

[26] Chen B, Guo H, Kuhn R. Multi-level evaluation for machine translation. In: Proceedings of the Tenth Workshop on Statistical Machine Translation. 2015.

[27] Marie B, Apidianaki M. Alignmentbased sense selection in meteor and the ratatouille recipe. Proceedings of the Tenth Workshop on Statistical Machine

[28] Koehn P. Statistical significance tests for machine translation evaluation. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 2004

[29] Way A. Quality expectations of machine translation. In: Translation Quality Assessment. Springer; 2018.

[30] Bonet J. No rage against the machine. Languages and Translation.

A survey of machine translation competences: Insights for translation technology educators and practitioners. Perspectives. 2015;**23**(3):333-358

[31] Gaspari F, Almaghout H, Doherty S.

[32] Plitt M, Masselot F. A productivity test of statistical machine translation post-editing in a typical localisation context. The Prague Bulletin of

Mathematical Linguistics. 2010;**93**:7-16

Translation. 2015:385-391

pp. 414-419

392-395

pp. 361-365

pp. 159-178

2013;**6**(2)

[16] Macháček M, Bojar O. Results of the

Proceedings of the Eighth Workshop on Statistical Machine Translation. 2013.

[17] Macháček M, Bojar O. Results of the wmt14 metrics shared task. Proceedings of the Ninth Workshop on Statistical Machine Translation. 2014:293-301

[18] Leusch G, Ueffing N, Ney H. Cder: Efficient MT evaluation using block movements. In: 11th Conference of the European Chapter of the Association for

[19] Libovickỳ J, Pecina P. Tolerant bleu: A submission to the wmt14 metrics task. In: Proceedings of the Ninth Workshop on Statistical Machine Translation.

[20] Virpioja S, Grönroos S-A. Lebleu: N-gram-based translation evaluation score for morphologically complex languages. In: Proceedings of the Tenth Workshop on Statistical Machine Translation. 2015. pp. 411-416

[21] Wang W, Peter J-T, Rosendahl H, Ney H. Character: Translation edit rate on character level. In: Proceedings of the

[22] Denkowski M, Lavie A. Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation. 2014.

[23] Stanojevic M, Sima'an K. Beer: Better evaluation as ranking. In: Proceedings of the Ninth Workshop on

First Conference on Machine Translation: Volume 2, Shared Task Papers, Volume 2. 2016. pp. 505-510

pp. 376-380

**161**

Computational Linguistics. 2006

2014. pp. 409-413

WMT13 metrics shared task. In:

pp. 671-688

pp. 45-51

[9] Lommel AR, Burchardt A, Uszkoreit H. Multidimensional quality metrics: A flexible system for assessing translation quality. In: Proceedings of ASLIB: Translating and the Computer. Vol. 35. 2013

[10] Papineni K, Roukos S, Ward T, Zhu W-J. Bleu: A method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics; 2002. pp. 311-318

[11] Doddington G. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: Proceedings of the Second International Conference on Human Language Technology Research. Morgan Kaufmann Publishers Inc; 2002. pp. 138-145

[12] Lavie A, Agarwal A. Meteor: An automatic metric for MT evaluation with high levels of correlation with human judgments. In: Proceedings of the Second Workshop on Statistical Machine Translation. Association for Computational Linguistics; 2007. pp. 228-231

[13] Fellbaum C. WordNet. In: Poli R, Healy M, Kameas A, editors. Theory and Applications of Ontology: Computer Applications. Springer; 2010. pp. 231-243

[14] Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J. A study of translation edit rate with targeted human annotation. In: Proceedings of Association for Machine Translation in the Americas. Vol. 200. 2006

[15] Ma Q, Bojar O, Graham Y. Results of the wmt18 metrics shared task: Both

*Machine Translation and the Evaluation of Its Quality DOI: http://dx.doi.org/10.5772/intechopen.89063*

characters and embeddings achieve good performance. In: Proceedings of the Third Conference on Machine Translation: Shared Task Papers. 2018. pp. 671-688

**References**

[1] Doherty S, Gaspari F, Groves D, van Genabith J. Mapping the industry. I: Findings on translation technologies and quality assessment. In: GALA. 2013

*Recent Trends in Computational Intelligence*

english to highly inflectional language. Information Technology and Control.

Uszkoreit H. Multidimensional quality metrics: A flexible system for assessing translation quality. In: Proceedings of ASLIB: Translating and the Computer.

[10] Papineni K, Roukos S, Ward T, Zhu W-J. Bleu: A method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational

Linguistics; 2002. pp. 311-318

[11] Doddington G. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: Proceedings of the Second International Conference on Human Language Technology Research. Morgan

Kaufmann Publishers Inc; 2002.

[12] Lavie A, Agarwal A. Meteor: An automatic metric for MT evaluation with high levels of correlation with human judgments. In: Proceedings of the Second Workshop on Statistical Machine Translation. Association for Computational Linguistics; 2007.

[13] Fellbaum C. WordNet. In: Poli R, Healy M, Kameas A, editors. Theory and Applications of Ontology: Computer Applications. Springer; 2010. pp. 231-243

[14] Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J. A study of translation edit rate with targeted human annotation. In: Proceedings of Association for Machine Translation in

[15] Ma Q, Bojar O, Graham Y. Results of the wmt18 metrics shared task: Both

the Americas. Vol. 200. 2006

Linguistics. Association for Computational

[9] Lommel AR, Burchardt A,

2018;**47**(1):63-74

Vol. 35. 2013

pp. 138-145

pp. 228-231

[2] Koehn P, Och FJ, Marcu D. Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human

Language Technology, Vol. 1. Association for Computational Linguistics; 2003. pp. 48-54

[3] Durrani N, Schmid H, Fraser A. A joint sequence translation model with integrated reordering. In: Proceedings of

the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Vol. 1. Association for Computational Linguistics; 2011.

[4] Donaj G, Kačič Z. Language Modeling for Automatic Speech

Lexical Data. Springer; 2016

[5] Donaj G, Kačič Z. Contextdependent factored language models. EURASIP Journal on Audio, Speech, and Music Processing. 2017;**2017**(1):6

[6] Maučec MS, Donaj G.

Morphosyntactic tags in statistical machine translation of highly

Saint-Petersburg, Russia. 2016.

pp. 99-102

**160**

inflectional language. In: Proceedings of the Artificial Intelligence and Natural Language Conference (AINL FRUCT);

[7] Maučec MS, Brest J. Slavic languages in phrase-based statistical machine translation: A survey. Artificial Intelligence Review. 2019;**51**(1):77-117

[8] Maučec MS, Donaj G. Morphology in statistical machine translation from

Recognition of Inflective Languages: An Applications-Oriented Approach Using

pp. 1045-1054

[16] Macháček M, Bojar O. Results of the WMT13 metrics shared task. In: Proceedings of the Eighth Workshop on Statistical Machine Translation. 2013. pp. 45-51

[17] Macháček M, Bojar O. Results of the wmt14 metrics shared task. Proceedings of the Ninth Workshop on Statistical Machine Translation. 2014:293-301

[18] Leusch G, Ueffing N, Ney H. Cder: Efficient MT evaluation using block movements. In: 11th Conference of the European Chapter of the Association for Computational Linguistics. 2006

[19] Libovickỳ J, Pecina P. Tolerant bleu: A submission to the wmt14 metrics task. In: Proceedings of the Ninth Workshop on Statistical Machine Translation. 2014. pp. 409-413

[20] Virpioja S, Grönroos S-A. Lebleu: N-gram-based translation evaluation score for morphologically complex languages. In: Proceedings of the Tenth Workshop on Statistical Machine Translation. 2015. pp. 411-416

[21] Wang W, Peter J-T, Rosendahl H, Ney H. Character: Translation edit rate on character level. In: Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, Volume 2. 2016. pp. 505-510

[22] Denkowski M, Lavie A. Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation. 2014. pp. 376-380

[23] Stanojevic M, Sima'an K. Beer: Better evaluation as ranking. In: Proceedings of the Ninth Workshop on Statistical Machine Translation. 2014. pp. 414-419

[24] Popović M. Chrf: Character n-gram f-score for automatic MT evaluation. Proceedings of the Tenth Workshop on Statistical Machine Translation. 2015: 392-395

[25] Popović M. Chrf deconstructed: Beta parameters and n-gram weights. In: Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, Volume 2. 2016. pp. 499-504

[26] Chen B, Guo H, Kuhn R. Multi-level evaluation for machine translation. In: Proceedings of the Tenth Workshop on Statistical Machine Translation. 2015. pp. 361-365

[27] Marie B, Apidianaki M. Alignmentbased sense selection in meteor and the ratatouille recipe. Proceedings of the Tenth Workshop on Statistical Machine Translation. 2015:385-391

[28] Koehn P. Statistical significance tests for machine translation evaluation. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 2004

[29] Way A. Quality expectations of machine translation. In: Translation Quality Assessment. Springer; 2018. pp. 159-178

[30] Bonet J. No rage against the machine. Languages and Translation. 2013;**6**(2)

[31] Gaspari F, Almaghout H, Doherty S. A survey of machine translation competences: Insights for translation technology educators and practitioners. Perspectives. 2015;**23**(3):333-358

[32] Plitt M, Masselot F. A productivity test of statistical machine translation post-editing in a typical localisation context. The Prague Bulletin of Mathematical Linguistics. 2010;**93**:7-16 [33] Krings HP, Shreve GM. Repairing Texts: Empirical Investigations of Machine Translation Post-Editing Processes. Vol. 5. Kent State University Press; 2001

**Chapter 9**

**Abstract**

have also been discussed.

**1. Introduction**

**163**

pattern recognition, artificial intelligence

and Challenges

Pedestrian Detection and Tracking

in Video Surveillance System:

*Ujwalla Gawande, Kamal Hajari and Yogesh Golhar*

Pedestrian detection and monitoring in a surveillance system are critical for numerous utility areas which encompass unusual event detection, human gait, congestion or crowded vicinity evaluation, gender classification, fall detection in elderly humans, etc. Researchers' primary focus is to develop surveillance system that can work in a dynamic environment, but there are major issues and challenges involved in designing such systems. These challenges occur at three different levels of pedestrian detection, viz. video acquisition, human detection, and its tracking. The challenges in acquiring video are, viz. illumination variation, abrupt motion, complex background, shadows, object deformation, etc. Human detection and tracking challenges are varied poses, occlusion, crowd density area tracking, etc. These results in lower recognition rate. A brief summary of surveillance system along with comparisons of pedestrian detection and tracking technique in video surveillance is presented in this chapter. The publicly available pedestrian benchmark databases as well as the future research directions on pedestrian detection

**Keywords:** pedestrian tracking, pedestrian detection, visual surveillance,

The word surveillance, prefix sur is a French word means "over" and the root veiller means "to watch." In distinction to surveillance, Steve Mann in [1] introduces the term "sousveillance." Contrasting the word sur, sous meaning is "under," i.e., it signifies that the camera is with human physically (ex. camera mounting on head). Surveillance and sousveillance both are used for continuous attentive observation of a suspect, prisoner, person, group, or ongoing behavior and activity in order to collect information. In order to improve conventional security systems, the use of surveillance system has been increasingly emboldened by government and private organizations. Currently, surveillance systems have been widely investigated and used effectively in several applications like (a) transport systems (railway stations, airports, urban and ruler motorway road networks), (b) government agencies (military base camps, prisons, strategic infrastructures, radar centers,

Issues, Comprehensive Review,

[34] Sanchez-Torron M, Koehn P. Machine translation quality and posteditor productivity. In: AMTA 2016 Vol. 2016. p. 16

#### **Chapter 9**

[33] Krings HP, Shreve GM. Repairing Texts: Empirical Investigations of Machine Translation Post-Editing Processes. Vol. 5. Kent State University

*Recent Trends in Computational Intelligence*

[34] Sanchez-Torron M, Koehn P. Machine translation quality and posteditor productivity. In: AMTA 2016

Press; 2001

Vol. 2016. p. 16

**162**

## Pedestrian Detection and Tracking in Video Surveillance System: Issues, Comprehensive Review, and Challenges

*Ujwalla Gawande, Kamal Hajari and Yogesh Golhar*

### **Abstract**

Pedestrian detection and monitoring in a surveillance system are critical for numerous utility areas which encompass unusual event detection, human gait, congestion or crowded vicinity evaluation, gender classification, fall detection in elderly humans, etc. Researchers' primary focus is to develop surveillance system that can work in a dynamic environment, but there are major issues and challenges involved in designing such systems. These challenges occur at three different levels of pedestrian detection, viz. video acquisition, human detection, and its tracking. The challenges in acquiring video are, viz. illumination variation, abrupt motion, complex background, shadows, object deformation, etc. Human detection and tracking challenges are varied poses, occlusion, crowd density area tracking, etc. These results in lower recognition rate. A brief summary of surveillance system along with comparisons of pedestrian detection and tracking technique in video surveillance is presented in this chapter. The publicly available pedestrian benchmark databases as well as the future research directions on pedestrian detection have also been discussed.

**Keywords:** pedestrian tracking, pedestrian detection, visual surveillance, pattern recognition, artificial intelligence

#### **1. Introduction**

The word surveillance, prefix sur is a French word means "over" and the root veiller means "to watch." In distinction to surveillance, Steve Mann in [1] introduces the term "sousveillance." Contrasting the word sur, sous meaning is "under," i.e., it signifies that the camera is with human physically (ex. camera mounting on head). Surveillance and sousveillance both are used for continuous attentive observation of a suspect, prisoner, person, group, or ongoing behavior and activity in order to collect information. In order to improve conventional security systems, the use of surveillance system has been increasingly emboldened by government and private organizations. Currently, surveillance systems have been widely investigated and used effectively in several applications like (a) transport systems (railway stations, airports, urban and ruler motorway road networks), (b) government agencies (military base camps, prisons, strategic infrastructures, radar centers,

laboratories, and hospitals), (c) industrial environments, automated teller machine (ATM), banks, shopping malls, and public buildings, etc. The most of the surveillance systems at public and private places depend on the human operator observer, who detect any suspicious pedestrian activities in a video scene [2, 3]. The term pedestrian is a person who is walking or running on the street. In some communities, a person using wheelchair is also considered as pedestrians. The most challenging task for automatic video surveillance is to detect and track the suspicious pedestrian activity. For a real-time dynamic environment, the learning-based methods did not provide an appropriate solution for real-time scene analysis because it is difficult to obtain a prior knowledge about all the objects. Still, the learning-based methods are adopted due to their accuracy and robust nature. In the literature, several researchers use efficiently deep-learning (DL) based model for classification purpose in video surveillance over traditional approaches viz. perceptron model, probabilistic neural network (PNN), radial basis neural network (RBN), etc. Numerous learning-based techniques include artificial neural network (ANN), support vector machine (SVM), AdaBoost, etc. These techniques require the features such as histogram of oriented gradients (HOG), speeded-up robust features (SURF), local binary pattern (LBP), scale and invariant feature transform (SIFT), etc. to classify the type of object. Specifically, these features are represented by different deep learning algorithm versions such as the deep belief networks (DBN), recurrent neural network (RNN), generative adversarial networks (GANs), convolutional neural network (CNN), restricted Boltzmann machine (RBM), AlphaGo, AlphaZero, capsule networks bidirectional encoder representations for transformers (BERT), etc.

signals not openly transmitted in a distributed environment, (2) CCTV depends on strategic placements of cameras as per the geographical structure of workplace, (3) human observer is required for camera inputs to monitor the CCTV recorded footage [4]. The CCTV loses its primary advantage as an active, real-time medium, because the video footage can be used only after the fact or incident occurs, that can be used as a legal evidence or forensic tool. Next, in 1996, IP-based surveillance cameras were introduced by Axis, that overcomes the limitation of initial CCTV cameras such as (1) IP-based camera's transmits the raw images instead of voltage signals using the secure transmission channel of TCP/IP, (2) IP-camera comes along with the video analytics, i.e., camera itself can be used for analyzing the images, (3) Ethernet cable can be used as a medium for power supply instead of dedicated power supply, and (4) two-way bidirectional audio signals can be transmitted over a single dedicated network [5]. The recent surveillance system facilitates with

*Pedestrian Detection and Tracking in Video Surveillance System: Issues, Comprehensive Review…*

remote location monitoring on handheld device like mobile phones.

through the network and store in database.

*DOI: http://dx.doi.org/10.5772/intechopen.90810*

*A general framework of an automated visual surveillance system [7–9].*

**Figure 2.**

**165**

The video surveillance systems can be categories based on a camera system, application and architecture. The camera system includes single camera, multi camera, fixed camera, moving camera and hybrid camera systems, etc. The application-based system includes object tracking and recognition, ID reidentification, customized event notification and alert based system, behavior analysis, etc. Finally, the architecture-based system includes standalone systems, cloud-based and distributed systems [6]. A general framework of automated visual surveillance system is shown in **Figure 2** [7–9]. Normally video surveillance system is based on multiple cameras, the videos from the multiple cameras are taken

The data need to be fused before incorporating the further processing. This can be done using data fusion techniques such as multi-sensory level, track to track and

appearance to appearance [10–12]. After the data fusion following steps are performed. The traditional video surveillance system consists of various steps such

These variants of DL algorithms are used in many computer vision applications like face recognition, image classification, speech recognition, text-to-speech generation, handwriting transcription, machine translation, medical diagnosis, cars: drivable area, lane keeping, pedestrian and landmark detection for driver, digital assistants, ads, search, social recommendations, game playing, and content-based image retrieval. The advantage of DL approaches is its ability to learn complex scene features with very less processing of raw data and its capability of learning unlabeled raw data efficiently. Most recently, a new deep-learning technique called CNN have shown high performances over conventional methods in video processing research space. CNN can handle efficiently complex and large data.

During the past decade video surveillance systems have revolved from the simple video acquisition system to real-time intelligent autonomous systems. **Figure 1** shows a timeline chart of the evolution of video surveillance.

Visual surveillance systems come back into existence back in 1942. Primarily, closed-circuit television (CCTV) is used commercially as a security system, mainly for indoor environment. The main concerns of initial CCTVs were (1) voltage

**Figure 1.** *Evolution of surveillance systems.*

#### *Pedestrian Detection and Tracking in Video Surveillance System: Issues, Comprehensive Review… DOI: http://dx.doi.org/10.5772/intechopen.90810*

signals not openly transmitted in a distributed environment, (2) CCTV depends on strategic placements of cameras as per the geographical structure of workplace, (3) human observer is required for camera inputs to monitor the CCTV recorded footage [4]. The CCTV loses its primary advantage as an active, real-time medium, because the video footage can be used only after the fact or incident occurs, that can be used as a legal evidence or forensic tool. Next, in 1996, IP-based surveillance cameras were introduced by Axis, that overcomes the limitation of initial CCTV cameras such as (1) IP-based camera's transmits the raw images instead of voltage signals using the secure transmission channel of TCP/IP, (2) IP-camera comes along with the video analytics, i.e., camera itself can be used for analyzing the images, (3) Ethernet cable can be used as a medium for power supply instead of dedicated power supply, and (4) two-way bidirectional audio signals can be transmitted over a single dedicated network [5]. The recent surveillance system facilitates with remote location monitoring on handheld device like mobile phones.

The video surveillance systems can be categories based on a camera system, application and architecture. The camera system includes single camera, multi camera, fixed camera, moving camera and hybrid camera systems, etc. The application-based system includes object tracking and recognition, ID reidentification, customized event notification and alert based system, behavior analysis, etc. Finally, the architecture-based system includes standalone systems, cloud-based and distributed systems [6]. A general framework of automated visual surveillance system is shown in **Figure 2** [7–9]. Normally video surveillance system is based on multiple cameras, the videos from the multiple cameras are taken through the network and store in database.

The data need to be fused before incorporating the further processing. This can be done using data fusion techniques such as multi-sensory level, track to track and appearance to appearance [10–12]. After the data fusion following steps are performed. The traditional video surveillance system consists of various steps such

#### **Figure 2.**

*A general framework of an automated visual surveillance system [7–9].*

laboratories, and hospitals), (c) industrial environments, automated teller machine (ATM), banks, shopping malls, and public buildings, etc. The most of the surveillance systems at public and private places depend on the human operator observer, who detect any suspicious pedestrian activities in a video scene [2, 3]. The term pedestrian is a person who is walking or running on the street. In some communities, a person using wheelchair is also considered as pedestrians. The most challenging task for automatic video surveillance is to detect and track the suspicious pedestrian activity. For a real-time dynamic environment, the learning-based methods did not provide an appropriate solution for real-time scene analysis because it is difficult to obtain a prior knowledge about all the objects. Still, the learning-based methods are adopted due to their accuracy and robust nature. In the literature, several researchers use efficiently deep-learning (DL) based model for classification purpose in video surveillance over traditional approaches viz.

perceptron model, probabilistic neural network (PNN), radial basis neural network (RBN), etc. Numerous learning-based techniques include artificial neural network (ANN), support vector machine (SVM), AdaBoost, etc. These techniques require the features such as histogram of oriented gradients (HOG), speeded-up robust features (SURF), local binary pattern (LBP), scale and invariant feature transform (SIFT), etc. to classify the type of object. Specifically, these features are represented by different deep learning algorithm versions such as the deep belief networks (DBN), recurrent neural network (RNN), generative adversarial networks (GANs), convolutional neural network (CNN), restricted Boltzmann machine (RBM), AlphaGo, AlphaZero, capsule networks bidirectional encoder representations for

These variants of DL algorithms are used in many computer vision applications like face recognition, image classification, speech recognition, text-to-speech generation, handwriting transcription, machine translation, medical diagnosis, cars: drivable area, lane keeping, pedestrian and landmark detection for driver, digital assistants, ads, search, social recommendations, game playing, and content-based image retrieval. The advantage of DL approaches is its ability to learn complex scene features with very less processing of raw data and its capability of learning unlabeled raw data efficiently. Most recently, a new deep-learning technique called

CNN have shown high performances over conventional methods in video processing research space. CNN can handle efficiently complex and large data. During the past decade video surveillance systems have revolved from the simple video acquisition system to real-time intelligent autonomous systems. **Figure 1** shows a timeline chart of the evolution of video surveillance.

Visual surveillance systems come back into existence back in 1942. Primarily, closed-circuit television (CCTV) is used commercially as a security system, mainly for indoor environment. The main concerns of initial CCTVs were (1) voltage

transformers (BERT), etc.

*Recent Trends in Computational Intelligence*

**Figure 1.**

**164**

*Evolution of surveillance systems.*

as (1) motion and object detection, (2) object classification, (3) object tracking, (4) behavior understanding and activity analysis, (5) pedestrian identification and (6) data fusion. Each stage of automated visual surveillance system is described as follows.

human crowd safety is important. Safety of human crowds depends upon the quantity and density of pedestrian move physically at different high crowed places.

*Pedestrian Detection and Tracking in Video Surveillance System: Issues, Comprehensive Review…*

The last step is human identification. Human face and gait are the main biometric features that can be used for personal identification in visual surveillance sys-

The goal of this chapter is to discuss the issues and challenges involved in designing visual surveillance system. Again, group pedestrian detection and tracking methods used for moving and fixed camera into broad categories and give an informative analysis of relative methods in each category. The main contributions

• The comparative analysis of publicly available benchmark datasets of pedestrian with its use, specification and environment limitation

sequences captured by a moving and fixed camera

described proposed improvements for each method

concluded with a discussion in Section 6.

**167**

**2. Pedestrian datasets reported in literature**

and environmental constrain followed by comparative analysis.

**2.1 Massachusetts Institute of Technology (MIT) pedestrian dataset**

either a front or a back view with a relatively limited range of poses [11, 12].

• Analyze issues and challenges of pedestrian detection and tracking in the video

• Categorizing the methods of pedestrian detection and tracking in different ways based on the general concept of methods belonging to each category and

This chapter is organized into the following sections. Section 1 gives an introductory part, the importance of video surveillance system, recent advancement and general framework of video surveillance. Section 2, discusses different benchmark pedestrian datasets used to compare the different methods of pedestrian detection and tracking. Section 3, presents a detailed discussion on issues and challenges of pedestrian detection and tracking in video sequence. Section 4, groups the methods of pedestrian detection and tracking method for moving and fixed camera into different categories, describe their general concept with the improvements in each category. In Section 5, discusses possible future directions. Finally, the chapter

The state-of-the-art methods for pedestrian detection and tracking method include adaptive local binary pattern (LBP), histogram of oriented gradient (HOG) into a multiple kernel tracker, spatiotemporal context information-based method using benchmark databases [10]. In this section we outlined the benchmark datasets that has been commonly used by the researchers. **Figure 3** shows a sample image of each pedestrian dataset. Next, we discuss each database with its specification, use

It is one of the first pedestrian datasets, fairly small and relatively well solved at this point. This data set contains 709 pedestrian images taken in city streets. Out of this 509 training and 200 test images of pedestrian in city scenes. Each image contains

**1.4 Person identification**

tems after a behavior analysis [8].

*DOI: http://dx.doi.org/10.5772/intechopen.90810*

of this chapter are as follows:

#### **1.1 Motion and object detection**

Object detection is the first step that deals with detecting instances of semantic objects of a certain class, such as humans, buildings, cars, etc. in a sequence of videos. The different approaches of object detection are frame-to-frame difference, background subtraction and motion analysis using optical flow techniques [13]. These approaches typically use extracted features and learning algorithms to recognize instances of an object category. The object detection process is divided into two categories. First, object detection, which include mainly three types of methods such as background subtraction, optical flow and spatiotemporal filtering. Second, object classification, use primarily visual features as shape based, motion based and texture-based method [13]. Motion detection is one of the problems in video surveillance, as it is not only responsible for the extraction of moving objects, but also critical to many applications including object-based video encoding, human motion analysis, and human machine interactions [14, 15].

After object detection, next step is motion segmentation. This step is used for detecting regions corresponding to moving objects such as humans or vehicles. It mainly focuses on detecting moving regions from video frames, and creating a database for tracking and behavior analysis. Motion detection is used for detecting a change in the position of an object, relative to its surroundings or a change in the surroundings, relative to an object. Motion detection can be achieved using electronic motion sensors, which detect the motion from the real environment.

#### **1.2 Object tracking**

Tracking of objects in a video sequence means identifying the same object in a sequence of frames using the object unique characteristics represented in the form of features. Generally, the detection process is always followed by tracking in video surveillance systems. Tracking is performed from one frame to another, using tracking algorithms such as kernel-based tracking, point based tracking and silhouette-based tracking [16].

#### **1.3 Behavior and activity analysis**

In some conditions, it is mandatory to analyze the behaviors of people and determine whether their behaviors are suspicious or not, such as the behavior of pedestrian at a crowded place (e.g. public market places and government offices, etc.). In this step the motion of objects is recognized from the video scene and generate the description of the action. Ahmed Elaiw et al. [80] proposed a critical analysis and modelling strategy of human crowds with the intention of selecting the most relevant scale out of three approaches, i.e., (1) microscopic, means pedestrian are individual detected based on the location, velocity and motion parameter is neglected, (2) mesoscopic, means pedestrian are detected based on position, velocity and depend on the distribution function and (3) macroscopic, mean the pedestrian are identified based on the average pedestrian quantity, moment of pedestrian. It can be used for efficient decision making in critical situations when

*Pedestrian Detection and Tracking in Video Surveillance System: Issues, Comprehensive Review… DOI: http://dx.doi.org/10.5772/intechopen.90810*

human crowd safety is important. Safety of human crowds depends upon the quantity and density of pedestrian move physically at different high crowed places.

#### **1.4 Person identification**

as (1) motion and object detection, (2) object classification, (3) object tracking, (4) behavior understanding and activity analysis, (5) pedestrian identification and (6) data fusion. Each stage of automated visual surveillance system is described as

Object detection is the first step that deals with detecting instances of semantic objects of a certain class, such as humans, buildings, cars, etc. in a sequence of videos. The different approaches of object detection are frame-to-frame difference, background subtraction and motion analysis using optical flow techniques [13]. These approaches typically use extracted features and learning algorithms to recognize instances of an object category. The object detection process is divided into two categories. First, object detection, which include mainly three types of methods such as background subtraction, optical flow and spatiotemporal filtering. Second, object classification, use primarily visual features as shape based, motion based and texture-based method [13]. Motion detection is one of the problems in video surveillance, as it is not only responsible for the extraction of moving objects, but also critical to many applications including object-based video encoding, human motion

After object detection, next step is motion segmentation. This step is used for detecting regions corresponding to moving objects such as humans or vehicles. It mainly focuses on detecting moving regions from video frames, and creating a database for tracking and behavior analysis. Motion detection is used for detecting a change in the position of an object, relative to its surroundings or a change in the surroundings, relative to an object. Motion detection can be achieved using electronic motion sensors, which detect the motion from the real environment.

Tracking of objects in a video sequence means identifying the same object in a sequence of frames using the object unique characteristics represented in the form of features. Generally, the detection process is always followed by tracking in video surveillance systems. Tracking is performed from one frame to another, using tracking algorithms such as kernel-based tracking, point based tracking and

In some conditions, it is mandatory to analyze the behaviors of people and determine whether their behaviors are suspicious or not, such as the behavior of pedestrian at a crowded place (e.g. public market places and government offices, etc.). In this step the motion of objects is recognized from the video scene and generate the description of the action. Ahmed Elaiw et al. [80] proposed a critical analysis and modelling strategy of human crowds with the intention of selecting the most relevant scale out of three approaches, i.e., (1) microscopic, means pedestrian are individual detected based on the location, velocity and motion parameter is neglected, (2) mesoscopic, means pedestrian are detected based on position, velocity and depend on the distribution function and (3) macroscopic, mean the pedes-

trian are identified based on the average pedestrian quantity, moment of

pedestrian. It can be used for efficient decision making in critical situations when

follows.

**1.1 Motion and object detection**

*Recent Trends in Computational Intelligence*

**1.2 Object tracking**

**166**

silhouette-based tracking [16].

**1.3 Behavior and activity analysis**

analysis, and human machine interactions [14, 15].

The last step is human identification. Human face and gait are the main biometric features that can be used for personal identification in visual surveillance systems after a behavior analysis [8].

The goal of this chapter is to discuss the issues and challenges involved in designing visual surveillance system. Again, group pedestrian detection and tracking methods used for moving and fixed camera into broad categories and give an informative analysis of relative methods in each category. The main contributions of this chapter are as follows:


This chapter is organized into the following sections. Section 1 gives an introductory part, the importance of video surveillance system, recent advancement and general framework of video surveillance. Section 2, discusses different benchmark pedestrian datasets used to compare the different methods of pedestrian detection and tracking. Section 3, presents a detailed discussion on issues and challenges of pedestrian detection and tracking in video sequence. Section 4, groups the methods of pedestrian detection and tracking method for moving and fixed camera into different categories, describe their general concept with the improvements in each category. In Section 5, discusses possible future directions. Finally, the chapter concluded with a discussion in Section 6.

#### **2. Pedestrian datasets reported in literature**

The state-of-the-art methods for pedestrian detection and tracking method include adaptive local binary pattern (LBP), histogram of oriented gradient (HOG) into a multiple kernel tracker, spatiotemporal context information-based method using benchmark databases [10]. In this section we outlined the benchmark datasets that has been commonly used by the researchers. **Figure 3** shows a sample image of each pedestrian dataset. Next, we discuss each database with its specification, use and environmental constrain followed by comparative analysis.

#### **2.1 Massachusetts Institute of Technology (MIT) pedestrian dataset**

It is one of the first pedestrian datasets, fairly small and relatively well solved at this point. This data set contains 709 pedestrian images taken in city streets. Out of this 509 training and 200 test images of pedestrian in city scenes. Each image contains either a front or a back view with a relatively limited range of poses [11, 12].

of over 50 K images with mirroring and annotation for validating detection and

*Pedestrian Detection and Tracking in Video Surveillance System: Issues, Comprehensive Review…*

It is an urban dataset captured from a stereo rig mounted on a stroller. Observing a traffic scene from inside a vehicle. The database is used for pedestrian detection and tracking from moving platforms in an urban scenario. Dataset consists of traffic agents such as different cars and pedestrians. One can predict their further motion, or even interpret their intention. At the same time, one needs to stay clear of any obstacles, remain on the assigned road, and read or interpret any traffic signs on the side of the street. On top that, a human is able to assess the situation, when close to a school or pedestrian crossing, one ideally will adapt one's driving behavior [17].

This dataset consists of pairs recorded in a crowded urban setting from a moving platform with an onboard camera and challenging automotive safety scenario in

INRIA is currently one of the most popular static pedestrian detection datasets. It contains moving people with significant variation in appearance, pose, clothing, background, illumination, coupled with moving cameras and backgrounds. Each

This is static object dataset with diverse object views and poses. The goal of visual object classes challenge is to recognize objects from a number of visual object classes in realistic scenes. The 20 object classes that have been selected are (1)

The COCO is recent dataset created by Microsoft [22]. The dataset designed to spur object detection research with a focus on detecting objects in context. The annotations include different instances of segmentations for objects belonging to 80 categories of object, stuff segmentations for 91 categories, key point annotations for person instances, and five image label per image. The different COCO 2018 dataset challenges are (1) object detection with segmentation masks on the image, (2) panoptic segmentation, (3) person key point estimation, and (4) dense pose

The Mapillary vistas panoptic segmentation targets the full perception stack for scene segmentation in street-images [22]. Panoptic segmentation solves both stuff and thing classes, unifying the typically distinct semantic and instance segmentation tasks efficiently. **Figure 3(l)** shows a sample image of Mapillary vistas research datasets. The comparative analysis of recently utilized pedestrian database with its

**2.8 National Institute for Research in Computer Science and Automation**

**2.9 PASCAL visual object classes (VOC) 2007 and 2012 dataset**

**2.10 Microsoft Common Object in Context (COCO) 2018 dataset**

detection. **Figure 3(g)** shows the sample images of MS COCO dataset.

**2.6 Swiss Federal Institute of Technology (ETH) pedestrian dataset**

tracking algorithm accuracy [16].

*DOI: http://dx.doi.org/10.5772/intechopen.90810*

**2.7 TUD-Brussels pedestrian dataset**

**(INRIA) pedestrian dataset**

pair shows two consecutive frames [19].

person, (2) animal, (3) vehicle [20].

**2.11 Mapillary vistas research dataset**

**169**

urban environment [18].

**Figure 3.**

*Example of pedestrians dataset. (a) Caltech pedestrian dataset images consists of unique annotated pedestrians. (b) GM-ATCI rear-view pedestrians' dataset. (c) Tsinghua-Daimler Cyclist Detection Benchmark dataset images. (d) NICTA urban dataset. (e) ETH urban dataset. (f) TUD-Brussels dataset. (g) Microsoft COCO pedestrian dataset. (h) INRIA person static pedestrian detection datasets. (i) PASCAL object dataset. (j) CVC-ADAS collection of pedestrian datasets. (k) MIT pedestrian database images. (l) Mapillary vistas research dataset.*

#### **2.2 Caltech pedestrian dataset**

The Caltech dataset consists of 640 � 480 resolution video taken from a vehicle driving through regular traffic in an urban environment. About 250,000 with a total of 350,000 bounding boxes and 2300 unique pedestrians were annotated for testing and training purpose. The annotation includes bounding boxes for each pedestrian walking on streets and detailed occlusion labels for each object captured in a video sequence in an urban environment. The annotation of pedestrians is used for validating the pedestrian detection and tracking algorithm accuracy [10].

#### **2.3 General Motors-Advanced Technical Center (GM-ATCI) pedestrian dataset**

GM-ATCI dataset is a rear-view pedestrians database captured using a vehiclemounted standard automotive rear-view display camera for evaluating rear-view pedestrian detection. In total, the dataset contains 250 clips duration of 76 min and over 200K annotated pedestrian bounding boxes. The dataset has been captured at different locations, including: indoor and outdoor parking lots, city roads and private driveways. This dataset was collected in both day and night scenarios, with different weather and lighting conditions [15].

#### **2.4 Daimler pedestrian dataset**

The pedestrian images captured from a vehicle-mounted calibrated stereo camera rig in an urban environment. This dataset contains tracking information and a large number of labeled bounding box with a float disparity map and a ground truth shape image. The training set contains 15,560 pedestrian samples with 6744 label pedestrian and testing set contains more than 21,790 images with 56,492 pedestrian labels [15].

#### **2.5 National Information and Communication Technology Australia (NICTA) pedestrian dataset**

It is a large-scale urban dataset collected in multiple cities and countries. The dataset contains around 25,551 unique pedestrians of humans, allowing for a dataset *Pedestrian Detection and Tracking in Video Surveillance System: Issues, Comprehensive Review… DOI: http://dx.doi.org/10.5772/intechopen.90810*

of over 50 K images with mirroring and annotation for validating detection and tracking algorithm accuracy [16].

#### **2.6 Swiss Federal Institute of Technology (ETH) pedestrian dataset**

It is an urban dataset captured from a stereo rig mounted on a stroller. Observing a traffic scene from inside a vehicle. The database is used for pedestrian detection and tracking from moving platforms in an urban scenario. Dataset consists of traffic agents such as different cars and pedestrians. One can predict their further motion, or even interpret their intention. At the same time, one needs to stay clear of any obstacles, remain on the assigned road, and read or interpret any traffic signs on the side of the street. On top that, a human is able to assess the situation, when close to a school or pedestrian crossing, one ideally will adapt one's driving behavior [17].

#### **2.7 TUD-Brussels pedestrian dataset**

This dataset consists of pairs recorded in a crowded urban setting from a moving platform with an onboard camera and challenging automotive safety scenario in urban environment [18].

#### **2.8 National Institute for Research in Computer Science and Automation (INRIA) pedestrian dataset**

INRIA is currently one of the most popular static pedestrian detection datasets. It contains moving people with significant variation in appearance, pose, clothing, background, illumination, coupled with moving cameras and backgrounds. Each pair shows two consecutive frames [19].

#### **2.9 PASCAL visual object classes (VOC) 2007 and 2012 dataset**

This is static object dataset with diverse object views and poses. The goal of visual object classes challenge is to recognize objects from a number of visual object classes in realistic scenes. The 20 object classes that have been selected are (1) person, (2) animal, (3) vehicle [20].

#### **2.10 Microsoft Common Object in Context (COCO) 2018 dataset**

The COCO is recent dataset created by Microsoft [22]. The dataset designed to spur object detection research with a focus on detecting objects in context. The annotations include different instances of segmentations for objects belonging to 80 categories of object, stuff segmentations for 91 categories, key point annotations for person instances, and five image label per image. The different COCO 2018 dataset challenges are (1) object detection with segmentation masks on the image, (2) panoptic segmentation, (3) person key point estimation, and (4) dense pose detection. **Figure 3(g)** shows the sample images of MS COCO dataset.

#### **2.11 Mapillary vistas research dataset**

The Mapillary vistas panoptic segmentation targets the full perception stack for scene segmentation in street-images [22]. Panoptic segmentation solves both stuff and thing classes, unifying the typically distinct semantic and instance segmentation tasks efficiently. **Figure 3(l)** shows a sample image of Mapillary vistas research datasets. The comparative analysis of recently utilized pedestrian database with its

**2.2 Caltech pedestrian dataset**

*Recent Trends in Computational Intelligence*

**Figure 3.**

*dataset.*

The Caltech dataset consists of 640 � 480 resolution video taken from a vehicle driving through regular traffic in an urban environment. About 250,000 with a total of 350,000 bounding boxes and 2300 unique pedestrians were annotated for testing and training purpose. The annotation includes bounding boxes for each pedestrian walking on streets and detailed occlusion labels for each object captured in a video sequence in an urban environment. The annotation of pedestrians is used for vali-

*Example of pedestrians dataset. (a) Caltech pedestrian dataset images consists of unique annotated pedestrians. (b) GM-ATCI rear-view pedestrians' dataset. (c) Tsinghua-Daimler Cyclist Detection Benchmark dataset images. (d) NICTA urban dataset. (e) ETH urban dataset. (f) TUD-Brussels dataset. (g) Microsoft COCO pedestrian dataset. (h) INRIA person static pedestrian detection datasets. (i) PASCAL object dataset. (j) CVC-ADAS collection of pedestrian datasets. (k) MIT pedestrian database images. (l) Mapillary vistas research*

**2.3 General Motors-Advanced Technical Center (GM-ATCI) pedestrian dataset**

GM-ATCI dataset is a rear-view pedestrians database captured using a vehiclemounted standard automotive rear-view display camera for evaluating rear-view pedestrian detection. In total, the dataset contains 250 clips duration of 76 min and over 200K annotated pedestrian bounding boxes. The dataset has been captured at different locations, including: indoor and outdoor parking lots, city roads and private driveways. This dataset was collected in both day and night scenarios, with

The pedestrian images captured from a vehicle-mounted calibrated stereo camera rig in an urban environment. This dataset contains tracking information and a large number of labeled bounding box with a float disparity map and a ground truth shape image. The training set contains 15,560 pedestrian samples with 6744 label pedestrian and testing set contains more than 21,790 images with 56,492 pedestrian labels [15].

**2.5 National Information and Communication Technology Australia (NICTA)**

It is a large-scale urban dataset collected in multiple cities and countries. The dataset contains around 25,551 unique pedestrians of humans, allowing for a dataset

dating the pedestrian detection and tracking algorithm accuracy [10].

different weather and lighting conditions [15].

**2.4 Daimler pedestrian dataset**

**pedestrian dataset**

**168**


application for video surveillance system is shown in **Table 1**. The comparison is performed in terms of application of dataset, size of dataset, dataset creation environment scenarios and type of annotation details used for testing, training and validation of detection and tracking algorithm performance. These datasets used by the researchers for testing the performance of their respective pedestrian detection

498 images Annotations are

*Pedestrian Detection and Tracking in Video Surveillance System: Issues, Comprehensive Review…*

60,000 frames 7900 annotated

marked manually

pedestrians

27,450 ROI annotated 6929 segmentations

**Annotation Environment Ref. Year**

Urban environment [19] 2005

Urban environment [20] 2009

Urban environment [21] 2012

The moving object is a nonrigid thing that moves over time in image sequences of a video captured by a fix or moving the camera. In video surveillance system the region of interest is a human being that needs to be detected and tracked in the video [23]. However, this is not an easy task to do due to the many challenges and difficulties involved. These challenges occur at three different levels of pedestrian detection. Video acquisition, human detection and its tracking. The challenges in acquiring video are, viz. illumination variation, abrupt motion, complex background, shadows, object deformation, etc. Human detection and tracking challenges are varied poses, occlusion, crowd density area tracking, etc. Each issues and

Many factors related to video acquisition systems, acquisition methods, compression techniques, stability of cameras (or sensors) can directly affect the quality of a video sequence. In some cases, the device used for video acquisition might cause limitation for designing object detection and tracking (e.g., when color information is unavailable, or when the frame rate is very low). Moreover, block artifacts (as a result of compression) and blur (as a result of camera's vibrations) reduce the quality of video sequences [36]. Noise is another factor that can severely deteriorate the quality of image sequences. Besides, different cameras have different sensors, lenses, resolutions and frame rates producing different image qualities. A lowquality image sequence can affect moving object detection algorithms. **Figure 4**

When dealing with detecting moving objects in the presence of moving cameras, the need for estimating and compensating the camera motion is evitable. However,

**3. Issues and challenges of pedestrian detection and tracking**

challenges are represented here in this section.

**3.1 Problems related to camera**

shows an example of each challenge.

*3.1.1 Camera motion*

**171**

and tracking algorithm.

**Data source**

CVC-ADAS

PASCAL VOC 2012

**Table 1.**

INRIA Detection,

segmentation

Detection, tracking

Detection, classification, segmentation

*Recently used pedestrian databases by the researchers.*

**Purpose Image or**

*DOI: http://dx.doi.org/10.5772/intechopen.90810*

**video clips**

11,530 images, 20 object classes

*Pedestrian Detection and Tracking in Video Surveillance System: Issues, Comprehensive Review… DOI: http://dx.doi.org/10.5772/intechopen.90810*


#### **Table 1.**

**Data source**

Caltech pedestrian dataset

NICTA 2016

MS COCO 2018

Mapillary vistas dataset 2017

MS COCO 2017

MS COCO 2015

TUD-Brussels

**170**

MIT City street

GM-ATCI Rear view

Daimler Detection and tracking of pedestrian

pedestrian segmentation, detection and tracking

*Recent Trends in Computational Intelligence*

Detection and tracking of pedestrian walking on the street

pedestrian segmentation, detection and tracking

Segmentation, pose estimation, learning of pedestrian

Object detection, segmentation, keypoint detection, DensePose detection

Semantic understanding street scenes

Recognition, segmentation, captioning

Recognition, segmentation, captioning

Detection, tracking

ETH Segmentation, detection, tracking

**Purpose Image or**

**video clips**

250,000 frames (in 137 approximately minute long segments)

250 video sequences

15,560 pedestrian samples, 6744 negative samples

25,551 unique pedestrians, 50,000 images

300,000, 2 million instances, 80 object categories

25,000 images, 152 object categories

328,124 images, 1.5 million object instances

328,124 images, 80 object categories

1092 image pairs

709 pedestrian images 509 training and 200 test images

No annotated pedestrian

350,000 bounding boxes and 2300 unique pedestrians were annotated

200K annotated pedestrian bounding boxes

2D bounding box overlap criterion and float disparity map and a ground truth shape image

2D ground truth image

5 captions per image

Pixel-accurate and instance-specific human annotations for understanding street scenes

Segmented people and objects

Segmented people and objects

Videos Dataset consist of

other traffic agents such as different cars and pedestrians

1776 annotated pedestrian

**Annotation Environment Ref. Year**

Dataset was collected in both day and night scenarios, with different weather and lighting conditions

Day light scenario [11]

Urban environment [10] 2012

Urban environment [15] 2016

Urban environment [16] 2016

Urban environment [22] 2018

Urban environment [22] 2017

Urban environment [22] 2017

Urban environment [22] 2015

Urban environment [17] 2010

Urban environment [18] 2009

[12]

[13] 2015

2000, 2005

*Recently used pedestrian databases by the researchers.*

application for video surveillance system is shown in **Table 1**. The comparison is performed in terms of application of dataset, size of dataset, dataset creation environment scenarios and type of annotation details used for testing, training and validation of detection and tracking algorithm performance. These datasets used by the researchers for testing the performance of their respective pedestrian detection and tracking algorithm.

#### **3. Issues and challenges of pedestrian detection and tracking**

The moving object is a nonrigid thing that moves over time in image sequences of a video captured by a fix or moving the camera. In video surveillance system the region of interest is a human being that needs to be detected and tracked in the video [23]. However, this is not an easy task to do due to the many challenges and difficulties involved. These challenges occur at three different levels of pedestrian detection. Video acquisition, human detection and its tracking. The challenges in acquiring video are, viz. illumination variation, abrupt motion, complex background, shadows, object deformation, etc. Human detection and tracking challenges are varied poses, occlusion, crowd density area tracking, etc. Each issues and challenges are represented here in this section.

#### **3.1 Problems related to camera**

Many factors related to video acquisition systems, acquisition methods, compression techniques, stability of cameras (or sensors) can directly affect the quality of a video sequence. In some cases, the device used for video acquisition might cause limitation for designing object detection and tracking (e.g., when color information is unavailable, or when the frame rate is very low). Moreover, block artifacts (as a result of compression) and blur (as a result of camera's vibrations) reduce the quality of video sequences [36]. Noise is another factor that can severely deteriorate the quality of image sequences. Besides, different cameras have different sensors, lenses, resolutions and frame rates producing different image qualities. A lowquality image sequence can affect moving object detection algorithms. **Figure 4** shows an example of each challenge.

#### *3.1.1 Camera motion*

When dealing with detecting moving objects in the presence of moving cameras, the need for estimating and compensating the camera motion is evitable. However,

differencing methods may fail to detect the portions of the object coherent to background [31]. Meanwhile, a very fast motion produces a trail of the ghost detected region. So, if this object's motions or camera motions are not considered, the object cannot correctly be detected correctly by methods based on background modeling. On the other hand, for tracking-based methods, prediction of motion becomes hard or even impossible; as a result, the tracker might lose the target. Even if the tracker does not lose the target, the unpredictable motion can introduce a

*Pedestrian Detection and Tracking in Video Surveillance System: Issues, Comprehensive Review…*

The background may be highly textured, especially in natural outdoor environments where high variability of textures is presented in outdoor scenes. Moreover, the background may be dynamic, like it may contain movement (e.g., a fountain, clouds in movement, traffic lights, trees waggle, water waves, etc.). These need to be considered as background in many moving object detection algorithms. Such

The presence of shadows in video image sequences complicates the task of moving object detection. Shadows are created due to the occlusion of the object by the light source. If the object does not move during the sequence, resulted shadow is considered as static and can effectively be incorporated into the background. However, a dynamic shadow, caused by a moving object, has a critical impact for accurately detecting moving objects, since it has the same motion properties as the moving object and is tightly connected to it. Shadows can be often removed from images of the sequence using their observed properties such as color, edges and texture or applying a model based on prior information such as illumination conditions and moving object shape [35, 47, 48]. However, dynamic shadows are still difficult to be distinguished from moving objects, especially for outdoor envi-

Next, human detection and tracking issues and challenges are discussed in brief.

The object may be occluded by other objects in the scene. In this case, some parts of the object can be camouflaged or just hidden behind other objects (partial occlusion) or the object can be completely hidden by others (complete occlusion). As an example, consider the target to be a pedestrian walking on the sidewalk. It may be occluded by trees, cars in the street, other pedestrians, etc. Occlusion severely affects the detection of objects in background modeling methods, where the object is completely missing or separated into unconnected regions [33]. If occlusion occurs, the object's appearance model can change for a short time, which can cause

In real scenarios, most objects can occur in 3D space, but we have the projection of their 3D movement in a 2D plane. Hence, any rotation in the direction of third

It includes varying poses, occlusion, crowd density area tracking, etc.

greater amount of error in algorithms [32].

*DOI: http://dx.doi.org/10.5772/intechopen.90810*

movements can be periodic or nonperiodic [34].

ronment where the background is usually complex.

**3.3 Challenges in human detection and tracking**

*3.3.1 Pedestrian occlusion*

**173**

some of the object tracking methods.

*3.3.2 Pose variation: moving object appearance changes*

*3.2.3 Complex background*

*3.2.4 Shadows*

#### **Figure 4.**

*Issues and challenges. (a) An example of illumination variation challenge (David indoor in the Ross dataset [39]). (b) An example of appearance change challenge (Dudek in the Ross dataset [39]). (c) An example of abrupt motion challenge (Motocross in the Kalal dataset [50]). (d) An example of occlusion challenge (car in the Kalal dataset [50]). (e) An example of freely motion of camera in the Michigan University dataset [10]. (f) An example of dynamic background challenge (Kitesurf in the Zhang dataset [60]). (g) An example of shadow challenge (pedestrian 4 in the Kalal dataset [50]). (h) An example of panning in camera in the CDNET database [10]. (i) An example of zooming in camera in the CDNET database [10]. (j) An example of nonrigid moving object in a video sequence [67].*

it is not an easy task to do because of possible camera's depth changes and its complex movements. Many works elaborated an easy scenario by considering simple movements of the camera, i.e., pan tilt zoom (PTZ) cameras. This limited movement allows using a planar homography in order to compensate camera motions, which results in creating a mosaic (or a panorama) background for whole frames of the video sequence [37].

#### *3.1.2 Nonrigid object deformation*

In some cases, different parts of a moving object might have different movements in terms of speed and orientation. For instance, a walking dog when wags its tail or a moving tank when rotates its turret. When dealing with detecting such moving objects, most algorithms, different moving objects. It produces an enormous challenge, especially for nonrigid objects and in the presence of moving cameras. In Hou et al. [40], articular models have been proposed for moving nonrigid objects to handle nonrigid object deformation. In these models, each part of an articulated object is allowed to have different movements. It can be concluded that local features of a moving object along with updating background models are more efficient for dealing with this challenge.

#### **3.2 Challenges in video acquisition**

#### *3.2.1 Illumination variation*

The lighting conditions of the scene and the target might change due to the motion of light source, different times of day, reflection from bright surfaces, whether in-outdoor scenes, partial or complete blockage of the light source by other objects, etc. The direct impact of these variable results in background appearance changes, which causes false positive detections for the methods based on background modeling. Thus, it is essential for these methods to adapt their model to this illumination variation. Meanwhile, because the object's appearance changes under illumination variation, appearance-based tracking methods may not be able to track the object in the sequence [23–28]. Thus, it is required for these methods to use features which are invariant to illumination.

#### *3.2.2 Presence of abrupt motion*

Sudden changes in the speed and direction of the object's motion or sudden camera motion are another challenge of video acquisition that affects the object detection and tracking. If the object or the camera moves very slowly, the temporal

#### *Pedestrian Detection and Tracking in Video Surveillance System: Issues, Comprehensive Review… DOI: http://dx.doi.org/10.5772/intechopen.90810*

differencing methods may fail to detect the portions of the object coherent to background [31]. Meanwhile, a very fast motion produces a trail of the ghost detected region. So, if this object's motions or camera motions are not considered, the object cannot correctly be detected correctly by methods based on background modeling. On the other hand, for tracking-based methods, prediction of motion becomes hard or even impossible; as a result, the tracker might lose the target. Even if the tracker does not lose the target, the unpredictable motion can introduce a greater amount of error in algorithms [32].

#### *3.2.3 Complex background*

The background may be highly textured, especially in natural outdoor environments where high variability of textures is presented in outdoor scenes. Moreover, the background may be dynamic, like it may contain movement (e.g., a fountain, clouds in movement, traffic lights, trees waggle, water waves, etc.). These need to be considered as background in many moving object detection algorithms. Such movements can be periodic or nonperiodic [34].

#### *3.2.4 Shadows*

it is not an easy task to do because of possible camera's depth changes and its complex movements. Many works elaborated an easy scenario by considering simple movements of the camera, i.e., pan tilt zoom (PTZ) cameras. This limited movement allows using a planar homography in order to compensate camera motions, which results in creating a mosaic (or a panorama) background for whole

*Issues and challenges. (a) An example of illumination variation challenge (David indoor in the Ross dataset [39]). (b) An example of appearance change challenge (Dudek in the Ross dataset [39]). (c) An example of abrupt motion challenge (Motocross in the Kalal dataset [50]). (d) An example of occlusion challenge (car in the Kalal dataset [50]). (e) An example of freely motion of camera in the Michigan University dataset [10]. (f) An example of dynamic background challenge (Kitesurf in the Zhang dataset [60]). (g) An example of shadow challenge (pedestrian 4 in the Kalal dataset [50]). (h) An example of panning in camera in the CDNET database [10]. (i) An example of zooming in camera in the CDNET database [10]. (j) An example*

In some cases, different parts of a moving object might have different movements in terms of speed and orientation. For instance, a walking dog when wags its tail or a moving tank when rotates its turret. When dealing with detecting such moving objects, most algorithms, different moving objects. It produces an enormous challenge, especially for nonrigid objects and in the presence of moving cameras. In Hou et al. [40], articular models have been proposed for moving nonrigid objects to handle nonrigid object deformation. In these models, each part of an articulated object is allowed to have different movements. It can be concluded that local features of a moving object along with updating background models are

The lighting conditions of the scene and the target might change due to the motion of light source, different times of day, reflection from bright surfaces, whether in-outdoor scenes, partial or complete blockage of the light source by other objects, etc. The direct impact of these variable results in background appearance changes, which causes false positive detections for the methods based on background modeling. Thus, it is essential for these methods to adapt their model to this illumination variation. Meanwhile, because the object's appearance changes under illumination variation, appearance-based tracking methods may not be able to track the object in the sequence [23–28]. Thus, it is

required for these methods to use features which are invariant to illumination.

Sudden changes in the speed and direction of the object's motion or sudden camera motion are another challenge of video acquisition that affects the object detection and tracking. If the object or the camera moves very slowly, the temporal

frames of the video sequence [37].

*of nonrigid moving object in a video sequence [67].*

*Recent Trends in Computational Intelligence*

more efficient for dealing with this challenge.

**3.2 Challenges in video acquisition**

*3.2.1 Illumination variation*

*3.2.2 Presence of abrupt motion*

**172**

*3.1.2 Nonrigid object deformation*

**Figure 4.**

The presence of shadows in video image sequences complicates the task of moving object detection. Shadows are created due to the occlusion of the object by the light source. If the object does not move during the sequence, resulted shadow is considered as static and can effectively be incorporated into the background. However, a dynamic shadow, caused by a moving object, has a critical impact for accurately detecting moving objects, since it has the same motion properties as the moving object and is tightly connected to it. Shadows can be often removed from images of the sequence using their observed properties such as color, edges and texture or applying a model based on prior information such as illumination conditions and moving object shape [35, 47, 48]. However, dynamic shadows are still difficult to be distinguished from moving objects, especially for outdoor environment where the background is usually complex.

Next, human detection and tracking issues and challenges are discussed in brief. It includes varying poses, occlusion, crowd density area tracking, etc.

#### **3.3 Challenges in human detection and tracking**

#### *3.3.1 Pedestrian occlusion*

The object may be occluded by other objects in the scene. In this case, some parts of the object can be camouflaged or just hidden behind other objects (partial occlusion) or the object can be completely hidden by others (complete occlusion). As an example, consider the target to be a pedestrian walking on the sidewalk. It may be occluded by trees, cars in the street, other pedestrians, etc. Occlusion severely affects the detection of objects in background modeling methods, where the object is completely missing or separated into unconnected regions [33]. If occlusion occurs, the object's appearance model can change for a short time, which can cause some of the object tracking methods.

#### *3.3.2 Pose variation: moving object appearance changes*

In real scenarios, most objects can occur in 3D space, but we have the projection of their 3D movement in a 2D plane. Hence, any rotation in the direction of third


**Challenge**

**175**

Illumination

1.Conditional

update scheme

2. Local

3. 4. Adaptive local binary pattern

5. Bayesian framework

6.Color modelling approach

with B-spline curves

2D-Cepstrum

 approach

representation

 model

 background

variation

 **Proposed** 

**methodology**

 **Advantage** 1. Evaluate rapidly scene changes

adaptively

2.More efficient to detect moving

objects

3. Provided good robustness to

illumination

4. Tolerant to illumination 5. Not sensitive to illumination

6. Adapt to irregular illumination

variations and abrupt changes of

brightness

Presence of abrupt

1.Kernelized

tracker based on swarm

intelligence

2.

Hamiltonian

Monte Carlo (MCMC) based

tracking algorithm 3. Bayesian filter tracking frame

framework

4.Wang-Landau

sampling method

Complex

1. Adaptive background 2. Dynamic texture modelling

methods

3. Principal features based on

statistical

4.

Auto-regressive

 model

characteristics

3. Effectively overcome the complex

background

 model

1. Highly complex detecting moving objects 2. Effectively used for moving object

detection

backgrounds

 for

1. Less accurate in background

with large

2. It is 3.Wrongly absorb a foreground

with background 4. The model is very complex

computationally

 very complex

 object

re-projection

 errors

 regions

1. Adaptive to any object

[42]

2017

[41]

2017

[40]

2005

[39]

2004

[38]

2003

[37]

1995

*…*

detection methods 2. Operates on dynamic

texture sequence 3. Fusion of information

solved the issue

background

 Monte Carlo

 Markov Chain

 method

 correlation filter

1. Effectively handle abrupt motion

1. Kernelized correlation filter tracker

computational

2. Tracker does not handle abrupt motion

in scale and position

3. Tracker not suitable in significant

motion

4. Tracker does not fully consider the

abrupt changes

 complexity

 is more

tracking in videos

2. Effective in handling different type of

abrupt motions

3. Effective against abrupt motions

4. Efficiently deals with the abrupt

motions

motion

 variations

 variations

 variation

camera

5. If the colors changes are very fast then,

this method fails to detect object

6. Speed of visual tracker is less. Visual

tracker can detect only single target

object

**Identifies gap**

1.

Computation

model is more

2. Final large noise patches

3. This method is

intensive

4. This method requires a nonmoving

computationally

segmentation

 results consist of

 time required for this

**Observation** 1. Results: TPF—7.36 ms

and Speed FPS—136

2. Post processing can

reduce large noise

patches

3.Color and texture

information

4. Texture features having

FPS—15 fps 5. Dynamic texture

coefficient is used for

texture variations with

5 fps 6. Visual tracker speed improved using good

workstation

1. A unified framework

[31]

2017

*Pedestrian Detection and Tracking in Video Surveillance System: Issues, Comprehensive Review*

[30]

2012

[29]

2012

[28]

2008

track smooth or abrupt

motion

2. Tracker can handle smooth and abrupt

motion

3. Bayesian filtering solve

the local-trap problem

4. Tracking algorithm efficiently handle abrupt

motions

 retain

**Ref. Year**

[23]

2017

[24]

2015

[25]

2010

[26]

2006

[27]

2006

[28]

2001

*DOI: http://dx.doi.org/10.5772/intechopen.90810*


#### *Pedestrian Detection and Tracking in Video Surveillance System: Issues, Comprehensive Review … DOI: http://dx.doi.org/10.5772/intechopen.90810*

**Challenge**

**174**

Problems related to

1.

Heterogeneous

technique

2.Color-based

model and the APSO based

framework

3. Pyramidal structure model

4. Blur-driven

 tracker (BLUT)

framework

5.Cascade particle filter in

tracking

Camera motion

 1. Image stabilization

based on motion

compensation

2. 3D motion models

3.Multi plane the 3D scene

4. Adaptive motion model

> Nonrigid object

> 1. Target object as a

of different segments having

different movements

2.Model of articulated objects

composed of rigid bodies

3. View-based

 eigenspace

representation

combination

1. Segments with motion consistency

1. 2. Dense articulated real-time tracker requires initial object pose. Tracker not

able to measure the similarity of the

object accurately for small object

Computationally

 intensive

> 2. Accurately produce a depth image

representing

moving object

3. Produced good results for handling

tracking

 different poses of a

deformation

representation

 of

3.Computational

decreased

4.More accurately detects moving

objects

 complexity

 is

 of features

 techniques

1. Efficiently detect moving object and

1.Motion detection fails in long video. It

computationally

2. It cannot be applied to real-time

systems due to the slowness of the SIFT

computation

3. It is

4.Camera motion detection time is more

computationally

 very complex

 complex

motion blur issue resolve 2.Can be efficiently used to compensate

camera vibrations

 appearance

feature-based

1. Detecting human movements

resolution conditions

2. Deal with low resolution video

sequences, the techniques based on

fusion can be used to better detect

moving objects

3. Fast detect moving objects 4.Can robustly track severely blurred

targets

5. Detection for low frame rate videos

 in low

1.

not cover all targets with different sizes

2. Human detection accuracy is less in

complex background

3. Not tested for real time scenarios

4.Misclassification

tracking in complex background 5. Tracker not able to distinguish between

different targets of video sequence

 in detection and

Low-resolution

 humans' images may

camera

 **Proposed** 

**methodology**

 **Advantage**

**Identifies gap**

**Observation**

1. Region of Interest

method can be used in

outdoor

2. Appearance object is significantly

fast

3. Detect the moving object with detection

accuracy are 80%

4. Effectively tracks the

blurred objects without

deblurring

5. Efficient multi-target

tracker is required 1. Adaptive particle filter

[63]

2014

[62]

2011

[61]

2009

[60]

2009

[59]

2017

[58]

2005

[57]

1994

[56]

2017

[55]

2000

[54]

[66]

2017

[65]

2015

[64]

1998

1998

framework

reduce SIFT feature

2. 3D camera motion

model requires fast

feature extraction 3. It can be used for image

analysis

4. Efficiently resolve the

issue of motion blur 1. Eigen tracker is used to

track and recognize

gestures

2. Accuracy of tracker can

be improved by more

color features 3. Tracker not able to detect the small objects

 uses PCA to

 model, the

environment

[51]

2014

[50]

2011

[49]

2008

*Recent Trends in Computational Intelligence*

**Ref. Year**

[53]

2017

[52]

2015


**Challenge**

**177**

 **Proposed**  3. Covariance

 matrix and Lie

algebra

4.

Low-dimensional

representation

**Table 2.** *Challenges*

 *of pedestrian*

 *detection and tracking with related reference works.*

 subspace

**methodology**

 **Advantage** 3. Adaptively track moving objects

under their appearance

moving camera

4. Efficiently adapts online to changes in

the appearance

 of the moving objects

 changes for

**Identifies gap** 3. Covariance

 tracker is

intensive

4. Tracker occasionally

object

 drift from a target

computationally

2.Models appearance using a mixture with 180

angles

3. Tracker handle

illumination

4. Robust object tracking. RMS error is 5.07 pixels

*DOI: http://dx.doi.org/10.5772/intechopen.90810*

per frame

*Pedestrian Detection and Tracking in Video Surveillance System: Issues, Comprehensive Review*

*…*

 changes

**Observation**

**Ref. Year**


 *Challenges of pedestrian detection and tracking with related reference works.*

**Table**

**2.**

*Pedestrian Detection and Tracking in Video Surveillance System: Issues, Comprehensive Review … DOI: http://dx.doi.org/10.5772/intechopen.90810*

**Challenge**

**176**

 **Proposed** 

5. Active contour model

> Shadow

1. Shadow elimination

using HSV color space and

texture features

2.Modified

model

Pedestrian

1. Histogram of oriented

1. Effectively handle occlusions in

1. Histogram of oriented gradient (HOG)

into a multiple kernel tracker

2.

Spatiotemporal

method is complex

3. Occlusion fail for long sequences with

varying lighting conditions

4. Offline object tracking not possible for

all types of objects

 context information

different conditions of moving

cameras

2. Tracker is able to distinguish the

object in occlusions effectively 3. Occlusions can be more efficiently

handled

4.Minimize

5. Overcome changing appearance

occlusion problems

 some energy function

 and

5. Tracked object move with its

background.

is occluded

 Tracker, fail when object

gradient (HOG) into a

multiple kernel tracker

2.

Spatiotemporal

information-based

3.Maintaining

models

4. Tracking was achieved by evolving the contour from

frame to frame

5. Appearance the filter responses from a

steerable pyramid

> Pose variation—

1. Trainable model which uses

1. Good

performance

appearance

2. Adaptive and significant to

appearance

 changes

 changes

 in handling

1. Difficult to identify ambiguous motion

pattern of object 2. Sensitive to lighting changes. Multiple

cameras required to cope with self-

occlusion

the optical flow

2.Wandering-stable-lost

framework

 model

moving object

appearance

 changes

 model based on

 appearance

 method

 context

occlusion

 Gaussian mixture

 algorithm

1.Can effectively distinguish shadow

1. Long video accuracy reduces due to

texture variation 2.Misclassification

result if complex scene

 in shadow detection

regions

2.Can handle a highly dynamic

environment

moving camera

 object detection for

**methodology**

 **Advantage**

 4. Effectively handle dynamic

background

5.Can detect active segments

**Identifies gap**

5. Output with noise patches

**Observation**

4. Filters operates with

linear prediction model

5. It detects complex

background

noise patches

1.

get difficult due to

texture variation

2. This algorithm works in

real time with good

accuracies

1. Effectively handle

[36]

2017

[35]

2007

[34]

2006

[33]

2004

[32]

2003

occlusions in moving

cameras

2.

to analyze occlusion

3. Tracker deals with complex real scenario

4. Nonrigid object tracking

derived by Bayesian

framework

5.Motion-based

with RMS error rate is

0.1–1.1

1.CNN, learn with

[27]

2017

[26]

2006

[25]

2006

[24]

2004

synthetic video with

Mean—69.7

 tracker

Spatiotemporal

 context

Discrimination

 in image

[44]

2017

*Recent Trends in Computational Intelligence*

[43]

2016

 object with

**Ref. Year** axis may change the object appearance [29]. Tracking algorithm performance gets affected due to variation in pose. Same pedestrian looks different in consecutive frames, if the pose changes continuously. Moreover, the objects themselves may have some changes in their pose and appearance like facial expressions, changing clothes, wearing a hat, etc. Also, the target can be a nonrigid object, where its appearance may change over time. In many applications, the goal is tracking humans or pedestrians, which makes tracking algorithms vulnerable in this challenging case [30]. **Table 2** summarizes the comparative analysis of methodologies with its advantage, identified gaps and observation for handling these challenging issues in a video surveillance system.

human behavior analysis, detection and tracking [71–73]. Human tracking is quite challenging, since humans may vary in intra-class variability in shape, appearance

*Pedestrian Detection and Tracking in Video Surveillance System: Issues, Comprehensive Review…*

Krahnstoever et al. [75] designed a real-time control system of active cameras for a multiple-camera surveillance system. Hence, various researchers shifted focus from static fixed camera-based pedestrian detection to moving dynamic multicamera-based pedestrian detection. Pedestrian tracking has been done by stationary cameras using a shape-based method [76], which detects and compares the humanbody shape in consecutive frames. The cameras have been calibrated using a common site-wide metric coordinate system described in [77, 78]. Funahasahi et al. [73] developed a system for tracking the human head and face parts by means of a hierarchical tracking method using a stationary camera and a PTZ camera. The recent surveillance system focuses on human tracking by detection as described in [72–75]. Andriluka et al. [76–78] combined the initial estimate of the human pose across frames in a tracking-by-detection framework. Sapp et al. [79] coupled locations of body joints within and across frames from an ensemble of tractable submodels. Wu and Nevatia [80] proposed an approach for detection and tracking of

The tracking of humans becomes more challenging under moving cameras than

occluded, the detections can fail and the tracking can be unreliable until the human reappear in the frames. It is observed that, many of the researcher works on many of challenges of pedestrian detection and tracking, but still complete and reliable solution to all the challenges like discussed. Most of the algorithms of pedestrian detection and tracking were tested in indoor and outdoor environment. Attempts were also made to estimate the accuracy of the system based on detection rate, time and computational complexity. From the performance evaluation of algorithms presented in authors, it is observed that, deep learning based pedestrian detection and tracking approaches can be efficient choice for real-time environment [45, 65]. There is still a scope of improvement in existing approaches of pedestrian detection

This chapter describes and reviews the methodologies, strategies and steps involved in video surveillance. It also addresses the challenges, issue, available databases, available solutions and research trends for human detection and tracking in video surveillance system. Based on the literature survey, most of the available techniques proposed by the earlier researchers can perform object detection and tracking either within single camera view or across multiple cameras. However, most of them failed to encounter trade-off problem between accuracy and speed. Although the accuracy of the trackers is very good, they are often impractical because of their high computational requirements and vice versa. Thus, to achieve an optimal trade-off, adaptive object detection and tracking method, it is essential

in static cameras as discussed in Section 2. Many effective pedestrian tracking techniques used in static camera, such as background subtraction and modeling [80] and a constant ground plane assumption, makes the task more difficult. Instead of using background modeling-based methods to extract the human information, human detectors are widely used to detect the human in the video. Therefore, the challenge is to successfully detect the humans in moving cameras, and then apply the tracking techniques to detected humans. However, human detectors may effectively extract human, still have some limitations viz. human detectors may produce false or miss human detection, when humans are partially or fully

due to different viewing perspectives and other visual properties.

*DOI: http://dx.doi.org/10.5772/intechopen.90810*

partially occluded people using an assembly of body parts.

and its tracking in surveillance system.

**5. Conclusions**

**179**

#### **4. Pedestrian detection and tracking**

In video-based surveillance, one of the key tasks is to detect the presence of pedestrians in a video sequence, i.e., localizing all subjects that are human [45, 68]. This problem corresponds to determining regions, typically the smallest rectangular bounding boxes in the video sequence that enclose humans. In most of the surveillance systems, human behavior has been recognized using analysis of the trajectories, positions of persons and historical or prior knowledge about the scene. **Figure 5** shows some examples of pedestrian detection and tracking. Haritaoglu et al. [46] describe a combined approach of shape analysis and body tracking, and model different appearances of a person. This has been designed for outdoor environment using a single camera. The system detects and tracks groups of people and monitors the behaviors, even in the presence of partial occlusion. However, the performance is mainly based on the detected trajectories of the concerned objects in video. Furthermore, the results are not sufficient for semantic recognition of dynamic human activities and event analysis in some cases. The advanced automatic video surveillance system consists of many features such as, motion detection [69, 70],

#### **Figure 5.**

*Example of pedestrians detection and tracking. (a) Detecting pedestrians outdoors, walking along the street. (b) ADAS pedestrian detection. (c) Pedestrian detector is based on the aggregate channel feature detector. (d) Realtime vehicle and pedestrian detection of road scenes. (e) Pedestrian action prediction is based on the analysis of human postures in context of traffic. (f) Pedestrian detection based on hierarchical co-occurrence model (g) Cross-modal deep representations for robust pedestrian detection. (h) Pedestrian detection OpenCV. (i) Object tracking with dlib C++ library. (j) Multiple object tracking with Kalman tracker. (k) Multi-Class Multi-Object Tracking using Changing Point Detection. (l) Pedestrian tracking using Deep-Occlusion Reasoning method.*

#### *Pedestrian Detection and Tracking in Video Surveillance System: Issues, Comprehensive Review… DOI: http://dx.doi.org/10.5772/intechopen.90810*

human behavior analysis, detection and tracking [71–73]. Human tracking is quite challenging, since humans may vary in intra-class variability in shape, appearance due to different viewing perspectives and other visual properties.

Krahnstoever et al. [75] designed a real-time control system of active cameras for a multiple-camera surveillance system. Hence, various researchers shifted focus from static fixed camera-based pedestrian detection to moving dynamic multicamera-based pedestrian detection. Pedestrian tracking has been done by stationary cameras using a shape-based method [76], which detects and compares the humanbody shape in consecutive frames. The cameras have been calibrated using a common site-wide metric coordinate system described in [77, 78]. Funahasahi et al. [73] developed a system for tracking the human head and face parts by means of a hierarchical tracking method using a stationary camera and a PTZ camera. The recent surveillance system focuses on human tracking by detection as described in [72–75]. Andriluka et al. [76–78] combined the initial estimate of the human pose across frames in a tracking-by-detection framework. Sapp et al. [79] coupled locations of body joints within and across frames from an ensemble of tractable submodels. Wu and Nevatia [80] proposed an approach for detection and tracking of partially occluded people using an assembly of body parts.

The tracking of humans becomes more challenging under moving cameras than in static cameras as discussed in Section 2. Many effective pedestrian tracking techniques used in static camera, such as background subtraction and modeling [80] and a constant ground plane assumption, makes the task more difficult. Instead of using background modeling-based methods to extract the human information, human detectors are widely used to detect the human in the video. Therefore, the challenge is to successfully detect the humans in moving cameras, and then apply the tracking techniques to detected humans. However, human detectors may effectively extract human, still have some limitations viz. human detectors may produce false or miss human detection, when humans are partially or fully occluded, the detections can fail and the tracking can be unreliable until the human reappear in the frames. It is observed that, many of the researcher works on many of challenges of pedestrian detection and tracking, but still complete and reliable solution to all the challenges like discussed. Most of the algorithms of pedestrian detection and tracking were tested in indoor and outdoor environment. Attempts were also made to estimate the accuracy of the system based on detection rate, time and computational complexity. From the performance evaluation of algorithms presented in authors, it is observed that, deep learning based pedestrian detection and tracking approaches can be efficient choice for real-time environment [45, 65]. There is still a scope of improvement in existing approaches of pedestrian detection and its tracking in surveillance system.

#### **5. Conclusions**

axis may change the object appearance [29]. Tracking algorithm performance gets affected due to variation in pose. Same pedestrian looks different in consecutive frames, if the pose changes continuously. Moreover, the objects themselves may have some changes in their pose and appearance like facial expressions, changing clothes, wearing a hat, etc. Also, the target can be a nonrigid object, where its appearance may change over time. In many applications, the goal is tracking humans or pedestrians, which makes tracking algorithms vulnerable in this challenging case [30]. **Table 2** summarizes the comparative analysis of methodologies with its advantage, identified gaps and observation for handling these challenging

In video-based surveillance, one of the key tasks is to detect the presence of pedestrians in a video sequence, i.e., localizing all subjects that are human [45, 68]. This problem corresponds to determining regions, typically the smallest rectangular bounding boxes in the video sequence that enclose humans. In most of the surveillance systems, human behavior has been recognized using analysis of the trajectories, positions of persons and historical or prior knowledge about the scene. **Figure 5** shows some examples of pedestrian detection and tracking. Haritaoglu et al. [46] describe a combined approach of shape analysis and body tracking, and model different appearances of a person. This has been designed for outdoor environment using a single camera. The system detects and tracks groups of people and monitors the behaviors, even in the presence of partial occlusion. However, the performance is mainly based on the detected trajectories of the concerned objects in video. Furthermore, the results are not sufficient for semantic recognition of dynamic human activities and event analysis in some cases. The advanced automatic video surveillance system consists of many features such as, motion detection [69, 70],

*Example of pedestrians detection and tracking. (a) Detecting pedestrians outdoors, walking along the street. (b) ADAS pedestrian detection. (c) Pedestrian detector is based on the aggregate channel feature detector. (d) Realtime vehicle and pedestrian detection of road scenes. (e) Pedestrian action prediction is based on the analysis of human postures in context of traffic. (f) Pedestrian detection based on hierarchical co-occurrence model (g) Cross-modal deep representations for robust pedestrian detection. (h) Pedestrian detection OpenCV. (i) Object tracking with dlib C++ library. (j) Multiple object tracking with Kalman tracker. (k) Multi-Class Multi-Object Tracking using Changing Point Detection. (l) Pedestrian tracking using Deep-Occlusion Reasoning*

issues in a video surveillance system.

*Recent Trends in Computational Intelligence*

**Figure 5.**

*method.*

**178**

**4. Pedestrian detection and tracking**

This chapter describes and reviews the methodologies, strategies and steps involved in video surveillance. It also addresses the challenges, issue, available databases, available solutions and research trends for human detection and tracking in video surveillance system. Based on the literature survey, most of the available techniques proposed by the earlier researchers can perform object detection and tracking either within single camera view or across multiple cameras. However, most of them failed to encounter trade-off problem between accuracy and speed. Although the accuracy of the trackers is very good, they are often impractical because of their high computational requirements and vice versa. Thus, to achieve an optimal trade-off, adaptive object detection and tracking method, it is essential

to achieve a real-time and reliable surveillance system. It is due to this reason that the main aim of this paper is to provide a valuable insight into the related areas of the related research topic in video surveillance and to promote new research.

**References**

pp. 13-17

March 2005. pp. 25-37

[1] Minsky M, Kurzweil R, Mann S. IEEE International Symposium on Technology and Society, Toronto, Ontario, Canada, 27th–29th June; The Society of Intelligent Veillance. 2013.

*DOI: http://dx.doi.org/10.5772/intechopen.90810*

[9] Gawande U, Zaveri M, Kapur A. A novel algorithm for feature level fusion using SVM classifier for multibiometrics-based person

identification. Applied Computational Intelligence and Soft Computing. July

[10] Dollar P, Wojek C, Schiele B, Perona P. Pedestrian detection: An evaluation of the state of the art. IEEE Transactions on Pattern Analysis and Machine Intelligence. April 2012;**34**(4):

[11] Papageorgiou C, Poggio T. A trainable system for object detection. International Journal of Computer Vision. June 2000;**38**(1):15-33

[12] MIT Pedestrian Dataset. Center for Biological and Computational Learning at MIT and MIT. 2005. Available at: http://cbcl.mit.edu/software-datasets/ PedestrianData.html [Accessed: 22

[13] Levi D, Silberstei S. Tracking and motion cues for rear-view pedestrian detection. In: 18th IEEE Intelligent Transportation Systems Conference, Spain, 15th–16th Sept. 2015. pp. 664-671

[14] Li X, Flohr F, Yang Y, Xiong H, Braun M, Pan S. A new benchmark for vision-based cyclist detection. In: IEEE Intelligent Vehicles Symposium, Sweden, 19th–22nd. June 2016.

[15] Campbell D, Petersson L. GOGMA: Globally-optimal Gaussian mixture alignment. In: IEEE Conference on Computer Vision and Pattern

Recognition (CVPR), Las Vegas, USA,

[16] Pellegrini S, Ess A, Van Gool L. Wrong turn–No dead end: A stochastic pedestrian motion model. International

2013;**2013**:1-11

*Pedestrian Detection and Tracking in Video Surveillance System: Issues, Comprehensive Review…*

743-761

September 2018]

pp. 1028-1033

IEEE. 2016

[2] Foresti GL, Micheloni C, Snidaro L, Remagnino P, Ellis T. Active videobased surveillance system: The low-level image and video processing techniques needed for implementation. In: IEEE Signal Processing Magazine, Vol. 22(2).

[3] Gawande U, Golhar Y. Biometric security system: A rigorous review of unimodal and multimodal biometrics techniques. International Journal of Biometrics (IJBM). April 2018;**10**(2)

[4] Cornett B. Intro to Surveillance Camera Technologies. Available at:

[5] Alexandr L. IP Video Surveillance. An Essential Guide. Alexandr Lytkin. ISBN 978-5-600-00033-9. 4th April

[6] Zafeiriou S, Zhang C, Zhang Z. A survey on face detection in the wild: Past, present and future. International Journal of Computer Vision Image Understand. Sep. 2015;**138**:1-24

[7] Teddy K, Lin W. A survey on behavior analysis in video surveillance applications. Video Surveillance. 2011. pp. 281-291. Available at: http://

www.intechopen.com/books/

[8] Gawande U, Golhar Y, Hajari K. Biometric-based security system: Issues and challenges, intelligent techniques in signal processing for multimedia security. In: Studies in Computational Intelligence. Vol. 660. Cham: Springer;

videosurveillance

2017. pp. 151-176

**181**

http://www.ezwatch.com

2012

### **Author details**

Ujwalla Gawande<sup>1</sup> \*, Kamal Hajari<sup>1</sup> and Yogesh Golhar<sup>2</sup>

1 Department of Information Technology, Yeshwantrao Chavan College of Engineering, Nagpur, Maharashtra, India

2 Department of Computer Science and Engineering, GH Raisoni Institute of Engineering and Technology, Nagpur, Maharashtra, India

\*Address all correspondence to: ujwallgawande@yahoo.co.in

© 2020 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

*Pedestrian Detection and Tracking in Video Surveillance System: Issues, Comprehensive Review… DOI: http://dx.doi.org/10.5772/intechopen.90810*

#### **References**

to achieve a real-time and reliable surveillance system. It is due to this reason that the main aim of this paper is to provide a valuable insight into the related areas of the related research topic in video surveillance and to promote new research.

*Recent Trends in Computational Intelligence*

\*, Kamal Hajari<sup>1</sup> and Yogesh Golhar<sup>2</sup>

1 Department of Information Technology, Yeshwantrao Chavan College of

2 Department of Computer Science and Engineering, GH Raisoni Institute of

© 2020 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,

**Author details**

Ujwalla Gawande<sup>1</sup>

**180**

Engineering, Nagpur, Maharashtra, India

provided the original work is properly cited.

Engineering and Technology, Nagpur, Maharashtra, India

\*Address all correspondence to: ujwallgawande@yahoo.co.in

[1] Minsky M, Kurzweil R, Mann S. IEEE International Symposium on Technology and Society, Toronto, Ontario, Canada, 27th–29th June; The Society of Intelligent Veillance. 2013. pp. 13-17

[2] Foresti GL, Micheloni C, Snidaro L, Remagnino P, Ellis T. Active videobased surveillance system: The low-level image and video processing techniques needed for implementation. In: IEEE Signal Processing Magazine, Vol. 22(2). March 2005. pp. 25-37

[3] Gawande U, Golhar Y. Biometric security system: A rigorous review of unimodal and multimodal biometrics techniques. International Journal of Biometrics (IJBM). April 2018;**10**(2)

[4] Cornett B. Intro to Surveillance Camera Technologies. Available at: http://www.ezwatch.com

[5] Alexandr L. IP Video Surveillance. An Essential Guide. Alexandr Lytkin. ISBN 978-5-600-00033-9. 4th April 2012

[6] Zafeiriou S, Zhang C, Zhang Z. A survey on face detection in the wild: Past, present and future. International Journal of Computer Vision Image Understand. Sep. 2015;**138**:1-24

[7] Teddy K, Lin W. A survey on behavior analysis in video surveillance applications. Video Surveillance. 2011. pp. 281-291. Available at: http:// www.intechopen.com/books/ videosurveillance

[8] Gawande U, Golhar Y, Hajari K. Biometric-based security system: Issues and challenges, intelligent techniques in signal processing for multimedia security. In: Studies in Computational Intelligence. Vol. 660. Cham: Springer; 2017. pp. 151-176

[9] Gawande U, Zaveri M, Kapur A. A novel algorithm for feature level fusion using SVM classifier for multibiometrics-based person identification. Applied Computational Intelligence and Soft Computing. July 2013;**2013**:1-11

[10] Dollar P, Wojek C, Schiele B, Perona P. Pedestrian detection: An evaluation of the state of the art. IEEE Transactions on Pattern Analysis and Machine Intelligence. April 2012;**34**(4): 743-761

[11] Papageorgiou C, Poggio T. A trainable system for object detection. International Journal of Computer Vision. June 2000;**38**(1):15-33

[12] MIT Pedestrian Dataset. Center for Biological and Computational Learning at MIT and MIT. 2005. Available at: http://cbcl.mit.edu/software-datasets/ PedestrianData.html [Accessed: 22 September 2018]

[13] Levi D, Silberstei S. Tracking and motion cues for rear-view pedestrian detection. In: 18th IEEE Intelligent Transportation Systems Conference, Spain, 15th–16th Sept. 2015. pp. 664-671

[14] Li X, Flohr F, Yang Y, Xiong H, Braun M, Pan S. A new benchmark for vision-based cyclist detection. In: IEEE Intelligent Vehicles Symposium, Sweden, 19th–22nd. June 2016. pp. 1028-1033

[15] Campbell D, Petersson L. GOGMA: Globally-optimal Gaussian mixture alignment. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, IEEE. 2016

[16] Pellegrini S, Ess A, Van Gool L. Wrong turn–No dead end: A stochastic pedestrian motion model. International Workshop on Socially Intelligent Surveillance and Monitoring (SISM'10), (CVPR), San Francisco, CA, USA, 13th–18th June. 2010

[17] Christian W, Stefan W, Schiele B. Multi-cue onboard pedestrian detection. In: IEEE Conference on Computer Vision and Pattern Recognition, Miami, Florida, USA, 20–25 June. 2009

[18] Navneet D. Finding people in images and videos [PhD thesis]. Inria Grenoble-Rhône-Alpes; 2006

[19] Everingham M, Van Gool L, Winn WC, Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision. June 2015;**111**(1):98-136

[20] CVC-ADAS Pedestrian dataset. 2012. Available at: http://adas.cvc.uab. es/site/ [Accessed: 22 September 2018]

[21] Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, et al. Microsoft coco: Common objects in context. In: European Conference on Computer Vision. Springer; 2014, 2015. pp. 740-755

[22] Yun K, Lim J, Choi JY. Scene conditional background update for moving object detection in a moving camera. Pattern Recognition Letters. 2017;**88**:57-63

[23] St-Charles PL, Bilodeau GA, Bergevin R. Subsense: A universal change detection method with local adaptive sensitivity. IEEE Transactions on Image Processing. 2015;**24**(1):359-373

[24] Cogun F, Cetin AE. Object tracking under illumination variations using 2D spectrum characteristics of the target. IEEE International Workshop on Multimedia Signal Processing. 2010: 521-526

[25] Heikkila M, Pietikainen M. A texture-based method for modelling the background and detecting moving objects. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2006;**28**(4):657-662

[34] Wang F, Lu M. Hamiltonian Monte Carlo estimator for abrupt motion tracking. In: International Conference on Pattern Recognition, ICPR. Nov.

*DOI: http://dx.doi.org/10.5772/intechopen.90810*

IEEE International Conference on Computer Vision. 2003. pp. 1305-1312

[43] Li L, Huang W, Gu IY, Tian Q. Statistical modelling of complex backgrounds for foreground object detection. IEEE Transactions on Image Processing. 2004;**13**(11):1459-1472

[44] Chetverikov D, Péteri R. A brief survey of dynamic texture description and recognition. Computer Recognition

representation using a deep multi-scale convolutional network. Journal of Visual

registration for moving cameras. Pattern

[45] Arashloo SR, Amirani MC, Noroozi A. Dynamic texture

Communication and Image Representation. 2017;**43**:89-97

[46] Minematsu T, Uchiyama H, Shimada A, Nagahara H, Taniguchi RI.

Recognition Letters. 2017:86-95

[48] Song R, Liu M. A shadow elimination algorithm based on HSV spatial feature and texture feature. In: International Conference on Emerging

Internetworking, Data Web Technologies. 2017. pp. 585-591

Systems. 2004;**48**(1):41-48

Computer Vision. Nov. 2007

[49] Treptow A, Zell A. Real-time object tracking for soccer-robots without color information. Robotics and Autonomous

[50] Hua C. A noise-insensitive object tracking algorithm. In: Conference on

[51] Unger M, Asdsch M, Hosten P. Enhanced background subtraction using

global motion compensation and

[47] Xia H, Shuxiang S, Liping H. A modified Gaussian mixture background model via spatiotemporal distribution with shadow detection. Signal, Image and Video Processing. 2016;**10**(2):

Adaptive background model

343-350

Systems. 2005:17-26

*Pedestrian Detection and Tracking in Video Surveillance System: Issues, Comprehensive Review…*

[35] Zhang H, Zhang J, Wu Q, Qian X, Zhou T, Hengcheng FU. Extended kernel correlation filter for abrupt motion tracking. KSII Transactions on Internet and Information Systems. 2017;

[36] Jepson AD, Fleet DJ, El-Maraghi TF. Robust online appearance models for visual tracking. IEEE Transactions on

[37] Yilmaz A, Li X, Shah M. Contourbased object tracking with occlusion handling in video acquired using mobile cameras. IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38] Senior A, Hampapur A, Tian YL,

Brown L, Pankanti S, Bolle R. Appearance models for occlusion handling. Image and Vision Computing.

[39] Pan J, Hu B. Robust occlusion handling in object tracking. In: IEEE Conference on Computer Vision and Pattern Recognition. 2007. pp. 1-8

[40] Hou L, Wan W, Lee KH, Hwang JN, Okopal G, Pitton J. Robust human tracking based on DPM constrained multiple-kernel from a moving camera. Journal of Signal Processing Systems.

[41] Delagnes P, Benois J, Barba D. Active contours approach to object tracking in image sequences with complex background. Pattern

Recognition Letters. 1995;**16**(2):171-178

[42] Monnet A, Mittal A, Paragios N, Ramesh V. Background modelling and subtraction of dynamic scenes. In: Ninth

Pattern Analysis and Machine Intelligence. 2003;**25**(10):1296-1311

2004;**26**(11):1531-1536

2006;**24**(11):1233-1243

2017;**86**(1):27-39

**183**

2012. pp. 3066-3069

**11**(9):4438-4460

[26] Shen C, Lin X, Shi Y. Moving object tracking under varying illumination conditions. Pattern Recognition Letters. 2006;**27**(14):1632-1643

[27] Lee YB. A real-time color-based object tracking robust to irregular illumination variations. In: IEEE International Conference on Robotics and Automation, 21st–26th. May 2001. pp. 1659-1664

[28] Tokmakov P, Alahari K, Schmid C. Learning motion patterns in videos. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR. 2017. pp. 531-539

[29] Balan A, Black MJ. An adaptive appearance model approach for model based articulated object tracking. In: IEEE Conference on Computer Vision and Pattern Recognition. June 2006. pp. 758-765

[30] Porikli F, Tuzel O. Covariance tracking using model update based on lie algebra. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR' 1. June 2006. pp. 728-735

[31] Lim J, Ross DA, Lin RS, Yang MH. Incremental learning for visual tracking. Advances in Neural Information Processing Systems. 2004:793-800

[32] Kwo J, Lee KM. Tracking of abrupt motion using Wang-Landau Monte Carlo estimation. In: European Conference on Computer Vision. Oct. 2008. pp. 387-400

[33] Zhou X, Lu Y, Lu J, Zhou J. Abrupt motion tracking via intensively adaptive Markov-chain Monte Carlo sampling. IEEE Transactions on Image Processing. 2012;**21**(2):789-801

*Pedestrian Detection and Tracking in Video Surveillance System: Issues, Comprehensive Review… DOI: http://dx.doi.org/10.5772/intechopen.90810*

[34] Wang F, Lu M. Hamiltonian Monte Carlo estimator for abrupt motion tracking. In: International Conference on Pattern Recognition, ICPR. Nov. 2012. pp. 3066-3069

Workshop on Socially Intelligent Surveillance and Monitoring (SISM'10), (CVPR), San Francisco, CA, USA,

*Recent Trends in Computational Intelligence*

background and detecting moving objects. IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26] Shen C, Lin X, Shi Y. Moving object tracking under varying illumination conditions. Pattern Recognition Letters.

[27] Lee YB. A real-time color-based object tracking robust to irregular illumination variations. In: IEEE International Conference on Robotics and Automation, 21st–26th. May 2001.

[28] Tokmakov P, Alahari K, Schmid C. Learning motion patterns in videos. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR. 2017.

[29] Balan A, Black MJ. An adaptive appearance model approach for model based articulated object tracking. In: IEEE Conference on Computer Vision and Pattern Recognition. June 2006.

[30] Porikli F, Tuzel O. Covariance tracking using model update based on lie algebra. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR' 1. June 2006.

[31] Lim J, Ross DA, Lin RS, Yang MH. Incremental learning for visual tracking.

[32] Kwo J, Lee KM. Tracking of abrupt motion using Wang-Landau Monte Carlo estimation. In: European Conference on Computer Vision. Oct.

[33] Zhou X, Lu Y, Lu J, Zhou J. Abrupt motion tracking via intensively adaptive Markov-chain Monte Carlo sampling. IEEE Transactions on Image Processing.

Advances in Neural Information Processing Systems. 2004:793-800

2006;**28**(4):657-662

2006;**27**(14):1632-1643

pp. 1659-1664

pp. 531-539

pp. 758-765

pp. 728-735

2008. pp. 387-400

2012;**21**(2):789-801

[17] Christian W, Stefan W, Schiele B. Multi-cue onboard pedestrian detection. In: IEEE Conference on Computer Vision and Pattern Recognition, Miami,

Florida, USA, 20–25 June. 2009

[18] Navneet D. Finding people in images and videos [PhD thesis]. Inria

Grenoble-Rhône-Alpes; 2006

[19] Everingham M, Van Gool L,

[20] CVC-ADAS Pedestrian dataset. 2012. Available at: http://adas.cvc.uab. es/site/ [Accessed: 22 September 2018]

[21] Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, et al. Microsoft coco: Common objects in context. In: European Conference on Computer Vision. Springer; 2014, 2015.

[22] Yun K, Lim J, Choi JY. Scene conditional background update for moving object detection in a moving camera. Pattern Recognition Letters.

[23] St-Charles PL, Bilodeau GA,

Processing. 2015;**24**(1):359-373

[25] Heikkila M, Pietikainen M. A texture-based method for modelling the

Bergevin R. Subsense: A universal change detection method with local adaptive sensitivity. IEEE Transactions on Image

[24] Cogun F, Cetin AE. Object tracking under illumination variations using 2D spectrum characteristics of the target. IEEE International Workshop on Multimedia Signal Processing. 2010:

pp. 740-755

2017;**88**:57-63

521-526

**182**

Winn WC, Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision. June 2015;**111**(1):98-136

13th–18th June. 2010

[35] Zhang H, Zhang J, Wu Q, Qian X, Zhou T, Hengcheng FU. Extended kernel correlation filter for abrupt motion tracking. KSII Transactions on Internet and Information Systems. 2017; **11**(9):4438-4460

[36] Jepson AD, Fleet DJ, El-Maraghi TF. Robust online appearance models for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2003;**25**(10):1296-1311

[37] Yilmaz A, Li X, Shah M. Contourbased object tracking with occlusion handling in video acquired using mobile cameras. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2004;**26**(11):1531-1536

[38] Senior A, Hampapur A, Tian YL, Brown L, Pankanti S, Bolle R. Appearance models for occlusion handling. Image and Vision Computing. 2006;**24**(11):1233-1243

[39] Pan J, Hu B. Robust occlusion handling in object tracking. In: IEEE Conference on Computer Vision and Pattern Recognition. 2007. pp. 1-8

[40] Hou L, Wan W, Lee KH, Hwang JN, Okopal G, Pitton J. Robust human tracking based on DPM constrained multiple-kernel from a moving camera. Journal of Signal Processing Systems. 2017;**86**(1):27-39

[41] Delagnes P, Benois J, Barba D. Active contours approach to object tracking in image sequences with complex background. Pattern Recognition Letters. 1995;**16**(2):171-178

[42] Monnet A, Mittal A, Paragios N, Ramesh V. Background modelling and subtraction of dynamic scenes. In: Ninth IEEE International Conference on Computer Vision. 2003. pp. 1305-1312

[43] Li L, Huang W, Gu IY, Tian Q. Statistical modelling of complex backgrounds for foreground object detection. IEEE Transactions on Image Processing. 2004;**13**(11):1459-1472

[44] Chetverikov D, Péteri R. A brief survey of dynamic texture description and recognition. Computer Recognition Systems. 2005:17-26

[45] Arashloo SR, Amirani MC, Noroozi A. Dynamic texture representation using a deep multi-scale convolutional network. Journal of Visual Communication and Image Representation. 2017;**43**:89-97

[46] Minematsu T, Uchiyama H, Shimada A, Nagahara H, Taniguchi RI. Adaptive background model registration for moving cameras. Pattern Recognition Letters. 2017:86-95

[47] Xia H, Shuxiang S, Liping H. A modified Gaussian mixture background model via spatiotemporal distribution with shadow detection. Signal, Image and Video Processing. 2016;**10**(2): 343-350

[48] Song R, Liu M. A shadow elimination algorithm based on HSV spatial feature and texture feature. In: International Conference on Emerging Internetworking, Data Web Technologies. 2017. pp. 585-591

[49] Treptow A, Zell A. Real-time object tracking for soccer-robots without color information. Robotics and Autonomous Systems. 2004;**48**(1):41-48

[50] Hua C. A noise-insensitive object tracking algorithm. In: Conference on Computer Vision. Nov. 2007

[51] Unger M, Asdsch M, Hosten P. Enhanced background subtraction using global motion compensation and

mosaicing. In: International Conference on Image Processing, ICIP. Oct. 2008. pp. 2708-2711

[52] Li Y, Ai H, Yamashita T. Tracking in low frame rate video: A cascade particle filter with discriminative observers of different life spans. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2008;**30**(10):1728-1740

[53] Spampinato C, Chen-Burger YH, Nadarajan G, Fisher RB. Detecting, tracking and counting fish in low quality unconstrained underwater videos. VISAPP. 2008;**2**:514-519

[54] Wu Y, Ling H, Yu J, Li F, Mei X, Cheng E. Blurred target tracking by blur-driven tracker. In: International Conference on Computer Vision. Nov. 2011. pp. 1100-1107

[55] Dollár P, Appel R, Blondie S, Perona P. Fast feature pyramids for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2014;**36**(8):1532-1545

[56] Zhang X, Hu W, Xie N, Bao H, Maybank S. A robust tracking system for low frame rate video. International Journal of Computer Vision. 2015; **115**(3):279-304

[57] Chen HK, Zhao XG. heterogeneous features fusion-based low-resolution human detection method for outdoor video surveillance. International Journal of Automation and Computing. 2017; **14**(2):136-146

[58] Irani M, Anandan P. A unified approach to moving object detection in 2d and 3d scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1998;**20**(6):577-589

[59] Sawhney HS, Guo Y, Kumar R. Independent motion detection in 3d scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2000;**22**(10):1191-1199

[60] Zhou D, Frémont V, Quost B, Dai Y, Li H. Moving object detection and segmentation in urban environments from a moving platform. Image and Vision Computing. 2017;**68**:76-87

Research in Computer Engineering and

*DOI: http://dx.doi.org/10.5772/intechopen.90810*

detection. IEEE Transaction on Image Processing. Nov. 2004;**13**(11):1459-1472

[79] Krahnstoever N, Yu T, Lim S-N, Patwardhan K, Tu P. Collaborative real-time control of active cameras in large scale surveillance systems. In: Proceeding Workshop, France. Oct.

[80] Elaiw A, Al-Turki Y, Alghamdi M. A critical analysis of behavioural crowd dynamics—From a modelling strategy to kinetic theory methods. MDPI Symmetry Journal. July 2019;**11**(851):

2008. pp. 1-12

1-11

*Pedestrian Detection and Tracking in Video Surveillance System: Issues, Comprehensive Review…*

[70] Chaaraoui A, Climent-Perez P. A review on vision techniques applied to human behavior analysis for ambient-

[71] Hu W, Tan T, Wang L, Maybank S. A survey on visual surveillance of object motion and behaviors. IEEE Transaction on Systems, Man, and Cybernetics,

[72] Haritaoglu I, Harwood D, Davis L. Real-time surveillance of people and their activities. IEEE Transaction on Pattern Analysis and Machine Intelligence. 2000;**22**:809-830

[73] Hu W, Tan T, Wang L. A survey on visual surveillance of object motion and

Systems, Man, and Cybernetics, Part C.

[75] Stauffer C, Grimson WEL. Learning patterns of activity using real-time tracking. IEEE Transaction on Pattern Analysis and Machine Intelligence. Aug.

[76] Yilmaz A. Object tracking: A survey. ACM Computing Survey. Dec. 2006;

[77] Wu D, Shao L. Silhouette analysisbased action recognition via exploiting human poses. IEEE Transaction on Circuits Systems and Video Technology.

[78] Li L, Huang W, Gu IY-H, Tian Q. Statistical modeling of complex backgrounds for foreground object

[74] Tsai D-M, Lai S-C. Independent component analysis-based background subtraction for indoor surveillance. IEEE Transaction on Image Processing.

behaviors. IEEE Transaction on

Aug. 2004;**34**(3):334-352

Jan. 2009;**18**(1):158-167

2000;**22**(8):747-757

Feb. 2013;**23**(2):236-243

**38**(4):1-45

**185**

Technology. 2012;**1**(8):242-247

assisted living. Expert System Application. Sep. 2012;**39**(12):

Part C. 2004;**34**:334-352

10873-10888

[61] Available from: http://www.cvpape rs.com/datasets.html

[62] Wang J. Representing moving images with layers. IEEE Transactions on Image Processing. 1994;**3**(5):625-638

[63] Xiao J, Shah M. Motion layer extraction in the presence of occlusion using graph cut. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2005;**27**(10):1644-1659

[64] Chen T, Lu S. Object-level motion detection from moving cameras. IEEE Transactions on Circuits and Systems for Video Technology. 2017;**27**(11): 2333-2343

[65] Hayman E, Eklundh JO. Statistical background subtraction for a mobile observer. In: IEEE International Conference on Computer Vision, ICCV. 2003. pp. 67-74

[66] Shen Y. Video stabilization using principal component analysis and scale invariant feature transform in particle filter framework. IEEE Transactions on Consumer Electronics. 2009;**55**(3): 1714-1721

[67] Li-Fen T, Qi P, Si-Dong Z. A moving object detection method adapted to camera jittering. Journal of Electronics and Information Technology. 2014; **35**(8):1914-1920

[68] Black MJ, Jepson AD. Eigentracking: Robust matching and tracking of articulated objects using a view-based representation. International Journal of Computer Vision. 1998;**26**(1):63-84

[69] Athanesious SP. Systematic survey on object tracking methods in video. International Journal of Advanced

*Pedestrian Detection and Tracking in Video Surveillance System: Issues, Comprehensive Review… DOI: http://dx.doi.org/10.5772/intechopen.90810*

Research in Computer Engineering and Technology. 2012;**1**(8):242-247

mosaicing. In: International Conference on Image Processing, ICIP. Oct. 2008.

*Recent Trends in Computational Intelligence*

[60] Zhou D, Frémont V, Quost B, Dai Y, Li H. Moving object detection and segmentation in urban environments from a moving platform. Image and Vision Computing. 2017;**68**:76-87

[61] Available from: http://www.cvpape

[62] Wang J. Representing moving images with layers. IEEE Transactions on Image Processing. 1994;**3**(5):625-638

[63] Xiao J, Shah M. Motion layer extraction in the presence of occlusion using graph cut. IEEE Transactions on

Pattern Analysis and Machine Intelligence. 2005;**27**(10):1644-1659

2333-2343

2003. pp. 67-74

1714-1721

**35**(8):1914-1920

[64] Chen T, Lu S. Object-level motion detection from moving cameras. IEEE Transactions on Circuits and Systems for Video Technology. 2017;**27**(11):

[65] Hayman E, Eklundh JO. Statistical background subtraction for a mobile observer. In: IEEE International

Conference on Computer Vision, ICCV.

[66] Shen Y. Video stabilization using principal component analysis and scale invariant feature transform in particle filter framework. IEEE Transactions on Consumer Electronics. 2009;**55**(3):

[67] Li-Fen T, Qi P, Si-Dong Z. A moving object detection method adapted to camera jittering. Journal of Electronics and Information Technology. 2014;

[68] Black MJ, Jepson AD. Eigentracking: Robust matching and tracking of articulated objects using a view-based representation. International Journal of Computer Vision. 1998;**26**(1):63-84

[69] Athanesious SP. Systematic survey on object tracking methods in video. International Journal of Advanced

rs.com/datasets.html

[52] Li Y, Ai H, Yamashita T. Tracking in low frame rate video: A cascade particle filter with discriminative observers of different life spans. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2008;**30**(10):1728-1740

[53] Spampinato C, Chen-Burger YH, Nadarajan G, Fisher RB. Detecting, tracking and counting fish in low quality unconstrained underwater videos.

[54] Wu Y, Ling H, Yu J, Li F, Mei X, Cheng E. Blurred target tracking by blur-driven tracker. In: International Conference on Computer Vision. Nov.

[55] Dollár P, Appel R, Blondie S, Perona P. Fast feature pyramids for object detection. IEEE Transactions on

Pattern Analysis and Machine Intelligence. 2014;**36**(8):1532-1545

[56] Zhang X, Hu W, Xie N, Bao H, Maybank S. A robust tracking system for low frame rate video. International Journal of Computer Vision. 2015;

[57] Chen HK, Zhao XG. heterogeneous features fusion-based low-resolution human detection method for outdoor video surveillance. International Journal of Automation and Computing. 2017;

[58] Irani M, Anandan P. A unified approach to moving object detection in 2d and 3d scenes. IEEE Transactions on

[59] Sawhney HS, Guo Y, Kumar R. Independent motion detection in 3d scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence.

Pattern Analysis and Machine Intelligence. 1998;**20**(6):577-589

2000;**22**(10):1191-1199

**184**

VISAPP. 2008;**2**:514-519

2011. pp. 1100-1107

**115**(3):279-304

**14**(2):136-146

pp. 2708-2711

[70] Chaaraoui A, Climent-Perez P. A review on vision techniques applied to human behavior analysis for ambientassisted living. Expert System Application. Sep. 2012;**39**(12): 10873-10888

[71] Hu W, Tan T, Wang L, Maybank S. A survey on visual surveillance of object motion and behaviors. IEEE Transaction on Systems, Man, and Cybernetics, Part C. 2004;**34**:334-352

[72] Haritaoglu I, Harwood D, Davis L. Real-time surveillance of people and their activities. IEEE Transaction on Pattern Analysis and Machine Intelligence. 2000;**22**:809-830

[73] Hu W, Tan T, Wang L. A survey on visual surveillance of object motion and behaviors. IEEE Transaction on Systems, Man, and Cybernetics, Part C. Aug. 2004;**34**(3):334-352

[74] Tsai D-M, Lai S-C. Independent component analysis-based background subtraction for indoor surveillance. IEEE Transaction on Image Processing. Jan. 2009;**18**(1):158-167

[75] Stauffer C, Grimson WEL. Learning patterns of activity using real-time tracking. IEEE Transaction on Pattern Analysis and Machine Intelligence. Aug. 2000;**22**(8):747-757

[76] Yilmaz A. Object tracking: A survey. ACM Computing Survey. Dec. 2006; **38**(4):1-45

[77] Wu D, Shao L. Silhouette analysisbased action recognition via exploiting human poses. IEEE Transaction on Circuits Systems and Video Technology. Feb. 2013;**23**(2):236-243

[78] Li L, Huang W, Gu IY-H, Tian Q. Statistical modeling of complex backgrounds for foreground object

detection. IEEE Transaction on Image Processing. Nov. 2004;**13**(11):1459-1472

[79] Krahnstoever N, Yu T, Lim S-N, Patwardhan K, Tu P. Collaborative real-time control of active cameras in large scale surveillance systems. In: Proceeding Workshop, France. Oct. 2008. pp. 1-12

[80] Elaiw A, Al-Turki Y, Alghamdi M. A critical analysis of behavioural crowd dynamics—From a modelling strategy to kinetic theory methods. MDPI Symmetry Journal. July 2019;**11**(851): 1-11

### *Edited by Ali Sadollah and Tilendra Shishir Sinha*

Traditional models struggle to cope with complexity, noise, and the existence of a changing environment, while Computational Intelligence (CI) offers solutions to complicated problems as well as reverse problems. The main feature of CI is adaptability, spanning the fields of machine learning and computational neuroscience. CI also comprises biologically-inspired technologies such as the intellect of swarm as part of evolutionary computation and encompassing wider areas such as image processing, data collection, and natural language processing. This book aims to discuss the usage of CI for optimal solving of various applications proving its wide reach and relevance. Bounding of optimization methods and data mining strategies make a strong and reliable prediction tool for handling real-life applications.

Published in London, UK © 2020 IntechOpen © raspirator / iStock

Recent Trends in Computational Intelligence

Recent Trends in

Computational Intelligence

*Edited by Ali Sadollah and Tilendra Shishir Sinha*